Frederick O'Loughlin

Creative Projects

Contact

Experiments, mad ideas, and other attempts to be innovative.

A news printer with a fast moving paper going through it

AI for identifying News Publisher

Building and deploying a Neural Network that could identify the publisher behind a given news article.

As an experiment, I thought it would be interesting to try and build a machine learning model that could recognise the source of a news article using only the actual text of the article. My logic was that different news publishers have do have unique styles. They focus on different topics, have different article sizes and can have unique vocabularies.

I sourced a pre-existing dataset of news articles from Kaggle, pre-processed the articles and used them to train an LSTM. The dataset had examples from 12 different news sources, which mean a random guess would be correct 8.33% of the time. After a few attempts and much tweaking of parameters, I was able to build a model that reached 72% accuracy. This was when tested on data from the dataset which it hadn’t seen before.

I deployed the final version on my website for a time using a simple form interface that returned its guess of which publication the text source came from. Having done this and after subjecting it to further review, I decided to take it back down again. The reason was that it had become clear the model was achieving far worse results then it achieved in training. What had happened?

Essentially, I had underestimated how the fast-paced world of news changes. The dataset I used had news articles from about two years old, maybe that seems recent, but the news cycle has moved on from much that preoccupied it back then. And as a result, I was seeing a clear example of concept drift, where the data used to train a machine learning model becomes less relevant with time and therefore less accurate.

Still, the project gave a fairly strong proof of concept for the idea. Next steps would have been to deploy a model capable of continual learning from sources as they are released. But that felt beyond the scope of the experiment.

(Photo credit Bank Phrom, via Unsplash)

Back to homepage