For this project I am going to look at how new words enter the common language realm.
One of the responsibilities of a journalist is to teach his readers. This is not just limited to conveying news, but also includes teaching new vocabulary. This is particularly relevant in the fast-paced realm of technology, where artificial intelligence, the cloud, machine learning, and big data have become significantly more newsworthy as these concepts transform the industry and the process of innovation. But when do these words enter the public realm and vocabulary and become ML and AI? We should expect that journalists initially only use the unabbreviated concept. As the concept starts to enter the public domain, journalists may use both the abbreviation and the full word side by side, until the abbreviation is eventually predominantly used. Let's see if this is true!
I decided to focus on the following words and abbreviations:
- Artificial Intelligence, AI, and A.I.
- Machine Learning, ML, and M.L.
- Natural Language Processing and NLP
- Neural Network and Neural Net
- Generative adversarial network and GANS
- Recurrent Neural Network, Recurrent Neural Net, RNN, and R.N.N.
- Application Programming Interface and API
- Deep Neural Network, Deep Neural Net, Deepmind, and Deep Mind
- Supervised Machine Learning, Unsupervised Machine Learning, and Reinforcement Learning
- LSTM, Embedding space
- Cloud, Big Data, Technology, Automation, Robot, AOL, Cyber Crime
I scraped articles from the Guardian between 1999 and 2017 and count the number of occurences and co-occurences of the se words. The Guardian's online edition was the fifth most widely read in the world in 2014 (Source) and is thus a reasonable proxy for journalistic activity.
The most interesting results came from AI and ML. According to the 'Journalist Educator Hypothesis' above I expected that the number of occurences of the abbreviations would eventually overtake those of the complete words. However, we observe the opposite!
Timeline for AI versus Artificial Intelligence
Timeline for ML versus Machine Learning
One explanation may be that the target group changed. Whereas initially these kinds of tech articles may have been directed at the already knowledgeable readers, as these topics became more popular over time, the full word usage became necessary. It may also be indicative of journalists preferring to use the full word as the abbreviation comes as across as more and more 'buzzwordy' as the popularity of the concept rises.
Speaking of buzzwords, let's have a look at a couple.
Timeline for Buzzwords
We can see, perhaps surprisingly, that 'Cloud' and 'Big Data' are actually on the downturn, whereas 'automation' and 'robot' have become much more common. If this is at all indicative of company behavior, it implies that there may have been a shift from virtual innovation to physical innovation.
Finally, most of the technical terms, like embedding space or the different types of machine learning almost never occur, presumably because the Guardian is a news outlet accessible to a general audience.
Just for fun I also tried to look into co-occurences of words, combining the full word with their abbreviations into single categories. We can see that there are actually not that many co-occurences. The most common ones were AI with Robots, AI with Automation, AI with ML, and Big Data with Cloud.