Should we allow Google to dictate the language we use?

For decades, the Oxford English Dictionary (OED) was the reference work for the English language. Indeed, for most queries its verdict was regarded as final. But now that the Internet is with us, more and more people are using search engines like Google. Internet users type in terms or idioms whose meaning is unclear to them and often decide on the basis of the number of hits their searches get whether or not what they find counts as acceptable language use. In other words, the authority of traditional dictionaries and the importance of lexicographical competence are gradually waning. So is Google ruining our language?

Martin Bächtold, December 2006

Google finds nearly 14 million hits for spelling errors like 'receive'. Other typos are also surprisingly widespread: 'accomodate' scores over 7 million hits and 'guage' over 6 million. Search engines also break all records for stylistic howlers. For example, we find nearly 1,200,000 hits for the pleonasm plans for the future (surely only a time traveller could make plans for the past?). In English, grammatical errors are also alarmingly common, for example concerning apostrophe use (it's confused with its) or homophones like there's and theirs. Surely, faced with such a dreadful 'colection' (957,000 hits) of errors, only a 'ninconpoop' (1,100 hits) or 'dilletante' (88,500 hits) would use a search engine to check their spelling or style? But unfortunately things just aren't quite that simple…

Translators are undoubtedly some of the most frequent users of Google, for their work is always closely scrutinised and they often use search engines to justify their linguistic choices. At the same time, texts translated into English may be produced for readers who don't necessarily have the language as their mother tongue. Consequently, some translators find it best to 'tone down' the style and supply their paying customers with something they will readily understand on the basis of their non-native grasp of English. But a little knowledge can be a dangerous thing, so it is not uncommon for a confused German-speaking customer to phone up and ask why seit Jahren has been translated as for years instead of literally 'since years', for example. Sometimes, the answers patiently given to such questions are not easy to comprehend or are simply dismissed as 'not logic' by the customer, which steps up the pressure on translators to resort to some other level of justification. In many instances they may feel that their best bet is to convince customers who have little feel for language that the translation was correct by invoking quantitative arguments, i.e. pointing out that their solution scores plenty of hits whereas the enquirer's presumed correction gets none. Although the statistics mostly come out clearly in favour of professional writers and native speakers, customers still occasionally grumble that the Internet is full of errors, and some even vehemently "refuse to let Google dictate language usage"!

Can search engines be used to prove the correctness or incorrectness of orthographic, grammatical or stylistic constructions? How do search engines influence our language usage?

How the brain processes errors

For people to recognise stylistic or spelling errors, the brain has to perceive them. The first area of the brain that was diagnosed as being involved in speech processing was discovered in 1861 by the French neurosurgeon Paul Broca (1824-1880). He described a patient who could still understand simple sentences, but was no longer capable of speech. When the patient died, an autopsy revealed a lesion in the cerebral tissue in the left hemisphere of the brain, an observation that prompted Broca to locate language production in that part of the cerebrum.

Later on, Broca found that the skullcap warms up when the brain performs complex tasks. Since that discovery, methods for taking measurements and localising neural activity in the brain have improved quite considerably. Electroencephalography enables us to pinpoint active neurons in the brain, magnetic resonance tomography can capture fluctuations in blood oxygen levels. Combining these two methods today we can describe – in spatio-temporal terms and in minute detail - the cerebral activity associated with language processing.

At the Max Planck Institute in Leipzig, scientists measured the responses triggered in the brain by the perception of incorrect sentences. In their experiments, respondents were presented with sentences containing syntactic errors (along the lines of The boy caught of the ball). After 120 milliseconds (ms), such sentences provoked a clearly measurable response in a specific region of the brain. The brain proved slower to respond to semantic errors: faced with a sentence roughly equivalent to The table was breastfed the brain only started ringing alarm bells after 400 ms. The inference here is that the brain processes the grammar of a sentence first, then its meaning. Information contained in sentences containing either kind of error takes far longer for the brain to process. First the problem elements have to be 'corrected' in some way; only then does the brain make us consciously aware of the language content.

Cerebral activity in language recognition is a bit like water falling on a mountain, which gouges out channels and riverbeds over time to enable the water to follow the path of least resistance as it flows down into the valley. If a large boulder falls into a river bed, the flow of water is blocked and takes longer to reach its destination, either by finding its way around the unexpected obstacle and continuing along its customary path or, if the obstacle is too large, by taking a new route altogether. Put simply, this is roughly how language recognition in the brain works. When we learn language, frequently repeated combinations of words and idioms etch themselves into the brain. If the brain is faced with a serious linguistic anomaly, it finds no customary path to follow and has to work harder, pumping more blood and neurons through its synapses. By contrast, correctly structured or worded sentences are easier to process, because the patterns encountered are familiar and thus easier to deal with. On the other hand, if an error is very frequently repeated, the brain will grow used to it and adjust to accept it, thereafter no longer recognising it as an anomaly. Vitally, such flexibility enables change.

Neurologically speaking, grammar and orthography can be explained in terms of the brain's laziness. So the principle is clear: error-free sentences flow through cerebral pathways following the path of least resistance. The presence of errors causes blockages which take longer to process. This behavioural pattern could be used to develop a kind of 'brain-lobe grammar' by hooking up a group of respondents to appropriate machinery and measuring their responses to a series of words or sentences. All stimuli that elicited responses delayed by more than 100 ms would be prohibited or deemed 'incorrect'. Of course, the brain might overlook some very frequent errors altogether and take some time to learn that they were actually wrong. Other errors, those based on false assumptions by non-native speakers for example, would never be regarded by masters of the mother tongue in question as anything but anomalies.

Wittgenstein's language games

By measuring neurological activity we can establish when and where the brain processes linguistic stimuli. But this kind of neurolinguistics tells us nothing about how language is understood. How do we perceive the meaning of a word or sentence?

This was the very question asked by Ludwig Wittgenstein (1889–1951), unquestionably the leading language philosopher of the 19th century. Before his brilliant analysis of language, most philosophers and linguists regarded the word as a kind of abstract equivalent of a tangible object. For example, the word wine was viewed as standing for or symbolising a drink made from grapes. Wittgenstein proposed a radical break with this theory. To him, instead of individual words 'standing for' things, their meaning was determined by the way they were actually used. In other words, only if we know how to words are used can we succeed in determining what a speaker meant in a certain situation.

Anyone reading an unfortunately misspelled stag night party invitation promising "an evening of swine, women and song" will immediately spot the familiar collocation and assume that the impending bridegroom is promising them drink, rather than some form of porcine entertainment (unless of course the misprint was a deliberate play on words intended to imply that poor behaviour was expected!) In fact, even if the word wine was completely omitted by mistake, any practised proofreader could reconstruct the intended message with reasonable certainty. Evidently, as we acquire language, the rules of use governing the word wine are etched into our brains and enable us to ascertain actual meanings in context.

Wittgenstein thought that the rules etched into our brains as a result of our functional interaction with the world were products of 'language games' that could prompt the brain to assimilate individual words, groups of words or even full-blown sets of specialist terminology. One example taken by Wittgenstein, a German speaker, was rot, the word for red in his native language. How did he know that such-and-such a colour was rot?" One answer might be that he had learnt German. Thus, through Wittgenstein's eyes we recognise the colour not because we know that its frequency in the spectrum is somewhere between 625 and 740 nm, but purely and simply because the word in question was drummed into our skull during childhood.

In his philosophical studies, Wittgenstein considers the truth content of language games. The conclusion he reaches is that all games are equivalent and that none is more important than another. In short, since there is no supreme language authority to authenticate the validity of a statement, there can be no 'right' or 'wrong' language games. However, this is not the case for, say, politicians, who see very clear differences between individual language games. For when it comes to elections, every vote counts, and if they are to be elected they must convey their message to the electorate as clearly as possible. Consequently, when making statements they must always carefully consider whether their language game is the same as that on which their public will draw. Because as soon as what they say goes beyond the boundaries of their audience's linguistic experience, their words will lose their meaning and may even be classed as 'wrong'.

So products of language games can be classified and listed in a hierarchy. For politicians, messages that get across to more people are 'better': in other words, for them quantity is almost synonymous with quality. The problem is, there are no fixed rules in our respective language games. On the contrary, there are no distinct borders; every audience uses a different framework of reference and the rules governing language usage are constantly shifting.

Changing language games

The sum products of all language games move like glaciers, with an accumulation zone and an ablation zone. The accumulation zone is where new snowflakes are constantly falling. Most are immediately blown away, but some remain and 'stick'. In time, the glacier flows into the ablation zone, where it melts into water and disappears.

All languages are constantly shifting, changing in many different ways, not merely through the addition of new words. Taking vocabulary as an example, most innovations breeze in and are instantly blown away again, but others gain a foothold and establish themselves as everyday terms. The process of change begins with all such innovations being regarded as errors that break established rules. Teachers underline them in red and desperately try to fend them off. Some innovations even manage to become established against all the odds, mostly as a result of being picked up by the mass media, and end up even being lexicographically recorded in the hallowed pages of the OED. So today's errors can become tomorrow's rules.

How we view language depends on where we happen to find ourselves on the 'language glacier'. Older users, who tend to occupy ablation zones, where outmoded words and rules are washed away into oblivion, are horrified by most innovations originating in the accumulation zone. The younger generation on the upper part of the glacier consider the language used further below in the ablation zone as antiquated or, in their kind of language, 'well past its sell-by date'.

A glacier moves under its own weight, owing to the pressure caused as a result. No single snowflake or particle of ice can move the glacier forwards, but move it does, because every particle behaves in the same way, trying to improve its position respective to the centre of the Earth. In so doing it bumps into other particles. This is how tiny 'microstructures' weighing virtually nothing can move massive 'macrostructures' weighing thousands or millions of tonnes.

In language, change occurs as a result of a similar pattern of interactions between micro- and macrostructures. In fact, the same pattern also applies in a free market economy, where the theory put forward by Adam Smith held that an individual's economic egoism could generate greater overall prosperity. Language change is essentially driven by three elements of human behaviour: innovation, idleness and a desire to show off.

Instead of writing words out in full, text message addicts tend to abbreviate them and be innovative in other ways in a bid to impress their peers and differentiate themselves from the generation in the ablation zone, people who would never use trendy words like "cool" or "happening" in the way that youngsters bandy them about. Naturally, no single user can actually change the language they speak, but since all users' behaviour adheres to the three elements mentioned above, some interaction between micro- and macrostructures results, setting the language glacier in motion.

Now, going for a walk on a glacier is always a dangerous undertaking, but the risks are particularly high for anyone out wandering on an unstable language glacier. Not only is there a danger of moving too far up or down the glacier, its middle, too, is riddled with crevasses, ready to swallow up anyone whose footing is unsure, i.e. whose language games prompt them to express themselves in what others deem to be a sexist, racist, or otherwise highly offensive or insulting way. To assess these risks properly we need modern measuring equipment. We'd like to know precisely where the pitfalls are so that we can avoid them. Anyone writing an article, advertising brochure or speech will know in advance whether and how certain terms and expressions will be construed, without having to wire up thousands of human guinea pigs to expensive neurological equipment.

Search engines as tools for measuring language games

In 1998, Larry Page and Sergey Brin founded the company Google Inc. in a garage and marketed the first test version of a new Internet search engine. Just eight years later, that search engine has an index of over 8 billion pages, and the company's market value is in excess of $150 billion. On the German-language part of the Internet, Google commands a market share among search engines of over 90%.

So what's the secret of Google's search engine? There were other search engines before Google came along, though one disadvantage they had was that the order of the websites or pages they displayed was often meaningless. Anyone who wanted to know something about the president of the United States, for example, might find that the first hit was a website cobbled together by some anonymous computer freak living in the back of beyond. So users had to open dozens of 'hits' until they more or less fortuitously chanced upon the desired information.

Google brought order to the Internet by optimising page ranking in a brilliantly simple way. Every Google page is ranked depending on the number of other pages using that page as a reference. Furthermore, ranking is redundant, for any page with numerous references of its own scores more points than, say, the unimportant and consequently isolated website created by our computer freak. [I don't understand the previous sentence]

The algorithm developed by Page and Brin compiles an index for its users, in which the order in which the pages are presented is determined by a constant, dynamic democratic process. The electorate comprises the countless website designers, and the votes are their links to other websites. To reduce potential election fraud to a minimum, given the impossibility of checking all the votes, Page and Brin combined direct with indirect democracy. In their system anyone who creates a website receives a direct vote, but if they themselves are elected by lots of other voters, the confidence this reflects is rewarded, and they can pass on the votes thus accumulated.

In this way, Google's ranking reflects the behavioural maxims of all website creators who include links, though those maxims have never been scientifically investigated. Nonetheless, we can assume that people mainly link websites whose content they deem to be important. The more relevant aspects for language-oriented Google users, namely correct orthography or grammatical form, are barely take into account when pages are selected. However, we can assume some kind of relationship between form and content. For example, anyone entering the word "screwdriver" into Google will find a description of the expected tool at the top of their list of hits, which clearly makes sense. Yet someone more interested in the cocktail of the same name will not be disappointed either, because the second hit provides a description of the drink in question.

Google has other advantages over lexicographical works like the OED. Not only is it free and more user-friendly, but in addition to finding the most relevant results it also provides the total number of hits, i.e. pages featuring the search string. So whereas the good old OED only comes up with a boolean (i.e. right or wrong) answer, the results supplied to Google users give them not only a qualitative, but also a quantitative response.

What's the more commonly used term: football or soccer? Google statistics effortlessly answer such questions. National Google pages even give users the option of focussing their searches on websites created in their respective country, which are likely to be of greater interest to them. Google also performs well in more advanced, contextual searches, quickly revealing important distinctions, like the fact that there are two different sports both called football in their respective countries (try searching for football in the UK and football in the USA, for example).

Language-oriented Google users produce work of a superior quality, because they use the search engine to tune into the respective specialist terminology relatively quickly. After all, any writer's 'feel' for their language depends on their own individual experience. Systematically checking all known variants can test the compatibility between different people's language games, and the text produced in this manner flows more smoothly when processed by like-minded readers' brains, being found somehow 'easier to read' and 'more readily comprehensible'. There's no question here of any form of dictatorship. The hits found by Google were determined by democratic means, the number of hits being based on all documents available on the Web.

Consequently, the spelling errors mentioned at the beginning of this paper should always be seen in relation to the correct results found: the misspelled 'recieve' may get 14 million hits, but the correct spelling scores 580 million; and incorrect renderings of 'accomodate' are far outnumbered by correct usages of accommodate. In English there are very few, if any, examples of misspellings that outnumber their correct equivalents in terms of the number of hits, perhaps because the language has many different variants (such as UK English versus US English) and is anyway so tolerant of alternative spellings. In German, by contrast the best-known example of this involves a word meaning "sour cream": the correct spelling Schmant gets 15,600 hits, whereas the supposedly incorrect spelling 'Schmand' gets over 330,000. In terms of the theorem about language evolution suggesting that "today's errors are tomorrow's rules" any such very frequent 'deviant spellings' should not simply be dismissed as breaches of the rules, but should actually be interpreted as votes in favour of a reformed spelling.

These examples show that using Google to evaluate language use is, for the time being at least, an activity best left in the hands of professional writers, such as text editors or translators, because it presupposes familiarity with multiple variants of the language game. Searches involving groups of words are even more problematic. For instance, a European using Google to try and find information about the teams playing the world's favourite sport in America will be frustrated at having to trawl through quite a few pages before finding anything about soccer, as opposed to American football. Other problems can arise from the fact that the number of hits for individual search terms can be inflated by instances of their occurrence in very widespread composite words or phrases that actually have nothing to do with the meaning sought. Naturally, this phenomenon can easily undermine the seeker's judgement.

In inflecting languages, performing searches is particularly difficult, so Google users often have to try out several different forms of 'the same word' and can only draw statistically based conclusions about frequency of usage after totting up combined numbers of hits. Another handicap is Google's wretched language recognition algorithm. For even if users access a national Google page and restrict their search to pages in their language of choice, words that happen to exist in the same spelling in other languages – often meaning something completely different – offer a 'hit list' that actually comprises a high proportion of 'misses'.

To be fair, Google was optimised for people seeking content, not designed for people performing searches drawing on their skill at the language game. The only Internet search engine specifically designed for language searches, is the author's own Translation Search Machine Keybot, which combs the Web just like Google, but only indexes those texts that are available in more than one language. Unlike all other search engines, rather than displaying entire pages, Keybot's index covers only an individual excerpt of the translation, in other words a section, heading or contents of a box in a table. However, in a version not yet available on the Internet, the segmentation is taken further and only individual domains are indexed. The result is the first language dictionary based on artificial intelligence. When launched it will probably already be the most comprehensive dictionary ever made…as well as the dictionary containing the highest number of errors.

The technology used for performing searches on the Internet is barely 10 years old and is still in its infancy. Within the foreseeable future we will witness the advent of search engines that have been optimised to only take account of linguistic aspects and the knowledge derived from our functional use of language. More and more frequently today, we're seeing cumbersome lexicographical works gathering dust on the bookshelves of young Internet users who opt instead to use Google, which is faster and more intelligent. Search engines will fundamentally change and also radically simplify how we deal with language. Man's age-old dream of a universal grammar will finally be fulfilled. In future, the one, simple rule applying to prescriptive spelling and grammar will be this: correct usage is frequent usage and frequent usage is correct usage. And anyone who claims that rule is wrong should bear in mind that today's errors are tomorrow's rules!


About the author:
Martin Bächtold studied history, German, philosophy and economics in Fribourg, Geneva, New York, Stanford and Zurich. In 1991 he produced the world's first fully automatic translation agency, which passes on assignments to specialist translators. In 2005, he launched Keybot, the Internet's first search engine for translators. Keybot is one of the world's largest translation archive, finding search strings and their translation on multilingual websites.