Crowdsourcing editorial decisions with Google’s Ngram

For those of us beholden to a style or branding guide (or both), it’s inevitable to have a conversation for the ages about word choice. We might even be cognizant of its absurdity at it’s happening: “I can’t believe we’re talking about this.” “But it’s for [insert seemingly important project here]!” “I can’t believe we’re still talking about this.” We try to anticipate that perfect turn of phrase to which our audience will respond, but the more we beat the thing to death, the more we doubt ourselves.

But what if we could back up our choice with data from all of the published works in the world? In an era of crowdsourcing everything from galaxy classification to municipality renaming, Google brings us Ngram, which I was fortunate to learn about during a recent presentation at their offices.

A data mining offshoot from Google’s initiative to digitize every book ever published (currently 15 million out of 129 million), Ngram allows for a simple search on the frequency of words or phrases over a specified time, from a choice of languages (as of this posting, they are up to 11, including five corpora for English).

For example, “deign” is one of my favorite words, probably because I like to sound ironically antiquated. The Ngram of usage since 1880 confirms my belief that it is on the decline:

20110209-deign

…and a good editor would tell me to get off my high-falutin’ horse and pick another word.

For Ngram co-creator Jon Orwant, the Q&A following his presentation was feverish (no surprise, given the audience was a mix of programmers and literary types). Questions abounded about the intricacies of his methodology and the quality of the sample (he assured us that the minds behind Google’s search algorithms could also take into account the potentially skewed sampling of books they are scanning). As I pondered my crowdsourcing proposition, I wondered whether published works really captured the zeitgeist, especially in an era where so much is communicated off the printed page, via blogs, tweets, even video.

Still, I challenged myself to create an Ngram that perfectly demonstrated how a word choice could be made based on traceable growth. As a starting point, here’s the use of “world wide web” plateauing against “Web” and “Internet.”

20110209-www

It’s an obvious choice, but I imagine editors faced this conundrum in the not-too distant past. Other ideas?

Leave a Reply