800 Episodes Word Cloud

On the occasion of Wondermark’s eight hundredth episode, I thought I would celebrate by looking at a complete corpus of words used in Wondermark, and creating a cloud from them (similar to my existing tag cloud of subject matter):

“Huh,” I thought to myself, “I suppose it is unsurprising that the most common words used in a large sample of comics probably closely resembles a list of common words found in the language in general.”

So no great discoveries here, unfortunately. It’s further complicated by the fact that the text I’m using as a corpus is an export of my Oh No Robot database, which contains user-submitted transcriptions of all my comics, which themselves often contain transcriber-invented character names and extensive scene descriptions — both of which are great, but which somewhat muddy the dataset. The heavy incidence of the words “man” and “woman” in the cloud, for example, are probably due to transcriptions reading something like:

Man: I have started a bean farm.
Woman: We’ll be millionaires!
Man: Not if flies eat the crops first.
Woman: Time to invest heavily in pesticides.

In that sample transcription, the words “man” and “woman” both appear twice as frequently as any other word, despite not occurring in the dialogue at all.

It’d be neat to see, instead of a brute word-frequency cloud, something like a collection of statistically improbable phrases, or words that show up in Wondermark once and only once…things like that. I wonder what interesting things could be mined from the data? If you’d like to play around with the corpus yourself, dirty as the set is, here’s the text file I used. If you derive anything neat, let us know!