I'll start by explaining briefly some of the characteristics that I decided to use in order to measure the level of literature in a piece of text:
- Sentence length
- Average word length
- Unique words
- Vocabulary Score (unique words / total words)
- Vocabulary Deviation
- Character Deviation
- Word Entropy
- Character Entropy
Sentence length and word length were calculated by splitting the text into components with the splitting parameter for sentences being a full stop, and for words being a space. Note that words can therefore include punctuation; "this" classes as a different word to "this." when it is at the end of a sentence. This gives us some information about where words are used in sentences.
Unique words are simply words without repeats, for example in the text chunk "hi hi there there there", the unique words are "hi" and "there". These can be used to provide what I've called a vocabulary score. The vocabulary score essentially allows us to create a measure of the range of words used, that is (almost) independent of the length of the text. I say almost, because the range of vocabulary in the English language is limited, therefore as text chunks become longer (for example in a very long novel such as War and Peace) the vocabulary score will automatically decrease.
The vocabulary and character deviation involve counting the number of repeats of each word and character - then calculating the standard deviation of this data set. The former is a good measure of how comfortable an author is using a wide range of words, as opposed to using them occasionally.
Finally two very important measures of "literaryness" are the entropy of words and characters. During initial experiments I found that these tend to stay fairly constant for the "literature" used for training. To calculate the entropy of a set the following calculation is used:
Entropy = - SUM [ Ki log2 ( Ki ) ],
where Ki is the probability of a member of the set (in this case a word or character) occurring.
The next step of the experiment was to find a transform of the values of the above features that yielded very similar results for all of my training set.
The following set of books were used in the training set:
- Franz Kafka, Metamorphosis
- Oscar Wilde, The Picture of Dorian Gray
- Sir Arthur Conan Doyle, The Hound of the Baskervilles
- Joseph Conrad, Heart of Darkness
- Jules Verne, Around the World in 80 Days
- J.D. Salinger, Catcher in the Rye
- Anthony Burgess, A Clockwork Orange
- Bram Stoker, Dracula
- George Orwell, Down and Out in Paris and London
And a value, which we'll call L was determined to be the smallest deviating transform of the textual features where:
L = (Word entropy + Character entropy - Word length) / Vocabulary score
Other features were found to range too much between the works.
Given an average L value the distance away from the mean was used to give any test text a literature score. Unfortunately this blog did not receive the title of literature, but then nor did Harry Potter and the Philosophers Stone, or Twilight, or a Mills and Boon title, A Fool For Love, so it seems to work quite nicely.
A book I would consider to be literature, not on the training set was Kidnapped by Robert Louis Stevenson, and it did indeed score highly on the literature front which means the system works!
Below is the application, just click to launch it and post a comment with any results you might find:
Currently the results are a bit spurious, mostly because thresholds are fiddly to get right, but with a bit of work and some other parameters I have been thinking about, this kind of textual analysis really might help us spot the next classics! The program currently works best with whole books, there are a load of free eBooks that are now in the public domain which you can test the software with!