“I before E, except after C”
A little rhyme taught to children to help remember the spelling of various words, it is often the butt of jokes for being wrong so often. Words like deficiencies even go so far as to break the rule twice.
After a debate about this rule in the office, and reading yet another inconclusive article online, it occured to me that everybody stating the rule needs to be disposed of has never actually sat down and determined whether the rule has any real value or not. What needs to happen is for someone to sit down and count how often the rule is violated in the English language. Don’t get me wrong, I don’t blame anyone for not wanting to sit down with a dictionary and count all the words; its a boring and repetitive task. What did occur to me is that I’m a computer guy, and computers are really good at boring, and repetitive tasks.
Using the power of computing, I intend to put this damned debate to rest, once and for all.
In order to do this, the first thing that will be required is a list of words to test. I am not going through the Oxford English Dictionary and typing every single word, so I have to go online and find a list. The most logical place is to look is on hacking sites. It is not uncommon to maintain lists of words for guessing people’s passwords.
Another source I can use would be one of the online dictionaries. This would likely give me a highly accurate representation of words; however, this would take more effort on my part, and this is a stupid experiment to begin with: I’m not wasting a lot of time on it.
Probably the most relevant source would be a wordlist maintained for cryptographic analysis. Optimally one maintained by the NSA (they have the most time and motivation to keep it up to date). I have created word lists for this purpose, but I just ran Mobey Dick through a program to count the occurance of words. Sharon has pointed out that nobody but me still speaks like that.
Originally I had obtained this word list:
However, the owner has decided that he needs to hide it away. Now I will use Wiktionary’s television show word frequency list. Not the best source of data, but it will do for my purposes.
Testing the Rule
((I had originally published these results on Wikipedia, however they have a policy against being the initial publishers of research. It makes sense, but unfortunately means that these results got removed with nowhere else to go. To see the original Wikipedia page, please go to wikipedia (http://en.wikipedia.org/w/index.php?title=I_before_E_except_after_C&oldid=52333742) ))
Using a lexicon of 627,935 words, several SQL queries were run to determine the validity of this rule. While the results are interesting, the lexicon was suspect as it appears to have contained several instances of the same word (-cy suffix, and its plural variant). While an attempt was made to handle these scenarios, the results are suspect:
1. I before E, except after C 1. "IE" suffix ignored
Lexicon Size................... : 627,935 Words with apostrophes in them. : 127,774 Words ending in "ie"/"ei"...... : 2,220 New Lexicon Size............... : 497,941 Sample Size (all rules applied) : 26,050 Success : 20,380 78.2341% Failure : 5,670 21.7658% ------ -------- Check Sum : 26,050 99.9999%
Since the lexicon itself was considered flawed, a new lexicon was obtained and the same rules applied: http://www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html
Lexicon Size................... : 4845 Words with apostrophes in them. : 12 Words ending in "ie"/"ei"...... : 4 New Lexicon Size............... : 4828 Sample Size (all rules applied) : 156 Success : 137 87.8205% Failure : 19 12.1794% ------ -------- Check Sum : 156 99.9999%
Unfortunately, this does not take into account the frequency we use these words in. The frequencies of word occurrence are in PPM (Parts Per Million) in the dataset.
Written Spoken ------------------- ------------------- Lexicon Weight 817,814 835,197 Sample Weight 13,283 6,910 Success 8,553 64.3905% 3,819 55.2677% Failure 4,730 35.6094% 3,091 44.7322% Check Sum 13,283 99.9999% 6,910 99.9999%
Accurate but Misleading
The rule appears to handle almost 90% of the cases, but words that violate the rule are used more frequently than words that do not.