I before E

“I before E, except after C”

A little rhyme taught to children to help remember the spelling of various words, it is often the butt of jokes for being wrong so often. Words like deficiencies even go so far as to break the rule twice.

After a debate about this rule in the office, and reading yet another inconclusive article online, it occured to me that everybody stating the rule needs to be disposed of has never actually sat down and determined whether the rule has any real value or not. What needs to happen is for someone to sit down and count how often the rule is violated in the English language. Don’t get me wrong, I don’t blame anyone for not wanting to sit down with a dictionary and count all the words; its a boring and repetitive task. What did occur to me is that I’m a computer guy, and computers are really good at boring, and repetitive tasks.

Using the power of computing, I intend to put this damned debate to rest, once and for all.

Word Lists

In order to do this, the first thing that will be required is a list of words to test. I am not going through the Oxford English Dictionary and typing every single word, so I have to go online and find a list. The most logical place is to look is on hacking sites. It is not uncommon to maintain lists of words for guessing people’s passwords.

Another source I can use would be one of the online dictionaries. This would likely give me a highly accurate representation of words; however, this would take more effort on my part, and this is a stupid experiment to begin with: I’m not wasting a lot of time on it.

Probably the most relevant source would be a wordlist maintained for cryptographic analysis. Optimally one maintained by the NSA (they have the most time and motivation to keep it up to date). I have created word lists for this purpose, but I just ran Mobey Dick through a program to count the occurance of words. Sharon has pointed out that nobody but me still speaks like that.

Originally I had obtained this word list:

http://www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html

However, the owner has decided that he needs to hide it away. Now I will use Wiktionary’s television show word frequency list. Not the best source of data, but it will do for my purposes.

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#TV_and_movie_scripts

Testing the Rule

((I had originally published these results on Wikipedia, however they have a policy against being the initial publishers of research. It makes sense, but unfortunately means that these results got removed with nowhere else to go. To see the original Wikipedia page, please go to wikipedia (http://en.wikipedia.org/w/index.php?title=I_before_E_except_after_C&oldid=52333742) ))

Using a lexicon of 627,935 words, several SQL queries were run to determine the validity of this rule. While the results are interesting, the lexicon was suspect as it appears to have contained several instances of the same word (-cy suffix, and its plural variant). While an attempt was made to handle these scenarios, the results are suspect:

Rules:

1. I before E, except after C
1. "IE" suffix ignored
 Lexicon Size................... : 627,935
 Words with apostrophes in them. : 127,774
 Words ending in "ie"/"ei"...... :   2,220
 New Lexicon Size............... : 497,941
 Sample Size (all rules applied) :  26,050

                         Success : 20,380  78.2341%
                         Failure :  5,670  21.7658%
                                   ------  --------
                       Check Sum : 26,050  99.9999%

Since the lexicon itself was considered flawed, a new lexicon was obtained and the same rules applied: http://www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html

 Lexicon Size................... : 4845
 Words with apostrophes in them. :   12
 Words ending in "ie"/"ei"...... :    4
 New Lexicon Size............... : 4828
 Sample Size (all rules applied) :  156

                         Success :    137  87.8205%
                         Failure :     19  12.1794%
                                   ------  --------
                       Check Sum :    156  99.9999%

Unfortunately, this does not take into account the frequency we use these words in. The frequencies of word occurrence are in PPM (Parts Per Million) in the dataset.

                       Written             Spoken
                 ------------------- -------------------
 Lexicon Weight   817,814             835,197
  Sample Weight    13,283               6,910             

        Success     8,553  64.3905%     3,819  55.2677%
        Failure     4,730  35.6094%     3,091  44.7322%
      Check Sum    13,283  99.9999%     6,910  99.9999%

Accurate but Misleading

The rule appears to handle almost 90% of the cases, but words that violate the rule are used more frequently than words that do not.

Leave a Reply

Your email address will not be published. Required fields are marked *