About the wordlist
The most time-consuming part in this project has been the compilation of the
wordlist. Writing the reference implementation has been trivial in comparison.
In the compilation of the wordlist I have scanned hundreds of thousands of
words using a combination of automated tools and manual sorting. I estimate
that over 300 hours have gone into the creation of the wordlist and around 50
ad-hoc awk scripts.
The wordlist still isn't perfect. See how you can help improve it.
Remember that at this stage the list is not final and encoded information will
not be compatible to future versions. When the list reaches version 1.0 it
will be frozen and no modifications will be done except, perhaps, spelling
corrections that are still accepted by the soundalike matching.
Word selection criteria
- The wordlist contains 1626 words.
- All words are between 4 and 7 letters long.
- No word in the list is a prefix of another word (e.g. visit, visitor).
- Five letter prefixes of words are sufficient to be unique.
The rest of the criteria are less strict. You may find exceptions to all of
them because it is difficult to satisfy them all at the same time.
- The words should be usable by people all over the world. The list is far
from perfect in that respect. It is heavily biased towards western culture
and English in particular. The international vocabulary is simply not big
enough. One can argue that even words like "hotel" or "radio" are not truly
international. You will find many English words in the list but I have tried
to limit them to words that are part of a beginner's vocabulary or words that
have close relatives in other european languages. In some cases a word has
a different meaning in another language or is pronounced very differently
but for the purpose of the encoding it is still ok - I assume that when the
encoding is used for spoken communication both sides speak the same language.
- The words should have more than one syllable. This makes them easier to
recognize when spoken, especially over a phone line. Again, you will find
many exceptions. For one syllable words I have tried to use words with 3 or
more consonants or words with diphthongs, making for a longer and more distinct
pronounciation. As a result of this requirement the average word length has
increased. I do not consider this to be a problem since my goal in limiting
the word length was not to reduce the average length of encoded data but to
limit the maximum length to fit in fixed-size fields or a terminal line
width.
- No two words on the list should sound too much alike. Soundalikes
such as "sweet" and "suite" are ruled out. One of the two is chosen and the
other should be accepted by the decoder's soundalike matching code or using
explicit aliases for some words.
- No offensive words. The rule was to avoid words that I would not
like to be printed on my business card. I have extended this to words that
by themselves are not offensive but are too likely to create combinations
that someone may find embarrassing or offensive. This includes words dealing
with religion such as "church" or "jewish" and some words with negative
meanings like "problem" or "fiasco". I am sure that a creative mind (or a
random number generator) can find plenty of embarrasing or offensive word
combinations using only words in the list but I have tried to avoid the more
obvious ones. One of my tools for this was simply a generator of random word
combinations - the problematic ones stick out like a sore thumb.
- Avoid words with tricky spelling or pronounciation. Even if the receiver
of the message can probably spell the word close enough for the soundalike
matcher to recognize it correctly I prefer avoiding such words. I believe this
will help users feel more comfortable using the system, increase the level
of confidence and decrease the overall error rate. Most words in the list can
be spelled more or less correctly from hearing, even without knowing the word.
- The word should feel right for the job. I know, this one is very
subjective but some words would meet all the criteria and still not feel
right for the purpose of mnemonic encoding. The word should feel like one of
the words in the radio phonetic alphabets (alpha, bravo, charlie, delta etc).
Notes:
When checking for soundalikes I have found that the standard soundex
algorithms are far too liberal and find too many words that supposedly sound
similar. It may be true that all vowels are pronounced as schwa in certain
cases, but completely eliminating vowels from the soundex comparison is going
a little too far . The consonant groups in the soundex algorithm are too
general while at the same time ignoring consonants that sound alike over a
limited bandwidth channel such as "F" and "S".
If you need a shorter wordlist for any purpose please use words from the
beginning of the list. It is sorted according to my ranking for word
quality.
The phonetic pronunciation database in the Moby wordlist has been particularly
useful in finding soundalikes by comparing the distance between the phonetic
representations rather than the standard spelling.
Actually, not all words are 4 to 7 letters long. 7 extra words with 3 letters
each are used for encoding 24 bit remainders i.e. when the encoded data length
is 3 modulu 4.
How you can help
You can help improve the wordlist by suggesting new words. Finding new words
that meet the criteria is not easy. As I approached my target of 1626 words
I found it was becoming asymptotically more difficult to find new words.
If you have a word to suggest:
- Verify that the word does not appear in the list.
- Check if the word already appears in the rejects list.
- Evaluate the word according to the above criteria.
- Send me email.
I would also appreciate your opinion about words already in the list. Remember
that if you find a word that you think shouldn't be there I can't just remove
it - I need a replacement first. If English is not your native language I
would especially like to hear your opinion about the usability of the list.
If you can show the list to potential users who know even less English than
you it would be even better.
The list is quite long. If you only have time for reviewing a small part make
sure it isn't the beginning of the list. I don't want to get only comments
about the first words in the list. To display a random selection of words you
can use this command:
head --bytes nnn /dev/urandom | mnencode
I must say it was quite frustrating to find so many good 8 letter
international words, but the 7 letter limit is still one of my primary criteria
that have no exceptions. If you feel that increasing the word quality and
making the words more international is important enough to reconsider 8 letter
words please tell me.
Resources used in compiling this wordlist
back to homepage