As most reasonable people are familiar with the Harry Potter books,
their content serves as ideal material for building a mnemonic system.
The mnemonic major system, in particular, is used to memorize number sequences.
In order to implement the steps outlined in this post you need the content of the Harry Potter books (or other book(s) if you prefer).
In a previous post I showed you how to
download fantasy books and extract their text.
Among the downloaded data were the Harry Potter books which I will use in this post.
Step 1: Learn the sound-number mapping
In the mnemonic major system each number from 0 to 9 is associated with one or more
consonant sounds. Use the following table as a reference.
Letters with example words
s (see), c (city), z (zero), x (xylophone)
t, d, ð, θ
t (tee), d (dad), th (though), th (think)
r (right), l (colonel)
ʤ, ʧ, ʃ, ʒ
ch (cheese), j (juice), g (ginger), sh (shell), c (cello, special), cz (czech), s (tissue, vision), sc (fascist), sch (eschew), t (ration), tsch (putsch), z (seizure)
k (kid), c (cake), q (quarter), g (good), ch (loch)
f (face), ph (phone), v (alive), gh (laugh)
p (power), b (baby)
In English, letters are pronounced in different ways depending on the context,
that’s why some letters are repeated in different rows. But in the end,
only the sound matters, not the spelling.
Here are a few examples for words, their IPA representation and the number they encode.
To familiarize yourself with the IPA notation, try to read the following excerpt.
Consider the same lines converted to number sequences.
I’m going to show you how to write a training program to internalize
the concept of mapping sounds to numbers in Step 4. But first, you need the ability
to convert text automatically to numbers. This happens in 2 steps. First, the text is converted
to IPA. Then, the IPA is converted to numbers.
Step 2: Converting IPA to numbers
The process of converting IPA to numbers is very simple. I iterate through the IPA chars
and if there is a number associated with the char I append the number to the result.
For example, major_decode_from_ipa('dɪfəkəlt') yields [1, 8, 7, 5, 1].
Additionally, I define a couple functions for converting number sequences to and from strings.
For example, numseq_to_str([1, 8, 7, 5, 1]) yields '18751'.
Step 3: Converting text to IPA
In order to automatically convert text to IPA (and then to numbers) you need to use an IPA
Python’s eng-to-ipa package is able to convert text to
IPA using the Carnegie-Mellon University Pronouncing Dictionary.
According to the docs, eng-to-ipa will reprint words that cannot be found in the CMU dictionary
with an asterisk. Thus, “i’ll and sandwiches.” have not been found. Clearly, the punctuation is the issue.
I preprocess the text in order to ease the conversion to IPA.
As you can see, there are 3 ways to pronounce this sentence depending on whether you like
to pronounce sandwich with m, n or nd. In order to use the major system effectively,
you should use the version that sounds most natural to you.
Let’s look at another example.
The word gryffindor was not found in the CMU dictionary, which can be expected.
After a quick search for the word’s pronunciation I found YouGlish
which uses YouTube videos to find IPAs. While their API is not free, a limited number of IPA’s can be scraped
for our purpose.
As you can see, this function is semi-interactive. Without user intervention it will
get stuck on a CAPTCHA. Even then, you’ll eventually reach their daily usage limit
and won’t be able to continue. For our purpose this shall be good enough though.
I have shown that for many words there are several possible pronunciations, from which you need to choose
your preferred one, and that some words are not in the CMU dictionary and
require scraping the IPA from another source or are not available at all.
For these 2 reasons, you will need to build your own personal IPA dictionary.
I’m going to build my IPA dictionary by iterating through the words of the Harry Potter books
and adding each word and the corresponding IPA to my dictionary.
First, I define
some functions for managing my dictionary (a simple JSON file in this case).
Next, I iterate through the words of the books and enter
each word and whatever is returned by eng-to-ipa into my dictionary.
Each value in the dictionary is now a list of possible IPAs as that is what
eng-to-ipa returned. Before the dictionary can be used, we need to ensure
that each word has exactly one IPA. If the different IPAs for a word
decode to different numbers we need to ask the user for their preferred
pronunciation. Otherwise, the choice is of no consequence and we simply choose
the first one.
Next, I search the IPA dict for values followed by an asterisk (which as you’ve seen earlier,
indicates that no IPA was found in the CMU dictionary) and attempt to scrape it from YouGlish.
And that’s all, my IPA dictionary is ready! I define a few functions for
converting text to IPA conveniently.
For example, text_to_ipa(load_ipa_dict(), 'Well done, Harry!') yields wɛl dən hɛri.
Using the major_decode_from_ipa function from the previous step I can now convert text to numbers.
To make it even more convenient, I define a function major_decode_from_text.
This function has the added benefit that it can group the result by the words in the source sentence
which is useful when you’re printing the number sequence for a longer text.
Step 4: Practice decoding words
In order to use the major system effectively, you need to be able to quickly convert
text to numbers in your mind.
Now that you can convert text to numbers using the computer, it’s easy to write
a simple training program for practicing doing the same thing in your mind.
Continue practicing until the text-to-number conversion becomes second nature to you.
Step 5: Practice encoding numbers
When you want to memorize a number sequence using the major system, you need to find an appropriate encoding
for it. The following code helps you practice this concept.
Finding good (memorable) encodings for a given number sequence is a much more creative (and laborious) process
than decoding text. Let’s see if we can use the content of the Harry Potter books to encode number sequences.
Step 6: Find encodings automatically
Given a number sequence I’m going to search the Harry Potter books for suitable encodings.
The processes involved can be slow, thus precomputing all possible encodings and saving it to index files makes sense.
Here are a few convenience functions that allow me to create index files from lists of strings
and query them for number sequences.
Next, I’m going to create several indexes for different types of text chunks extracted
from the Harry Potter books.
Well duh! All words that can be encoded are already in our IPA dictionary so
finding words for number sequences is straightforward.
To use this index, I defined the following function.
Nouns are generally easier to imagine (and thus memorize) than other types of words.
For this reason, it makes sense to look for nouns specifically when trying to find encodings
for a number sequence.
I’m going to use Python’s NLTK package to process the Harry Potter text and identify nouns.
The process required for this is called part-of-speech tagging (POS tagging).
Here is an example to give you an idea of what the POS tagger does.
Now, I’m not a linguist but unknown seems like a noun to me in this case, yet it is marked as an adjective (JJ). Anyway,
NLTK’s POS tagger does a good job identifying nouns generally. You can see the tag for nouns starts with NN.
Note: If you remove the call to preprocess, the tag for Dumbledore will be NNP (proper noun), not NN.
That is because preprocess lowercases the entire text and NLTK uses casing to determine the correct tag.
Thus, the following code will extract all nouns from a piece of text.
Then I can build my noun index like so.
Noun phrases are nouns with modifiers, e.g. ridiculous muggle protection act, restricted section, giant gryffindor hourglass,
slow-acting venoms, mrs weasley, ….
Noun phrases are useful for my purpose because they are as easy to remember as individual nouns (if not easier due to being more concrete)
and have the potential to encode longer number sequences.
The technique I’m going to use to find noun phrases is called Chunking.
Chunking is the segmentation and labelling of multi-token sequences.
In other words it takes tokens (e.g. a tokenized sentence) as input and produces
non-overlapping subsets of those tokens.
The way a list of tokens is chunked is defined by a chunk grammar. Here’s the grammar
I’m going to use to find noun phrases.
The first rule in this grammar says that an NBAR chunk should be formed whenever the chunker finds
zero or more adjectives (of all types, that’s why it’s JJ.* not just JJ) followed
by one or more nouns (of all types).
The second rule says that an NP chunk should be formed whenever the chunker finds
an NBAR chunk and whenever it finds two NBAR chunks connected by a preposition.
Examples for phrases that would be caught by the second rule but not by the first are
head protruding over ron, death eater in disguise, ministry of magic, sound of laughter,
ray of purest sunlight, cup of strong tea.
The following is the code I use for extracting noun phrases from any text.
The first function is just a helper function that preprocesses, tokenizes and
POS tags the text, then feeds it to the chunker, iterates over the produced
subsets and reassembles the chunks I’m interested in into text.
Simple nouns would also be considered noun phrases by the grammar I used,
thus I only consider the subsets consisting of more than one token.
And I build the index for noun phrases.
A clause is a group of words containing a subject and a verb and functions as a member of a sentence.
I used the same technique as for noun phrases but with a different grammar.
I took the grammar I used from the NLTK docs.
A few example clauses:
And I build the index.
Sentences are easily extracted by a simple call to nltk.tokenize.sent_tokenize.
Double quotation marks need to be removed otherwise they affect sentence tokenization.
Consider the following example.
This would be sent-tokenized like so:
The quotation mark following the exclamation mark in the third sentence prevents
the sent tokenizer from breaking it up into two sentences.
With double quotation marks removed this problem is solved.
Now that we have all the indexes we can define the remaining find functions.
Let’s do a quick integrity check and search for a number sequence mentioned earlier in this post.
Measuring the coverage of the indexes
Try finding encodings for a random number sequence and you’ll quickly realize
that there won’t be any results for many cases.
In order to determine how useful our indexes are we need to measure the
probability of at least one encoding being found for a random number sequence of a certain length.
And the result is:
Not very impressive! You are very unlikely to find a match even for something as simple as a phone number.
To improve the numbers one might use more books (and thus content) and
repeat the previous steps. One could also use more complicated NLP techniques to recombine text chunks into new phrases
I will however show you a technique to encode number sequences of any lengths that don’t rely
purely on a high coverage of encodings for number sequences.
Step 7: Find noun sequences
In order to encode number sequences of an arbitrary length you need to combine several encodings
and link them together. For example, to encode 2184775142 you could use the words wand frog cauldron.
Then, create a story with those words to link them. For example, to memorize the word sequence wand frog cauldron
you could imagine using a wand to conjure a frog inside a cauldron.
I’m using nouns since they are especially well suited for building stories.
Here’s a function to find noun sequences automatically from our indexes.
The function is interactive, it asks you to choose a word from a list of possible nouns
repeatedly until the entire number sequence is encoded.
Step 8: Build your personal word list
You don’t always have your computer at hand to find good encodings for you.
Plus, picturing new encodings every time you need to memorize a number requires a higher mental effort and is time-consuming.
It’s better (and faster) to reuse the same encodings over and over again. In order to memorize number sequences efficiently,
you should have a list of encodings for all numbers from 00 to 99.
For this purpose, I created a few functions that will help you build your personal
word list and save it to a JSON file.
Step 9: Memorize your word list with the help of the Leitner system
Memorizing a hundred number-word pairs does require a significant effort but with the help of the
Leitner system nothing shall stop you.
The Leitner system is an implementation of the good old principle of spaced repetition.
Spaced repetition is a strategy for learning facts. The main idea is that you review facts
less frequently the more often you remembered them correctly.
You may use real flashcards of course but here are a few functions that implement a simple Leitner system.
Run interactive_leitner() to start training. It will show you the number of facts in each box
and ask you which box you would like to review.
In this case, I used box 0 as a staging box for new cards. This way you can add hundreds of facts
into the system at once without overwhelming your learning capacity.
When reviewing box 0, if you answer correctly the fact jumps to box 2, otherwise to box 1.
If you think you introduced enough new facts (I suggest 5-10), press Ctrl-C to stop.
When reviewing the other boxes, if you answer correctly the fact jumps to the next-higher box, otherwise back to 1.
Finish reviewing the box or press Ctrl-C to stop.
Once a fact reaches the box after max_box, so box 6 in this case, it is considered successfully learned
and does not appear to be reviewed again.
Review the boxes with decreasing frequency e.g. box 1 daily, box 2 every other day, box 3 once a week and so on.
Adding the word list created in the previous step to our Leitner system is straightforward.
Continue practicing until all facts are beyond box 5.
Step 10: Practice memorizing number sequences
You are now fully ready to use the major system to memorize number sequences.
Consider the following training program. It shows you a number to memorize,
then distracts you with a mental math exercise before it asks you to enter the memorized number.
I used curses for some advanced user interaction in the
Step 11: Apply the major system to real life situations
While the mnemonic major system seems tedious at first, with enough practice
it becomes an incredibly efficient technique for memorizing long (or short) number sequences quickly.
In everyday life, there are many opportunities for memorizing number sequences. For example, the next time
you’re eating out memorize the prices for each dish you order and impress everyone at the end when it’s
time to split the check and no-one knows what they owe.
Add numbers you want to memorize long-term to the Leitner system created in Step 9. For example,
to add important emergency phone numbers in case you loose your phone:
In this post, I introduced the mnemonic major system, showed you how to manually and automatically
decode text to number sequences, find encodings in the content of your favorite fantasy books,
build your personal wordlist and use it to memorize number sequences of arbitrary lengths
and how to write your own code to train all these concepts.
I hope you found the information in this post useful.
In future posts, I will explore
other ways to apply data science to your favorite fantasy literature and maybe have a look
at other memory techniques as well.
Share your feedback in the comments and most importantly, start memorizing!
Update 1: Both sounds ð and θ should decode to 1
As suggested in a comment on Hacker News, the voiced and unvoiced pair ð, θ belong together
and should decode to 1.
Originally, I put θ into the group of sounds that decode to 8 because
the sound feels intuitively closer to f than d to me.
However, many words can be pronounced either with ð or with θ which means that separating them
increases the number of words that require a manual IPA disambiguation in Step 3.
Thus, I am convinced that it is better to have them both decode to 1.
Update 2: Word list based on Characters
Based on NF’s suggestion in the comments, I’m going to create a list of two-digit pegs
using characters from the Harry Potter books. Instead of using the entire word(s) for decoding however
only the first 2 decoded digits shall be used, otherwise we couldn’t find encodings for all number sequences of length 2.
The character names are quickly extracted with the help of a technique called Named-entity recognition (NER).
NER identifies so-called entities and places them into categories such as PERSON, DATE, GPE (which stands for
geopolitical entity), …
In this case, using SpaCy instead of NLTK is simpler and faster. See SpaCy’s docs
for a full list of named entity types.
This will output the identified characters into a text file. Unfortunately, for most
of the names we don’t have IPAs so you will have to build your word list manually.
You can reuse the interactive_create_wordlist function from before though, just use
a different filename and ignore the presented nouns. Instead, look for suitable names
in the file with the character names we just created.
Honestly, I didn’t find a good name for each number sequence so I used nouns sometimes.
Locations (GPE and LOC) could be extracted and used for the word list as well.