Names from a high order Markov Process and a simplified Katz back-off scheme

Various articles on RogueBasin have suggested that a Markov Process might be a good way to generate random names for a game. However, most of them have a weakness or two which can easily be solved. In this article, I will describe a basic first order Markov process seen in other articles on name generation, and discuss some of the short comings. I will then explain how to overcome these weakness using a higher order model, with a simplify Katz back-off scheme. This model is still fairly simple, but will allow you to come up with a data based name generator, even with a somewhat limited amount of data, which will produce pronounceable names with a similar feel to your training data.

Preliminaries (you can skip if you already know a bit about Markov processes)

Markov Process

A Markov model can be thought of as a generative model of data. In other words, it defines a full joint probability distribution which hypothesizes how the data was generated. However, this distribution has a very special property know as the Markov property. Essentially what this property says is that the output of the Markov process as any particular point in the output sequence depends entirely upon a fixed window of previous outputs. In other words, the following holds:

p(s_i | s_{i-1}, s_{i-2}, ..., s_0) = p(s_i | s_{i-1})

This equation gives the equation for a first order Markov model. Essentially what this says is that the current output depends solely upon the previous output. In the context of name generation, this might means that I will output a particular character based on the previous character. For example, if I just generated a 'z', I would be unlikely to generate another 'z', but would likely generate a vowel. The only trick then becomes coming up with the exact transition probabilities so I can ask things like what is the probability of generating an 'e' following a 'z'. It turns out that we do a simple maximum likelihood estimate from data by simply looking at all the times we observed 'ze', divided by all of the times that we say a 'z'.

Higher Order Models

In general though, a first order model might not be sufficient. It turns out that if we increase the size of the window we look back in time, we often get better looking outputs. This is unsurprising when we consider that many phonemes require more than one letter to express, or that syllabic dependencies might make a word unpronounceable. In order to have a higher order Markov model, we simply increase the size of our context from 1, to a bigger number. For example for an order 3 model the following would hold under the Markov property:

p(s_i | s_{i-1}, s_{i-2}, ..., s_0) = p(s_i | s_{i-1}, s_{i-2}, s_{i-3})

Training this model is accomplished in the same way as before, only we have to store more context. Now instead of asking things like the probability of observing an 'e' after a 'z', we have ask for the probability of an 'e' followed by the previous n letters, where n is the order of our model. The same principal applies though to train the model. We simply count the number of times our context of length n was followed by an 'e', and divide by the number of times we observed that context. We can very easily store all of these probabilities an a lookup table.

Problems with the Markov Process

Data Sparsity

We often don't have enough data to properly train our models. This is especially true if we have a higher order model. In general, the amount of data needed to train a model increase exponentially with the order of our model. If we don't have enough data, then the models will produce lower quality names, and be unable to generate some things it ought to, simply because the model never observed a particular character sequence in training. We can solve this by giving a Dirichlet prior to our model, which will be explained in the next section.

Zero Frequency

Another issue is that sometimes a higher order model will get stuck. It is possible to generate a sequence which leads to a context for which there was no training data. This is even worse than simply not having enough data to correctly assign probabilities as before, as we will not even have a table to perform a look up, which explains some of the crashes mentioned in other name generation articles on RogueBasin. We can solve this by employing a back-off scheme, which will be explained in the next section.

Start and End

There is one slight hicup with the way we have setup the model. We want every character in our generated sequence to be based on the previous n characters in the sequence, which works fine if we are n characters into the sequence. For example, suppose my data consisted of only two names: 'abb', and 'acc'. If we do not model the beginning of the names correctly, we might end up generating names like 'bb' or 'cc', when we never actually say a name begin this way. Similar issues exist with the end of names. This problem can be solved by prepending special start symbols to names in our training data, and by appending special end symbols to the end of names in our training. This will be explored more carefully in the next section.

Revised Markov Process

Dirichlet Prior

We can overcome the data sparsity issue to some degree by applying additive smoothing to all of our look up tables. This means that we simply add a constant to every count in our look up tables, even the spots where the count would have been 0. This constant can be any positive real number. For tables where we have very little data, the prior will smooth out the distributions so that we don't get too heavily swayed by a small number of observations. When we have a large number of observations, the prior will matter less since the counts are high enough to wash out that small constant. This also has the advantage that every probability will be non-zero. This is nice since it means we can generate any possible name, even if unlikely due to our training data.

In Bayesian terminology, this basically means we are adding a prior to the expected values from our distribution. Knowing what the Dirichlet distribution might give you some insight into what this prior means. The general intuition is that a prior closer to 0 means you trust the data more than your prior. A value of 1 corresponds to an uninformative prior. Values greater than 1 will tend to smooth more and make your distributions more uniform.

Katz Back-off

As mentioned, we can sometimes hit a problem when find ourselves trying to sample given a context we have never seen before. This is slightly different than the previous problem where we may have seen a context, but not often enough to have a good estimate of the probabilities. The zero-frequency problem is one where we have no table to even look up. Back-off schemes give a way out.

Essentially what we want to do is use a short context when ever we have no data at all. This means that when training an nth order model, we will simultaneously train up a n-1 order model, an n-2 order, and so on until a 0-order model. Then when ever we are given a context for which there is no data, we simply look up the probability in the next model of lower order.

This is slightly different than the original Katz back-off scheme, which gave a threshold for when you would backoff (in our case we just say that threshold is 0). Katz also gave a smoothing parameter which would be calculated using Good-Turing smoothing. While not to complicated, Good-Turing smoothing is probably beyond most people who have not studied natural language process more in depth than this tutorial is giving. For simplicity, we will just make this back-off weight 1, meaning we can ignore it.

Start and End Symbols

In order to get correct probabilities for the beginning of the a name though, we need to prepend n special start symbols to every name in our training data, where n is the order of our model. This should be a symbol which is never used in the training data to avoid confusion. Then when we train the model, the context for the first character will actually be a sequence of start symbols.

Similarly, we should model the end of words too. We do this in the same way, by having a special end of name symbol. Here we only ever need to append one such symbol, regardless of the order of our model, since once we hit this symbol, we will not generate observe any more data in that sequence.