Difference between revisions of "Names from a high order Markov Process and a simplified Katz back-off scheme"

From RogueBasin
Jump to navigation Jump to search
Line 1: Line 1:
Various articles on RogueBasin have suggested that a Markov Process might be a good way to generate random names for a game. However, most of them have a weakness or two which can easily be solved. In this article, I will describe a basic first order Markov process seen in other articles on name generation, and discuss some of the short comings. I will then explain how to overcome these weakness using a higher order model, with a simplify Katz back-off scheme. This model is still fairly simple, but will allow you to come up with a data based name generator, even with a somewhat limited amount of data, which will produce pronounceable names with a similar feel to your training data.  
Various articles on RogueBasin have suggested that a Markov Process might be a good way to generate random names for a game. However, most of them have a weakness or two which can easily be solved. In this article, I will describe a basic first order Markov process seen in other articles on name generation, and discuss some of the short comings. I will then explain how to overcome these weakness using a higher order model, with a simplify Katz back-off scheme. This model is still fairly simple, but will allow you to come up with a data based name generator, even with a somewhat limited amount of data, which will produce pronounceable names with a similar feel to your training data.  


==Markov Process==
==Preliminaries (you can skip if you already know a bit about Markov processes)==
 
===Markov Process===


A Markov model can be thought of as a generative model of data. In other words, it defines a full joint probability distribution which hypothesizes how the data was generated. However, this distribution has a very special property know as the Markov property. Essentially what this property says is that the output of the Markov process as any particular point in the output sequence depends entirely upon a fixed window of previous outputs. In other words, the following holds:
A Markov model can be thought of as a generative model of data. In other words, it defines a full joint probability distribution which hypothesizes how the data was generated. However, this distribution has a very special property know as the Markov property. Essentially what this property says is that the output of the Markov process as any particular point in the output sequence depends entirely upon a fixed window of previous outputs. In other words, the following holds:
Line 15: Line 17:
Training this model is accomplished in the same way as before, only we have to store more context. Now instead of asking things like the probability of observing an 'e' after a 'z', we have ask for the probability of an 'e' followed by the previous n letters, where n is the order of our model. The same principal applies though to train the model. We simply count the number of times our context of length n was followed by an 'e', and divide by the number of times we observed that context. We can very easily store all of these probabilities an a lookup table.
Training this model is accomplished in the same way as before, only we have to store more context. Now instead of asking things like the probability of observing an 'e' after a 'z', we have ask for the probability of an 'e' followed by the previous n letters, where n is the order of our model. The same principal applies though to train the model. We simply count the number of times our context of length n was followed by an 'e', and divide by the number of times we observed that context. We can very easily store all of these probabilities an a lookup table.


==Start and End Symbols==
===Start and End Symbols===


There is one slight hicup with the way we have setup the model. We want every character in our generated sequence to be based on the previous n characters in the sequence, which works fine if we are n characters into the sequence. In order to get correct probabilities for the beginning of the a name though, we need to prepend n special start symbols to every name in our training data, where n is the order of our model. This way we will be able to correctly model the beginning of names. Other approaches, tend to ignore that names start and end in special ways. For example, suppose my data consisted of only two names: 'abb', and 'acc'. If we do not model the beginning of the names correctly, we might end up generating names like 'bb' or 'cc', when we never actually say a name begin this way. We just have to be careful that when we generate a name from our model, we then strip the start symbols off, since they are only markers, and not actually significant.
There is one slight hicup with the way we have setup the model. We want every character in our generated sequence to be based on the previous n characters in the sequence, which works fine if we are n characters into the sequence. In order to get correct probabilities for the beginning of the a name though, we need to prepend n special start symbols to every name in our training data, where n is the order of our model. This way we will be able to correctly model the beginning of names. Other approaches, tend to ignore that names start and end in special ways. For example, suppose my data consisted of only two names: 'abb', and 'acc'. If we do not model the beginning of the names correctly, we might end up generating names like 'bb' or 'cc', when we never actually say a name begin this way. We just have to be careful that when we generate a name from our model, we then strip the start symbols off, since they are only markers, and not actually significant.


Similarly, we should model the end of words too. We do this in the same way, by having a special end of name symbol. Here we only ever need to append one such symbol, regardless of the order of our model, since once we hit this symbol, we will not generate observe any more data in that sequence.
Similarly, we should model the end of words too. We do this in the same way, by having a special end of name symbol. Here we only ever need to append one such symbol, regardless of the order of our model, since once we hit this symbol, we will not generate observe any more data in that sequence.

Revision as of 18:22, 14 March 2012

Various articles on RogueBasin have suggested that a Markov Process might be a good way to generate random names for a game. However, most of them have a weakness or two which can easily be solved. In this article, I will describe a basic first order Markov process seen in other articles on name generation, and discuss some of the short comings. I will then explain how to overcome these weakness using a higher order model, with a simplify Katz back-off scheme. This model is still fairly simple, but will allow you to come up with a data based name generator, even with a somewhat limited amount of data, which will produce pronounceable names with a similar feel to your training data.

Preliminaries (you can skip if you already know a bit about Markov processes)

Markov Process

A Markov model can be thought of as a generative model of data. In other words, it defines a full joint probability distribution which hypothesizes how the data was generated. However, this distribution has a very special property know as the Markov property. Essentially what this property says is that the output of the Markov process as any particular point in the output sequence depends entirely upon a fixed window of previous outputs. In other words, the following holds:

p(s_i | s_{i-1}, s_{i-2}, ..., s_0) = p(s_i | s_{i-1})

This equation gives the equation for a first order Markov model. Essentially what this says is that the current output depends solely upon the previous output. In the context of name generation, this might means that I will output a particular character based on the previous character. For example, if I just generated a 'z', I would be unlikely to generate another 'z', but would likely generate a vowel. The only trick then becomes coming up with the exact transition probabilities so I can ask things like what is the probability of generating an 'e' following a 'z'. It turns out that we do a simple maximum likelihood estimate from data by simply looking at all the times we observed 'ze', divided by all of the times that we say a 'z'.

In general though, a first order model might not be sufficient. It turns out that if we increase the size of the window we look back in time, we often get better looking outputs. This is unsurprising when we consider that many phonemes require more than one letter to express, or that syllabic dependencies might make a word unpronounceable. In order to have a higher order Markov model, we simply increase the size of our context from 1, to a bigger number. For example for an order 3 model the following would hold under the Markov property:

p(s_i | s_{i-1}, s_{i-2}, ..., s_0) = p(s_i | s_{i-1}, s_{i-2}, s_{i-3})

Training this model is accomplished in the same way as before, only we have to store more context. Now instead of asking things like the probability of observing an 'e' after a 'z', we have ask for the probability of an 'e' followed by the previous n letters, where n is the order of our model. The same principal applies though to train the model. We simply count the number of times our context of length n was followed by an 'e', and divide by the number of times we observed that context. We can very easily store all of these probabilities an a lookup table.

Start and End Symbols

There is one slight hicup with the way we have setup the model. We want every character in our generated sequence to be based on the previous n characters in the sequence, which works fine if we are n characters into the sequence. In order to get correct probabilities for the beginning of the a name though, we need to prepend n special start symbols to every name in our training data, where n is the order of our model. This way we will be able to correctly model the beginning of names. Other approaches, tend to ignore that names start and end in special ways. For example, suppose my data consisted of only two names: 'abb', and 'acc'. If we do not model the beginning of the names correctly, we might end up generating names like 'bb' or 'cc', when we never actually say a name begin this way. We just have to be careful that when we generate a name from our model, we then strip the start symbols off, since they are only markers, and not actually significant.

Similarly, we should model the end of words too. We do this in the same way, by having a special end of name symbol. Here we only ever need to append one such symbol, regardless of the order of our model, since once we hit this symbol, we will not generate observe any more data in that sequence.