Difference between revisions of "Random name generation"
m (→Finite State Approach: fixed c++ link.) |
m (→Markov-Chains Approach: updated layout a little bit. (to stop the boxes from overflowing)) |
||
Line 197: | Line 197: | ||
{ | { | ||
tempFrequency = itr2->second; | tempFrequency = itr2->second; | ||
*fileStreamOut << itr->first << " " << itr2->first << " " << tempFrequency.returnCountBeginning() << " " << tempFrequency.returnCountWithin() << " " << tempFrequency.returnCountEnd() << std::endl; | *fileStreamOut << itr->first << " " << itr2->first << " " | ||
<< tempFrequency.returnCountBeginning() << " " << | |||
tempFrequency.returnCountWithin() << " " | |||
<< tempFrequency.returnCountEnd() << std::endl; | |||
} | } | ||
} | } | ||
Line 276: | Line 279: | ||
name.outputList(outFile); | name.outputList(outFile); | ||
std::cout << name.outputName(3, 9) << '\n' << name.outputName(3, 9) << '\n' << name.outputName(3, 9) << '\n' << name.outputName(3, 9) << '\n' << name.outputName(3, 9); | std::cout << name.outputName(3, 9) << '\n' << | ||
name.outputName(3, 9) << '\n' << | |||
name.outputName(3, 9) << '\n' << | |||
name.outputName(3, 9) << '\n' << | |||
name.outputName(3, 9); | |||
} | } | ||
Revision as of 12:44, 18 August 2008
There are a lot of different ways to tackle random name generation. Some use existing lists of names to generate variants of these names, such as the Markov-Chain Approach. Other variants use lists of syllables to randomly select names. And others combine different words or parts of words into new random names.
Markov-Chains Approach
Markov-Chains are usually used in the field of thermodynamics, but I will be using a fudged version of Markov-Chains to generate some random names. The basic premise of Markov-Chains is that if we have a big enough data set we should be able to predict the next event in a series of seemingly random events. So using this we need to create a dataset of names that we want to generate, lucky for you I have already done this. Here is the List of Names.
This list is every male and gender-neutral name from babynames.com. Now we need to create a program to parse this list for it's statistics. To make this interesting I have split the statistics up, 1. Letter combinations that begin names, 2. Letter combinations that end names, and 3. Letter combinations that are in between the beginning and end. This should give us slightly better names. The following is my code samples in C++, after compilation this program will output some randomly generated names.
// CWordFrequncy.h #ifndef CWORDFREQUENCY_H_ #define CWORDFREQUENCY_H_ class CWordFrequency { private: int countBeginning; int countEnd; int countWithin; public: CWordFrequency(); ~CWordFrequency(); void incrementCountBeginning(); void incrementCountEnd(); void incrementCountWithin(); int returnCountBeginning(); int returnCountEnd(); int returnCountWithin(); }; #endif
//CWordFrequency.cpp #include "CWordFrequency.h" CWordFrequency::CWordFrequency() : countBeginning(0), countEnd(0), countWithin(0) { } CWordFrequency::~CWordFrequency() { } void CWordFrequency::incrementCountBeginning() { ++countBeginning; } void CWordFrequency::incrementCountEnd() { ++countEnd; } void CWordFrequency::incrementCountWithin() { ++countWithin; } int CWordFrequency::returnCountBeginning() { return countBeginning; } int CWordFrequency::returnCountEnd() { return countEnd; } int CWordFrequency::returnCountWithin() { return countWithin; }
//CRandomName.h #include <fstream> #include <map> #include <string> #include <cstdlib> #include <ctime> #include <vector> #include <algorithm> #include "CWordFrequency.h" #ifndef CRANDOMNAME_H_ #define CRANDOMNAME_H_ class CRandomName { private: std::string errorMessage; std::ifstream *fileStreamIn; std::ofstream *fileStreamOut; std::map<char, std::map<char, CWordFrequency> > baseMap; std::map<char, CWordFrequency> sequenceFrequencyMap; CWordFrequency tempFrequency; public: CRandomName(); ~CRandomName(); void inputFile(std::ifstream &streamHandle); void processFile(); void outputList(std::ofstream &streamHandle); std::string outputName(double minLength, double maxLength); }; #endif
//CRandomName.cpp #include <iostream> #include "CRandomName.h" CRandomName::CRandomName() { srand(time(NULL)); } CRandomName::~CRandomName() { fileStreamOut->close(); fileStreamIn->close(); } void CRandomName::inputFile(std::ifstream &streamHandle) { fileStreamIn = &streamHandle; } void CRandomName::processFile() { std::string word; char base; char sequence; int wordPosition; while(!fileStreamIn->eof()) { *fileStreamIn >> word; for (wordPosition = 0; (wordPosition + 1) < (word.length()); wordPosition++) { base = word[wordPosition]; sequence = word[wordPosition + 1]; CWordFrequency &wf = baseMap[base][sequence]; if (wordPosition == 0) {wf.incrementCountBeginning();} else if ((wordPosition + 1) >= (word.length() - 1)) {wf.incrementCountEnd();} else if ((wordPosition > 0) && ((wordPosition + 1) < (word.length() - 1))) {wf.incrementCountWithin();} } } } void CRandomName::outputList(std::ofstream &streamHandle) { fileStreamOut = &streamHandle; std::map<char, std::map<char, CWordFrequency> >::iterator itr; std::map<char, CWordFrequency>::iterator itr2; for (itr = baseMap.begin(); itr != baseMap.end(); itr++) { sequenceFrequencyMap = itr->second; for (itr2 = sequenceFrequencyMap.begin(); itr2 != sequenceFrequencyMap.end(); itr2++) { tempFrequency = itr2->second; *fileStreamOut << itr->first << " " << itr2->first << " " << tempFrequency.returnCountBeginning() << " " << tempFrequency.returnCountWithin() << " " << tempFrequency.returnCountEnd() << std::endl; } } } std::string CRandomName::outputName(double minLength, double maxLength) { std::string name; std::vector<char> freqVector; double range = static_cast<double>((maxLength - minLength) + 1); int rangeLength = static_cast<int>(minLength + (range * ((double)rand() / (double)(RAND_MAX + 1)))); char a = static_cast<char> (65 + (26 * rand() / ( RAND_MAX + 1.0 ))); //I made this only go to //'Z' because I haven't finished compiling my list of names name += a; for(int counter = 1; counter < rangeLength; counter++) { int cdc = 0; if(baseMap.find(a) != baseMap.end()) { for (char b = 'A'; b <= 'Z'; b++) { if(baseMap[a].find(b) != baseMap[a].end()) { if(counter == 1) { for(int cc = 0; cc < (baseMap[a][b].returnCountBeginning()); cc++) { freqVector.push_back(b); cdc++; } } else if((counter + 1) >= (rangeLength - 1)) { for(int cc = 0; cc < baseMap[a][b].returnCountEnd();cc++) { freqVector.push_back(b); cdc++; } } else { for(int cc = 0; cc < baseMap[a][b].returnCountWithin();cc++) { freqVector.push_back(b); cdc++; } } } } } std::random_shuffle(freqVector.begin(), freqVector.end()); std::random_shuffle(freqVector.begin(), freqVector.end()); std::random_shuffle(freqVector.begin(), freqVector.end()); int c = (int)(((cdc) * rand() / ( RAND_MAX + 1.0 ))); name += freqVector.at(c); a = freqVector.at(c); } return name; }
//main.cpp #include <iostream> #include "CRandomName.h" int main() { CRandomName name; std::ifstream inFile("NameList.txt"); std::ofstream outFile("Stats.txt", std::ios_base::trunc); name.inputFile(inFile); name.processFile(); name.outputList(outFile); std::cout << name.outputName(3, 9) << '\n' << name.outputName(3, 9) << '\n' << name.outputName(3, 9) << '\n' << name.outputName(3, 9) << '\n' << name.outputName(3, 9); }
And here is some sample output:
ONMY TERLEMDA ZENEEVML CHIA SAWH SPTKIVCDE ILAN COONR KETADAN VIAN XAIIN URRIRERON FRTOKRR YONNAN IDANSZEDS CAALLSTF FLVAHE XHAM DUSEXNCST DEBAD SHEUO WARFA MUASS INEE
As you can see, some are ok but most aren't. This is because the dataset although big isn't quite big enough. To compensate for this, you could try to implement artificial rules to make the names more readable, but I will leave that as an exercise for the reader.
Finite State Approach
source : Original author Jeff Lait
This generator uses a very simple finite state machine to select between individual letters in a fashion that creates pronounceable player names.
The comments at the top explain the nature of the beast. The only syllable based component is where I trigger the termination timer using a quickly accelerating random chance to keep names short.
Following code is in C++
#include <stdio.h> #include <stdlib.h> #include <ctype.h> #include <string.h> //this function returns true if a random roll (1-100) is lower then c. //So: rand_chance(20) has a 20% chance of returning true. bool rand_chance(int c) { return( (rand() % 100 + 1) <= c); } //this function returns a random int between 0 and i-1. int rand_choice(int i) { return rand() % i; } void rand_name(char *text, int len) { // Very simple markov generator. // We repeat letters to make them more likely. const char *vowels = "aaaeeeiiiooouuyy'"; const char *frictive = "rsfhvnmz"; const char *plosive = "tpdgkbc"; const char *weird = "qwjx"; // State transitions.. // v -> f, p, w, v' // v' -> f, p, w // f -> p', v // p -> v, f' // w, p', f' -> v int syllables = 0; char state; int pos = 0; bool prime = false; // Initial state choice if (rand_chance(30)) state = 'v'; else if (rand_chance(40)) state = 'f'; else if (rand_chance(70)) state = 'p'; else state = 'w'; while (pos < len-1) { // Apply current state switch (state) { case 'v': text[pos++] = vowels[rand_choice(strlen(vowels))]; if (!prime) syllables++; break; case 'f': text[pos++] = frictive[rand_choice(strlen(frictive))]; break; case 'p': text[pos++] = plosive[rand_choice(strlen(plosive))]; break; case 'w': text[pos++] = weird[rand_choice(strlen(weird))]; break; } // Chance to stop.. if (syllables && pos >= 3) { if (rand_chance(20+pos*4)) break; } // Transition... switch (state) { case 'v': if (!prime && rand_chance(10)) { state = 'v'; prime = true; break; } else if (rand_chance(40)) state = 'f'; else if (rand_chance(70)) state = 'p'; else state = 'w'; prime = false; break; case 'f': if (!prime && rand_chance(50)) { prime = true; state = 'p'; break; } state = 'v'; prime = false; break; case 'p': if (!prime && rand_chance(10)) { prime = true; state = 'f'; break; } state = 'v'; prime = false; break; case 'w': state = 'v'; prime = false; break; } } text[0] = toupper(text[0]); text[pos++] = '\0'; } //Arguments: First is max length of random name // Second is numer of random names int main(int argc, char *argv[]) { srand ( time(NULL) ); int j = 1; if(argc > 2) { j = atoi(argv[2]); } while(j > 0) { if(argc > 1) { int i = atoi(argv[1]); char * a = (char*) malloc(sizeof(char) * i); rand_name(a, i); printf("%s\n",a); } else { char a [5]; rand_name(a, 5); printf("%s\n",a); } j--; } return 0; }
(The code is public domain)
This program accepts the following arguments, the first is the max length of the name, but most will be smaller. And the second is the number of names to generate. Defaults are 5 1.
Results:
./randomName 10 20 'tosezivc Fdo 'dybze Wask Bof Yqa 'vymi Fatyope Xar Qow Midu Tezk Tucik Hgy Eqyhp Hti Cevidi Sd'gaquz Ubeqy Spa
As you can see the algorithm has a bias for short names, and is not always pronounceable. But it is a start. Also as this is a proof of concept implementation, the random number generator is not properly seeded. Running multiple types after each other could result in the same random names (depending on the operating system).
Syllable-based Approach
If you have one or more information about it, add it here.