Jump to content

P_2(R)

Members
  • Posts

    3
  • Joined

  • Last visited

P_2(R)'s Achievements

11

Reputation

  1. Sorry for the cryptic message I posted just before this one. Well, you ask me for instructions, I give you instructions... I have ideas to improve the algorithm, but I realize it's pointless if I don't package it properly and distribute it, nobody but me will benefit from it. I no Mac, so that means Windows packages first... Anyway, I rewrote the code to be much, much, much cleaner and improvable in the future. I uploaded on www.github.com so it'll be more easy to share, or to contribute for the ones who can code. In a great moment of lack of inspiration (of about 4 seconds), I named it "jaminique". There's no distributable package yet, you still have to install python3, clone with git and run in terminal. I need to do a (simple) user-friendly gui first, then I can think of user-friendly unzip&launch thing. Here is the url: https://github.com/laerne/jaminique
  2. Thanks for the positive feedbacks. I had ideas to improve the thing and, as always, some motivation helps to try to get them done. The algorithm is definitevely improvable, and I'm getting ideas about it (well I open my AI book, "Natural Language processing" section...). However there are already option to dismiss results deemed "too probable" or "too improbable". I can alter the boundaries to select worst results. For example, with the same american baby names, but selecting less good results : 10.801808 : atan 13.900588 : ko 13.869776 : jajeneo 13.255757 : rrene 10.340706 : shanga For usage, it involves to use the terminal aka the command-line-interface aka using the computer text based (I always though linguists should like to communicate to their computer by typing in a language, but this seems not to be the case...). Read the instructions below first, and if it seems too hard maybe wait until I produce something more distributable. Or go full on a complete python3 tutorial until you can create and run the famous "hello, world!" program. So, the first thing you need is to install python, version 3. The process is heavily dependent on your platform. I don't have a mac myself, so I'm not sure how to install python there. There is a tutorial on the official site on how to get python working on each platform : https://docs.python.org/3/using/index.html. Section 4 speaks about mac. Sorry for the heavy readings that it might involve. Now create a new text file with some equivalent of window's notepad, copy my code above and paste in the file. Save the file in a folder, for example Downloads and for example with the name namegen.py. Also, creates a list of names in a file (one name per line, avoid capital letters) in the same directory, for example names.txt. As I said there's no graphical user interface (GUI) yet, so you have to run the program from the terminal. On mac, I think you can launch the terminal in Application/Utilities. From the documentation, linked above "To run your script from the Terminal window you must make sure that /usr/local/bin is in your shell search path." To verify this, type in the terminal echo $PATH and see if you can see "/usr/local/bin" in the output, separated from other items by colons (it should be there). If it isn't there type : export PATH="/usr/local/bin:${PATH}" to add it & verify again. You have to run that export command for each terminal you open. Now you must change the current terminal directory (like in an file explorer), using the command cd. For example, if the script is in you download directory, you can do either cd Downloads cd ~/Downloads cd $HOME/Downloads Now you can run the program by typing python3 namegen.py names.txt -n 50 names.txt should be the path from the directory you are to the file containing one name per line and 50 is the number of elements to generate. It should generate a number, a colon and a name. The number is how "likely" a name is ( the lower, the more likely ). If you have already used the terminal before, you can see the list of options with --help python3 namegen.py --help
  3. Hey, guys. I've seen this topic from a google search this Monday, and it tickled my mind. I struck me that the "Everchanging Book of Names" software was: -windows only -closed source -not much improved since many years There's an open-source video game on linux-mac-windows called wesnoth, that used a name generator using markov chains to generate the names of its character. It seems a bit overkill (and concentration-risky) to start a whole video game to game character names, so I felt challenged to write a small basis code, usable on linux/mac and with the source so you can improve if you wish to. So I did my research, and come up with this little script. So how did the everlasting book of names work ? It's source closed, it's a bit harder to guess the algorithm. There is a help file with a small description, and it seems very linguistic-influenced. From what I understand, it split word in phonems, then in consonents-vowel clusters. Then the interesting parts comes up: it makes some trigrams of consonent-vowel, consonent-any-consonent or vowel-any-vowel clusters, create a word structure in term of consonent/vowel sequence and tries to match the trigrams on it. Wesnoth uses a pure "graph" search method. It iterates by picking a letter that has been observed in a word of the name dictionary to follow the last three generated letters, until it reaches some limit length or a end-of-word marker, treated as a letter for the rest of the algorithm. I decided to go on with the wesnoth method, because it seems more general (usable with different alphabets or way of writing phonems). However I went way much more probabilistic, and allowed for trigrams not appearing in the dictionary to be generated, just with low probability (it's called Lagrange smoothing). It decrease the quality of results, but improve the randomness of names. If you decide to discard names that already appears in the dictionary, it actually improves quality. For quickness and easiness, I wrote a pure command-line software in python 3. If you're geek enough, you can try it out, I'm copying the script below. If enough people are interested, I may try to improve the script, like adding a quick GTK gui or throwing in linguistic or other ideas I have or you have to improve name generation and then upload it on github. EDIT: oh, I forgot to mention: you need to `feed` the script with the path to a name dictionary file, with is simly a txt file with one name per line. #!/usr/bin/env python3 import random import argparse argparser = argparse.ArgumentParser( usage='%(prog)s [options] dictionary [dictionary...]', description='Generate a random name from a flavor of existing names.' ) argparser.add_argument( '-n', '--number', action='store', type=int, default=1, help='number of names to generate' ) argparser.add_argument( '-u', '--uniform', action='store_true', default=False, help='ignore possible word weight and set them all to 1' ) argparser.add_argument( '-s', '--ngram-size', action='store', type=int, default=2, help='how many previous characters are looked to choose the next one (default 2)' ) argparser.add_argument( '-l', '--min-length', action='store', type=int, default=3, help='minimun length of a generated name (default 3)' ) argparser.add_argument( '-L', '--max-length', action='store', type=int, default=20, help='maximum length of a generated name (default 20)' ) argparser.add_argument( '-o', '--original', action='store_true', default=False, help='discard words already existing in the dictionary' ) argparser.add_argument( '-p', '--min-perpexity', action='store', type=float, default=0.0, help='tolerance threshold to discard words that matches too closely the examples (default 0.0)' ) argparser.add_argument( '-P', '--max-perpexity', action='store', type=float, default=10, help='tolerance threshold to discard ugly words (default 10.0)' ) argparser.add_argument( 'dictionary', action='store', nargs='*', default=['namelists/viking_male.txt','namelists/viking_female.txt','namelists/viking_unknownGender.txt'], help='Number of names to generate' ) args = argparser.parse_args() dictfiles = args.dictionary n = args.number uniform_weight = args.uniform psize = args.ngram_size size = psize + 1 min_length = args.min_length max_length = args.max_length min_perpexity = args.min_perpexity max_perpexity = args.max_perpexity only_original = args.original ## c = the character itself ## l = length of prefix ## n = number of occurences of the prefix ## k = number of occurenecs of the character after the prefix class MaxRegularizer(): def __init__( self ): self.scores_ = {} def learn( self, l, c, n, k ): score = (k/n) * 2**l if c in self.scores_: self.scores_[c] += score else: self.scores_[c] = score def scores( self ): for t in self.scores_.items(): yield t new_regularizer = MaxRegularizer def dichotomic_find( random_access_collection, element ): i = 0 j = len( random_access_collection ) while i < j: k = (i+j) // 2 if random_access_collection[k] > element: j = k elif random_access_collection[k] < element: i = k + 1 else: #random_access_collection[k] == element return k assert i==j return i class DiscretePicker: def __init__( self, probabilities ): self.cumulative_probabilities = [] accumulator = 0.0 for p in probabilities: accumulator += p self.cumulative_probabilities.append( accumulator ) def pick( self ): upper_bound = self.cumulative_probabilities[-1] random_float = random.uniform(0.0,upper_bound) return dichotomic_find( self.cumulative_probabilities, random_float ) def discrete_pick( probabilities ): picker = DiscretePicker( probabilities ) return picker.pick() all_words = set() prefix_counters = [{} for i in range(psize+1)] for df in dictfiles: with open( df, "r" ) as dictstream: for word in dictstream: word = word.split('#')[0].strip() word_data = word.split(':') word = word_data[0] all_words.add( word ) if not uniform_weight: weight = float(word_data[1]) if len(word_data) > 1 else 1 else: weight = 1 if word=='': continue word = '^' + word + '$' ngrams = ['' for i in range(size+1)] for c in word: for i in range(1,size+1): ngrams[i] += c if len(ngrams[i]) > i: ngrams[i] = ngrams[i][1:] prefix = ngrams[i][:-1] character = ngrams[i][-1] if character == '^': continue if prefix not in prefix_counters[i-1]: prefix_counters[i-1][ prefix ] = {} if character not in prefix_counters[i-1][ prefix ]: prefix_counters[i-1][ prefix ][ character ] = 0 prefix_counters[i-1][ prefix ][ character ] += weight def get_prefix_count( psize, prefix ): if prefix not in prefix_counters[psize]: return 0, {} a = 0 for char, k in prefix_counters[psize][prefix].items(): a += k return a, prefix_counters[psize][prefix] ## generation ### num_generated_names = 0 while num_generated_names < n: name = '^' name_probability = 1.0 while len(name) < max_length: s = min( psize, len(name) ) regularizer = new_regularizer() for l in range( 0, s+1 ): lprefix = name[-l:] if l > 0 else '' total_occurences, per_char_occurences = get_prefix_count( l, lprefix ) for c, k in per_char_occurences.items(): regularizer.learn( l, c, total_occurences, k ) characters_only, strengths_only = zip(* regularizer.scores() ) i = discrete_pick( strengths_only ) character = characters_only[i] if character == '$': if len(name) < min_length: continue else: break name += character name_probability *= strengths_only[i]/sum(strengths_only) name=name[1:] name_perpexity = (name_probability**(-1/len(name))) if only_original and name in all_words: pass # print( "duplicate!", name, '(%d)' % name_perpexity ) elif name_perpexity > max_perpexity: pass # print( "ugly!", name, '(%d)' % name_perpexity ) elif name_perpexity < min_perpexity: pass # print( "too likely!", name, '(%d)' % name_perpexity ) else: print("%.6f : %s" % (name_perpexity, name) ) num_generated_names +=1 Here are samples, generated from american male baby names since the 2000s. Not perfect, but there are already insteresting examples appearing, like "jaminique" or "trennett".
×
×
  • Create New...