Genetic Algorithms with Python

In this post we will explore deap - a genetic algorithms Python framework - by coding a complete example to grasp the basic patterns behind it.

The task we want to accomplish with our genetic algorithm is finding a suitable letter arrangement for a word clock, that is a clock composed by a matrix of characters that when lit in specific patterns reveal a sentence spelling the current time. This problem is not new: a solution with genetic algorithms originally appeared on hack a day.

Some basic knowledge of the ideas behind genetic algorithms, Python and OOP will help make sense of the following paragraphs.

Individual and fitness classes definition

To begin we need to install deap (pip install deap), and to import the necessary modules that will be used in the program.

Next we need to model the elements of our population by creating an Individual, that is a python class to hold the genome and fitness of each population member.

1 from deap import creator, base, tools, algorithms
2 
3 creator.create("FitnessMax", base.Fitness, weights=(1.0,))
4 creator.create("Individual", list, fitness=creator.FitnessMax)

The creator.create function might be confusing at first, as it does several things at once:

So, after executing lines 3 and 4 of the above snippet, the creator module will contain two new classes, FitnessMax and Individual.

These classes respectively derive from the base.Fitness class, part of deap, and from the list class, a core python class.

The FitnessMax class has values for the abstract base class attribute weights, while the Individual class adds the new attribute fitness in addition to the ones inherited from list.

A peculiar feature of our individuals is that each of them is associated with a Fitness. This class of the base module manages a list of float values that represent a metric of each individual fitness for being a solution to the problem. The weights list, defined in the class instantiation, defines how each individual’s fitness fares against the others: a high value fitness attribute in the same position of a positive weight will yield a better fitness than a lower value one, and, on the contrary, a negative weight will make lower values in the same position in the fitness tuple of the individual contribute to a better overall fitness.

Here we chose (when defining weigths of FitnessMax) to have one dimensional fitness, so that a single feature of the individuals will be used to pick the best ones, and with a positive weight, so that the highest values of that feature will be the best ones for our purposes.

To model the word clock a list of characters is suitable to represent the solution.

Factory to create individuals

Very similarly to creating new custom classes into the framework, deap allows to define functions with the toolbox.register function:

 1 toolbox = base.Toolbox()
 2 import random
 3 
 4 toolbox.register(
 5   "random_char",
 6   random.choice,
 7   "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
 8 
 9 DIM = 10 #matrix side
10 
11 toolbox.register(
12   "individual",
13   tools.initRepeat,
14   creator.Individual,
15   toolbox.random_char,
16   n=DIM * DIM)

With a pattern similar to the creator.create function, the toolbox.register method creates a new function with the name given as first argument and registers it in the toolbox module.

The behavior of the registered function is defined by the function passed in as second argument, to which the remaining parameters are fed in order, and the others remain to be passed at invocation time (much like using functools.partial).

Now the first call to register is a simple alias for the random.choice python function, to which the array of the alphabet is passed as a parameter; the second call instead makes use of a factory method offered in the tools module called initRepeat: this method takes a container class as first argument (so list or array are suitable classes to pass) and initializes it by calling n times the function passed as second argument. In the call at line 9, the initRepeat function is using the Individual class added to the creator, which was created as a list subclass, and it is using the random_char function registered in the toolbox as the function to be used 8 times as the container initializer.

So in the end we can use this toolbox.individual as a factory function to mint new members of our population, and initialize each of them with random content; if we

print toolbox.individual()

we obtain our first individual as a list of random characters representing a flattened version of the matrix of the word clock.

['T', 'N', 'H', 'V', 'B', 'L', 'Z', 'C', ... ]

In order to better represent the matrix we can add a __str__ method to the Individual class to let it represent itself as a string with a separator between each matrix line. With this method it is guaranteed that any manipulation on the string of the individual later on performed by mutation or mating is not affecting the constraint that there is a matrix of DIM by DIM rows and columns.

1 def __str__(individual):
2     s = ""
3     for i in range(len(individual)):
4         s += individual[i]
5         if i % DIM == DIM-1: s+='#'
6     return s
7 
8 creator.Individual.__str__ = __str__

Finally, the initRepeat pattern is again useful to fill another list, this time with randomly initialized individuals, in order to create the initial population to be processed by the genetic algorithm, and register this in the toolbox as the population function.

1 toolbox.register("population",
2                   tools.initRepeat,
3                   list,
4                   toolbox.individual)

Evolutionary functions

Deap provides a set of predefined genetic algorithms that rely upon particular functions defined by the user into the toolbox in order to perform the necessary evolutionary steps. We will use the eaMuPlusLambda, a popular evolution strategy, as our algorithm.

For this algorithm to work, the evaluate, mate, mutate and select functions must be defined in the toolbox, with these specific names. Let’s define them.

The most complex function to be defined is the function that evaluates each individual and assigns it its fitness, so that the other parts of the algorithm can build the next generation upon the best individual ancestors in the current iteration.

Fitness function

In our example the fitness function will be composed by a bunch of regular expressions representing all the wordings of the times that the clock has to support; these will be matched against the individual representation string and the number of matches will represent the score, or fitness, of each individual.

For this example we choose to represent the quarters of each hour, so a total of 48 regular expressions will be defined by the following initialization, and put in an array to be later used by our evaluate function.

1 #regex list:
2 hours = ("ONE", "TWO", "THREE", "FOUR", "FIVE", "SIX",
3         "SEVEN", "EIGHT", "NINE", "TEN", "ELEVEN", "TWELVE",)
4 
5 restrings = []
6 for h in hours: restrings.append(h+".+O.+CLOCK")
7 for h in hours: restrings.append("HALF.+PAST.+"+h)
8 for h in hours: restrings.append("QUARTER.+PAST.+"+h)
9 for h in hours: restrings.append("QUARTER.+TO.+"+h)

This initialization code simply defines a list of regular expressions by combining the twelve hours into four different patterns of words, giving 48 combinations where any character - including line separators - can be placed between the words for the regular expression to match. The regular expressions must be built one for each pattern of the word clock, since there is no way to find the number of matches of a regular expression containing alternative tokens.

The fitness function will simply return the number of regular expressions that match with the string representation of the individual, so that if all regular expressions match the individual is a solution to the word clock problem.

1 def evaluateInd(individual):
2     import re
3     s = str(individual)
4     scores = [re.compile(r).search(s) != None for r in restrings]
5     return (float(sum(scores)),)

The scores array is a boolean array that contains a True value if the corresponding regex is matching the string of the individual; the return value is the count of True values present in the evaluation array of all the 48 regular expressions defined. Note that the function return value is expressed as a one element list of floats, as required by the FitnessMax weigths property defined previously as (1.0, ), so that the count of matching regular expressions will tend to be maximized by the algorithm.

Mutation function

A simple mutation function for each individual could be to pick a random position in the individual list of characters and place a random character in place of the initial one. This method could work, but it can be very time consuming to randomly match all the words in the domain of the problem until a proper solution can be found.

A more efficient approach consists in building a keyword set that can be used to mutate the string by inserting whole words that we already know will be part of the final solution: it makes little sense to guess these words one random character at a time.

The following code builds a list of keywords from the list of regular expressions that had been previously initialized, relying on the fact that those expressions were a list of keywords each, separated by the “one or more character” (.+) token.

1 #build a keyword list for mutation
2 keywords = set()
3 for r in restrings:
4     for i in r.split(".+"):
5         keywords.add(i)
6 keywords = list(keywords)

With this list we can write a function that randomly picks a position into the genome of an individual and replaces the letters found at that position with a keyword from the problem domain:

1 def myMutation(individual):
2     kw = random.choice(keywords)
3     pos = random.randint(1,len(individual)-len(kw))
4     for i, ch in enumerate(kw):
5         individual[pos+i]=ch
6     return (individual,)

The function could be sped up further by avoiding placing the keyword across line boundaries, but to keep it simple we will let the fitness function penalize those solutions by not recognizing the keywords that fall across two lines, as the separator inserted by the __str__ method of the Individual class will prevent the keywords in the regular expressions from matching.

Mating and Selection functions

Given that both the individuals and the population in the project share the same list implementation, we can leverage predefined operators for the remaining mating and selection functions: these are predefined into the tools package and can be simply added to the toolbox.

For mating a pair of individuals and producing a new pair the cxTwoPoint can be used; for selection the selBest function is used to skim the best of each population round and add it to the next round of evolution. The previously defined evaluation and mutation functions can be added to the toolbox as well.

1 toolbox.register("mate", tools.cxTwoPoint)
2 toolbox.register("select", tools.selBest)
3 toolbox.register("evaluate", evaluateInd)
4 toolbox.register("mutate", myMutation)

Running the algorithm

Now that all the building blocks have been defined, we are finally ready to use the genetic algorithm eaMuPlusLambda to evolve an inital population to a good solution. In order to follow the evolution and stop only when a viable solution is found, the algorithm can be inserted into a loop that will go on until all the regular expressions of the fitness function are satified by the best individual. At each iteration the top individual is computed and the fitness is printed out, in order to monitor the convergence of the algorithm, but this section can be put outside of the loop to speed up iteration; alternatively the statistics and halloffame objects can be used together with the verbose option.

 1 if __name__ == "__main__": 
 2     pop = toolbox.population(n=1000)
 3 
 4     fit = 0.0
 5     while (fit < len(restrings)):
 6 
 7         algorithms.eaMuPlusLambda (
 8                 pop, toolbox, 
 9                 400, 100, #parents, children
10                 .2, .4, #probabilities
11                 1) #iterations
12 
13         top = sorted(pop, key=lambda x:x.fitness.values[0])[-1]
14         fit = top.fitness.values[0]
15         print fit
16 
17     print top

Results

With this method a couple of interesting solutions to the problem have evolved within 10 seconds or so from launching the process on a relatively speedy computer with DIM = 11. Here they are.

First solution 11x11 and sample highlight Second solution 11x11 and sample highlight

With DIM = 10 the algorithm often plateaued at around 46 satisfied expressions out of 48, trapped in a local solution from which nor mutation nor mating could come out of, even after many hours of running. In some other luckier runs, however, it has been able to produce solutions in half a minute or even less:

First solution 10x10 and sample highlight Second solution 10x10 and sample highlight

Conclusion and Download

This code can be modified introducing new regular expressions in order to create different languages or a weather clock, or perhaps as a base to solve different genetic problems. The __str__ function of the individual can also be altered to produce different shapes of word clocks.

The full source code for the genetic word clock is available for download here, it will be useful to experiment with the deap package, to understand the concepts behind it and for writing variations or improving on the problem solution.

Comments