Experiments with parallel genetic programming in Clojure
2009-08-26
I’ve been experimenting with genetic programming, learning Clojure as I go. I came to the point where I wanted to make my program parallel.
First of all, I am amazed at how readable and concise the code turns out to be. As an example, take a look at this function:
(defn choose-reproducing-parents [individuals] (take 2 (sort-by :fitness > (select-randomly 5 individuals))))
It doesn’t get more readable than this!
But the real joy came when I started to parallelize my code. Normally, the process would involve extra libraries, lots of fussing around with locks, and hours spent debugging deadlocks. So let’s look at an example function I needed to parallelize. Generating random code takes quite a bit of CPU time, especially if one needs to generate code for thousands of individuals. There is a function for that:
(defn create-random-individuals [number code-generator] (map create-individual-from-code (take number (repeatedly code-generator))))
And here’s the reworked parallel version:
(defn create-random-individuals [number code-generator] (pmap create-individual-from-code (take number (repeatedly code-generator))))
Can you spot the difference? Yes, that’s it. The little letter ‘p’ is all it takes for the work to be spread among all of my CPU cores.
Other functions required more work (again, notice how concise and readable the code is):
(defn produce-offspring [population number] (take number (repeatedly #(reproduce (choose-reproducing-parents population)))))
For those unfamiliar with Clojure, repeatedly produces a lazy sequence whose elements are produced by the function (in this case an anonymous function) supplied as the argument. take simply takes the first number elements of a sequence.
And now for the parallel version:
(defn produce-offspring [population number] (pmap reproduce (take number (repeatedly #(choose-reproducing-parents population)))))
Encouraged by this, I then moved on to parallelize the most time-consuming step of all GP programs: fitness evaluations. I’ll spare you the boring details of extended parallelization work I did on the function (but see examples above). The result was:
(defn test-generation [population fitness-function] (pmap #(set-fitness % (fitness-function %)) population))
After fitness was evaluated for all individuals, the next generation was produced and the fitness evaluations were run in parallel again.
This worked fine, but I thought there should be a better way. pmap limits me to a single multicore machine. This is fine for now, but in the future I plan to move to a distributed cluster, where the synchronous nature of map would be limiting. So I tried to write an asynchronous implementation.
First, I defined my pool: a collection where individuals are gathered once their fitness is evaluated:
(def pool (agent (vector)))
The pool is a Clojure agent. Agents are a synchronization primitive: you can send them actions (functions), which will be queued and executed in order.
As you can see, the pool initially starts as an empty vector. So how do individuals get to the pool? Their fitness needs to be evaluated, and then they need to be added to the pool. It all starts with this function:
(defn run-individuals [individuals] (dorun (map #(send % test-individual *fitness-function*) (map #(add-watcher (agent %) :send pool fitness-tested) individuals))))
First, we make each individual an agent, so that we can add a watcher to it. Watchers are a cool Clojure feature — they let you watch for state changes. Using add-watcher we add a watcher to each individual, telling it to send a fitness-tested action to the pool (which is also an agent, remember?). Then, once we’ve set up watching for state changes, we send a test-individual action to each individual, giving it the fitness function as a parameter. test-individual is a really simple function, all it does is call the fitness evaluation function and return the new state of the individual.
The dorun is necessary, because we’re dealing with lazy sequences and discarding the result (sending agent actions is a side effect). If the dorun wasn’t there, the entire sequence would never get evaluated and actions would never get sent.
Let’s see what happens once the pool is notified of a state change in an individual:
(defn fitness-tested [watched-population individual] (let [population (conj watched-population @individual)] (if (>= (count population) target-population-size) (let [new-population (prune-population population)] (run-individuals (produce-offspring new-population (- target-population-size (count new-population)))) new-population) population)))
First, we add the new individual to the pool. If we haven’t gathered enough individuals, we simply return the pool with the individual added — this updates the global pool.
The fun part begins when we have enough individuals to produce the next generation. Then, we prune the population, deleting the poorest individuals and produce a new batch of individuals, letting them go using the previously described run-individuals.
If you’ve done any parallel programming, you’ll probably worry about multiple threads modifying (pruning) the population simultaneously. Not to worry — Clojure agents are monitors, you are guaranteed that only one action will execute at a time.
We now have a fully asynchronous parallel GP implementation. Notice how there aren’t any queues, locks, thread pools to manage. All we have is a single global variable and a couple of simple functions. We don’t need any new data structures! The beauty of this solution is that because we’re using agents and watchers, Clojure does the queueing for us. Look, ma, no queues!
I am very happy with how easy and clean this solution turned out to be. I can now see why people keep raving about Clojure. Somebody finally did some serious thinking and implemented a new approach to parallel programming, not just a rehash of old ideas.
In less than an hour I went from a sequential implementation to a parallel asynchronous one. I’d say that’s impressive. And most importantly, the same code still runs on a single-cpu machine, with minimal performance impact. I am very impressed.
In case you haven't seen, Rich is working on using fork/join primatives from Java to make an even more efficient pvmap which has less overhead compared to pmap while still being fully capable of parallel work like pmap.
Here's an example from a while back of him testing in the branch, note things may have changed since this paste.
http://paste.lisp.org/display/84027
Closure newb here :-)
Interesting! Thanks for blogging about this.
I'm curious about your thoughts on the last step: actually running the code in a distributed environment.
How much effort do you think it would be to actually get the code running and working together on multiple machines?
I'm hoping there is a Clojure solution to this instead of the usual setting up "service containers", downloading tons of libraries, large configuration files, etc.
Patrick: I heard about this work, but haven't checked it out. However, in my case the overhead is not a problem, as the granularity is quite high. Evaluating fitness of a a single individual might take tens to hundreds of milliseconds. Compared to that, pmap overheads are quite small.
Mark: Clojure can't help me there by itself, I will need libraries. I plan to take a close look at swarmiji, which looks very promising.
Cool article. I'm interested in genetic algorithms as well and program in Clojure regularly, so this was just my cup of tea! On the topic of distributed Clojure, Rich has indicated that the maturity and robustness of existing offerings alleviate the need to bake this feature directly into Clojure. He's put together a nice list of options here:
http://groups.google.com/group/clojure/msg/4a7a866c45dc2101
I've been playing around with JGroups lately, and it seems like a pretty painless way to distribute tasks in Clojure.
Thanks for this post.
You use 8 functions that are not defined here
(declare select-randomly)
(declare create-individual-from-code)
(declare reproduce)
(declare set-fitness)
(declare test-individual)
(declare *fitness-function*)
(declare target-population-size)
(declare prune-population)
As a GP newbie, is there a simple example of where these are defined
so that I can run through some specific numbers?
John: I intended the code snippets to demonstrate just the parallel aspects. The functions you mentioned in turn use other functions, so I'd have to post the entire thing, which is definitely not ready for publishing yet. Perhaps some day.
That looks cool, thanks for the post. I've been doing the same thing using Gambit Scheme chosen for performance reasons as Gambit is probably the fastest Scheme implementation overall.
As you've probably discovered GP is extremely CPU intensive and being able to easily spread the load over multiple cores is useful. I'm looking at using Termite which is an Erlang style distributed processing system for Gambit that allows you to distribute load over multiple processes and hosts. It would be interesting to get a benchmark comparison of Clojure against Gambit (and perhaps SBCL) to see which would be the best option for GP using a Lisp style language.
Andrew: certainly, although I suspect that both Gambit and SBCL would beat Clojure in terms of raw single-core performance.
I chose to wrote my implementation in Clojure because I suspected it would be easy to scale to multiple cores -- and it turned out I was right. Even if my Clojure code was 2x slower than, say, SBCL, I am running it now on 8 cores, so I'm 4x ahead. I would likely never implement a multi-threaded solution using SBCL.
You are right - multicore is the way to go and as you've shown doing this in Clojure is simple. Writing threads in Clojure, Gambit or SBCL is hassle you can do without and it wouldn't help in Gambit because the threads are internal and not native threads.
"Even if my Clojure code was 2x slower than, say, SBCL, I am running it now on 8 cores, so I'm 4x ahead." - what if it's 10 x slower ? As a comparison, I compared SISC (JVM Scheme) against Gambit for a web application and it was 20 x slower. I'm not saying that Clojure will be necessarily but there is potential for it to be.
The future for GP has to be using massively parallel hardware like the NVidia CUDA graphics cards where you can get 196 cores for around $200 ! There is a C API to access these - it would be nice if there was a Scheme or Lisp implementation that could make use of them but I haven't found one yet.
BTW your Clojure code looks nice and very readble.
there is no big deal with implementing this kind of parallel stuff in SBCL - check out http://common-lisp.net/project/eager-future/ for example