Jan Rychter


Technology, usability and programming with an occasional business angle.

Blog Widget by LinkWithin
Profile
Search

Entries in Clojure (3)

Wednesday
Aug262009

Experiments with parallel genetic programming in Clojure

I’ve been experimenting with genetic programming, learning Clojure as I go. I came to the point where I wanted to make my program parallel.

First of all, I am amazed at how readable and concise the code turns out to be. As an example, take a look at this function:

(defn choose-reproducing-parents [individuals]
  (take 2 
    (sort-by :fitness > 
      (select-randomly 5 individuals))))

It doesn’t get more readable than this!

But the real joy came when I started to parallelize my code. Normally, the process would involve extra libraries, lots of fussing around with locks, and hours spent debugging deadlocks. So let’s look at an example function I needed to parallelize. Generating random code takes quite a bit of CPU time, especially if one needs to generate code for thousands of individuals. There is a function for that:

(defn create-random-individuals [number code-generator]
  (map create-individual-from-code (take number (repeatedly code-generator))))

And here’s the reworked parallel version:

(defn create-random-individuals [number code-generator]
  (pmap create-individual-from-code (take number (repeatedly code-generator))))

Can you spot the difference? Yes, that’s it. The little letter ‘p’ is all it takes for the work to be spread among all of my CPU cores.

Other functions required more work (again, notice how concise and readable the code is):

(defn produce-offspring [population number]
  (take number
	(repeatedly #(reproduce
		      (choose-reproducing-parents population)))))

For those unfamiliar with Clojure, repeatedly produces a lazy sequence whose elements are produced by the function (in this case an anonymous function) supplied as the argument. take simply takes the first number elements of a sequence.

And now for the parallel version:

(defn produce-offspring [population number]
  (pmap reproduce
	(take number
	      (repeatedly #(choose-reproducing-parents population)))))

Encouraged by this, I then moved on to parallelize the most time-consuming step of all GP programs: fitness evaluations. I’ll spare you the boring details of extended parallelization work I did on the function (but see examples above). The result was:

(defn test-generation [population fitness-function]
  (pmap #(set-fitness % (fitness-function %)) population))

After fitness was evaluated for all individuals, the next generation was produced and the fitness evaluations were run in parallel again.

This worked fine, but I thought there should be a better way. pmap limits me to a single multicore machine. This is fine for now, but in the future I plan to move to a distributed cluster, where the synchronous nature of map would be limiting. So I tried to write an asynchronous implementation.

First, I defined my pool: a collection where individuals are gathered once their fitness is evaluated:

(def pool (agent (vector)))

The pool is a Clojure agent. Agents are a synchronization primitive: you can send them actions (functions), which will be queued and executed in order.

As you can see, the pool initially starts as an empty vector. So how do individuals get to the pool? Their fitness needs to be evaluated, and then they need to be added to the pool. It all starts with this function:

(defn run-individuals [individuals]
  (dorun
   (map #(send % test-individual *fitness-function*)
	(map #(add-watcher (agent %) :send pool fitness-tested) individuals))))

First, we make each individual an agent, so that we can add a watcher to it. Watchers are a cool Clojure feature — they let you watch for state changes. Using add-watcher we add a watcher to each individual, telling it to send a fitness-tested action to the pool (which is also an agent, remember?). Then, once we’ve set up watching for state changes, we send a test-individual action to each individual, giving it the fitness function as a parameter. test-individual is a really simple function, all it does is call the fitness evaluation function and return the new state of the individual.

The dorun is necessary, because we’re dealing with lazy sequences and discarding the result (sending agent actions is a side effect). If the dorun wasn’t there, the entire sequence would never get evaluated and actions would never get sent.

Let’s see what happens once the pool is notified of a state change in an individual:

(defn fitness-tested [watched-population individual]
  (let [population (conj watched-population @individual)]
    (if (>= (count population) target-population-size)
      (let [new-population (prune-population population)]
	(run-individuals
	 (produce-offspring new-population
			    (- target-population-size (count new-population))))
	new-population)
      population)))

First, we add the new individual to the pool. If we haven’t gathered enough individuals, we simply return the pool with the individual added — this updates the global pool.

The fun part begins when we have enough individuals to produce the next generation. Then, we prune the population, deleting the poorest individuals and produce a new batch of individuals, letting them go using the previously described run-individuals.

If you’ve done any parallel programming, you’ll probably worry about multiple threads modifying (pruning) the population simultaneously. Not to worry — Clojure agents are monitors, you are guaranteed that only one action will execute at a time.

We now have a fully asynchronous parallel GP implementation. Notice how there aren’t any queues, locks, thread pools to manage. All we have is a single global variable and a couple of simple functions. We don’t need any new data structures! The beauty of this solution is that because we’re using agents and watchers, Clojure does the queueing for us. Look, ma, no queues!

I am very happy with how easy and clean this solution turned out to be. I can now see why people keep raving about Clojure. Somebody finally did some serious thinking and implemented a new approach to parallel programming, not just a rehash of old ideas.

In less than an hour I went from a sequential implementation to a parallel asynchronous one. I’d say that’s impressive. And most importantly, the same code still runs on a single-cpu machine, with minimal performance impact. I am very impressed.

Wednesday
Jul292009

Clojure performance revisited

Since many people asked me about this, here are some additional notes about Clojure performance.

First, something which came to me as a surprise: the single biggest performance jump I got with my application was achieved by switching from Java 5 to Java 6 (64-bit, Mac OS X). The jump was huge — from interpreting around 850,000 instructions per second right up to 1,300,000 instr/s. That’s a nearly 60% improvement that required ZERO work on my part. Two clicks in Java Preferences.

Invoking a function is expensive. I am back to old Common Lisp techniques of using macros instead of functions in many places.

Watch out for var lookups (yes, I mentioned this before, but this is important).

The other things I did were application-specific, so there isn’t much point in describing them here.

And if you’re interested in how the JIT performs, here’s a sample run of the application. As you can see, it takes almost 20 runs until the times stabilize at around 1.5 million interpreted instructions per second. The improvement is dramatic: 276% from the first run to the last one.


Executed 616154 instructions in 1.543867 seconds, instruction rate: 399097.84 inst/s
Executed 616154 instructions in 0.653465 seconds, instruction rate: 942902.9 inst/s
Executed 616154 instructions in 0.522443 seconds, instruction rate: 1179370.8 inst/s
Executed 616154 instructions in 0.492671 seconds, instruction rate: 1250639.9 inst/s
Executed 616154 instructions in 0.482119 seconds, instruction rate: 1278012.2 inst/s
Executed 616154 instructions in 0.424934 seconds, instruction rate: 1449999.2 inst/s
Executed 616154 instructions in 0.424169 seconds, instruction rate: 1452614.4 inst/s
Executed 616154 instructions in 0.416273 seconds, instruction rate: 1480168.1 inst/s
Executed 616154 instructions in 0.420429 seconds, instruction rate: 1465536.4 inst/s
Executed 616154 instructions in 0.421797 seconds, instruction rate: 1460783.2 inst/s
Executed 616154 instructions in 0.421114 seconds, instruction rate: 1463152.5 inst/s
Executed 616154 instructions in 0.4115 seconds, instruction rate: 1497336.5 inst/s
Executed 616154 instructions in 0.410837 seconds, instruction rate: 1499753.0 inst/s
Executed 616154 instructions in 0.411064 seconds, instruction rate: 1498924.8 inst/s
Executed 616154 instructions in 0.410936 seconds, instruction rate: 1499391.6 inst/s
Executed 616154 instructions in 0.410301 seconds, instruction rate: 1501712.1 inst/s
Executed 616154 instructions in 0.410638 seconds, instruction rate: 1500479.8 inst/s
Executed 616154 instructions in 0.408832 seconds, instruction rate: 1507108.0 inst/s
Executed 616154 instructions in 0.410466 seconds, instruction rate: 1501108.5 inst/s
Executed 616154 instructions in 0.410113 seconds, instruction rate: 1502400.5 inst/s
Executed 616154 instructions in 0.409741 seconds, instruction rate: 1503764.5 inst/s
Monday
Jul202009

Clojure performance tuning

Having done some actual coding in Clojure I can post notes that will hopefully help others. I have code that gets executed a lot (it’s a stack-based language interpreter) and needed to bring it up to reasonable performance. Here are some notes from the process:

  1. You really should know the difference between a seq and a list. If you use any of the seq functions (such as drop or take), your list will no longer have an O(1) count operation. Instead, it will become a LazySeq and count will become O(n). If count is something you call frequently (I do), you will want to avoid this. In my case, my stacks were implemented as lists and count gets called very frequently, so this was a serious and surprising problem. Surprising, because most tutorials skim over the difference, only emphasizing how general seqs are. So if your count performance takes a dive, see if you can replace (drop 2 mylist) with (pop (pop mylist)). The latter will keep the PersistentList structure.
  2. I implemented my stacks as both lists and vectors. There was almost no performance difference between the two (lists were actually slightly faster). I found this to be a surprising result, I expected vectors to be faster. I still think vectors might have lower memory requirements, but I don’t know how to check.
  3. For vector access, (v n) is faster than (nth v n) which is faster than (get v n). This is not something I expected. I think this will get ironed out in the future, as Clojure matures.
  4. Attempts to replace Clojure vectors containing integers with primitive Java int arrays produced no performance gains. In fact, they managed to hurt performance because of some necessary conversions.
  5. As expected, there are very few places where type declarations improve performance. But if you have loops that get executed a lot, you might want to check. Always measure, as the results might be unexpected, and remember that less is more.
  6. When measuring anything running on top of the JVM, let it run for a while before you draw any conclusions. The Hotspot JIT compiler does really cool things with the code, but it does them after a while. In my case, I run code for at least 10-20 seconds and watch the results. I take them into account only after they stabilize. It is common to see a 2x improvement between the first run and the last one.
  7. Accessing vars costs cycles. Clojure has no constants, so many people use *global-parameters*. This is not a good idea in performance-sensitive code. There are two ways around it:
    1. Define macros that expand to constants:
      (defmacro global-parameter [] *global-parameter*)
      and use (global-parameter) in your code.
    2. Enclose your function definitions in let forms, rebinding global constants to lexical variables:
      (let [global-parameter *global-parameter*] (defn my-function [] ...))

As for profiling tools, the old and tried method of actually measuring wall time worked best. I tried the YourKit profiler, but wasn’t impressed, at at $499 it is way overpriced. If they come out with a Clojure edition (drop some features) at $79, I’ll consider. I also tried JVisualVM, but it turned out that it is buggy on a Mac and the profiler doesn’t work. I hope this will be fixed in the future.