Counting Words in Clojure

2013-10-26

One of the early Clojure exercises on exercism.io is writing a word count algorithm. The exercise is easy enough in Ruby, but at first glance seems daunting for someone not accustomed to functional programming. In reality, though, with a proper introduction to the concept of reduce in Clojure, the exercise is quite enjoyable and not nearly so difficult.

To start, let's assume we have a string: one fish two fish red fish blue fish. How do we write an algorithm to get the following output?

{"one" 1, "fish" 4, "two" 1, "red" 1, "blue" 1}

In Ruby, we might split the string out into an array of words and iterate over them while keeping track of their frequency using a hash. Clojure gives us an easy way to do much the same:

(def words-to-count "one fish two fish red fish blue fish")

(frequencies (clojure.string/split words-to-count #"\s"))
; {"one" 1, "fish" 4, "two" 1, "red" 1, "blue" 1}

First, we split our string using a regular expression for whitespace (i.e., #"\s"), and then pass the resulting vector off to the frequencies function which does the rest of the work, returning the result we wanted.

So that wasn't so hard, but what if we didn't have the handy frequencies function? Enter reduce.

In a Clojure repl, we can type (doc reduce) to get some basic documentation on the function. According to the docs, reduce takes a function, an initial value, and a collection. For example, if we wanted to sum a a vector of numbers, we could use reduce as follows:

(reduce + 0 [1 2 3]) ; 6

The plus sign is our function, the zero our initial value, and the vector our collection. Understanding reduce in my mind is a fundamental requirement to understanding much of how functional programming works. And there's a great post on reduce which has helped me wrap my mind around the subtle function. The post deserves to be read in its entirety, but I will briefly present the points relevant to counting words here.

The core of reduce is in its use of looping and recursion. Let's go one step further with reduce before we return to our word count algorithm.

Assume we wanted to sum a vector without using reduce. How would that work? We would start by using loop and recur, Clojure's (and functional programming's) version of iteration.

(loop [acc 0 coll [1 2 3]]
  (if (empty? coll) acc
      (recur (+ acc (first coll)) (rest coll))))

We start by assigning some local values, acc for our accumulator which is set to zero, and coll for our initial collection. As with all recursive algorithms, we access the base case first, i.e., that our collection is empty. If it is in fact empty, we will return our accumulator. Otherwise, we use recur to pass new values for acc and coll: 1) the sum of our accumulator's value and the first value in the collection as acc, and 2) what's left of the collection without the first element, i.e., (rest coll), as coll. This recursion will continue until we empty out coll and have added all values to the accumulator.

Whereas the loop above will only use addition, we could easily write our own version of reduce to which we could pass the function with which we wanted to reduce our collection:

(defn my-reduce [fn init collection]
  (loop [acc init coll collection]
    (if (empty? coll) acc
        (recur (fn acc (first coll)) (rest coll)))))

Although the function above is a naive implementation, it nonetheless demonstrates how reduce is built from loop and recur.

Now that we've covered the basics of reduce, let's return to our word count algorithm above. We know now that we can use reduce in the absence of frequencies. Our first attempt might look something like the following:

(def split-words (clojure.string/split words-to-count #"\s"))

(defn count-words [words-to-count]
  (reduce _ {} words-to-count))

(count-words split-words)

Once we've split our string into a vector of words, we can easily formulate the skeleton of our call to reduce. What function will take the place of the underscore above? Let's write it now.

We'll call the function record-word-count. It will take two arguments, a map to keep track of each word and its count, and a word to enter into a new copy of the map.

(defn record-word-count [memo word]
  (assoc memo word (inc (memo word 0))))

We'll start from the inside and move outwards. First, with (memo word 0), we attempt to lookup the value stored in memo for our word. If we don't find it, we return 0. This is much like Ruby's Hash#fetch method which optionally takes a value for a missing key. Next, whatever value we return (zero or otherwise), we increment its value with inc. Finally, we use assoc to create a new map which will include our word and its updated value.

Our resulting code is as follows:

(def split-words (clojure.string/split words-to-count #"\s"))

(defn record-word-count [memo word]
  (assoc memo word (inc (memo word 0))))

(defn count-words [words-to-count]
  (reduce record-word-count {} words-to-count))

(count-words split-words)
; {"one" 1, "fish" 4, "two" 1, "red" 1, "blue" 1}