How to Leak a Goroutine and Then Fix It

2018-05-22

Many Go developers are familiar with the dictum, Never start a goroutine without knowing how it will stop. And yet, it remains incredibly easy to leak goroutines. Let’s look at one common way to leak a goroutine and how to fix it.

To do that, we are going to build a library with a custom map type whose keys expire after a configured duration. We will call the library ttl, and it will have an API that looks like this:

// Create a map with a TTL of 5 minutes
m := ttl.NewMap(5*time.Minute)
// Set a key
m.Set("my-key", []byte("my-value"))

// Read a key
v, ok := m.Get("my-key")
// "my-value"
fmt.Println(string(v))
// true, key is present
fmt.Println(ok)

// ... more than 5 minutes later
v, ok := m.Get("my-key")
// no value here
fmt.Println(string(v) == "")
// false, key has expired
fmt.Println(ok)

To ensure keys expire, we start a worker goroutine in the NewMap function:

func NewMap(expiration time.Duration) *Map {
    m := &Map{
        data:       make(map[string]expiringValue),
        expiration: expiration,
    }

    // start a worker goroutine
    go func() {
        for range time.Tick(expiration) {
            m.removeExpired()
        }
    }()

    return m
}

The worker goroutine will run every configured duration invoking a method on the map to remove any expired keys. This means SetKey will need to record the key’s entry time, which is why the data field includes an expiringValue type which associates the actual value with an expiration time:

type expiringValue struct {
    expiration time.Time
    data       []byte // the actual value
}

To the untrained eye, the invocation of the worker goroutine may seem fine. And if this wasn’t a post about leaking goroutines, it would be incredibly easy to scan over the lines without raising an eyebrow. Nonetheless, we leak a goroutine inside the constructor. The question is, how?

Let’s walk through a typical lifecycle of a Map. First, a caller creates an instance of the Map. After creating the instance, a worker goroutine is now running. Next, the caller might make any number of calls to Set and Get. Eventually, though, the caller will finish using the Map instance and release all references to it. At that point, the garbage collector would normally be able to collect the instance’s memory. However, the worker goroutine is still running and is also holding onto a reference of the Map instance. Since there are no explicit calls to stop the worker, we have leaked a goroutine and have leaked the instance’s memory as well.

Let’s make the problem especially obvious. To do that, we will use the runtime package to view statistics about the memory allocator and the number of goroutines running at a particular moment in time.

func main() {
    go func() {
        var stats runtime.MemStats
        for {
            runtime.ReadMemStats(&stats)
            fmt.Printf("HeapAlloc    = %d\n", stats.HeapAlloc)
            fmt.Printf("NumGoroutine = %d\n", runtime.NumGoroutine())
            time.Sleep(5*time.Second)
        }
    }()

    for {
        work()
    }
}

func work() {
    m := ttl.NewMap(5*time.Minute)
    m.Set("my-key", []byte("my-value"))

    if _, ok := m.Get("my-key"); !ok {
        panic("no value present")
    }
    // m goes out of scope
}

It doesn’t take long to see the heap allocations and the number of goroutines are growing much, much too fast.

HeapAlloc    = 76960
NumGoroutine = 18
HeapAlloc    = 2014278208
NumGoroutine = 1447847
HeapAlloc    = 3932578560
NumGoroutine = 2832416
HeapAlloc    = 5926163224
NumGoroutine = 4322524

So now it’s clear we need to stop that goroutine. Currently, the Map API provides no way to shutdown the worker goroutine. It would be nice to avoid any API changes and still stop the worker goroutine when the caller is done with the Map instance. But only the caller will know when they are done.

A common pattern to solve this problem is to implement the io.Closer interface. When a caller is done with the Map, they can call Close to tell the Map to stop its worker goroutine.

func (m *Map) Close() error {
    close(m.done)
    return nil
}

The invocation of the worker goroutine in our constructor now looks like this:

func NewMap(expiration time.Duration) *Map {
    m := &Map{
        data:       make(map[string]expiringValue),
        expiration: expiration,
        done:       make(chan struct{}),
    }

    // start a worker goroutine
    go func() {
        ticker := time.NewTicker(expiration)
        defer ticker.Stop()
        for {
            select {
                case <-ticker.C:
                    m.removeExpired()
                case <-m.done:
                    return
            }
        }
    }()

    return m
}

Now the worker goroutine includes a select statement which checks the done channel in addition to the ticker’s channel. Note, we have swapped out time.Tick as well, as it provides no means for a clean shutdown and will also leak.

After making the changes, here is what our simplistic profiling looks like:

HeapAlloc    = 72464
NumGoroutine = 6
HeapAlloc    = 5175200
NumGoroutine = 59
HeapAlloc    = 5495008
NumGoroutine = 35
HeapAlloc    = 9171136
NumGoroutine = 240
HeapAlloc    = 8347120
NumGoroutine = 53

The numbers are hardly small, which is a result of work being invoked in a tight loop. More importantly, though, we no longer have the massive growth in the number of goroutines or heap allocations. And that’s what we’re after. Note, the final code for may be found here.

If anything, this post provides an obvious example of why knowing when a goroutine will stop is so important. As a secondary conclusion, we might say that monitoring the number of goroutines in an application is just as important. Such a monitor provides a warning system if a goroutine leak sneaks into the codebase. It’s also worth keeping in mind that sometimes goroutine leaks take days if not weeks to manifest in an application. And so it’s worth having monitors for both shorter and longer timespans.

Thanks to Jean de Klerk and Jason Keene who read drafts of this post.