Lessons in concurrency from Erlang

Concurrent programming needn’t be hard. I always thought that writing concurrent code was one of the most mystifying parts of software development; and experience with helping others made me realise that this was a surprisingly common view. The main question is usually not “how do I write concurrent code?” though, but “how do I write reliable concurrent code that’s resilient enough for production usage?”.

If there’s a bunch of people that know a thing or two about writing reliable concurrent code, then it’s the telecommunications company Ericsson. In 1998 they produced a telephone gateway (the AXD301) that had an immense nine 9s of availability: i.e it’s uptime was a staggering 99.9999999%. Their programming language of choice? Erlang. So what can Erlang teach us about writing reliable concurrent code?

If you’re not familiar with Erlang, all you really need to know for this post is:

  1. A process is a concurrency type that behaves much like an OS thread with isolated memory and mechanisms for termination and control. They’re also very lightweight.
  2. Erlang’s concurrency model is based heavily on the Actor Model - the concept of an Actor being a concurrency primitive that can only communicate with others via messages, as opposed to direct function calls.

Pass Messages and not Memory#

If you’ve spent any time developing in Go, you’ll have heard the idiom “Do not communicate by sharing memory; instead, share memory by communicating.” - that’s to say opt for sharing primitives via channels as opposed to sharing pointers or interacting with the variables of a higher scope.

In Erlang there is no shared memory between processes: each process uses memory that’s isolated from any others. Additionally, as a functional language, Erlang variables (or “bindings”) are immutable - so the risks associated with sharing memory would be minimal even if it was possible.

Eshell V11.2  (abort with ^G)
1> X = 1.
1
2> X = 2.
** exception error: no match of right hand side value 2

In a functional language this makes sense: what point is there to sharing references as opposed to values, other than to cause side effects?1 Likewise, variable immutability becomes a minor issue when you’re writing short recursive functions where the scope is limited, ephemeral and often re-created.

Idea One: Communication through messages - as opposed to memory - drastically reduces the risk of race conditions, and also makes code easier to comprehend and reason with.

It’s obvious to state that “global variables bad"2 - but in concurrent applications this rings true more than ever. Anecdotally, as a contractor the worst issues I used to come across were often around -ahem- “proof of concept” codebases that had been pushed in to production whilst relying upon globals.3

Idea Two: Whilst global variables are often seen as a code-smell, this is especially true in concurrent applications: take care around the scoping of any code that's written.

What about the humble mutex?#

One sychronisation primitive that you’ll see commonly used in concurrent code (albeit not in Erlang!) is the mutex. Go has sync.Mutex and Rust has std::sync::Mutex - and these allow you to “lock” any shared state, ensuring consistent read/writes and reducing the risk of race conditions. These should be a last resort.

As a rule of thumb these should only be used when there’s no alternative but to share memory across threads; and when the Actor model is employed, it becomes a lot easier to fully embrace the idiom of sharing by communication. A common use case for a Mutex is when developing network services, where an abstraction around an underlying network socket may need to be guarded to prevent concurrent writes.

Synchronisation becomes a lot easier if we borrow the idea of having small concurrent actors that (a) provide a single path for operations against the resources they own, and (b) export no shared state. Here’s a rather contrived example using file writes:

// usage:
// fileWrite, _ := NewFileActor("example.tmp")
// fileWrite <- "Hello World!"
// close(fileWrite)
type FileActor struct {
    *os.File
    write chan string
}

func NewFileActorWriter(filename string) (chan string, error) {
  file, err := os.Create(filename)
  if err != nil {
    return err
  }
  
  actor := &FileActor{
    write: make(chan string),
  }
  actor.File = file
  
  go actor.handle()
  return actor.write, nil
}

func (f *FileActor) handle() {
    defer f.Close()

    for {
        select {
            case line, isOpen := <-f.write:
            if !isOpen {
	              return
            }

            f.Write([]byte(line))
        }
    }
}

So when should you rely upon the trusty Mutex? Well.. only in very specific instances:

Do you actually need a mutex?

Idea Three: The use of mutexes is often very important, but it can also be symptomatic of sharing memory needlessly - and can often be avoided by writing minimal actor abstractions around resources.

Failing Fast#

When writing concurrent code the overhead of error handling increases hugely: suddenly errors not only need to be handled and/or propagated up the call-chain, but they also need to be communicated to any other dependent concurrent functions. Erlang encourages the idea of simply failing upon a error, relying upon some form of orchestration mechanism to recover and/or restart the process4. Combined with a transactional approach to interacting with external systems such as databases, this can alleviate the work associated with defensive boilerplate.

Erlang’s approach allows this via the use of a Supervision Tree 5; these trees ensure that individual worker nodes are managed via a supervisor node. A simplified supervisor could be achieved in Go like this:

type supervisableFunc func(context.Context, chan struct{})

func doSomething(ctx context.Context, done chan struct{}) {
  defer func(){
    if r := recover(); r != nil {
      log.Println("doSomething panicked. recovering...")
    }
    
    close(done)
  }()
  
  for {
    select {
      case := ctx.Done():
      return
      // case for inbound messages, actually allowing useful work
      // to commence.
    }
  }
}

func superviseSomething(ctx context.Context, worker supervisableFunc) {
  for {
    isDone := make(chan struct{})
    go worker(ctx, isDone)
    
    <- isDone
    if ctx.Err() != nil {
      break
    }
  }
}

Placing the call to the goroutine in a loop, and only breaking out of the loop in the event that the context is cancelled and/or timed out, is a very simplistic approach to achieving supervision in Go. In conjunction with the usage of arecover() call per goroutine, as well as correct error logging, a lot of the usual error handling boilerplate can be minimised and located outside of the normal business logic.

Using this pattern we could construct a supervision tree with the route node consisting of a simple sync.WaitGroup - which would give a structure that looks something like this:

Example Go Supervision Tree

The Worker leafs in the above tree could be Actors constructed around resources like database connections, caching layers, or they could form part of a pipeline like this:

Pipeline

This structure has multiple advantages: firstly it allows workers to forego defensive error-checking measures, secondly it makes failures easier to pinpoint - avoiding obscure stacktraces as error-handling bubbles up, thirdly it provides assurances that all concurrent functions are running as expected at all times.

Idea Four: Traditional "defensive" programming practices can introduce needless complexity, whereas failing fast and having a very thin orchestration layer can help simplify even complex object relationships.

Conclusion#

I’ve previously written about the pitfalls of writing code in the style of another language , and how it’s something that should be avoided at all costs… so this post could be viewed as a bit hypocritical. That’s not to say that you can’t learn from the patterns that are used for specific problems in other languages.

In this post I’ve barely scratched the surface of the Actor Model, or some of the native-Erlang concurrency tools such as links , monitors , or even Erlang’s out-of-the-box distributed programming support. We’re not trying to re-implement Erlang/OTP in another language though!

These are little more than simple ideas that - if implemented in a more idiomatic manner - could prove to be a useful tool for approaching concurrent problems.

A note about language constructs & Rust#

Other than the concurrency primitives, there’s very few similarities between Erlang and Go; and the observations in this post are just as applicable to any other language. The first one that springs to mind is Rust - with it’s claims of “fearless concurrency”, and it’s standard library support for spawning threads via std::thread::spawn and creating channels via std::sync::mpsc::channel . However it’s worth highlighting that these are native OS threads - and not the lightweight abstractions that Go and Erlang provide.

Rust is an interesting option for implementing these patterns though, as although native OS threads are certainly “heavier”, they do provide the memory isolation that one gains in Erlang. Additionally, the default immutability of variables (not to mention Rust’s unique memory ownership model) is also an area of similarity with Erlang.


  1. Yeah, okay - you’ve got me: sharing references (i.e. pointers) is better for memory utilisation. This is minimised in a model where you’re sending messages - i.e. primitive types as opposed to complex types - though. ↩︎

  2. If you do find yourself working on a project that has excessive globals - and you don’t fancy heavy refactoring initially - a good temporary solution can be using closures as a hack to maintain the expected state at a function level whilst slowly transitioning to a saner dependency injection model. ↩︎

  3. This particular contract involved working on a voice-based telephony system which allowed people to take automated assessments over the phone. The use of globals actually led to cross talk between calls: and private details would randomly be sent to multiple callers…! ↩︎

  4. In a world of Kubernetes and Netflix’s Chaos Monkey, I’m always surprised at how this approach of simply failing and allowing an orchestrator to do it’s thing isn’t more common. ↩︎

  5. Actually, the Supervision Tree isn’t a design principle of Erlang - but of OTP (Open Telecom Platform). Docs: OTP - Design Principles ↩︎

© 2021