Sparse set in Go

June 17, 2017
Go Cgo Sparse Set Data structures

This post is inspired by https://research.swtch.com/sparse

Complete code: https://github.com/ideahitme/go-and-learn/blob/master/golang/cgo/sparse.go

Implementing data structures in Go is easy due to the presence of slices and maps which allows data structures like queue, stack programmed in just few lines of code. Implementing sets should not be too difficult either, but let’s take a closer look at two different ways of doing it. First is the most commonly seen implementation via map and the other will utilize an interesting data structure called sparse set. The latter, however, will require us to do some C code invocation for additional optimization.

I would like to point out that the word set in this text means a data structure which stores unordered set of unique integers and is different from C++ std::set.

Let’s begin by defining an interface for our set implementation:


type Set interface { 
  Insert(uint32) // Insert new element if not present
  Find(uint32) bool // Check if element is in the set
  Iterate() []uint32 // Provide a way to iterate over inserted elements
  Clear() // Clear the set
}

(we omit Remove operation at least in the scope of this post)

MapSet

The most common way to implement a set is to use a hash map, one of the possible implementations is described below:

type MapSet map[uint32]bool

func (ms MapSet) Insert(x uint32) { 
  ms[x] = true
}

func (ms MapSet) Find(x uint32) bool{ 
  return ms[x]
}

func (ms MapSet) Iterate() []uint32 { 
  result := make([]uint32, 0, len(ms))
  for x := range ms { 
    result = append(result, x)
  }
  return result
}

func (ms MapSet) Clear() { 
  for x := range ms { 
    delete(ms, x)
  }
}

...

// usage
set := MapSet{}
set.Insert(1)
set.Find(1) //true

The MapSet implementation is completely fine and it is probably the simplest and cleanest solution in most of the cases. However it has two downsides, both Clear() and Iterate() methods require O(n) operations for the keys of the map to be extracted. That’s where sparse set comes to rescue.

Sparse Set

The idea behind sparse set is to maintain two arrays, namely dense and sparse:

dense array stores elements in the order of their arrival
sparse is an array which elements point to the location where given element is in dense

To understand it better, let’s consider the following example:

In the beginning our sparse set has no elements. And now we want to insert 4 into it:

Before: dense: [] sparse: []

After: dense:[4] sparse:[_,_,_,_,0]

Now let’s insert 2,

Before: dense:[4] sparse:[_,_,_,_,0]

After: dense:[4, 2] sparse:[_,_,1,_,0]

So dense simply stores the elements as they arrive into the set. While sparse stores the indexes where this element exist in dense. So if we want to find if element 2 is present in the set, all we have to do is check sparse[2], which will give us the index of dense where 2 is present. Therefore, for any x in the set, the following holds true dense[sparse[x]] == x. So far so good.

Let’s look at one of the naive Go implementations:

type NaiveSparseSet struct {
	dense  []uint32
	sparse []int
}

func (ns *NaiveSparseSet) Insert(x uint32) {
	ns.sparse[x] = len(ns.dense)
	ns.dense = append(ns.dense, x)
}

Apparently there is a problem, if we call (&NaiveSparseSet{}).Insert(1) the code will panic with index out of bound runtime error. To fix this problem we can ask the user to provide us with the maximum element value that can be stored in the set:

func NewNaiveSparseSet(maxValue uint32) *NaiveSparseSet {
	sparse := make([]int, maxValue)
	return &NaiveSparseSet{
		sparse: sparse,
	}
}

So with this little trick we can now implement Set interface quite simply:

func (ns *NaiveSparseSet) Insert(x uint32) {
	if ns.Find(x) {
		return
	}
	ns.sparse[x] = len(ns.dense)
	ns.dense = append(ns.dense, x)
}

func (ns *NaiveSparseSet) Find(x uint32) bool {
	return ns.sparse[x] < len(ns.dense) && ns.dense[ns.sparse[x]] == x
}

func (ns *NaiveSparseSet) Iterate() []uint32 {
	return ns.dense
}

func (ns *NaiveSparseSet) Clear() {
	ns.dense = nil
}

It all works and all our operations are constant-time operations, but we added new problems:

Initializing sparse now takes O(max-value) amount of operations
Consistently calling Clear will cause the underlying arrays to be generated and continuously garbage collected, which is not really necessary as will be shown further

Optimized Sparse Set

Let’s first solve the GC problem with underlying array of ns.dense. Instead of detaching the slice header, we can simply say that our set has no elements by adding a flag which indicates how many elements we actually have in the set:

type SparseSet struct {
	size   int32
  ...

Therefore, now Clear and Insert can be simply rewritten as:


func (s *SparseSet) Clear() {
	s.size = 0
}

func (s *SparseSet) Insert(x uint32) {
	if s.Find(x) {
		return
	}
	if int32(len(s.dense)) == s.size { //grow only when necessary
		s.dense = append(s.dense, x)
	} else {
		s.dense[s.size] = x
	}
	s.sparse[x] = s.size
	s.size++
}

Now to the dummy initialization problem. The following line caused the sparse slice to be allocated and initialized.

sparse := make([]int, maxValue)

While the former is required for us to use the memory, we need to understand if initialization phase is really necessary. To do that let’s check where we actually need to read the sparse data:

func (s *SparseSet) Find(x uint32) bool {
	return s.sparse[x] < s.size && s.dense[s.sparse[x]] == x
}

So if we had some garbage stored in the s.sparse[x] it would not really affect the correctness of the program, because it would mean that either s.sparse[x] points out of the s.dense boundary, or the value in s.dense[s.sparse[x]] is not actually x.

So how do we allocate a memory without initializing ? Obviously malloc() !

One of the great features of Go is the ability to invoke C code (more details here https://github.com/golang/go/wiki/cgo) and all we have to do is to import the C library:

/*
#include <stdlib.h>
*/
import "C"

So now our sparse allocation would look like:

func NewSparseSet(maxValue uint32) (*SparseSet, error) {
	//allocate a new array via malloc
	sparse := C.malloc(C.size_t(maxValue * 4))
	if sparse == nil {
		return nil, fmt.Errorf("failed to allocate memory")
	}
	return &SparseSet{sparse: unsafe.Pointer(sparse)}, nil
}

It is important to understand that when we call C code, we are interacting with C types and malloc(1000) returns *void type (pointer to anything) so we have to be precise with the size. Since we are expecting to store uint32 (4 bytes) we allocate 4 * maxValue amount of memory. Furthermore, we need to be able to do pointer arithmetics, therefore conversion to unsafe.Pointer is needed as well (more on unsafe package https://github.com/ideahitme/go-and-learn/tree/master/golang/unsafe).

This is how we can perform pointer arithmetics to navigate over the s.sparce (type unsafe.Pointer):

// at(i) returns pointer to sparse array in position i
func (s *SparseSet) at(shift uint32) unsafe.Pointer {
	return unsafe.Pointer(uintptr(s.sparse) + uintptr(4*shift))
}

So the rest of our code is as simple as:

func (s *SparseSet) Clear() {
	s.size = 0
}

func (s *SparseSet) Insert(x uint32) {
	if s.Find(x) {
		return //already in the set
	}

	if int32(len(s.dense)) == s.size { //grow only when necessary
		s.dense = append(s.dense, x)
	} else {
		s.dense[s.size] = x
	}

	mem := (*int32)(s.at(x))
	*mem = s.size
	s.size++
}

func (s *SparseSet) Find(x uint32) bool {
	mem := (*int32)(s.at(x))
	return *mem < s.size && s.dense[*mem] == x
}

func (s *SparseSet) Iterate() []uint32 {
	return s.dense
}

All methods now are constant time operations, we do not need to initialize the sparce and we can avoid constant creation of n.dense underlying arrays. But we forgot one little thing, didn’t we ? Of course, we forgot that malloc() and free() are very good friends.

It is important to keep in mind that Go runtime has no idea of the memory allocation performed in the C code, therefore it will never garbage collect s.sparce. Fortunately, this is quite easy to solve as well:

func (s *SparseSet) Free() {
	C.free(s.sparse)
}
...

func main() { 
  	s, err := NewSparseSet(100000)
	if err != nil {
		log.Fatal(err)
	}
	defer s.Free() //clean up
	s.Insert(100)
	fmt.Printf("is %d in the set: %t\n", 100, s.Find(100))
	fmt.Printf("is %d in the set: %t\n", 99, s.Find(99))
	s.Insert(99)
	fmt.Printf("is %d in the set: %t\n", 99, s.Find(99))

// is 100 in the set: true
// is 99 in the set: false
// is 99 in the set: true
}

Complete code for all implementations can be found in https://github.com/ideahitme/go-and-learn/blob/master/golang/cgo/sparse.go

Conclusion

Use map to implement set when:

Not operating on large data sets
Iteration is not really needed
Set does not need to be cleared

MapSet is the cleanest and recommended way of implementing sets and should be used in 99% of the cases.

Use naive sparse set implementation when:

Not operating on large data sets
Iteration is needed
Clear is not called often

NaiveSparseSet is has some advantages over its optimized version due to the abscense of direct C code invocation and it is type safety guarantees.

Use optimized sparse set implementation when:

Max value is not too large
Iteration is needed
Clear is called often

SparseSet is a great example of cases when we do not need to operate on initialized arrays, we can take advantage of reducing time and resource utilizations for initialization stage and work with “dirty” data.

What’s next ?

In the next post I will do some benchmarking and compare these implementations on various data sets and utilize some of the great profiling capabilities of Go. We will also try to find a way to circumvent the need for the max-value, i.e. the maximum number which can be stored in the sparse set.