Succinct Representation of Data Structures:Introduction and Bitvector.

Introduction

Although computer memories, at all levels of the hierarchy, have grown dramatically over the past few years, increased problem sizes continues to outstrip this growth. Minimizing space is crucial not only in keeping data in the fastest memory possible, but also in moving it from one level to another, be it from main memory to cache or from a web site around the world. Standard data compression, say Huffman code or grammar based code, applied to a large text file reduces space dramatically, but basic operations on the text require that it be fully decoded.

In this chapter we focus on representations that are not only terse but also permit the basic operations one would expect on the underlying data type to be performed quickly. Jacobson [33] seems to have been the first to apply the term succinct to such structures; the goal is to use the information-theoretic minimum number of bits and to support the expected operations on the data type in optimal time. Our archetypical example (discussed in Section 37.4) is the representation of a binary tree. Suppose, we would like to support the operations of navigating through a binary tree moving to either child or the parent of the current node, asking the size of the subtree rooted at the current node or giving the unique ‘number’ of the node so that data can be stored in that position of an array. At lg n bits per reference, this adds up to at least 5n lg n bits. However, there are only (2n+1//(2n + 1) binary trees, so the information-theoretic minimum space is fewer than 2n bits. Our archetypical data structure is a 2n + o(n)-bit representation that supports the operations noted above, and others, in constant time.

We consider a variety of abstract data types, or combinatorial objects, with the goal of producing such succinct data structures. Most, though not all, of the structures we consider are static. In most cases the construction of a succinct data structure from the standard representation is fairly straightforward in linear time.

Memory Model: We study the problems under the RAM model with word size Θ(lg n), where n is the input size of the problem under consideration. This supports arithmetic (addition, subtraction, multiplication and division), indexing and bit-wise boolean operations (AND, OR, NOT, XOR etc.) on words, and reading/writing of words from/to the memory in constant time.

Bitvector

A bitvector provides a simple way to represent a set from any universe that is easily mapped onto [m] . Membership queries (checking whether a given element from the universe is present in the set) can be answered in constant time (in fact a single bit probe) using a bitvector. Furthermore, one can easily support updates (inserting and deleting elements) in constant time. The most interesting twist on the bitvector came with Jacobson [33] considering two more operations:

• rank(i) : return the number of 1s before the position i, and

• select(i) : return the position of the i-th 1.

As we shall see, these operations are crucial to a number of more complex structures supporting a variety of data types. An immediate use is to support the queries:

• predecessor(x) : find the largest element y x in the set S,

• successor(x) : find the smallest element y x in the set S.

Given a bitvector of length m, Jacobson [33] gave a structure that takes o(m) bits of additional space and supports rank and select operations by making O(lg m) bit probes to the structure. On a RAM with word size Θ(lg m) bits, the structure given by Munro [40] enhanced this structure and the algorithms to support the operations in O(1) time, without increasing the space bound. We briefly describe the details of this structure.

The structure for computing rank, the rank directory, consists of the following:

• Conceptually break the bitvector into blocks of length llg2 ml. Keep a table containing the number of 1s up to the last position in each block. This takes O(m/ lg m) bits of space.

Conceptually break each block into sub-blocks of length l 1 lg ml. Keep a table containing the number of 1s within the block up to the last position in each sub-block. This takes O(m lg lg m/ lg m) bits.

• Keep a precomputed table giving the number of 1s up to every possible position in every possible distinct sub-block. Since there O(m) distinct possible blocks, and O(lg m) positions in each, this takes O(m lg m lg lg m) bits.

image

Thus, the total space occupied by this auxiliary structure is o(m) bits. The rank of an element is, then, simply the sum of three values, one from each table.

The structure for computing select uses three levels of directories and is more complex. The first one records the position of every (lg m lg lg m)-th 1 bit in the bitvector. This takes O(m/ lg lg m) bits. Let r be the subrange between two values in the first directory, and consider the sub-directory for this range. If r (lg m lg lg m)2 then explicitly store the positions of all ones, which requires O(r/ lg lg m) bits. Otherwise, subdivide the range and store the position (relative to the beginning of this range) of every (lg r lg lg m)-th one bit in the second level directory. This takes O(r/ lg lg m) bits for each range of size r, and hence O(m/ lg lg m) bits over the entire bitvector. After one more level of similar range subdivision, the range size will reduce to at most (lg lg m)4. Computing select on these small ranges is performed using a precomputed table. The total space occupied by this auxiliary structure is o(m) bits. The query algorithm is straightforward.

details.

See [9, 44] for This ‘indexable bitvector’ is used as a substructure in several succinct data structures. To represent a bitvector of length m, it takes m + o(m) bits of space. In general, if nothing is known about the bitvector then any representation needs at least m bits to distinguish between all possible bitvectors, and hence this is close to the optimal space. But if we also know the density (the number of ones) of the bitvector, then the space bound is no longer optimal, in general. The ‘fully indexable dictionary’ described in Section 37.3.2 gives a solution that takes nearly optimal space.

Using the ideas involved in constructing rank and select directories, one can also support the following generalizations of these two operations, using o(m) bits of extra space: Given a bitvector of length m, and a fixed binary pattern p of length up to (1 E) lg m, for some fixed constant 0 < E < 1

• rankp(i) : return the number of (possibly overlapping) occurrences of p before the position i, and

• selectp(i) : return the i-th occurrence of the pattern p.

One can extend the ideas of rank and select directories to support indexing into a fixed or variable length encoded text (e.g. Huffman coding, prefix-free encoding etc.) in constant time, using negligible extra space. See [33, 44] for some examples.

Related posts:

Leave a comment

Your email address will not be published. Required fields are marked *