We have seen binary search trees in Chapters 3 and 10.
When data volume is large and does not ﬁt in memory, an extension of the binary search tree to disk-based environment is the B-tree, originally invented by Bayer and McCreight . In fact, since the B-tree is always balanced (all leaf nodes appear at the same level), it is an extension of the balanced binary search tree. Since each disk access exchanges a whole block of information between memory and disk rather than a few bytes, a node of the B-tree is expanded to hold more than two child pointers, up to the block capacity. To guarantee worst-case performance, the B-tree requires that every node (except the root) has to be at least half full. An exact match query, insertion or deletion need to access O(logB n) nodes, where B is the page capacity in number of child pointers, and n is the number of objects.
Nowadays, every database management system (see Chapter 60 for more on applications of data structures to database management systems) has implemented the B-tree or its variants. Since the invention of the B-tree, there have been many variations proposed. In particular, Knuth  deﬁned the B*-tree as a B-tree in which every node has to be at least 2/3 full (instead of just 1/2 full). If a page overﬂows during insertion, the B*-tree applies a local redistribution scheme to delay splitting the node till two another sibling node is also full. At this time, the two nodes are split into three. Perhaps the best variation of the B-tree is the B+-tree, whose idea was originally suggested by Knuth , but whose name was given by Comer . (Before Comer, Knuth used the name B*-tree to represent both B*-tree and B+-tree.) In a B+-tree, every object stays at the leaf level. Update and query algorithms need to be modiﬁed from those of the original B-tree accordingly.
The idea of the B-tree also motivates the design of many other disk-based index structures like the R-tree , the state-of-art spatial index structure (Chapter 21).
In this chapter, we describe the B-tree and B+-tree in more detail. In section 15.2, we brieﬂy describe the disk-based environment and we introduce some notations. The B-tree is described in section 15.3, while the B+-tree is described in section 15.4. Finally, in section 15.5 we further discuss some related issues.
The Disk-Based Environment
Most application software deal with data. For instance, a registration application may keep the name, address and social security number of all students. The data has to be stored somewhere. There are three levels of storage. The computer CPU deals directly with the primary storage, which means the main memory (as well as cache). While data stored at this level can be access quickly, we cannot store everything in memory for two reasons. First, memory is expensive. Second, memory is volatile, i.e. if there is a power failure, information stored in memory gets lost.
The secondary storage stands for magnetic disks. Although it has slower access, it is less expensive and it is non-volatile. This satisﬁes most needs. For data which do not need to be accessed often, they can also be stored in the tertiary storage, e.g. tapes.
Since the CPU does not deal with disk directly, in order for any piece of data to be accessed, it has to be read from disk to memory ﬁrst. Data is stored on disk in units called blocks or pages. Every disk access has to read/write one or multiple blocks. That is, even if we need to access a single integer stored in a disk block which contains thousands of integers, we need to read the whole block in. This tells us why internal memory data structures cannot be directly implemented as external-memory index structures.
Consider the binary search tree as an example. Suppose we implement every node as a disk block. The storage would be very ineﬃcient. If a disk page is 8KB (=8192 bytes), while a node in the binary search tree is 16 bytes (four integers: a key, a value, and two child pointers), we know every page is only 0.2% full. To improve space eﬃciency, we should store multiple tree nodes in one disk page. However, the query and update will still be ineﬃcient. The query and update need to access O(log2 n) nodes, where n is the number of objects. Since it is possible that every node accessed is stored in a diﬀerent disk page, we need to access O(log2 n) disk pages. On the other hand, the B-tree query/update needs to access only O(logB n) disk pages, which is a big improvement. A typical value of B is 100. Even if there are as many as billions of objects, the height of a B-tree, logB n, will be at most 4 or 5.
A fundamental question that the database research addresses is how to reduce the gap between memory and disk. That is, given a large amount of data, how to organize them on disk in such a way that they can eﬃciently be updated and retrieved. Here we measure eﬃciency by counting the total number of disk accesses we make. A disk access can be either a read or a write operation. Without going into details on how the data is organized on disk, let’s make a few assumptions. First, assume each disk page is identiﬁed by a number called its page ID. Second, given a page ID, there is a function Disk Read which reads the page into memory. Correspondingly, there is a DiskWrite function which writes the in-memory page onto disk. Third, we are provided two functions which allocate a new disk page and deal locate an existing disk page.
The four functions are listed below.
• Disk Read: given a page ID, read the corresponding disk page into memory and return the corresponding memory location.
• Disk Write: given the location of an in-memory page, write the page to disk.
• Allocate Page: ﬁnd an unoccupied page ID, allocate space for the page in memory and return its memory location.
• Deal locate Page: given a page ID, mark the corresponding disk page as being unoccupied.
In the actual implementation, we should utilize a memory buﬀer pool. When we need to access a page, we should ﬁrst check if it is already in the buﬀer pool, and we access the disk only when there is a buﬀer miss. Similarly, when we want to write a page out, we should write it to the buﬀer pool. An actual Disk Write is performed under two circumstances: (a) The buﬀer pool is full and one page needs to be switched out of buﬀer. (b) The application program terminates. However, for our purposes we do not diﬀerentiate disk access and buﬀer pool access.