Bulk Loading B+ trees: Top-down and Bottom-up approach

455 views Asked by At

I know there exists a technique to bulk load sorted data in a B+ tree.

However, I read at some places that there are 2 ways how bulk-loading can be approached - top-down and bottom-up. Resource (1) mentions that in the top-down approach, most of the internal nodes are sparse/half-full and none of the old entries go into a newer node. Whereas, in the bottom-up approach we can circumvent the sparsity.

Now, resource (2) discusses this bulk-loading technique in its own words, yet it is similar to resource (1). However, resource (2) proceeds to visualize how this bulk-loading can be implemented and ends up with a B+ tree with half-full nodes.

My question is

  1. Which of the resource is correct?
  2. How exactly is a top-down build different from a bottom-up build when considered for bulk-loading?
  3. Am I reading all of it wrong and bulk-loading is implemented with a bottom-up approach only? (Resource (2) says that the bottom-up approach is implemented as a part of bulk-load utility)

Resources:

  1. https://db-coder.github.io/DBInternalsReport.pdf
  2. https://slideplayer.com/slide/15127631/
  3. https://en.wikipedia.org/wiki/B-tree#Initial_construction

Note: Resource (2) is built with reference from the book Database Management Systems, (3rd edition), by Raghu Ramakrishnan and Johannes Gehrke. McGraw Hill, 2003.

I have given a thorough reading to the above-mentioned resources and other similar content online. Well, the only thing I did not do yet is asking ChatGPT.

1

There are 1 answers

2
Lemonina On

they describe two different approaches to bulk-loading a B+ tree and both are correct

  • top-down approach -- algorithm starts with an empty tree and inserts the keys in sorted order -- splitting full nodes as necessary(can result in many sparse (half-full) internal nodes -- as the algorithm may split nodes before they reach their maximum capacity)

  • bottom-up approach -- algorithm builds a set of leaf nodes with sorted data and then constructs the internal nodes recursively by merging pairs of leaf nodes and promoting their median key(more densely packed internal nodes -- does not split nodes until they contain the maximum number of keys)

both approaches have their advantages and disadvantages -- the choice depends on factors such as the size of the data set | expected access patterns | available hardware resources(some implementations may even use a hybrid approach)