Tuesday, December 4, 2007

B-Tree and its application

A b-tree of order m is an m-way tree such that
-all leaves are on the same level
-all internal nodes except the root are constrained to have at most m non empty children and at least m/2 non empty children . The root has at most m non empty children.
A balanced search tree in which every node has between m/2 and m children, where m>1 is a fixed integer. m is the order. The root may have as few as 2 children. This is a good structure if much of the tree is in slow memory (disk), since the height, and hence the number of accesses, can be kept small, say one or two, by picking a large m.
Also known as balanced multiway tree.
A B-tree is essentially just a sorted list of all the item identifiers from one of your data files. For example, if you have a customer file, and every customer item in the file uses a customer number as the item identifier, and if you use B-TREE-P to create a B-tree for the customer file in ZIP code order, then the resulting B-tree will simply be a list of all the customer numbers sorted by ZIP code. However, the B-TREE-P subroutines keep the sorted B-tree list structured in a special fashion that makes it very fast and easy to find any number in the list.
Just as there has to be a file to contain customer data, there has to be a file to contain a B-tree. Naturally, a good convention (and one followed in the examples already presented) is to create a file called B-TREE for keeping the B-tree data that the B-TREE-P subroutines create. Initially, the B-TREE file is completely empty. Then, each time the BTPINS subroutine is called by a program, another item identifier is inserted into the B-TREE file, and the file becomes a specially sorted and constructed list of identifiers.
The order in which item identifiers are sorted in a B-tree is controlled by the BTPKEY subroutine. Although the statements in BTPKEY may specify a very complicated sort for controlling how items in a B-tree are ordered (for example, by ZIP code by address by company by name), the only data actually saved in a B-tree are item identifiers. Therefore, it doesn't matter how complicated the sort is, since the size of the resulting B-tree file is always the same. As a very rough rule of thumb, a B-tree for a file takes approximately the same amount of space as about two SELECT lists of the file.
The actual structure of a B-tree consists of a number of nodes stored as items in the B-tree's file. Each node contains a portion of the sorted list of identifiers in the B-tree, along with pointers to other nodes in the B-tree. The number of identifiers and pointers stored in each node is controlled by a special size parameter that is passed as the second argument to the BTPINS and BTPDEL subroutines. The size parameter indicates the minimum number of identifiers in a node, and also half the maximum. For example, in the examples already presented, the node size used was 5, so each node contained from 5 to 10 item identifiers.
The B-tree node size can be any number from 1 up. Small sizes create B-trees that may be faster to search, but take up more disk space because there are more nodes with pointers. Extremely small nodes may cause very "deep" B-trees that end up being slow to search. Larger node sizes slow down searches, but take less disk space because there are fewer pointers. The disk space occupied by nodes also depends on the length of your data file's item identifiers. A node size of 50 is often a good, all-around, starting value. Once the B-tree is built, it can be examined using standard Pick file maintenance techniques to find the optimum node size that keeps items in the B-tree file nicely packed within the boundaries of Pick's frame structure. If desired, the B-tree can then easily be rebuilt with the optimum node size.
The first node created in a B-tree file is numbered 0, the next is 1 (even though the node might be for a different B-tree in the same file), and the next node is 2, and so on. As more identifiers are inserted into the file's B-trees, more nodes are created. The special item named NEXT.ID, which is automatically created in every B-tree file, contains the number of the next node that will be created.
A file can contain any number of B-trees, but each B-tree in the file must have a unique name, which can be any string of characters. Each B-tree name is saved as an item in the B-tree file, and contains the number of the root node in the B-tree. The root node for a given B-tree is where all searches through that tree happen to start. In the B-TREE-P examples, the B-TREE file contained three different B-trees, named ZIP, COMP, and LNAME, so the B-TREE file also contained three items with those names.

No comments: