Tokyo Cabinet中Bucket(桶)的设置与性能关系

关于TC中的Bucket的大小设置。

作者原文如下:

Tokyo Cabinet attains improvement in retrieval by loading RAM with the whole of a bucket array. If a bucket array is on RAM, it is possible to access a region of a target record by about one path of file operations. A bucket array saved in a file is not read into RAM with the `read' call but directly mapped to RAM with the `mmap' call. Therefore, preparation time on connecting to a database is very short, and two or more processes can share the same memory map.

If the number of elements of a bucket array is about half of records stored within a database, although it depends on characteristic of the input, the probability of collision of hash values is about 56.7% (36.8% if the same, 21.3% if twice, 11.5% if four times, 6.0% if eight times). In such case, it is possible to retrieve a record by two or less paths of file operations. If it is made into a performance index, in order to handle a database containing one million of records, a bucket array with half a million of elements is needed. The size of each element is 4 bytes. That is, if 2M bytes of RAM is available, a database containing one million records can be handled.

Bucket(桶)的容量将严重影响HashTable发生碰撞的概率,例如:
桶大小为row数量的2倍时,概率是21.3% ,而1/2时碰撞概率达到56.7%。按照这个概率来说,2次以内搞定的概率是:

(1-0.567)+(1-0.567)*(0.567) = 67%

3次内搞定的概率是80%,已经很可观了。

作者的建议是,至少将桶大小设置在rows数量的1/2以上,开销也不是很大,1row4字节,对于100W的数据,只需要2MB的内存。

Leave a Reply

Your email address will not be published. Required fields are marked *