Compressed adressing

Here comes another of my save memory/optimisation questions.

I have a table on disc, and the table width is variable, say 100-1000 bytes. My current solution is to use the maximum size as a fixed size so I can simply access the right row in the table by an offset to the file with the table. But this is spacewasting and also increases the readtime of the rows.

What do I do to make a table with variable size rows to be accessed efficiently? That is, when the adress of each row cannot be easily determined by just an offset. An access is typically a sequence of rows like [1,2,10,12,16,30,31], which would return the data for those rows. I have a few suggestions:

1. Store the rows in a b-tree (which I can easily do). The key used is the row nr and then I use some kind of function in the btree called

getMany(int[] keys) -> Object[]

This I don't think is that efficient since the jumps within the btrees creates alot of extra reads of b-tree pages.

2. Use compressed adressing. Store the rows sequentially on a flat file. But then have a smart index (compressed adressing) to find the actual adress of the row.

The smart index would be a long[] that contains:

* First a long 64 bit to the file position of a row.

* Then a number of longs that contains 4 16 bits offsets from the first 64 bit - as long as 16 bits is enough as offset. Then a new new 64 bit position is needed.

Eg, if the max row widht is 1000 bytes, there can first be the 64 bit position and then 32 16 bits offsets before the next 64 bit position. (since 2^16>32*1000)

This might be done even smarter if we stuff bits even more, never using more bits than absolutely necessary to describe an offset. Any smart suggestions to this?

Anyway, this smart index even in the simple form described, would only take approximately 2 bytes (16 bits) per row in the table. A table with 100 000 000 rows would require 200 MB, which could be kept in primary memory in my case.

If the smart index needs to be stored on disc, I still think the readtime of it would be so small compared to the read of the actual rows that it is acceptable.

Is there a better way to do quick access of the variable size rows? Suggestions of optimisations of the suggested solutions are appriciated, especially the 2nd solution which I believe in.

(Just a comment for those who wonders:

As some may see deletion of a row in the 2nd solution could cause a problem. But I will keep a sorted list (sorted by size) of the "removed" rows, and when a new row will be inserted it finds a removed space that it can fit in: the row that is equal or the smallest free space larger than the space required for the new row.)

Gil

[2771 byte] By [gilroittoa] at [2007-9-29 5:07:00]
# 1

just an idea:

store the average length of an record at the beginning of the index.

then store for each position only the distance it has to the position it should have when all records are of average length.

You'd need to store the maximum length of an index entry as well I guess

regards

Spieler

spielera at 2007-7-14 18:21:07 > top of Java-index,Other Topics,Algorithms...
# 2

You are entering the interesting realm of i.a. DB systems programming.

What about caching for access, large pages of fixed size with 100-200 records compressed, and such?

I am afraid that your question does not lend itself for a specific answer.

Maybe someone reading this knows a software package based on these ideas.

Wasn't there HyperSQL with an in-core SQL DB?

joop_eggena at 2007-7-14 18:21:07 > top of Java-index,Other Topics,Algorithms...
# 3
look here for some ideas about how oracle handles disk io for variousstructures (around figure 3) might give you some ideas for disk based hashes and storage of tables. http://ugweb.cs.ualberta.ca/~c391/manual/chapt2.htmlmatfud
matfuda at 2007-7-14 18:21:07 > top of Java-index,Other Topics,Algorithms...
# 4

Guessing from your description of how you would handle deleted rows, I'd guess that order is not of any importance to you. If it is not one possible solution would be pages with a fixed row size per page.

Then your smart index would reference the pages and their sizes instead of a constantly variable size.

RyanVBakera at 2007-7-14 18:21:07 > top of Java-index,Other Topics,Algorithms...
# 5
It's amazingly good. It will probably drasticly improve the performance in many cases.Thanks a bunch!Gil
gilroittoa at 2007-7-14 18:21:07 > top of Java-index,Other Topics,Algorithms...