Huge tree problem
Hello,
I want to store huge amount of combinations in memory. For each combination I want to store an integer value.
A combination looks like: 2,4,3,8. For this combination I build a tree composed of nodes: 2, 4, 3, 8. Then for node 8, I set the int value to be 1. if the same combination is encountered again, the in value will be incremented to 2.
I can also get the combination: 2,4,3. Then the int value of node "3" will be incremented to 1. The combinations are randomally generated, but they have limited length. For example, I know in advance that I can get combinations:
3 * 4 * 2 * 20 * 12. Then, I know that the first number in the sequaence will be between 1-3, the second between 1-4 and so on ... Valid combinations are can be either all the sequence or a part of it. For example:
- 1
- 1,3
- 1,3,2,19,10
all are valid combinations that can be excepted.
As was mentioned before, for each received combination an integer value of the last node in the sequence is incremented.
My problem is that I need to store this entire tree in memory, and I expect it to be very big.
I there some algorithm for dealing with huge trees allowing it to be compressed or arranged in a manner that will reduce size and efficiency of the tree ?
So how big is 'huge' - 1,000, 1,000,000, 1,000,000,000 combinations? How many items on average in a combination? How big is an item?Message was edited by: sabre150
Basically you're going to have to analyze your data. I got lost where you tried to define "combination" -- it originally seemed to be a series of integers but then it mutated to include asterisks -- and couldn't make any sense of anything after that, but hopefully you understand the data anyway.
You can represent such a tree (without pointers) using an array of ints. Look at section 17.9 of this link:
http://www.faqs.org/docs/thinkjava/chap17.htm
for an example using binary trees. As long as you know the number of symbols that can appear in each position and the length of the longest string you'll see, you can generalize this scheme for different -ary trees.
- A bit of a background:
The idea is actually to collect statistical information on users on a daily basis. I want this statistical information to be accessed fast and easy for future analysis. For example, if user has 3 questions, and for the first question there are 2 answers, the second 5 and the last 3 answers, there will be posible combinations of 2 * 5 * 3. But, in addition, user can leave a form before filling all questions. A typical result of a user that filled all 3 questions is: 2, 4, 2. A typical result for a user that filled only 2 questions is: 1, 3 and so on.
It is not reasonable to keep the data filled by all users since the system can serve a huge amount of users, and also reports should be produced in a decent amount of time. So I plan to keep all data of a user in a tree (A tree for each customer of course ...), that will hold for each combination the number of times it occured. This data will be kept on a daily basis compressed in the database. In order to produce reports I will load the data to the tree, and then be able to make fast analsis.
- Assumptions: the number of combinations can be really huge, but I also take in to consideration several assumptions:
* I collect the data on a daily basis, therefore, the tree (or whatever data structure that will be used) is being cleared every day. But ... I will want to output some reports on the data, that will sum information for more than one day.
* I assume that at any given day, 30% at the most of all combinations will be used (It will even be less ....). The more users will input data, the more combinations are possible to occure.
I have already built the tree and it seems to be working fine. I just worry about the size of it in memory, and wanted to know if there is some other Data structure or algorithm that can deal with huge amounts of combinations somehow in order to gain some kind of compression.
To: RadcliffePike
Thanks for your answer. I know that a tree can be represented by arrays. I have several problems with it:
- This is not a binary tree and I cannot rely on a final size of childs of each node.
- Even if I knew the exact size, a tree would save more memory, since in arrays you have to predefine the sapce for each set of childs.
- Even when using arrays, I gain no actual "compression", but only saving memory because of using int instead of an object to represnt the data.
Anyway, I will be happy to know if there is any solution to these problems when using arrays of integers (if there is ...).
Thanks again for your answer.
You might be better off just using a reporting API. Just write out hte information as and when you have it and let the reporting infrastructure collate it and query it for you.
ARM (OpenARM) is a reporting API that can be used (although I can't say I recommend it (It appears to have been written by a one armed dyslexic Fortran coding chimpansie)).
Cheers
Matt
Thanks for your answer.
It is a problem writing all the information for two main reasons:
- It is a huge amount of information that will cause the database to grow to a huge size very quickly.
- This huge amount of information could not be accessed and gathered in a reasonable time (I want a report to be out after a few seconds).
I have some experience with systems that collect a lot of data, and I wrote reporting systems that aggregates data from huge database log tables realtively pretty fast, But I never delt with a really huge amount of information that also needed to be fetched quickly.
Do u think this OpenARM can deal with this amount of data ? I estimate the data size with no compression can be more than 200MB per month.
Thats not much data for a database to handle. If you write out the correct information a
decent DB with good indexes should be able to generate your reports in short order.
Arm is an industry standard mechanims that allows you to augment your code to generate reports without caring where the information goes.
You call into the API's from your code and the information you report disappears off into the ether. It could end up ion a data base or most commonly in a commercial reporting tool (often backed by a database).
As I said, I would not recommend it. The java API's really stink. However it does do what it says it will and its performance is not bad.
matfud
I have already encountered cases in which much smaller amounts of data was not fetched after more than two hours on a strong server with oracle database installed.Anyway, I will take a look at this project. Maybe I am missing something.Thanks.
> I have already encountered cases in which much
> smaller amounts of data was not fetched after more
> than two hours on a strong server with oracle
> database installed.
>
That is because very often Oracle and other relational databases are not well designed and never re-designed, and some reports need to do really expensive joins, ordering and stuff that the database was not originally designed for.
Very often for these relatively small specific problems, a relational database is overkill, noone understands why things work by the speed they do etc, and an average talented programmer will come up with a much better, more easily maintainable and foremost, much faster solution. And cheaper since an Oracle installation isn't exactly free.
Gil
I would do like this:
A: 1 int[][] table that is sorted.
- Search for matches by Arrays.binarySearch
- Don't add data
B: 1 int[][] table that is NOT sorted
- search for matches lenearly
- add data by appending to the int[][] table
C: When the second int[][] table gets too big for linear search to be effective:
- merge the first and second int[][] table
- let the merged table be the new first table
- clear the second table to be empty
These tables are just the keys of course, and you might need 2 additional int[][] tables with the counts
This should save memory and give you lightning fast reporting
Gil
You could also use an int array and increment the value at the index that is the rank of the combination. If there are 2*5*3 answers and the first two are filled with 1 an 3, the element at rank(1,3,0) would be incremented. This might require more or less memory than a tree depending on the shape and implementation of the tree.
In either case, one way to halve the size is to use short values instead of ints. Whenever the value is about to overflow, it would be replaced with a negative number indicating (the negative of) an index in an int array (of 1-Short.MIN_VALUE elements) where the larger value is stored. This is safe if no more than -Short.MIN_VALUE combinations have more than Short.MAX_VALUE answers (that would require over a billion answers a day).
I wrote you an encoder and a decoder, that will convert an array of answers to questions to an integer and back. This will let you eliminate the overhead of a tree encodeing. (I rather arbitrarily wired the shape to be {2,3,5} so Q1 has two possible answers, namely 0 and 1, Q2 has 3 answers, 0,1,2 etc.)
They can be used to help in several of the schemes that have already been mentioned.
I would mention one other, just a varient on the last one, use byte[] rather than short. Then you are only burning a single byte for each unused combination. When a byte hits 255, you add a pair (encodeValue, fullCount) into a list. Which you can grovel over however you choose.
public static class ED{
int[] shape = {2,3,5};
void decode(int[] a, int n){
int base = 1;
for(int i=0; i<shape.length; i++){
if(n >< 0) {
a[i] = -1;
} else {
a[i] = n%shape[i];
n = n/shape[i] -1;
}
}
}
int encode(int[] a){
int result = 0;
for(int i = shape.length-1; i>-1; i--){
if(a[i] >= 0){
result += a[i];
if(i>0)result = (result+1)*shape[i-1];
}
}
return result;
}
void printArray(int[] a){
System.out.print("[");
for(int i=0;i<a.length; i++){
System.out.print(a[i]);
if(i><a.length-1) System.out.print(", ");
}
System.out.print("]");
}
void test(int n){
int[] a = new int[3];
for(int i=0; i><n; i++){
System.out.print(i + " - ");
decode(a,i);
printArray(a);
System.out.print(" - " + encode(a));
System.out.println();
}
}
public static void main(String args[]){
ED ed = new ED();
ed.test(42);
}
}
Sample of test output to show that the encoding and decoding works as expected.
0 - [0, -1, -1] - 0
1 - [1, -1, -1] - 1
2 - [0, 0, -1] - 2
3 - [1, 0, -1] - 3
4 - [0, 1, -1] - 4
5 - [1, 1, -1] - 5
6 - [0, 2, -1] - 6
7 - [1, 2, -1] - 7
8 - [0, 0, 0] - 8
9 - [1, 0, 0] - 9
10 - [0, 1, 0] - 10
11 - [1, 1, 0] - 11
12 - [0, 2, 0] - 12
13 - [1, 2, 0] - 13
14 - [0, 0, 1] - 14
15 - [1, 0, 1] - 15
16 - [0, 1, 1] - 16
17 - [1, 1, 1] - 17
18 - [0, 2, 1] - 18
19 - [1, 2, 1] - 19
20 - [0, 0, 2] - 20
21 - [1, 0, 2] - 21
22 - [0, 1, 2] - 22
23 - [1, 1, 2] - 23
24 - [0, 2, 2] - 24
25 - [1, 2, 2] - 25
26 - [0, 0, 3] - 26
27 - [1, 0, 3] - 27
28 - [0, 1, 3] - 28
29 - [1, 1, 3] - 29
30 - [0, 2, 3] - 30
31 - [1, 2, 3] - 31
32 - [0, 0, 4] - 32
33 - [1, 0, 4] - 33
34 - [0, 1, 4] - 34
35 - [1, 1, 4] - 35
36 - [0, 2, 4] - 36
37 - [1, 2, 4] - 37>
Thanks a lot,
This is a nice solution, but it has several limitations for me:
- I will have to reallocated space in memory for all combinations. In a tree form, only each combination that arrives, if new, is being allocated several tree nodes. I noted, that data is being collected on a daily basis. Therefore, it is logical to assume that not all combinations will occur on a single day.
- The "Answers" can be any number (sometime ids). Therefore, I will have to add the overhead off mapping each id to its array index. In tree I can just save each received answer (id, number or whatever) as a single value of a node. More than that, all the "partial" combinations also allocated space in the array. For example: 1, 3, -1 (user left on third question), will take space in the array. On tree it will not happen, since by the nature structure of a tree, you have a node on the way for each combination. you just have to count the second node (answer 3) instead of counting the last node.
- The system is very dynamic, I don't want it to be aware to the number of answers that can arrive for each question. In array I have to redefine the size. In tree I can expand whereever I want easily with no effort.
Thanks again for your answer. I copied the code you sent me, and I may find it to be useful.
In a tree formulation Every single node in the tree requires some number of bytes.
1 byte of answer number (like this was answer number 3 to Q4)
1 (or more) bytes of count for how often you have seen this item.
at least 2 ints of pointers to other tree elements. (I assume objects are 32 bit pointers)
Probably at least one int of object overhead (like a pointer to the Tree node class)
So let's call it 3 ints + 2 bytes = 14 bytes per node.
Consider a node like 2.3.0.4
You get to amortize the initial portion 2.3.0 across several different nodes (you pass through the same nodes to get to 2.3.0.4 as you do to get to 2.3.0.1 so we will only count the cost of a node as being the cost of a single node. This is however an underestimate. You actually pay the full cost of 4 nodes to get to 2.3.0.4 even if you NEVER independently use the node 2.3
So you are keeping roughly 14 bytes PER node that you see, you estimated node density to be 30% of the total possible.
IF you encode the answers into a single integer AND you implicitly keep those integers as array indices, you have essentially 0 overhead and 1 (or more) bytes of count per node. 1 byte instead of 14 bytes per data item.
Yes you pay a full size memory hit up front to keep an array. Yes you keep extra space for nodes you have not seen. As soon as you have collected 1/14 = 7% of the total number of possible answer patterns, the array storage method starts to win over an incremental tree method.
Even if you abandon the array with implicit pointers, you can keep an array of encodings, int[], explicitly and an array of counts, byte[], and you are paying 5 bytes per node instead of 14+ bytes per node.
Paying 5 bytes per node (so that you can get a method that grows incrementally and has no "holes") wins over the full array method only up until your loading is about 20%.
If you were correct in your 30% number (and I realize that those sorts of estimates are at best rough guides) the full array, even with 70% of it full of holes, is still a better solution. I think you have overestimated the impact of carrying around unused nodes, "holes", and have underestimated the overhead of keeping individual objects like tree nodes out in a heap somewhere.
There isn't any magic to packing stuff into memory. What I am doing with encode and decode is packing several small numbers into a single int. From your original description I was under the impression that your answer strings were small numbers. If not, you will need to make some accomodation. Keeping a few small arrays around to pack IDs into small numbers is a good thing to do if it saves you space in the final pack. That is what packing is all about, burning code and memory in one place to save even more memory in another place.
If you are trying to reduce your memory footprint, you need to be aware that keeping things as objects costs memory for the object overhead. To get rid of that overhead you want to regularize your data into arrays whenever possible. Less objects is less object overhead (and generally you pay in terms of more code that is essentailly encoding and decoding to move things in and out of an array). When you can eliminate data by keeping it implicitly rather that explicitly you do that too if the byte counts make sense.
The suggestions made by myself and others that you consider using arrays are still appropriate. If you want to reduce your memory you should consider using arrays, preferably one.
I have had the following two problems with servers that collate thier own statistics:
1) you are limited to the stats they create (and others that you can derive from those)
2) dynamic structures for collating stats are, in effect, a memory leak. If you don't collect the
stats and free the structures your server will run out of memory and fail to serve anymore
requests.
It is often preferable to just stream out the raw (timestamped) data. This allows you to
perform arbitrary statistical analysis without having to recode, redeploy, and restart your
server. It allows you to monitor live statistics (incrementally computed). It does not
consume memory on the server. You also have a wide number of choices as to where
the raw data is stored (DB/file/dedicated client/whatever).
If your DB is not up to handing the reporting task then write a custom client that reads the
raw data from the DB (or file or whatever). It can use any of the approaches listed in this thread.
The disadvantages of streaming out the raw data are that it can be a pain to get those stats back onto the server if, for example, it is a web server and requires a page that
shows its own usage stats). The stats arn't on the system so you would need a
mechanism for making them available to the originating server.
If you must collate the stats on the server then try hard to ensure that it uses static data structures. If they don't grow then there is no memory leak.
(by the way I have suffered the problem of 3rd party server software dieing due to its statistics fundtion being enabled (the random desktop machine that was supposed to collect the statistics from the server stopped working and caused the live server to fail. Stupid setup but there it is)
