map question

I am associating a String <key> with an int <value> in a Map. The two are in one-one-correspondence; every String has exactly one int, every int has exactly one String. I am wanting to later retrieve a key given a value, as well as retrieve a value given a key.

I am doing this in anattempt to save memory on loading a large delimited file with 700,000 rows; two of the fields in the file have a limited number of distinct string values, and I want to store pointers to those values as ints instead of Strings in my objects. I will have a lookup table that lets me display the original string when I want to.

I want to read the file, pull the string as the <key> to the map, and use a counter to get a <value>. When I want to display the original string, I will go from the int <value> in my object and look up the original String.

But I don't know how to make the two-way trip without resorting to a rube goldberg scheme with two maps for each int-String association. That seems needlessly complicated, and I feel there must be a simpler approach.

Maybe I need to use an array (?); the indexes could be the pointer values, but then I don't know how to enforce uniqueness in the values at each array index.

I just tried loading the file with the JVM argument "-Xmx512M", and it gave me "out of memory" after about 8 mnutes of whirring.

[1407 byte] By [cumina] at [2007-11-26 16:36:42]
# 1

> I am doing this in anattempt to save memory on

> loading a large delimited file with 700,000 rows; two

> of the fields in the file have a limited number of

> distinct string values, and I want to store pointers

> to those values as ints instead of Strings in my

> objects. I will have a lookup table that lets me

> display the original string when I want to.

Maybe I am wrong, but doesn't Java do this already. (I believe it maintains a pool of strings?)

Anyway, why are loading that many rows of data at once. Process them in chunks. There is no need to have all those rows in memory at once.

zadoka at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 2

I am keeping the data in memory because the rows of data are represented as a big Map in my program. The data in the text file is essentially one big data structure that I am representing and manipulating and querying in the program. If I could chunk it appropriately so the data could be fetched as needed from disk I would do that, but I am afraid I will end up trying to hack together some sort of RDBMS.

The text data is a structured vocabulary, a set of names and codes, keyed on one of the codes. Each of those keys is associated with a set of name-code (and other data bits) associations.

Does the java string pool mechanism mean that if 20,000 objects all have the same string in various instance variables, there is only one actual string kept in memory?

If that is the case, then this particular approach to memory optimization is futile.

cumina at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 3

> I am keeping the data in memory because the rows of

> data are represented as a big Map in my program. The

> data in the text file is essentially one big data

> structure that I am representing and manipulating

> and querying in the program. If I could chunk it

> appropriately so the data could be fetched as needed

> from disk I would do that, but I am afraid I will end

> up trying to hack together some sort of RDBMS.

Why not use a RDBMS?

> If that is the case, then this particular approach to

> memory optimization is futile.

That is what I am thinking. However, you need to research it or maybe someone else here can confirm it because I am not 100% sure on that.

zadoka at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 4

> Why not use a RDBMS?

I want a free-standing program, not tied to a network link or a particular RDBMS. I believe Derby is an embeddable database, but I was hoping to let a user point my program at the text files and let the program figure out the rest without adding the overhead of an RDBMS. I have seen some other programs that do what I want to do (browse a big data store), and which work with more data than mine. The examples I am thinking of are very (!) fast and don't seem to use a database AFAIK. You just plug them in and they go.

One uses Lucene in some method that apparently has thousands of files with Lucene indexes, and the other is a mystery, done in Microsoft visual basic with a lot of pointer manipulation (according to the author, who I have had occasion to talk to a few times, but never to the point of getting the "secret").

Message was edited by:

cumin

cumina at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 5
Well, you are going to need to come up with some sort of way to quickly access your data, without having to load it all into memory. Not knowning your requirements, all I can say is use a RDBMS or find some other way to index your data for quick retrieval.
zadoka at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 6
Yeah, it looks like I might have to make binary files with fixed field sizes and use random access lookup. I was hoping there was more of an "off-the-shelf" approach, but that may be wishful thinking.
cumina at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 7

It turns out that the problem was with using Scanner to read the lines of text, instead of a BufferedReader. When I read the files line by line with a BufferedReader, it takes 88 seconds to read in the whole 73 MB file and make the concept graph in memory.

I wasn't aware that there was such a difference in memory use in using a Scanner to read big files.

cumina at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 8

> Maybe I am wrong, but doesn't Java do this already.

> (I believe it maintains a pool of strings?)

Java maintains a pool of Strings alright. I'm not sure, but I don't think you get the String from the pool automatically, but have a look at String.intern(), http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#intern().

OleVVa at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 9

> > Maybe I am wrong, but doesn't Java do this

> already.

> > (I believe it maintains a pool of strings?)

>

> Java maintains a pool of Strings alright. I'm not

> sure, but I don't think you get the String from the

> pool automatically, but have a look at

> String.intern(),

> http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Stri

> ng.html#intern().

I think Java does it automatically. You don't need to explicitly need to call intern.

http://javatechniques.com/public/java/docs/basics/string-equality.html

zadoka at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...
# 10

Thanks for the link, it looks like a good source. At a glance, I interpret it this way: Java does "interning" (uses pooled Strings) automatically for String literal (those written with quotes in the source code), not with Strings loaded from a file at runtime. The original poster will have to go ahead and find out.

OleVVa at 2007-7-8 23:01:40 > top of Java-index,Core,Core APIs...