Can Java handle this?
I have a raw data file coming into the system. It is a comma delimited file and can have as many as 80k records. Now the problem is there could be duplicate records in the file (not any particular order) which I need to suppress based on the earliest date and then load them into a db and also in another system where I can load only once.
First I thought about loading the records in a Collection list and sort them and suppress the duplicate from there. But my worry is about the memory issue and don't think its going to work.
Has anybody wokered w/ this much data in Java before? Am I worrying too much? Or is there any other solution? Please let me know. Thanks!
[688 byte] By [
tasneem93a] at [2007-10-2 15:34:30]

store all, select distinct?
Loading objects into a java.util.Set will prevent duplicates. You'll have to be sure to map the data into an Object and override hashCode and equals properly.
80,000 records * 1KB/record = 78MB of data. That's certainly a concern for memory, but not for disk. How big are the records? What's the size of the .csv file?
What percentage do you think are duplicates?
%
> Has anybody wokered w/ this much data in Java before?
> Am I worrying too much? Or is there any other
> solution? Please let me know. Thanks!
i did one two years ago. you need to give your app as much as memory as you can, probablly at least a gig, two would be much better. what easts up lots of memory is the strings. if you dont have enough memory, you would have to do a bit by bit, eg using hard disk space.
Write a method that generates a hashcode for each record, such that
the hashcode is guaranteed to be unique..
Then, you dont need to store the 80k records in memory.
Store only the hashcodes in memory, and for each new record,
compute the hashcode and check if the hashcode already exists in your
memory.
Depending on the nature of data in your records, the hashcode could
be a String value or even a long value.
> i did one two years ago. you need to give your app as
> much as memory as you can, probablly at least a gig,
> two would be much better.
That's preposterous.
You have no basis on which to make that claim because you have no idea how much data is in a given row.
jverda at 2007-7-13 15:07:59 >

> Write a method that generates a hashcode for each
> record, such that
> the hashcode is guaranteed to be unique..
Hashcodes in general are not unique.
You can make them unique if you have enough information about the actual values that can be present, but for 80k records read from a file, this is unlikely to be tractable.
Then next best thing you can do is make collisions unlikely, and handle them when they do happen.
jverda at 2007-7-13 15:07:59 >

> > Write a method that generates a hashcode for each
> > record, such that
> > the hashcode is guaranteed to be unique..
>
> Hashcodes in general are not unique.
>
Maybe, i should have used the word "hashcode" then.
A method that given a record, generates a unique String or long number
should be enough.
> You can make them unique if you have enough
> information about the actual values that can be
> present, but for 80k records read from a file, this
> is unlikely to be tractable.
>
A raw implementation would be to just concatentate all values in a record, into a String - and there, you have your unique String.
(The problem here is this wont save you any memory)
If you know the nature of your data, you can improve on that.
So, OP has to evaluate the feasibility.
> > Write a method that generates a hashcode for each
> > record, such that
> > the hashcode is guaranteed to be unique..
>
> Hashcodes in general are not unique.
>
Maybe, i should have used the word "hashcode" then.
A method that given a record, generates a unique String or long number
should be enough.
> You can make them unique if you have enough
> information about the actual values that can be
> present, but for 80k records read from a file, this
> is unlikely to be tractable.
>
A raw implementation would be to just concatentate all values in a record, into a String - and there, you have your unique String.
(The problem here is this wont save you any memory)
If you know the nature of your data, you can improve on that.
So, OP has to evaluate the feasibility.
> > > Write a method that generates a hashcode for
> each
> > > record, such that
> > > the hashcode is guaranteed to be unique..
> >
> > Hashcodes in general are not unique.
> >
>
> Maybe, i should have used the word "hashcode" then.
No, I knew what you meant, and I was also talking about "fingerprints" in general.
jverda at 2007-7-13 15:07:59 >

the ezest way of doing this thing is first load everything into a table, and then select unique from it to load into the real table. no fingerprints or toeprints necessary.
jverda at 2007-7-13 15:07:59 >

hmm, echoes from reply #1
> hmm, echoes from reply #1And it took a whole hour, too. Paraphrasing must not be as "ez" as other things...
it can be an honour to have some people agree with one, with some others, it makes one wonder more about the issue some more :)
> it can be an honour to have some people agree with> one, with some others, it makes one wonder more about> the issue some more :)LOL! Indeed!There is a saying, though: "Imitation is the sincerest form of flattery". :o)Cheers~
when I first read that I thought you'd said"Irritation is the sincerest form of flattery". :o)