Reading entire txt file into memory?

When you are using BufferedReader to read info into a buffer, that means you are reading the file into memory, correct? (Is that what buffer means?)

I want to look for pattern matches in text files (about 1000 of them) using the regex utils. But I don't want to read and examine the text files line by line. I want to read in the entire text file into memory first and then look for the pattern matches. The text files generally don't exceed about 15K in size. I'm only going one file at a time, too, so this won't give me any out of memory errors, will it?

And more importantly, how do I do it? I mean the "reading in the file" part only. I have my RegEx, I have my array of files to examine already. I just can't figure out the right code to use to read each file into memory before I look for pattern matches.

Could someone help, please?

[870 byte] By [adf1972a] at [2007-10-2 5:45:03]
# 1

> When you are using BufferedReader to read info into a

> buffer, that means you are reading the file into

> memory, correct? (Is that what buffer means?)

Yes.

> I want to look for pattern matches in text files

> (about 1000 of them) using the regex utils. But I

> don't want to read and examine the text files line by

> line.

Why not?

> I want to read in the entire text file into

> memory first and then look for the pattern matches.

Why?

> The text files generally don't exceed about 15K in

> size. I'm only going one file at a time, too, so

> this won't give me any out of memory errors, will

> it?

Depends on how much memory you've given the VM and how much of that it's using already at the time you read the files, but in general, probably not a problem.

> And more importantly, how do I do it? I mean the

> "reading in the file" part only.

Use BufferedReader to read line by line and then append each line (plus a newline, since BR.readLine() strips those off) to a StringBuilder.

Or use a BufferedInputStream and and array that's as big as the file, and in a loop, try to read as much as is left into that array at an offset equal to how much has been read so far.

I still think this is probably not a good approach though.

targaryena at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 2

If your concern is that you have detect patterns that might span multiple lines (for instance, a string of words that broken between 2 lines), you could try reading 2-5 (or some other reasonable, small number) lines at a time. Of course, you'd have to keep track of where the line breaks are, and discard the first line in the group each time you read & append the next line.

Dick_Adamsa at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 3

> If your concern is that you have detect patterns that

> might span multiple lines (for instance, a string of

> words that broken between 2 lines), you could try

> reading 2-5 (or some other reasonable, small number)

> lines at a time. Of course, you'd have to keep track

> of where the line breaks are, and discard the first

> line in the group each time you read & append the

> next line.

Yes, that is exactly my concern. And since the spacing of what I'm looking for between the lines is totally unpredictable (could be 1 line between matches or 40), I don't think the reading in 2-5 lines at a time approach will work for me either.

That's why I want to read the whole thing into a buffer first. So how do I do that? Can someone please help?

adf1972a at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 4

Maybe I should elaborate on what I'm trying to do and why.

I am trying to pull out button names, definitions, and filepaths from 1000+ .htm files in an online help project and write them to a separate .txt file. The reason is to make sure that the button definitions are uniform.

I have written regexes that work to find the button names and definitions. Now I want to open each .htm file, find a button name, store it, then find the first button definition that occurs AFTER the button name. Then look for the next button name, and so on. So, alternating regexes.

My button objects can store multiple definitions for one button name, so eventually I want to output the arraylist of all of the button definitions used for each button, and where they occur.

adf1972a at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 5
> Can someone please> help?See the end of reply 1.
targaryena at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 6
> See the end of reply 1.So I should stick all of the info into one big string? Will that still let me keep track of where the matcher found the last match and so on?
adf1972a at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 7

> > See the end of reply 1.

>

> So I should stick all of the info into one big

> string?

If you're tyring to use regex across lines, then that's the simplest way to go.

> Will that still let me keep track of where

> the matcher found the last match and so on?

Not sure what you're asking here, but you can do whatever you can normally do with regex.

targaryena at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 8
Regex has parameters that allow matches across multiple lines. Check MULTILINE mode.
ChuckBinga at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 9

> > The text files generally don't exceed about 15K in

> > size. I'm only going one file at a time, too, so

> > this won't give me any out of memory errors, will

> > it?

15K? That is a very small text file. If that's as big as they get, you have nothing to worry about.

> I still think this is probably not a good approach

> though.

StreamTokenizer? I guess it depends what you are trying to look for.

bsampieria at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...
# 10

> Maybe I should elaborate on what I'm trying to do and

> why.

>

> I am trying to pull out button names, definitions,

> and filepaths from 1000+ .htm files in an online

> help project and write them to a separate .txt file.

> The reason is to make sure that the button

> definitions are uniform.

>

> I have written regexes that work to find the button

> names and definitions. Now I want to open each .htm

> file, find a button name, store it, then find the

> first button definition that occurs AFTER the button

> name. Then look for the next button name, and so on.

> So, alternating regexes.

>

> My button objects can store multiple definitions for

> one button name, so eventually I want to output the

> arraylist of all of the button definitions used for

> each button, and where they occur.

This just cries out for 'perl' or 'Python'.

sabre150a at 2007-7-16 1:54:58 > top of Java-index,Java Essentials,New To Java...