A slightly unusual String question
Hi everybody,
Does anyone know if there is a limit to how long Strings can be? I'm parsing an html file and the opening and closing paragraph tags are not always on the same line. The only way I could think of to make sure I parsed out the paragraphs correctly was to squish the whole HTML body (which is very long) into one big String and scan through it for the tags. Hence my question; I don't want to kill my JVM with an OutOfMemoryError.
Any help on this would be great.
Thanks,
Jezzica85
[526 byte] By [
jezzica85a] at [2007-11-27 8:13:04]

Jezzica,
Strings in Java are backed by char arrays. Like all arrays in Java they are limited to 2,147,483,647.
Which is alot.
However that is the theoretical maximum not the practical maximum. You will probably (almost certainly) run out of memory long before then.
So the best answer is that this is probably not a good idea.
Why is it a problem over multiple lines again?
And why are you writing your own parser?
Hi cotton,
I saved a Microsoft Word document as HTML so I could parse it better (RTF and XML looked like complete gobbledygook and HTML I could at least read to figure out what it meant). All I really need is a few properties from the header, then the text and its formatting. For some reason, I guess when Word saves as html sometimes it decides to randomly split lines.
I'm writing my own parser because I didn't know another one existed, especially not for Microsoft Word. I thought that since it's a proprietary format that the HTML it generated wouldn't be "normal." What parser are you talking about, and what does it give you back, especially with formatted text?
Thanks,
Jezzica85
Message was edited by:
jezzica85
extract from word docs you say?
ive done that
package com.doesthatevencompile.desktopsearch.filetypes.framework.msdoc;
import org.apache.lucene.document.Document;
import org.textmining.text.extraction.WordExtractor;
import com.doesthatevencompile.desktopsearch.filetypes.framework.DocumentFieldHelper;
import com.doesthatevencompile.desktopsearch.filetypes.framework.DocumentHandler;
import com.doesthatevencompile.desktopsearch.filetypes.framework.DocumentHandlerException;
import java.io.InputStream;
public class TextMiningWordDocHandler implements DocumentHandler {
public Document getDocument(InputStream is)
throws DocumentHandlerException {
String bodyText = null;
try {
bodyText = new WordExtractor().extractText(is);
}
catch (Exception e) {
throw new DocumentHandlerException(
"Cannot extract text from a Word document", e);
}
if ((bodyText != null) && (bodyText.trim().length() > 0)) {
Document doc = new Document();
DocumentFieldHelper.addFieldToDocument(doc, DocumentFieldHelper.KEYWORD_ALL_TEXT, bodyText);
DocumentFieldHelper.setDocumentType(doc, DocumentFieldHelper.TYPE_DOC);
return doc;
}
return null;
}
}
the lib you need can be found here:
/doesthatevencompile.com/current-projects/code-sniplets/lib/
called tm-extractors
there is some lucene code mixed in the code sniplet which you dont need to worry about. hopefully this is enough to set you on your way
extract from word docs you say?
ive done that
package com.doesthatevencompile.desktopsearch.filetypes.framework.msdoc;
import org.apache.lucene.document.Document;
import org.textmining.text.extraction.WordExtractor;
import com.doesthatevencompile.desktopsearch.filetypes.framework.DocumentFieldHelper;
import com.doesthatevencompile.desktopsearch.filetypes.framework.DocumentHandler;
import com.doesthatevencompile.desktopsearch.filetypes.framework.DocumentHandlerException;
import java.io.InputStream;
public class TextMiningWordDocHandler implements DocumentHandler {
public Document getDocument(InputStream is)
throws DocumentHandlerException {
String bodyText = null;
try {
bodyText = new WordExtractor().extractText(is);
}
catch (Exception e) {
throw new DocumentHandlerException(
"Cannot extract text from a Word document", e);
}
if ((bodyText != null) && (bodyText.trim().length() > 0)) {
Document doc = new Document();
DocumentFieldHelper.addFieldToDocument(doc, DocumentFieldHelper.KEYWORD_ALL_TEXT, bodyText);
DocumentFieldHelper.setDocumentType(doc, DocumentFieldHelper.TYPE_DOC);
return doc;
}
return null;
}
}
the lib you need can be found here:
http://doesthatevencompile.com/current-projects/code-sniplets/lib/
called tm-extractors
there is some lucene code mixed in the code sniplet which you dont need to worry about. hopefully this is enough to set you on your way
Hi, does that extract from Word as a Word document, or HTML? I've had a lot of trouble with downloading outside classes in the past, so I usually tend to stay away from them (particularly apache, which I can't make sense of), but if this can do what I need then I could give it a shot. Does this extractor keep formatting intact somehow? It looks really complicated.
Thanks,
Jezzica85
Maybe you can use a ParserCallback. It notifies you whenever if finds an HTML tag and you can then look at the associated text and attributes of the tag.
Or, you can use the HTMLEditorKit to parse the document and then query its structure.
This link will give you a simple example of each approach:
http://www.exampledepot.com/egs/javax.swing.text.html/pkg.html
> I saved a Microsoft Word document as HTML so I could
> parse it better (RTF and XML looked like complete
> gobbledygook and HTML I could at least read to figure
> out what it meant). All I really need is a few
> properties from the header, then the text and its
> formatting. For some reason, I guess when Word saves
> as html sometimes it decides to randomly split
> lines.
Sure, there's nothing wrong with that. Since it's HTML, newline characters are ignored. So it doesn't matter if a few newlines are tossed in.
If it matters to you then you shouldn't be using HTML.
> I'm writing my own parser because I didn't know
> another one existed, especially not for Microsoft
> Word. I thought that since it's a proprietary format
> that the HTML it generated wouldn't be "normal."
Well, it's pretty grotty HTML but it is well-formed HTML. And I thought you decided you could parse HTML better than RTF or XML? If you're writing your own HTML parser then that must mean you didn't know that Java has an XML parser built into it.
I would suggest saving the document as XML, then using a standard XML parser to extract the pieces you want.
