Reading an XML file, general question

Hi all,

This time, I just have a quick question, and I can't seem to get a straight answer from anyone I know. I'm thinking about getting Microsoft Word 2007 soon, and from what I've heard, since it's XML formatted, you can read it right into Java as a Word file (that is, you don't have to bother converting it to text) with a FileReader and BufferedReader, then parse it just like you would a text file. Is that true?

Thanks,

Jezzica85

[467 byte] By [jezzica85a] at [2007-11-27 7:13:31]
# 1
I believe that is correct, Word 2007 is or will be an XML format. But that isn't necessarily going to be too helpful. Sure, XML is a text format, but MS Word is an extremely complicated format. You may have trouble deciphering the contents of that file.
DrClapa at 2007-7-12 19:04:15 > top of Java-index,Java Essentials,Java Programming...
# 2
Oh wait a minute--does that all mean that we'd have to parse the XML code, even if we could read it into Java?
jezzica85a at 2007-7-12 19:04:15 > top of Java-index,Java Essentials,Java Programming...
# 3
Yes you would, but there are of course helper classes to help with that problem.PS.
puckstopper31a at 2007-7-12 19:04:15 > top of Java-index,Java Essentials,Java Programming...
# 4
Yes, it does mean that.
DrClapa at 2007-7-12 19:04:15 > top of Java-index,Java Essentials,Java Programming...
# 5
Cool, I'm cautiously optimistic. Are the helper classes in Java 5 or Java 6?
jezzica85a at 2007-7-12 19:04:15 > top of Java-index,Java Essentials,Java Programming...
# 6

I've got Word 2003 and it has the option to save as XML. I created the document with the contents "Hello, world!".

Here is the XML produced (I was nice and indented it.)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<?mso-application progid="Word.Document"?>

<w:wordDocument

xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"

xmlns:v="urn:schemas-microsoft-com:vml"

xmlns:w10="urn:schemas-microsoft-com:office:word"

xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"

xmlns:aml="http://schemas.microsoft.com/aml/2001/core"

xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"

xmlns:o="urn:schemas-microsoft-com:office:office"

xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"

xmlns:wsp="http://schemas.microsoft.com/office/word/2003/wordml/sp2"

w:macrosPresent="no"

w:embeddedObjPresent="no"

w:ocxPresent="no"

xml:space="preserve">

<w:ignoreElements w:val="http://schemas.microsoft.com/office/word/2003/wordml/sp2"/>

Hello, world

Terrence Stamp

Terrence Stamp

1

1

2007-06-12T17:00:00Z

2007-06-12T17:01:00Z

1

2

12

Nerd in VWs

1

1

13

11.8134

<w:fonts>

<w:defaultFonts

w:ascii="Times New Roman"

w:fareast="Times New Roman"

w:h-ansi="Times New Roman"

w:cs="Times New Roman"/>

</w:fonts>

<w:styles>

<w:versionOfBuiltInStylenames w:val="4"/>

<w:latentStyles w:defLockedState="off" w:latentStyleCount="156"/>

<w:style w:type="paragraph" w:default="on" w:styleId="Normal">

<w:name w:val="Normal"/>

<w:rPr>

<wx:font wx:val="Times New Roman"/>

<w:sz w:val="24"/>

<w:sz-cs w:val="24"/>

<w:lang w:val="EN-US" w:fareast="EN-US" w:bidi="AR-SA"/>

</w:rPr>

</w:style>

<w:style w:type="character" w:default="on" w:styleId="DefaultParagraphFont">

<w:name w:val="Default Paragraph Font"/>

<w:semiHidden/>

</w:style>

<w:style w:type="table" w:default="on" w:styleId="TableNormal">

<w:name w:val="Normal Table"/>

<wx:uiName wx:val="Table Normal"/>

<w:semiHidden/>

<w:rPr>

<wx:font wx:val="Times New Roman"/>

</w:rPr>

<w:tblPr>

<w:tblInd w:w="0" w:type="dxa"/>

<w:tblCellMar>

<w:top w:w="0" w:type="dxa"/>

<w:left w:w="108" w:type="dxa"/>

<w:bottom w:w="0" w:type="dxa"/>

<w:right w:w="108" w:type="dxa"/>

</w:tblCellMar>

</w:tblPr>

</w:style>

<w:style w:type="list" w:default="on" w:styleId="NoList">

<w:name w:val="No List"/>

<w:semiHidden/>

</w:style>

</w:styles>

<w:docPr>

<w:view w:val="print"/>

<w:zoom w:percent="100"/>

<w:doNotEmbedSystemFonts/>

<w:proofState w:spelling="clean" w:grammar="clean"/>

<w:attachedTemplate w:val=""/>

<w:defaultTabStop w:val="720"/>

<w:punctuationKerning/>

<w:characterSpacingControl w:val="DontCompress"/>

<w:optimizeForBrowser/>

<w:validateAgainstSchema/>

<w:saveInvalidXML w:val="off"/>

<w:ignoreMixedContent w:val="off"/>

<w:alwaysShowPlaceholderText w:val="off"/>

<w:compat>

<w:breakWrappedTables/>

<w:snapToGridInCell/>

<w:wrapTextWithPunct/>

<w:useAsianBreakRules/>

<w:dontGrowAutofit/>

</w:compat>

<wsp:rsids>

<wsp:rsidRoot wsp:val="008866C9"/>

<wsp:rsid wsp:val="008866C9"/>

</wsp:rsids>

</w:docPr>

<w:body>

<wx:sect>

<w:p wsp:rsidR="008866C9" wsp:rsidRDefault="008866C9">

<w:r>

<w:t>Hello, world!</w:t>

</w:r>

</w:p>

<w:sectPr wsp:rsidR="008866C9">

<w:pgSz w:w="12240" w:h="15840"/>

<w:pgMar

w:top="1440"

w:right="1800"

w:bottom="1440"

w:left="1800"

w:header="708"

w:footer="708"

w:gutter="0"/>

<w:cols w:space="708"/>

<w:docGrid w:line-pitch="360"/>

</w:sectPr>

</wx:sect>

</w:body>

</w:wordDocument>

Arg! the filter is stripping out some stuff, but you get the idea.

Message was edited by:

Hippolyte

Hippolytea at 2007-7-12 19:04:15 > top of Java-index,Java Essentials,Java Programming...
# 7

One word fits that post, I think.

Wow.

It looks like parsing is definitely possible then, but a lot of lines would be skipped along the way. I imagine it probably wouldn't be much slower than the average text file since a lot of lines would be skipped, and the files I'm thinking about parsing are a lot of lines, but you'd need to be very careful about what you started and stopped parsing on.

I think maybe I'll try this, just to see what happens.

That is, if I have the patience...it looks like my nice little macro that saves as a text file would save me from quite the headache.

Thanks,

Jezzica85

jezzica85a at 2007-7-12 19:04:15 > top of Java-index,Java Essentials,Java Programming...
# 8

The "helper classes" that allow you to parse XML in Java have been there since at least Java 1.4. But that's just general XML parsing. Helper classes that actually know what to do with a w:docPr element and so on do not exist yet (as far as I know). However chances are that you are only interested in a small part of the XML, most of it is Word framework stuff, so you could write your own XML-handling code that skips over whatever you aren't interested in.

DrClapa at 2007-7-12 19:04:15 > top of Java-index,Java Essentials,Java Programming...