Help with choosing a file format

I am creating a program for law students/ practicing lawyers/ law professors/ general law-geeks etc that assists in annotating documents and then organizing them as the user needs. I have thought to annotate the documents using some kind of markup language, and some means by which I could let other applications know to ignore this language in displaying the documents. However I am uncertain how to do this, and if there even is such a file format available.

What kind of file should I try to use, and if none is really available how do I create a new file format that fits my needs? (Actually, the more I think on it, though harder the latter option would probably work best for the program and expanding on it in the future. So though recommendations of existing file formats are welcome, focusing on the creation of a new one would be preferable.)

Thanks,

M.

[890 byte] By [mitoguarda] at [2007-11-27 9:36:28]
# 1

creating a new file format from scratch would be a severe disservice to your users, as it would prevent them from using the documents with any other application.

Do you really think Microsoft and others who create word processors are going to support some snotnosed upstart's new document format that he created for a small niche market application?

Rather you should employ the power of what's already out there. MS Word documents are THE world standard for word processing.

If you want to store information that doesn't fit in the format you can always save it in a separate file next to the original document (good idea anyway, that way the original document stays intact and can be distributed electronically if needed without the annotations which your customers may not want their recipients to have access to).

jwentinga at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...
# 2

> MS Word documents are THE world standard

> for word processing.

Would PDF be a viable option?

I guess it depends on what format the docs are in to start with, but my not-really-all-that-informed impression is that, in Java land at least, there are more and better tools for munging .pdfs than .docs.

jverda at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...
# 3
and MSWord already has this feature (or one very like it) !
sabre150a at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...
# 4

> Do you really think Microsoft and others who create

> word processors are going to support some snot-nosed

> upstart's new document format that he created for a

> small niche market application?

I was under the impression that, with chunked files, external programs that did not recognize the notation of a file format in whole, but recognized parts of it would deal exclusively with those parts they recognized and display those. Which is why I wished to deal with a new file format, with parts safely ignorable in other applications but still generally speaking usable for those applications.

Was I mistaken in this belief? If so, is there any way to add on to existing document formats not wholly compatible with the goals I have for them as is?

mitoguarda at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...
# 5

> > MS Word documents are THE world standard

> > for word processing.

>

> Would PDF be a viable option?

>

As he's generating the documents and supposedly wants the ability to edit them outside the system, probably not.

> I guess it depends on what format the docs are in to

> start with, but my not-really-all-that-informed

> impression is that, in Java land at least, there are

> more and better tools for munging .pdfs than .docs.

no. PDFs are read-only, .docs aren't.

POI is quite powerful. That of course gives it a bit of a learning curve, but nothing that can't be overcome.

You could of course recreate the entire PDF from scratch whenever you save the document, but that would be kinda defeating the purpose of the whole thing (and still leave it impossible to edit the document using an outside editor).

jwentinga at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...
# 6
> and MSWord already has this feature (or one very like> it) !yes, it does.Not sure if POI understands it though. And you may (as I indicated) not want the annotations inside the document if it's to be sent to others you don't want to read them.
jwentinga at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...
# 7

> I was under the impression that, with chunked files,

> external programs that did not recognize the notation

> of a file format in whole, but recognized parts of it

> would deal exclusively with those parts they

> recognized and display those. Which is why I wished

Depends on the program. Some may ignore (and when saving the file again wipe out!) the parts they don't understand, others will refuse to read it at all, considering it corrupt.

> to deal with a new file format, with parts safely

> ignorable in other applications but still generally

> speaking usable for those applications.

>

Again, what makes you think anyone is going to change their word processors (or whatever) to support your new file format?

> Was I mistaken in this belief? If so, is there any

> way to add on to existing document formats not

> wholly compatible with the goals I have for them as

> is?

Not unless you want to get into contact with the owners of those document formats and convince them your plan would be a good idea for the next version of their software (and then you'd be limiting yourself to support for a version of that software that hardly anyone will have for quite some time).

And of course it's the same problem as the previous one: why would they listen to you in the first place. You're (unless you're the size of say Corel! or Adobe) not important enough for any of the major players to change their software just to accommodate you.

jwentinga at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...
# 8
> no. PDFs are read-only, .docs aren't.Not if you have Acrobat, but, granted, that's much less common than Word.> POI is quite powerful.Really? I thought I'd heard recently from someone heret that it's Word support was krap.
jverda at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...
# 9
I just threw POI out of a project in favour of JExcel, even though the latter is a couple of orders of magnitude slower. Reason: the formula support is better, also JExcel is apparently being kept more up-to-date than POI.POI didn't understand NOW() correctly.
ejpa at 2007-7-12 23:05:32 > top of Java-index,Java Essentials,Java Programming...