How useful can parsing XML documents help....?

Hi,

I don't know if anyone has contemplated about this before, so in case you have the slighest of idea...please help me with it.

If I were to build a medium scale search engine, and I have records(XML) of all the documents I want to search on my disk. Would I rather use the traditional approach of putting everything I have into a database and play with the queries or would I gain something if I search out my XML records, put these parsing tools we are discussing here to better use!?

Any suggestions would be helpful....

Thanks

Prashanth

[587 byte] By [varaprashanth] at [2007-9-26 1:23:59]
# 1

Prashanth,

The followings are only my own opinion regarding using XML or query from DB:-

1. Using XML instead of query from DB will make your application more portable, as the XML file is only plain text file. U can move your application to any PC or platform as needed.

2. U don't have to deal with the DB performance like SQL tuning or any indexing stuffs by using XML.

3. It is always good for u to learn something new, like XML.

Joey

joey829 at 2007-6-29 1:03:29 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 2

The issue here is performance.

Performance is got by indexing

Most search engines use reverse indexing to perform the search . It works something like this. If you search for the work "xml" in say yahoo, it will go to its database entry "xml" and the same record will have no's which represent documents which have the word xml in it. This is a very rudimentary idea. With databases you get this indexing capability for free. With xml you may have to build it or buy software which has already implemented this capability.

hope this helps.....

sudhirsrinivasan at 2007-6-29 1:03:29 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 3

I know you probably hate Micro$oft, but they did a phenomenal job a few years ago with www.gear.com (I think that was it). What they did was this:

1. The data all lives in two places:

a. database, where all deletes / updates, inserts go.

b. XML Files

2. The website itself only hits the XML files. This makes for super fast searching, as they never have to open a connection to a DB, and never have to send/receive data across the network. As well, everything is indexed by default (no need for putting an index on a table that needs to be rebuilt every now and then)

3. To keep the data up-to-date, whenever any changes happen, they are directed to the database.

a. A trigger is then kicked off which instantiates an object that writes all the new data to temporary XML files on a server.

b. The temporary XML files are then (in one swoop) sent up to the main XML servers, overwriting the XML data for the website.

They also had one section, which usually got small changes and for this section they did it a bit diff. They had the changes go to the DB, but then the object kicked off by the trigger simply "walked" the XML nodes and make the quick tweak to the XML files (both the live and the temp).

Hope this helps! See if you can find the info on it. Microsoft actually gave out papers on how they did this, and had a conference in a bunch of diff. cities.

6tr6tr at 2007-6-29 1:03:29 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...