Regular expressions

Hi,

I have a file which I have to parse and extract all the possible instances of a date from. The File is free text format and I can't impose any restrictions on the content (or formatting)

I'm using java regular expressions to catch some of the dates, but it occurred to me that someone may have already written something like this.

I've tried Google and had a swatch at the forums but couldn't find anything.

Catching dates with regular expression must happen a lot, anyone know of any code already out there (yes I am that lazy).

Cheers

Mark

[594 byte] By [524467niknaksbarbequea] at [2007-11-26 12:18:26]
# 1
trywww.txt2re.com
568166mkoryaka at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 2

> I have a file which I have to parse and extract all

> the possible instances of a date from. The File is

> free text format and I can't impose any restrictions

> on the content (or formatting)

I hope you can impose restrictions on the format of the dates?

91610kajbja at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 3
what date is this 02-03-04?Feb. 3, 2004?Mar. 2, 2004?Mar. 4, 2002?
2959jverda at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 4
Thanks for all the replies.I can't impose any restrictions on anything, it totally sucks.CheersMark
524467niknaksbarbequea at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 5
> I can't impose any restrictions on anythingThen you can't know what 02-03-04 is supposed to represent.
2959jverda at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 6

> I can't impose any restrictions on anything,

Then you must be able accept

"third october seventeen eighty three at twelve twenty seven post meridian"

so almost anything could be a date.

> it

> totally sucks.

Understatement! Go back to your Boss and ask him for a specification for valid dates. If he fails to deliver then consider your position.

106725sabre150a at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 7

I consider my position every day, it's not good "touch your toes, relax....... Here it comes!!!!).

Thanks for all your help, I agree with everyone that's commented about the dates potentially being meaningless/wrong I've raised all the concerns and got nowhere.

Some days work feel like I'm on my back with my legs in the air, spinning imaginary wheels going nowhere, for scant applause and zero recompense. Still beats bending over.

524467niknaksbarbequea at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 8

> Thanks for all your help, I agree with everyone

> that's commented about the dates potentially being

> meaningless/wrong I've raised all the concerns and

> got nowhere.

Okay. But out of curiosity, what will you do if you get "02-03-04"? There are three valid ways to interpret it. You'll have to pick one.

2959jverda at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 9

You could always throw the problem back at your boss. Create a sytem that reads a set of regular expressions from a configuration file (say an XML file) and provide a default set of regular expressions for the obvious candidates.

Then, when your boss says ".. but it does not deal with ... " you can ask him to provide the appropriate regular expression for you to place in the configuration file.

106725sabre150a at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 10

Also, if you define an interface for a 'date extraction actor' you could then specify in the XML file implementations of the interface that could be dynamically loaded. In this way, if someone hits you with a format you can't deal with using a regex then you can write a parser and add it to the XML file.

This would make for an interesting design and implementation and would allow you to shaft your boss (if you are that way inclined) rather than him shafting you.

106725sabre150a at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...
# 11

Just because you asked....:) I live in the UK and the default format for dates here is DDMMYYYY and we get our data from a British source. I would interpret 020304 as 02nd March 2004. I know this could be wrong but I have been told to do the best I can. It really is buttocks.

Thanks again for all input.

Mark

524467niknaksbarbequea at 2007-7-7 14:57:59 > top of Java-index,Archived Forums,Socket Programming...