Regular expressions
Hi,
I have a file which I have to parse and extract all the possible instances of a date from. The File is free text format and I can't impose any restrictions on the content (or formatting)
I'm using java regular expressions to catch some of the dates, but it occurred to me that someone may have already written something like this.
I've tried Google and had a swatch at the forums but couldn't find anything.
Catching dates with regular expression must happen a lot, anyone know of any code already out there (yes I am that lazy).
Cheers
Mark
# 2
> I have a file which I have to parse and extract all
> the possible instances of a date from. The File is
> free text format and I can't impose any restrictions
> on the content (or formatting)
I hope you can impose restrictions on the format of the dates?
# 3
what date is this 02-03-04?Feb. 3, 2004?Mar. 2, 2004?Mar. 4, 2002?
# 4
Thanks for all the replies.I can't impose any restrictions on anything, it totally sucks.CheersMark
# 5
> I can't impose any restrictions on anythingThen you can't know what 02-03-04 is supposed to represent.
# 6
> I can't impose any restrictions on anything,
Then you must be able accept
"third october seventeen eighty three at twelve twenty seven post meridian"
so almost anything could be a date.
> it
> totally sucks.
Understatement! Go back to your Boss and ask him for a specification for valid dates. If he fails to deliver then consider your position.
# 7
I consider my position every day, it's not good "touch your toes, relax....... Here it comes!!!!).
Thanks for all your help, I agree with everyone that's commented about the dates potentially being meaningless/wrong I've raised all the concerns and got nowhere.
Some days work feel like I'm on my back with my legs in the air, spinning imaginary wheels going nowhere, for scant applause and zero recompense. Still beats bending over.
# 8
> Thanks for all your help, I agree with everyone
> that's commented about the dates potentially being
> meaningless/wrong I've raised all the concerns and
> got nowhere.
Okay. But out of curiosity, what will you do if you get "02-03-04"? There are three valid ways to interpret it. You'll have to pick one.
# 9
You could always throw the problem back at your boss. Create a sytem that reads a set of regular expressions from a configuration file (say an XML file) and provide a default set of regular expressions for the obvious candidates.
Then, when your boss says ".. but it does not deal with ... " you can ask him to provide the appropriate regular expression for you to place in the configuration file.
# 10
Also, if you define an interface for a 'date extraction actor' you could then specify in the XML file implementations of the interface that could be dynamically loaded. In this way, if someone hits you with a format you can't deal with using a regex then you can write a parser and add it to the XML file.
This would make for an interesting design and implementation and would allow you to shaft your boss (if you are that way inclined) rather than him shafting you.
# 11
Just because you asked....:) I live in the UK and the default format for dates here is DDMMYYYY and we get our data from a British source. I would interpret 020304 as 02nd March 2004. I know this could be wrong but I have been told to do the best I can. It really is buttocks.
Thanks again for all input.
Mark