StringTokenizer problem
Hi again everybody,
I'm trying to parse that stubborn file I keep posting about with a StringTokenizer, using ascii characters. So far, I've got this:
StringTokenizer parser =new StringTokenizer( line,
"\0032\0033\0040\0041\0044\0058\0059\0063\0133\0145\0146\0147\0148\0151" );
but it doesn't seem to work. The line doesn't parse. I tried removing the zeros in the ascii codes, and that turned the file into gibberish. Does anyone know what I'm doing wrong this time? As always, I really appreciate it and thanks so much for the help so far!
Jezzica85
Regex is probably better.
Are you aware how StringTokenizer actually works? In the sense that it parses based on each character of what you have in delimter and not as a whole?
For example if given the line
Apple|Orange|Banana|:Apple|Orange|Banana
And the delimiter of "|:" the returned tokens would be
Apple
Orange
Banana
Apple
Orange
Banana
and not
Apple|Orange|Banana
Apple|Orange|Banana
Hi Cotton, good to hear from you again! I know how StringTokenizer works, that's why I was trying to separate the ascii characters with escape sequences. What do you mean by regex is better? The only parsing techniques I know are StringTokenizer and split, and neither of them seem to
> Hi Cotton, good to hear from you again! I know how
> StringTokenizer works,
Well I thought I'd ask because it's often not really understood.
> that's why I was trying to
> separate the ascii characters with escape sequences.
Hmmm. What are those characters. Are they printable? I am pretty lazy and don't want to look them up. Control characters?
> What do you mean by regex is better?
It's mostly preferable because it's a fairly well known construct but mainly because it's much more powerful.
> The only parsing
> techniques I know are StringTokenizer and split, and
> neither of them seem to work.
I think you shoudl post the content of one or more of your lines and some description of how you want to parse it. That would probably be best.
I have never seen that sort of escaping before, so I copied it into some code that breaks it up. Your string"\0032\0033\0040\0041\0044\0058\0059\0063\0133\0145\0146\0147\0148\0151"
consists of these characters:\003
2
\003
3
\004
0
\004
1
\004
4
\005
8
\005
9
\006
3
\013
3
\014
5
\014
6
\014
7
\014
8
\015
1
So you're alternating unprintable characters with digits. I don't think that is what you meant to do. However I have no idea what you did mean to do with that mess, so I can't suggest anything except "Don't do that".
The characters I'm splitting with are:
space, exclamation point, left and right parenthesis, comma, period, ellipsis, open/closed single quote, open/closed double quote, and long dash. (I think that's all of them)
Basically, I'm trying to parse a chapter in a book, and I want all the words, with no punctuation. I've heard of doing it with pattern matching (is that another name for regex? now that I think about it it sounds kind of familiar), but I don't know how to code that or how to use accented characters in the pattern.
> The characters I'm splitting with are:
> space, exclamation point, left and right parenthesis,
> comma, period, ellipsis, open/closed single quote,
> open/closed double quote, and long dash. (I think
> that's all of them)
So why are you doing that escaping nonsense?
Just use
StringTokenizer parser = new StringTokenizer( line, " !(),.'\"-?");
I added the ? and I am having a brain cramp on what ellipsis means. The only one you have to escape (and only to compile) is "
>
> Basically, I'm trying to parse a chapter in a book,
> and I want all the words, with no punctuation. I've
> heard of doing it with pattern matching (is that
> another name for regex? now that I think about it it
> sounds kind of familiar), but I don't know how to
> code that or how to use accented characters in the
> pattern.
An ellipsis is "..." It's one character according to ascii, 0133. The reason I was using the escaping is because " ' " takes out apostrophes, single opening quotes, and single closing quotes. I didn't want to take out the apostrophes. Also, I wanted to make sure the dash (0151) wasn't mistaken for a hyphen, which I want to keep. Is there a way to combine escaping those particular ascii characters with writing out the others in the Tokenizer?
The complexity of your rules suggests regex is the way to go here. So yes the split method.
I would post that code and maybe someone can help you out with it. Tokenizer is just for simple parsing but you have complexity here which isn't going to work without a lot more work on you part... work that regex will do happily for you.