StringTokenizer problem

Hi again everybody,

I'm trying to parse that stubborn file I keep posting about with a StringTokenizer, using ascii characters. So far, I've got this:

StringTokenizer parser =new StringTokenizer( line,

"\0032\0033\0040\0041\0044\0058\0059\0063\0133\0145\0146\0147\0148\0151" );

but it doesn't seem to work. The line doesn't parse. I tried removing the zeros in the ascii codes, and that turned the file into gibberish. Does anyone know what I'm doing wrong this time? As always, I really appreciate it and thanks so much for the help so far!

Jezzica85

[653 byte] By [jezzica85a] at [2007-10-3 5:05:40]
# 1

Regex is probably better.

Are you aware how StringTokenizer actually works? In the sense that it parses based on each character of what you have in delimter and not as a whole?

For example if given the line

Apple|Orange|Banana|:Apple|Orange|Banana

And the delimiter of "|:" the returned tokens would be

Apple

Orange

Banana

Apple

Orange

Banana

and not

Apple|Orange|Banana

Apple|Orange|Banana

cotton.ma at 2007-7-14 23:11:35 > top of Java-index,Java Essentials,Java Programming...
# 2
Hi Cotton, good to hear from you again! I know how StringTokenizer works, that's why I was trying to separate the ascii characters with escape sequences. What do you mean by regex is better? The only parsing techniques I know are StringTokenizer and split, and neither of them seem to
jezzica85a at 2007-7-14 23:11:35 > top of Java-index,Java Essentials,Java Programming...
# 3

> Hi Cotton, good to hear from you again! I know how

> StringTokenizer works,

Well I thought I'd ask because it's often not really understood.

> that's why I was trying to

> separate the ascii characters with escape sequences.

Hmmm. What are those characters. Are they printable? I am pretty lazy and don't want to look them up. Control characters?

> What do you mean by regex is better?

It's mostly preferable because it's a fairly well known construct but mainly because it's much more powerful.

> The only parsing

> techniques I know are StringTokenizer and split, and

> neither of them seem to work.

I think you shoudl post the content of one or more of your lines and some description of how you want to parse it. That would probably be best.

cotton.ma at 2007-7-14 23:11:35 > top of Java-index,Java Essentials,Java Programming...
# 4

I have never seen that sort of escaping before, so I copied it into some code that breaks it up. Your string"\0032\0033\0040\0041\0044\0058\0059\0063\0133\0145\0146\0147\0148\0151"

consists of these characters:\003

2

\003

3

\004

0

\004

1

\004

4

\005

8

\005

9

\006

3

\013

3

\014

5

\014

6

\014

7

\014

8

\015

1

So you're alternating unprintable characters with digits. I don't think that is what you meant to do. However I have no idea what you did mean to do with that mess, so I can't suggest anything except "Don't do that".

DrClapa at 2007-7-14 23:11:35 > top of Java-index,Java Essentials,Java Programming...
# 5

The characters I'm splitting with are:

space, exclamation point, left and right parenthesis, comma, period, ellipsis, open/closed single quote, open/closed double quote, and long dash. (I think that's all of them)

Basically, I'm trying to parse a chapter in a book, and I want all the words, with no punctuation. I've heard of doing it with pattern matching (is that another name for regex? now that I think about it it sounds kind of familiar), but I don't know how to code that or how to use accented characters in the pattern.

jezzica85a at 2007-7-14 23:11:35 > top of Java-index,Java Essentials,Java Programming...
# 6

> The characters I'm splitting with are:

> space, exclamation point, left and right parenthesis,

> comma, period, ellipsis, open/closed single quote,

> open/closed double quote, and long dash. (I think

> that's all of them)

So why are you doing that escaping nonsense?

Just use

StringTokenizer parser = new StringTokenizer( line, " !(),.'\"-?");

I added the ? and I am having a brain cramp on what ellipsis means. The only one you have to escape (and only to compile) is "

>

> Basically, I'm trying to parse a chapter in a book,

> and I want all the words, with no punctuation. I've

> heard of doing it with pattern matching (is that

> another name for regex? now that I think about it it

> sounds kind of familiar), but I don't know how to

> code that or how to use accented characters in the

> pattern.

cotton.ma at 2007-7-14 23:11:35 > top of Java-index,Java Essentials,Java Programming...
# 7

An ellipsis is "..." It's one character according to ascii, 0133. The reason I was using the escaping is because " ' " takes out apostrophes, single opening quotes, and single closing quotes. I didn't want to take out the apostrophes. Also, I wanted to make sure the dash (0151) wasn't mistaken for a hyphen, which I want to keep. Is there a way to combine escaping those particular ascii characters with writing out the others in the Tokenizer?

jezzica85a at 2007-7-14 23:11:35 > top of Java-index,Java Essentials,Java Programming...
# 8

The complexity of your rules suggests regex is the way to go here. So yes the split method.

I would post that code and maybe someone can help you out with it. Tokenizer is just for simple parsing but you have complexity here which isn't going to work without a lot more work on you part... work that regex will do happily for you.

cotton.ma at 2007-7-14 23:11:35 > top of Java-index,Java Essentials,Java Programming...