sentence generation

Hi,

I'm looking for a library / piece of code that can generate all possible sentences that conform to a given (simplified) regular expression.

i.e.:

Foo[abc]Bar[0-9]

would generate:

FooaBar0

FooaBar1

...

FooaBar8

FooaBar9

FoobBar0

....

FoocBar8

FoocBar9

Yes this can get very large very fast but thats ok. Is there anything out there?

thanks

Jurgen Voorneveld

[464 byte] By [Rivala] at [2007-10-2 20:13:34]
# 1

> I'm looking for a library / piece of code that can

> generate all possible sentences that conform to a

> given (simplified) regular expression.

Yikes.

> Yes this can get very large very fast but thats ok.

I say again, yikes.

> Is there anything out there?

Not to my (limited) knowledge; frankly I don't see how such a thing is even practical. What is your specific goal (brute-force password *******, etc.)? Perhaps some folks can help you explore alternate solutions...

~

yawmarka at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 2

I'm writing a small web spider where the user can supply a URL with some regular expressions in there. The program generates all the possible real URLs and downloads the files it finds. This is useful in situations where you know that for example filenames of pictures on a website conform to a simple naming rule. This way you can download pictures without having to download the HTML that normally comes with it.

http://www.sinfest.net/comics/sf20060515.gif

==>

http://www.sinfest.net/comics/sf200[0-6][0-12][0-31].gif

Rivala at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 3
thats a backwards way of doing it.why not search for all src=".*sf200[0-6][0-12][0-31].gif"and then just download those?better regexsrc="(.*?sf200[\d]*\.(?:gif|jpg|jpeg|png)$)"Message was edited by: sophisticatedd
sophisticatedda at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 4

> why not search for all

As the OP is not searching html files for links, but leaching. wget can do this. Look at the source for that?

OP, if you are sticking with a very small subset of regex (i.e. just [xyz] and [0-10] stuff) then I don't think it would be too hard to write yourself.

Or maybe use better defined datastructures.

so

http:///www.example.com/pictures/goat_dairy_<date format="yyyymmdd" from="2006/01/01" to="2006/05/15">.jpg

and

http:///www.example.com/pictures/donkey_<integer format="00" from="0" to="10">.jpg

mlka at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 5
ahh yes, i saw spider and well...So what is the point of avoiding downloading the html?
sophisticatedda at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 6
> So what is the point of avoiding downloading the html?Why download the HTML files? You know the format of the images (its <date>.jpg).
mlka at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 7
I think thats silly,suppose the page only has 5 images on it, and you feed the "spider" goats[0-6][0-12][0-31].gifyou are going to make ~2912 connections just to download 5 images.Just seems unneccesary, a bit more efficient to just download the page.
sophisticatedda at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 8

> why not search for all src=".*sf200[0-6][0-12][0-31].gif"

I don't want to search I want to download. Getting a list and then having to click on those

links is annoying. The regex was just an example not meant to be optimal

> wget can do this. Look at the source for that?

Is wget written in Java? If I have to port weird code I could just as well write it from

scratch. Plus wget gets its links from HTML files. I don't think it actually generates

anything.

> Why download the HTML files? You know the format of the images (its <date>.jpg).

And when the picture isn't actually linked to from any HTML file this method will work as well

> Just seems unneccesary, a bit more efficient to just download the page.

Things change when you want to download 3000 images. I just want to give this option to the user,

the user can decide whether its useful in individual cases

Rivala at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 9

I think you aren't understanding me.

You aren't clicking on links.

make the regex pattern

download the html

match on the pattern

loop over all matches and download those images

I know my suggestion is falling on deaf ears, and you are going to do it the way you want to do it, but, its just a suggestion.

sophisticatedda at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 10

> Is wget written in Java? If I have to port weird code

> I could just as well write it from

> scratch. Plus wget gets its links from HTML files. I

> don't think it actually generates

> anything.

Either wget, or curl has this option, but they are both C/C++ applications.

mlka at 2007-7-13 22:55:41 > top of Java-index,Java Essentials,Java Programming...
# 11

> I think you aren't understanding me.

I don't think you really understand what the OP is attempting to do.

Lets say you want all of Jeffs Goat image lib.

It is spread over 200 folders:

http://www.jverd.com/goat/0/

http://www.jverd.com/goat/1/

..

http://www.jverd.com/goat/200/

Each folder have 200 images in, all named XXX.jpg, i.e.

http://www.jverd.com/goat/0/001.jpg

...

http://www.jverd.com/goat/200/001.jpg

...

http://www.jverd.com/goat/200/200.jpg

And each folder has 20 HTML index pages.

Thats 4000 html index pages.

The OP wants to bypass downloading the index pages with:

http://www.jverd.com/goat/[0-200]/[000-200].jpg

mlka at 2007-7-13 22:55:42 > top of Java-index,Java Essentials,Java Programming...
# 12
> I'm writing a small web spider where the user can> supply a URL with some regular expressions in there.That's a pretty anti-social kind of program, you realise? Consider the load on the server.
malcolmmca at 2007-7-13 22:55:42 > top of Java-index,Java Essentials,Java Programming...
# 13

@sophisticatedd

That will work but I will not be able to download the files that aren't linked to and I will have to download the entire website and check each html file to find all the links to the images. I might decide to build something like that later on but not right now.

@mlk

I read the cURL and wget manpages. I can't find anything related to regular expressions or sentence generation in cURL. wget has a spidering option called --mirror and a -A to match file extensions but I think it still only downloads those things it finds links to.

Thanks for the suggestions

Rivala at 2007-7-13 22:55:42 > top of Java-index,Java Essentials,Java Programming...
# 14

> I read the cURL and wget manpages. I can't find

> anything related to regular expressions or sentence

> generation in cURL. wget has a spidering option

> called --mirror and a -A to match file extensions but

> I think it still only downloads those things it finds

> links to.

Mm, one of the command based web toys (links or lynx or something else) has this option.

mlka at 2007-7-13 22:55:42 > top of Java-index,Java Essentials,Java Programming...
# 15
You might want to consider something like [url= http://www.matuschek.net/software/jobo/]JoBo[/url]...~
yawmarka at 2007-7-21 1:45:15 > top of Java-index,Java Essentials,Java Programming...
# 16

@malcolmmc

yes it is.. I'm going to build in some be-nice-to-servers functionality eventually. Of course programs such as Teleport Pro, wget, Intellitamper, etc. are all just as rude. I don't think this program will suddenly make things fall apart

@yawmark

thanks for the link.. this is definately worth looking at. The tarball doesn't want to download but I'll get it later. I don't think JoBo does what I want but at the very least it will save me writing some protocol handlers and parsers.

Rivala at 2007-7-21 1:45:15 > top of Java-index,Java Essentials,Java Programming...
# 17
FYI: Using that type of program functionality on some webservers may cause it's users to be banned. On my webserver I would assume such brute force tactics to download every possible picture filename as a DOS attack.
EvolvedAnta at 2007-7-21 1:45:15 > top of Java-index,Java Essentials,Java Programming...
# 18

> Of course programs such as Teleport Pro, wget,

> Intellitamper, etc. are all just as rude. I don't

> think this program will suddenly make things fall

> apart

Pointing to the bad behaviour of others to excuse one's own bad behaviour isn't exactly the highest of ethical principles, I don't think.

DrClapa at 2007-7-21 1:45:15 > top of Java-index,Java Essentials,Java Programming...