sentence generation
Hi,
I'm looking for a library / piece of code that can generate all possible sentences that conform to a given (simplified) regular expression.
i.e.:
Foo[abc]Bar[0-9]
would generate:
FooaBar0
FooaBar1
...
FooaBar8
FooaBar9
FoobBar0
....
FoocBar8
FoocBar9
Yes this can get very large very fast but thats ok. Is there anything out there?
thanks
Jurgen Voorneveld
[464 byte] By [
Rivala] at [2007-10-2 20:13:34]

> I'm looking for a library / piece of code that can
> generate all possible sentences that conform to a
> given (simplified) regular expression.
Yikes.
> Yes this can get very large very fast but thats ok.
I say again, yikes.
> Is there anything out there?
Not to my (limited) knowledge; frankly I don't see how such a thing is even practical. What is your specific goal (brute-force password *******, etc.)? Perhaps some folks can help you explore alternate solutions...
~
I'm writing a small web spider where the user can supply a URL with some regular expressions in there. The program generates all the possible real URLs and downloads the files it finds. This is useful in situations where you know that for example filenames of pictures on a website conform to a simple naming rule. This way you can download pictures without having to download the HTML that normally comes with it.
http://www.sinfest.net/comics/sf20060515.gif
==>
http://www.sinfest.net/comics/sf200[0-6][0-12][0-31].gif
Rivala at 2007-7-13 22:55:41 >

thats a backwards way of doing it.why not search for all src=".*sf200[0-6][0-12][0-31].gif"and then just download those?better regexsrc="(.*?sf200[\d]*\.(?:gif|jpg|jpeg|png)$)"Message was edited by: sophisticatedd
> why not search for all
As the OP is not searching html files for links, but leaching. wget can do this. Look at the source for that?
OP, if you are sticking with a very small subset of regex (i.e. just [xyz] and [0-10] stuff) then I don't think it would be too hard to write yourself.
Or maybe use better defined datastructures.
so
http:///www.example.com/pictures/goat_dairy_<date format="yyyymmdd" from="2006/01/01" to="2006/05/15">.jpg
and
http:///www.example.com/pictures/donkey_<integer format="00" from="0" to="10">.jpg
mlka at 2007-7-13 22:55:41 >

ahh yes, i saw spider and well...So what is the point of avoiding downloading the html?
> So what is the point of avoiding downloading the html?Why download the HTML files? You know the format of the images (its <date>.jpg).
mlka at 2007-7-13 22:55:41 >

I think thats silly,suppose the page only has 5 images on it, and you feed the "spider" goats[0-6][0-12][0-31].gifyou are going to make ~2912 connections just to download 5 images.Just seems unneccesary, a bit more efficient to just download the page.
> why not search for all src=".*sf200[0-6][0-12][0-31].gif"
I don't want to search I want to download. Getting a list and then having to click on those
links is annoying. The regex was just an example not meant to be optimal
> wget can do this. Look at the source for that?
Is wget written in Java? If I have to port weird code I could just as well write it from
scratch. Plus wget gets its links from HTML files. I don't think it actually generates
anything.
> Why download the HTML files? You know the format of the images (its <date>.jpg).
And when the picture isn't actually linked to from any HTML file this method will work as well
> Just seems unneccesary, a bit more efficient to just download the page.
Things change when you want to download 3000 images. I just want to give this option to the user,
the user can decide whether its useful in individual cases
Rivala at 2007-7-13 22:55:41 >

I think you aren't understanding me.
You aren't clicking on links.
make the regex pattern
download the html
match on the pattern
loop over all matches and download those images
I know my suggestion is falling on deaf ears, and you are going to do it the way you want to do it, but, its just a suggestion.
> Is wget written in Java? If I have to port weird code
> I could just as well write it from
> scratch. Plus wget gets its links from HTML files. I
> don't think it actually generates
> anything.
Either wget, or curl has this option, but they are both C/C++ applications.
mlka at 2007-7-13 22:55:41 >

> I think you aren't understanding me.
I don't think you really understand what the OP is attempting to do.
Lets say you want all of Jeffs Goat image lib.
It is spread over 200 folders:
http://www.jverd.com/goat/0/
http://www.jverd.com/goat/1/
..
http://www.jverd.com/goat/200/
Each folder have 200 images in, all named XXX.jpg, i.e.
http://www.jverd.com/goat/0/001.jpg
...
http://www.jverd.com/goat/200/001.jpg
...
http://www.jverd.com/goat/200/200.jpg
And each folder has 20 HTML index pages.
Thats 4000 html index pages.
The OP wants to bypass downloading the index pages with:
http://www.jverd.com/goat/[0-200]/[000-200].jpg
mlka at 2007-7-13 22:55:42 >

> I'm writing a small web spider where the user can> supply a URL with some regular expressions in there.That's a pretty anti-social kind of program, you realise? Consider the load on the server.
@sophisticatedd
That will work but I will not be able to download the files that aren't linked to and I will have to download the entire website and check each html file to find all the links to the images. I might decide to build something like that later on but not right now.
@mlk
I read the cURL and wget manpages. I can't find anything related to regular expressions or sentence generation in cURL. wget has a spidering option called --mirror and a -A to match file extensions but I think it still only downloads those things it finds links to.
Thanks for the suggestions
Rivala at 2007-7-13 22:55:42 >

> I read the cURL and wget manpages. I can't find
> anything related to regular expressions or sentence
> generation in cURL. wget has a spidering option
> called --mirror and a -A to match file extensions but
> I think it still only downloads those things it finds
> links to.
Mm, one of the command based web toys (links or lynx or something else) has this option.
mlka at 2007-7-13 22:55:42 >

You might want to consider something like [url= http://www.matuschek.net/software/jobo/]JoBo[/url]...~
@malcolmmc
yes it is.. I'm going to build in some be-nice-to-servers functionality eventually. Of course programs such as Teleport Pro, wget, Intellitamper, etc. are all just as rude. I don't think this program will suddenly make things fall apart
@yawmark
thanks for the link.. this is definately worth looking at. The tarball doesn't want to download but I'll get it later. I don't think JoBo does what I want but at the very least it will save me writing some protocol handlers and parsers.
FYI: Using that type of program functionality on some webservers may cause it's users to be banned. On my webserver I would assume such brute force tactics to download every possible picture filename as a DOS attack.
> Of course programs such as Teleport Pro, wget,
> Intellitamper, etc. are all just as rude. I don't
> think this program will suddenly make things fall
> apart
Pointing to the bad behaviour of others to excuse one's own bad behaviour isn't exactly the highest of ethical principles, I don't think.