Please help me on Regular Expression!

Hi All

I have an html file so i would like to remove tags script and the contents on it (only remove the content on this tag),

Here is code but it is not work as i want, the html file after and before still same entrie (Test.html and Test1.html is still the same), please help me to correct it!

import java.util.regex.*;

import java.io.*;

publicclass Remover{

publicstaticvoid main(String[] args)

throws Exception{

File fin =new File("Test.html");

File fout =new File("Test2.html");

//Open and input and output stream

FileInputStream fis =

new FileInputStream(fin);

FileOutputStream fos =

new FileOutputStream(fout);

BufferedReader in =new BufferedReader(

new InputStreamReader(fis));

BufferedWriter out =new BufferedWriter(

new OutputStreamWriter(fos));

String regex ="<script[^>]*>(.*?)</script>";

int flags = Pattern.MULTILINE | Pattern.DOTALL| Pattern.CASE_INSENSITIVE;

Pattern p = Pattern.compile(regex.toString(), flags);

Matcher m = p.matcher("\\n");

String aLine =null;

while((aLine = in.readLine()) !=null){

m.reset(aLine);

String result = m.replaceAll("");

out.write(result);

out.newLine();

}

in.close();

out.close();

}

}

Thanks in advance,

Regards,

Ps: i am not good in english, so sorry for my bad english!

null

[2456 byte] By [ecard104a] at [2007-11-27 1:24:18]
# 1
Try "\\<script\\>.*\\</script\\>"This works for a single line that contains both tags. If your expression covers multiple lines you need to accomodate that.
ChuckBinga at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 2
The regex is okay; the problem is that you're reading and processing the text one line at a time. Inline script elements usually span several lines. Try reading the whole file into a string and processing it that way. Assuming the file isn't too big, of course.
uncle_alicea at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 3
Thanks ChuckBing and uncle_alice please tell me more, because it is still not works, so how to correct it in the code above?What must i need? what is method can i add for this?Please!!!Thanks you for you help ?Message was edited by: ecard104
ecard104a at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 4
Anyone? Someone?Please!!!
ecard104a at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 5
uncle_alice gave you a suggested solution, "...Try reading the whole file into a string and processing it that way..."Did you do that?
ChuckBinga at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 6

I have edit code , instead of reading and processing the text one line, i have read earch of character and add it to a StringBuffer!

import java.util.regex.*;

import java.io.*;

public class Remover {

public static void main(String[] args)throws Exception {

File fin = new File("Test.html");

File fout = new File("Test2.html");

FileInputStream fis = new FileInputStream(fin);

FileOutputStream fos = new FileOutputStream(fout);

BufferedReader in = new BufferedReader(new InputStreamReader(fis));

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(fos));

String regex = ".*?(<script>.*?</script>).*?";

int flags = Pattern.MULTILINE | Pattern.DOTALL| Pattern.CASE_INSENSITIVE;

Pattern p = Pattern.compile(regex.toString(), flags);

Matcher m = p.matcher("");

int ch;

StringBuffer sb = new StringBuffer();

while ((ch = in.read())!= -1) {

sb.append((char)ch);

String result = m.replaceAll("");

out.write(result);

//out.newLine();

}

in.close();

out.close();;

}

}

The test2.html is an file no contents, all data have deleted

Message was edited by:

ecard104

ecard104a at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 7
while ((ch = in.read())!= -1)note sure read() will ever return -1
calvino_inda at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 8
So please help me to correct it,Thanks for listening my friendsregads,ecard104
ecard104a at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 9

Try this. I simplified a number of things, as well.

import java.util.regex.*;

import java.io.*;

public class Remover

{

public static void main(String[] args)throws Exception

{

File fin = new File("Test.html");

File fout = new File("Test2.html");

BufferedReader in = new BufferedReader(new FileReader(fin));

BufferedWriter out = new BufferedWriter(new FileWriter(fout));

String str;

StringBuffer sb = new StringBuffer();

while ((str = in.readLine())!= null)

{

sb.append(str);

}

in.close();

String regex = "<script>.*?</script>";

String result = sb.toString().replaceAll(regex, "");

out.write(result);

out.close();

}

}

ChuckBinga at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 10

Hi ChuckBing,

it still not works as i want!

This is Test.html code:

<html>

<head>

<title> title for script </title>

<script type="javascript"> function a () {} </script>

</head>

<body>

<h1>Hello World</h1>

</div> aaaa</div>

</body>

</html>

When i run your code --> Here is Test2.html:

<html> <head> <title> title for script </title> <script type="javascript"> function a () {} </script></head><body> <h1>Hello World</h1> </div> aaaa</div></body></html>

script tag and content on it still no is deleted and all tags is at a line!

I would like Test2.html following:

<html>

<head>

//script tag and contents on it will be deleted

</head>

<body>

<h1>Hello World</h1>

</div> aaaa</div>

</body>

</html>

Thanks you for helping me

Regards,

Message was edited by:

ecard104

ecard104a at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 11

@Chuck, BufferedReader.readLine() drops the line separator, so need to add it back in. Also, the regex still needs to have the DOTALL and CASE_INSENSITIVE flags set (for the real world, that is; it would work fine as-is on the sample text). Here I used an alternate means to set the flags.

@OP, the replaceAll() regex only needs to match the exact text you want to replace; by adding the .* to either end, you're telling it to replace everything.

import java.util.regex.*;

import java.io.*;

public class Remover

{

public static void main(String[] args)throws Exception

{

FileReader fin = new FileReader("Test.html");

FileWriter fout = new FileWriter("Test2.html");

BufferedReader in = new BufferedReader(new FileReader(fin));

BufferedWriter out = new BufferedWriter(new FileWriter(fout));

String line = null;

String sep = System.getProperty("line.separator"):

StringBuilder sb = new StringBuilder();

while ((line = in.readLine())!= null)

{

sb.append(line).append(sep);

}

in.close();

String regex = "(?is)<script>.*?</script>";

String result = sb.toString().replaceAll(regex, "");

out.write(result);

out.close();

}

}

uncle_alicea at 2007-7-12 0:15:16 > top of Java-index,Java Essentials,Java Programming...
# 12
@uncle_alice:The better definition of requirements and test data eliminated the "shooting in the dark" I did ;)
ChuckBinga at 2007-7-12 0:15:17 > top of Java-index,Java Essentials,Java Programming...
# 13

All too often, taking a shot in the dark or two is the only way to learn what the requirements are. I mean, if you can describe the problem clearly and succinctly, you're over half way to solving it yourself (that's especially true you're dealing with regexes). A big part of what we do here is teaching people to use the jargon correctly. Broken English we can handle; incomplete or inaccurate requirements we can't.

uncle_alicea at 2007-7-12 0:15:17 > top of Java-index,Java Essentials,Java Programming...
# 14

Thanks All

I have corrected my code and it successed for delete the script tag

Here is the regex that will match the entire script tag:

String regex = "<script.*></script>";

So it was worked fine for the delete:

Here is code:

import java.util.regex.*;

import java.io.*;

public class Remover

{

public static void main(String[] args)throws Exception

{

File fin = new File("Test.html");

File fout = new File("Test2.html");

BufferedReader in = new BufferedReader(new FileReader(fin));

BufferedWriter out = new BufferedWriter(new FileWriter(fout));

String str;

StringBuffer sb = new StringBuffer();

while ((str = in.readLine())!= null)

{

sb.append(str);

}

in.close();

String regex = "<script.*></script>;

String result = sb.toString().replaceAll(regex,"");

out.write(result);

out.close();

}

}

But new problem is the readLine method of BufferedReader leaves out the line separator so the Test2.html file have format follow:

<html> <head> <title> title for script </title></head><body> <h1>Hello World</h1> </div> aaaa</div></body></html>

I would like to keep it so i have edited the code:

int i=0;

while ((str = in.readLine())!= null) {

if (i > 0) sb.append("\n");

sb.append(str);

i++;

}

But the Test2.html file comes back the same the Test.html again.

I dont understand what is happen,so i confused.Maybe i was asked so much, sorry!

Thanks you very much

Message was edited by:

ecard104

ecard104a at 2007-7-12 0:15:17 > top of Java-index,Java Essentials,Java Programming...