Please help me on Regular Expression!
Hi All
I have an html file so i would like to remove tags script and the contents on it (only remove the content on this tag),
Here is code but it is not work as i want, the html file after and before still same entrie (Test.html and Test1.html is still the same), please help me to correct it!
import java.util.regex.*;
import java.io.*;
publicclass Remover{
publicstaticvoid main(String[] args)
throws Exception{
File fin =new File("Test.html");
File fout =new File("Test2.html");
//Open and input and output stream
FileInputStream fis =
new FileInputStream(fin);
FileOutputStream fos =
new FileOutputStream(fout);
BufferedReader in =new BufferedReader(
new InputStreamReader(fis));
BufferedWriter out =new BufferedWriter(
new OutputStreamWriter(fos));
String regex ="<script[^>]*>(.*?)</script>";
int flags = Pattern.MULTILINE | Pattern.DOTALL| Pattern.CASE_INSENSITIVE;
Pattern p = Pattern.compile(regex.toString(), flags);
Matcher m = p.matcher("\\n");
String aLine =null;
while((aLine = in.readLine()) !=null){
m.reset(aLine);
String result = m.replaceAll("");
out.write(result);
out.newLine();
}
in.close();
out.close();
}
}
Thanks in advance,
Regards,
Ps: i am not good in english, so sorry for my bad english!
null
[2456 byte] By [
ecard104a] at [2007-11-27 1:24:18]

Try "\\<script\\>.*\\</script\\>"This works for a single line that contains both tags. If your expression covers multiple lines you need to accomodate that.
The regex is okay; the problem is that you're reading and processing the text one line at a time. Inline script elements usually span several lines. Try reading the whole file into a string and processing it that way. Assuming the file isn't too big, of course.
Thanks ChuckBing and uncle_alice please tell me more, because it is still not works, so how to correct it in the code above?What must i need? what is method can i add for this?Please!!!Thanks you for you help ?Message was edited by: ecard104
Anyone? Someone?Please!!!
uncle_alice gave you a suggested solution, "...Try reading the whole file into a string and processing it that way..."Did you do that?
I have edit code , instead of reading and processing the text one line, i have read earch of character and add it to a StringBuffer!
import java.util.regex.*;
import java.io.*;
public class Remover {
public static void main(String[] args)throws Exception {
File fin = new File("Test.html");
File fout = new File("Test2.html");
FileInputStream fis = new FileInputStream(fin);
FileOutputStream fos = new FileOutputStream(fout);
BufferedReader in = new BufferedReader(new InputStreamReader(fis));
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(fos));
String regex = ".*?(<script>.*?</script>).*?";
int flags = Pattern.MULTILINE | Pattern.DOTALL| Pattern.CASE_INSENSITIVE;
Pattern p = Pattern.compile(regex.toString(), flags);
Matcher m = p.matcher("");
int ch;
StringBuffer sb = new StringBuffer();
while ((ch = in.read())!= -1) {
sb.append((char)ch);
String result = m.replaceAll("");
out.write(result);
//out.newLine();
}
in.close();
out.close();;
}
}
The test2.html is an file no contents, all data have deleted
Message was edited by:
ecard104
while ((ch = in.read())!= -1)note sure read() will ever return -1
So please help me to correct it,Thanks for listening my friendsregads,ecard104
Try this. I simplified a number of things, as well.
import java.util.regex.*;
import java.io.*;
public class Remover
{
public static void main(String[] args)throws Exception
{
File fin = new File("Test.html");
File fout = new File("Test2.html");
BufferedReader in = new BufferedReader(new FileReader(fin));
BufferedWriter out = new BufferedWriter(new FileWriter(fout));
String str;
StringBuffer sb = new StringBuffer();
while ((str = in.readLine())!= null)
{
sb.append(str);
}
in.close();
String regex = "<script>.*?</script>";
String result = sb.toString().replaceAll(regex, "");
out.write(result);
out.close();
}
}
Hi ChuckBing,
it still not works as i want!
This is Test.html code:
<html>
<head>
<title> title for script </title>
<script type="javascript"> function a () {} </script>
</head>
<body>
<h1>Hello World</h1>
</div> aaaa</div>
</body>
</html>
When i run your code --> Here is Test2.html:
<html> <head> <title> title for script </title> <script type="javascript"> function a () {} </script></head><body> <h1>Hello World</h1> </div> aaaa</div></body></html>
script tag and content on it still no is deleted and all tags is at a line!
I would like Test2.html following:
<html>
<head>
//script tag and contents on it will be deleted
</head>
<body>
<h1>Hello World</h1>
</div> aaaa</div>
</body>
</html>
Thanks you for helping me
Regards,
Message was edited by:
ecard104
@Chuck, BufferedReader.readLine() drops the line separator, so need to add it back in. Also, the regex still needs to have the DOTALL and CASE_INSENSITIVE flags set (for the real world, that is; it would work fine as-is on the sample text). Here I used an alternate means to set the flags.
@OP, the replaceAll() regex only needs to match the exact text you want to replace; by adding the .* to either end, you're telling it to replace everything.
import java.util.regex.*;
import java.io.*;
public class Remover
{
public static void main(String[] args)throws Exception
{
FileReader fin = new FileReader("Test.html");
FileWriter fout = new FileWriter("Test2.html");
BufferedReader in = new BufferedReader(new FileReader(fin));
BufferedWriter out = new BufferedWriter(new FileWriter(fout));
String line = null;
String sep = System.getProperty("line.separator"):
StringBuilder sb = new StringBuilder();
while ((line = in.readLine())!= null)
{
sb.append(line).append(sep);
}
in.close();
String regex = "(?is)<script>.*?</script>";
String result = sb.toString().replaceAll(regex, "");
out.write(result);
out.close();
}
}
@uncle_alice:The better definition of requirements and test data eliminated the "shooting in the dark" I did ;)
All too often, taking a shot in the dark or two is the only way to learn what the requirements are. I mean, if you can describe the problem clearly and succinctly, you're over half way to solving it yourself (that's especially true you're dealing with regexes). A big part of what we do here is teaching people to use the jargon correctly. Broken English we can handle; incomplete or inaccurate requirements we can't.
Thanks All
I have corrected my code and it successed for delete the script tag
Here is the regex that will match the entire script tag:
String regex = "<script.*></script>";
So it was worked fine for the delete:
Here is code:
import java.util.regex.*;
import java.io.*;
public class Remover
{
public static void main(String[] args)throws Exception
{
File fin = new File("Test.html");
File fout = new File("Test2.html");
BufferedReader in = new BufferedReader(new FileReader(fin));
BufferedWriter out = new BufferedWriter(new FileWriter(fout));
String str;
StringBuffer sb = new StringBuffer();
while ((str = in.readLine())!= null)
{
sb.append(str);
}
in.close();
String regex = "<script.*></script>;
String result = sb.toString().replaceAll(regex,"");
out.write(result);
out.close();
}
}
But new problem is the readLine method of BufferedReader leaves out the line separator so the Test2.html file have format follow:
<html> <head> <title> title for script </title></head><body> <h1>Hello World</h1> </div> aaaa</div></body></html>
I would like to keep it so i have edited the code:
int i=0;
while ((str = in.readLine())!= null) {
if (i > 0) sb.append("\n");
sb.append(str);
i++;
}
But the Test2.html file comes back the same the Test.html again.
I dont understand what is happen,so i confused.Maybe i was asked so much, sorry!
Thanks you very much
Message was edited by:
ecard104
