generating html links from plain text

How can I scan a text (a String) and add links when a word is an URL?

I tried the following which works but it's too slow:

String text = rs.getString("text");

text2scan=text.split("[\n ]")

for(int i = 0; i < text2scan.length; i++){

String replacement =null;

String word = text2scan[i];

log.info(word);

if(word.indexOf(".com")!=-1){

replacement="<a href=\""+word+"\">"+word+"</a>";

text=text.replaceAll(word, replacement);

text=text.replaceAll("www","http://www");

text=text.replaceAll("http://http://","http://");

text=text.replaceAll(">http://",">");

}

}

Isn't this already implemented somewhere?

Thanks

Niklas

[1290 byte] By [Niklasa] at [2007-11-26 15:50:12]
# 1

You have 4 calls to String.replace in there, nested in a loop that gets executed once for each line.

Every call to replace all will result in a new String being allocated.

Also you are calling replaceAll on "text" rather than "text2scan" - thus you are consistently replacing things in earlier ones which you have already "fixed", and the subsequent calls to replace need to fix them again.

Can you have more than one URL on a line? How does your code handle that?

One thing that might speed it up is if you use a StringBuffer rather than a String for your manipulations.

Probably the best approach is to use RegularExpressions to do pattern matching/replacement - that only requires one pass, rather than 4 passes for each line.

Check out the java.util.regex package

Cheers,

evnafets

evnafetsa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 2

Thank you evnafets. The text does not contain many URLs, replacements are likely to be made only once or twice for the entire text. Now there are some ideas how to optimize. I'll use reg exp instead. I'll also create fewer String objects by declaring word and replacement outside the loop.

Maybe also the logging of words makes it slow, I will remove it.

The String text will split both at whitespace and newline so it can handle more than one URL on a line, in theory - I haven't tested it yet with more than one URL

Niklasa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 3

Because I was semi interested by this issue, and how best to solve it, heres a quick and dirty example program.

It does a very basic job, but should be pretty much equivalent to what you had originally.

package mypackage;

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.io.PrintWriter;

import java.io.StringReader;

import java.io.StringWriter;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class RegexpMakeHyperlinks {

/**

* Loads a file into a StringBuffer

* @param f File to read

* @return StringBuffer with contents of text file.

* @throws IOException

*/

public StringBuffer loadFile(File f) throws IOException {

System.out.println("Reading file " + f.getAbsolutePath());

BufferedReader in = new BufferedReader(new FileReader(f));

StringWriter sw = new StringWriter();

PrintWriter pw = new PrintWriter(new BufferedWriter(sw));

String line = null;

while ((line = in.readLine()) != null) {

pw.println(line);

}

pw.flush();

in.close();

return sw.getBuffer();

}

/**

* Writes the contents of a StringBuffer to file

* @param f File to write to

* @param sb Text to write to file

* @throws IOException

*/

public void writeToFile(File f, StringBuffer sb) throws IOException{

System.out.println("Outputting to " + f.getAbsolutePath());

BufferedReader in = new BufferedReader(new StringReader(sb.toString()));

PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(f)));

String line = null;

while ((line = in.readLine()) != null) {

out.println(line);

}

out.flush();

in.close();

}

/**

* Converts all urls like "www.google.com" into hyperlinks in the text.

*

* @param initialText The text to convert

* @return Converted text.

*/

public StringBuffer convertText(StringBuffer initialText) {

StringBuffer result = new StringBuffer(initialText.length());

Pattern p = Pattern.compile("(href=\")?(http://)?www\\..*?.com\\b");

Matcher m = p.matcher(initialText);

while (m.find()) {

String href = m.group();

// ignore links that are already hyperlinks

if (href.startsWith("href")){

continue;

}

// add on the http:// if necessary

if (!href.startsWith("http://")) {

href = "http://" + href;

}

// add the new text into the output.

m.appendReplacement(result, "<a href=\"" + href + "\">" + href + "</a>");

}

m.appendTail(result);

return result;

}

/**

* Testing method for experimenting with the patternmatching

* @param text

*/

public void matchLink(String text) {

System.out.println("Analyzing text " + text);

Pattern p = Pattern.compile("(http://)?www\\..*?.com\\b");

Matcher m = p.matcher(text);

while (m.find()) {

System.out.println("Found: " + m.group());

}

}

public static void main(String[] args) {

RegexpMakeHyperlinks hyper = new RegexpMakeHyperlinks();

String[] testStrings = { "www.google.com", "www.myspace.net", "make sure you visit www.yahoo.com for details", "Two sites: http://www.java.com and http://www.javabeans.com" };

for (int i = 0; i < testStrings.length; i++) {

hyper.matchLink(testStrings[i]);

}

System.out.println();

System.out.println("-");

System.out.println();

File inFile = new File("makeHyperlinkTest.txt");

File outFile = new File("makeHyperlinkTest.output");

try{

StringBuffer sb = hyper.loadFile(inFile);

StringBuffer result = hyper.convertText(sb);

hyper.writeToFile(outFile, result);

}

catch(Exception e){

System.out.println("Error " + e.getMessage());

e.printStackTrace(System.out);

}

}

}

And a test file makeHyperlinkTest.txt

This file just contains plain text with hyperlinks to places like www.google.com and http://www.yahoo.com.

The program I have written will convert these links like www.google.com into hyperlinks for displaying as html.

Make sure you visit http://www.java.sun.com and read all about Java.

And if I already have an <a href="www.google.com">Hyperlink</a> in this file it should leave it alone right?

evnafetsa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 4

Splendid. Reg exp is more difficult than I thought.

How can I also match and replace Strings like

http://www.java.sun.com/forum/index.html

http://www.google.com.br/google

www.google.com.mx

http://example.spaces.msn.com/example/index.html

www.mypage.cjb.net/something

members.geocities.com/name

Thanks

Niklas

Niklasa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 5

It comes down to finding a better pattern to match with.

This one matches the examples you have given (starting to get complicated now)

However it probably will also match some things you don't want it to

Pattern p = Pattern.compile("(href=\")?(http://)?[A-Za-z]+(\\.\\w+)+(/[&?\\+\\%/\\.\\w]+)?");

Basically it is looking for

- letters (eg www, forum, members)

- followed by one or more groups of .letters (eg .java.sun.com .google.com)

Optionally followed by a path starting with / - it assumes no white space in the url. It allows for &, ?, + and % in the url string.

It is probably still overly simplistic. Take a look around on the net. You will find a bunch of resources for regular expressions and detecting/validating strings.

Hope this helps,

evnafets

evnafetsa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 6

Thank you very much. It works very well. How can we also achieve link generation when there are parameters, for instance

http://www.example.com.br/mycontext/view?id=398&parameter2=300

It is the reg exp again that needs modification.

Can I achieve this by adding the = sign to the reg exp like this:

Pattern p = Pattern.compile("(href=\")?(http://)?[A-Za-z]+(\\.\\w+)+(/[&=?\\+\\%/\\.\\w]+)?");

Best regards

Niklas

Message was edited by:

Niklas

Niklasa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 7
Did you try it?It worked for me
evnafetsa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 8

Yes it worked. Now there is only one more thing:

Email adresses now get their domain converted to a hyperlink.

How can I change the reg exp so that domains that are part of an email address don't get converted? This seems quite hard since the @ sign isn't part of the match.

Thanks

Niklas

Niklasa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 9

The quick and dirty solution is to make the @ sign the first thing in the match, and then ignore anything that starts with @.

Pattern p = Pattern.compile("(@)?(href=\")?(http://)?[A-Za-z]+(\\.\\w+)+(/[&=?\\+\\%/\\.\\w]+)?");

// ignore links that are email address

if (href.startsWith("@")){

continue;

}

evnafetsa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 10
Works perfectly and fast too. Thank you evnafets you are brilliant.
Niklasa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 11

How can I also make it work for URLs that have a dash e.g.

http://www.trabalhoson-line.com/artigos.htm

I tried this but it doesn't work:

Pattern p = Pattern.compile("(@)?(href=\")?(http://)?[A-Za-z\\-]+(\\.\\w+)+(/[&\\n=?\\+\\%/\\.\\w]+)?");

We also want to allow for & and other common chars in the domain name and query string. How can this be achieved?

Many thanks

Niklasa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...
# 12

The following change worked for me:

Pattern p = Pattern.compile("(@)?(href=\")?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[&\\n=?\\+\\%/\\.\\w]+)?");

But I havn't tested with all possible URLs and I'd still like to know your suggestions for improvents, since there is probably a URL which this doesn鋞 match and there's probably a matchning which isn't a URL for instance www.abc.123. This can be solved by scanning for all possible top domains but then I have to update the reg exp if there are new top domains.

Niklasa at 2007-7-8 22:09:52 > top of Java-index,Enterprise & Remote Computing,Web Tier APIs...