substring matching for lists

Can anyone suggest efficient data structure or algorithm to get the "subset" or "superset" of a string. What I mean by subset/superset is this:

{

"new york",

"new york city",

"new york state",

"york city"

}

superset("new york") ={"new york","new york city","new york state"}// left side is substring of every string on right side

subset("new york city") ={"new york","new york city","york city"}// every string on right side is substring of the left side

subset("new york state") ={"new york","new york state"}

The brute force way is use loop through each string and do string contains comparison but is there a more efficient method especially w/ith large string lists?

[1396 byte] By [javvvaa] at [2007-10-1 23:29:26]
# 1
Google for 'patricia tree', it might be what you want.kind regards,Jos
JosAHa at 2007-7-15 14:12:06 > top of Java-index,Other Topics,Algorithms...
# 2

If your string set is constant you can construct an index.

Given that you have n strings to search through, the string

you are looking for is k characters, and you get m answers,

following code will find the superset by comparing

O(k*m*log(n)) number of characters and the subset by

comparing O(k^2*m*log(n)) number of characters. It is

simple and fast but it demands a lot of memory.

I wrote the code with little care so it is probably full

with bugs.

import java.util.*;

public class Index {

private Row[] _rows = null;

public Index(String[] input) {

List rows = new Vector();

for (int i=0;i<input.length;i++) {

for (int k=0;k<input[i].length();k++) {

rows.add(new Row(input[i], input[i].substring(k)));

}

}

_rows = (Row[])rows.toArray(new Row[0]);

Arrays.sort(_rows);

}

public Set getSuperset(String part) {

int p = Arrays.binarySearch(_rows, new Row(null, part));

TreeSet ts = new TreeSet();

int pt = p;

while (pt>=0 && _rows[pt].startsWith(part)) {

ts.add(_rows[pt].getMain());

pt--;

}

pt = p;

while (pt<_rows.length && _rows[pt].startsWith(part)) {

ts.add(_rows[pt].getMain());

pt++;

}

return ts;

}

public Set getSubset(String part) {

TreeSet ts = new TreeSet();

for (int i=0;i<part.length();i++) {

ts.addAll(getSuperset(part.substring(i)));

}

return ts;

}

public class Row implements Comparable {

public String _main = null;

public String _part = null;

public Row(String main, String part) {

_main = main;

_part = part;

}

public String getMain() {

return _main;

}

public boolean startsWith(String part) {

return _part.startsWith(part);

}

public int compareTo(Object o) {

return _part.compareTo(((Row)o)._part);

}

}

}

>

parza at 2007-7-15 14:12:06 > top of Java-index,Other Topics,Algorithms...
# 3

So if string A is a substring of string B than the subset of B contains A and the superset of A contains B.

A classical algorithm that is used sometime to speed substring searching works like this. You create an integer signature for each string. For example, you could set a bit in the integer to indicate the presence of each individual letter of the alphabet. (Since there are more letters than the 32 bit location, you would actually hash each individual letter to a number between 1 and 32 and set that bit). What the signatures do is let you do a very fast substring rejection. If the integer signature of A is NOT a subset of the signature of B it means that A has character that does not show up in B and thus A could NOT be a subset of B.

Note integer subset testing is very fast. A is subset of B iff (A & B) == A

More typically, this encoding is done not on letters themselves but rather on di-grams. So to encode "new york" you hash "ne", "ew", "w ", " y"...

Now, this structure, only gives you a fast reject. If signature of A really is a subset of signature of B it does NOT tell you that A is a substring of B. You must do the substring comparison to check, but it can reject lots of strings that could not possibly match quickly.

The use of 32 bits for the signature was arbitrary and may not be appropriate. If your strings are long, way more than 32 characters, you would find yourself in general setting most of the bits, and if most of the bits are set you will typically get no rejections. So you must choose the length of your signature to match the characteristics (average length) of the data you expect to be processing.

Thus to find the superset of String A, you would rip through the entire list of strings using the fast signature comparison to do a fast reject and only do the slower substring comparison on the ones where the signatures were compatible.

There is a sense in which this is still just the brute force method, in that you compare every string to every other string, but by preprocessing each string once to create a signature, you can potentially speed up all subsequent substring processing.

BTW I think Jos was pulling your leg about Patricia trees. They are a very efficient method for storing a dictionary of words that allow you to quickly decide if a word is in the dictionary but they are used for matching full strings against full strings, or full strings against prefixes and are not useful for matching substrings. Of course I could be wrong about this. I'd be happy to let Jos teach me a new use for Patricia trees.

marlin314a at 2007-7-15 14:12:07 > top of Java-index,Other Topics,Algorithms...