Union-Closed Families of Sets
Hi, I have written a short program in Java that takes as input a text file containing an arbitrary number of mathematical sets (presumably with the property that none of the sets can be formed by union of any of the other sets, although thats not really important) and finds all possible combinations of unions between the sets and builds a new family of sets that is closed under union (ie, any two sets in the family, when unioned together, will always produce a set that is already in the family)
we can then run certain analysis on the sets and try to draw information from them...we are doing research on Frankl's union-closed conjecture if anybody is interested..
with some of the examples we feed into this algorithm, it can take several days to complete, and memory usage has been getting quite high on some of the larger families (although i am more concerned with cpu time than memory)...is there any common sense changes anybody can suggest to speed this thing up?
here is the code:
/** MathSet.java **/
import java.util.HashSet;
import java.util.Iterator;
public class MathSet extends HashSet {
public final boolean isSuperset(MathSet s) {
//Returns true if this mathematical set is a superset of the specified set.
boolean result = false;
if (this.containsAll(s)) result = true;
return result;
}
public final boolean isSubset(MathSet s) {
//Returns true if this mathematical set is a subset of the specified set.
boolean result = false;
if (s.containsAll(this)) result = true;
return result;
}
public final boolean isDisjoint(MathSet s) {
//Returns true if the specified set has no common elements with this mathematical set.
boolean result = false;
MathSet union = this.union(s);
if (union.isEmpty()) result = true;
return result;
}
public final MathSet union(MathSet s) {
//Returns the union with the specified set.
MathSet union = (MathSet) this.clone();
union.addAll(s);
return union;
}
public final MathSet intersection(MathSet s) {
//Returns the intersection with the specified set.
MathSet intersection = (MathSet) this.clone();
intersection.retainAll(s);
return intersection;
}
public final MathSet difference(MathSet s) {
//Returns the asymmetric difference between this mathematical set and the specified set.
MathSet difference = (MathSet) this.clone();
difference.removeAll(s);
return difference;
}
public final String toString() {
StringBuffer buffer = new StringBuffer();
String string;
buffer.append("{");
Iterator thisIt = this.iterator();
while (thisIt.hasNext()) {
buffer.append(thisIt.next());
if (thisIt.hasNext()) buffer.append(",");
}
buffer.append("}");
string = buffer.toString();
return string;
}
}
/** Family.java **/
import java.util.Iterator;
import java.util.Date;
import java.text.DateFormat;
import java.util.ArrayList;
public final class Family extends MathSet {
public Family() {
}
public final void finish() { //optimized completion algorithm
int added = 1, last = 0, tmpSize = 0, i = 0, j = 0, k = 0; //initialize a few values
MathSet[] members, current; //allocate space for arrays
MathSet buffer = new MathSet(), family = new MathSet(); //instantiate some temporary sets
ArrayList<MathSet> stack = new ArrayList(); //create an array list for our stack
stack.add((MathSet) this.clone()); //put this families current state into the first position on the stack
while (added > 0) { //keep doing this until no unique members are created
added = 0; //reset added counter
family.clear(); //clear temporary family
last = stack.size() - 1; //obtain position of current set
current = new MathSet[stack.get(last).size()]; //create array for members of current set
stack.get(last).toArray(current); //populate array
Date now = new Date(); //get the current time
System.out.println("Family contains " + this.size() + " sets as of " + DateFormat.getTimeInstance().format(now)); //and print it
//union each of the previous sets' members with the current sets members
for (i = (last - 2); i > 0; i--) { //for each previous set
tmpSize = stack.get(i).size(); //obtain size of this set
members = new MathSet[tmpSize]; //create an array for this set
stack.get(i).toArray(members); //populate array
for (j = tmpSize - 1; j > 0; j--) { //for each member in that set
for (k = current.length - 1; k > 0; k--) { //for each member in the current set
buffer = members.union(current[k]); //union these two members
if (this.add(buffer)) { //add to this family
family.add(buffer); //if this was a unique element, add it to the next set
added++; //and increment the counter to trigger another iteration of the loop
}
}
}
}
//union each element of the member set with every other member
tmpSize = stack.get(last).size(); //obtain size of the last set
for (i = tmpSize - 1; i > 0; i--) //for each member in this set
for (j = i; j < tmpSize; j++) { //and for each member after it
buffer = current.union(current[j]); //union these two members
if (this.add(buffer)) { //add to the family
family.add(buffer); //if this was a unique element, add it to the next set
added++; //and increment the counter to trigger another iteration of the loop
}
}
stack.add((MathSet) family.clone()); //add newly generated family to end to stack
}
Date now = new Date(); //get the current time
System.out.println("Family completed with " + this.size() + " sets as of " + DateFormat.getTimeInstance().format(now)); //and print it
} //end finish()
public final int weight(Character ch) {
//Returns the weight of the family with respect to the element specified.
int sum = 0;
Iterator<MathSet> thisIt = this.iterator();
while (thisIt.hasNext()) {
MathSet member = thisIt.next();
if (member.contains(ch)) sum++;
}
return sum;
}
}
Message was edited by:
h3nry
[6457 byte] By [
h3nrya] at [2007-10-2 20:27:15]

test:System.out.println();
public final class Family extends MathSet
{
public Family() {
}
public final void finish() { //optimized completion algorithm
//initialize a few values
int added = 1, last = 0, tmpSize = 0, i = 0, j = 0, k = 0;
MathSet[] members, current; //allocate space for arrays
//instantiate some temporary sets
MathSet buffer = new MathSet(), family = new MathSet();
//create an array list for our stack
ArrayList<MathSet> stack = new ArrayList();
//put this families current state into the first position on the stack
stack.add((MathSet) this.clone());
while (added > 0) { //keep doing this until no unique members are created
added = 0; //reset added counter
family.clear(); //clear temporary family
last = stack.size() - 1; //obtain position of current set
//create array for members of current set
current = new MathSet[stack.get(last).size()];
stack.get(last).toArray(current); //populate array
Date now = new Date(); //get the current time
System.out.println("Family contains " + this.size()
+ " sets as of " + DateFormat.getTimeInstance().format(now)); //and print it
// union each of the previous sets' members with the current sets members
for (i = (last - 2); i > 0; i--) { //for each previous set
tmpSize = stack.get(i).size(); //obtain size of this set
members = new MathSet[tmpSize]; //create an array for this set
stack.get(i).toArray(members); //populate array
for (j = tmpSize - 1; j > 0; j--) { //for each member in that set
for (k = current.length - 1; k > 0; k--) { //for each member in the current set
buffer = members.union(current[k]); //union these two members
if (this.add(buffer)) { //add to this family
family.add(buffer); //if this was a unique element, add it to the next set
added++; //and increment the counter to trigger another iteration of the loop
}
}
}
}
// union each element of the member set with every other member
tmpSize = stack.get(last).size(); //obtain size of the last set
for (i = tmpSize - 1; i > 0; i--) //for each member in this set
for (j = i; j < tmpSize; j++) { //and for each member after it
buffer = current.union(current[j]); //union these two members
if (this.add(buffer)) { //add to the family
family.add(buffer); //if this was a unique element, add it to the next set
added++; //and increment the counter to trigger another iteration of the loop
}
}
stack.add((MathSet) family.clone()); //add newly generated family to end to stack
}
Date now = new Date(); //get the current time
System.out.println("Family completed with " + this.size() + " sets as of "
+ DateFormat.getTimeInstance().format(now)); //and print it
} //end finish()
public final int weight(Character ch) {
//Returns the weight of the family with respect to the element specified.
int sum = 0;
Iterator<MathSet> thisIt = this.iterator();
while (thisIt.hasNext()) {
MathSet member = thisIt.next();
if (member.contains(ch)) sum++;
}
return sum;
}
}
Quick quesiton: What are the possible elements of the sets? Chars? What is the range?
thank you for taking an interest
we are currently using Character objects because primitives can't be used in a HashSet
we don't want a limitation on the number of unique elements, however realistically our underlying universe will likely never contain more than 70-80 unique elements.
this code is very ugly..but this is the program we are using to read in, complete, and output a family
import java.io.*;
import java.util.StringTokenizer;
public class Driver {
final static Character a = new Character('a'), b = new Character('b'), c = new Character('c');
final static Character d = new Character('d'), e = new Character('e'), f = new Character('f');
final static Character g = new Character('g'), h = new Character('h'), i = new Character('i');
final static Character j = new Character('j'), k = new Character('k'), l = new Character('l');
final static Character m = new Character('m'), n = new Character('n'), o = new Character('o');
final static Character p = new Character('p'), q = new Character('q'), r = new Character('r');
final static Character s = new Character('s'), t = new Character('t'), u = new Character('u');
final static Character v = new Character('v'), w = new Character('w'), x = new Character('x');
final static Character y = new Character('y'), z = new Character('z');
final static Character A = new Character('A'), B = new Character('B'), C = new Character('C');
final static Character D = new Character('D'), E = new Character('E'), F = new Character('F');
final static Character G = new Character('G'), H = new Character('H'), I = new Character('I');
final static Character J = new Character('J'), K = new Character('K'), L = new Character('L');
final static Character M = new Character('M'), N = new Character('N'), O = new Character('O');
final static Character P = new Character('P'), Q = new Character('Q'), R = new Character('R');
final static Character S = new Character('S'), T = new Character('T'), U = new Character('U');
final static Character V = new Character('V'), W = new Character('W'), X = new Character('X');
final static Character Y = new Character('Y'), Z = new Character('Z');
final static Character ch1 = new Character('1'), ch2 = new Character('2'), ch3 = new Character('3');
final static Character ch4 = new Character('4'), ch5 = new Character('5'), ch6 = new Character('6');
final static Character ch7 = new Character('7'), ch8 = new Character('8'), ch9 = new Character('9');
final static Character ch0 = new Character('0');
final static String BREAK = System.getProperty("line.separator");
public static void main(String[] args) throws IOException {
Family family = new Family();
boolean parseText = false;
String inputFile = "input\\" + "in8x2.txt";
String outputFile = "output\\" + "output8x2.txt";
try {
FileReader fro = new FileReader(inputFile);
FileWriter out = new FileWriter(outputFile);
BufferedReader bro = new BufferedReader(fro);
// declare String variable and prime the read
String stringFromFile = bro.readLine();
while(stringFromFile != null) {
StringTokenizer st;
st = new StringTokenizer(stringFromFile,",");
while ( st.hasMoreTokens() ) {
String string = st.nextToken();
if (string.equals("BOF")) {
System.out.println("Found BOF\n");
parseText = true;
}
else if (string.equals("#")) {
System.out.println("\n\nFinished loading generator from input file: "
+ inputFile);
System.out.println("\nCompleting family..."
+ "Be patient, this may take a considerable amount of time.\n");
family.finish();
System.out.println("\n\nDumping output to file: "
+ outputFile);
out.append(BREAK + BREAK + "Completed Family.");
out.append(BREAK + family.toString());
out.append(BREAK + BREAK + "size: " + family.size() + BREAK);
out.append(BREAK + "weight(a): " + family.weight(a) + BREAK);
out.append(BREAK + "sat(a): "
+ (String) new Double((double) family.weight(a)/family.size()).toString() + BREAK);
//System.out.println("\n" + family);
parseText = false;
} else {
MathSet member = new MathSet();
if (parseText) {
for (int i2=0;i2<string.length();i2++) {
Character el = new Character(' ');
char ch = string.charAt(i2);
if (ch=='a') el=a;
else if (ch=='a') el=a;
else if (ch=='b') el=b;
else if (ch=='c') el=c;
else if (ch=='d') el=d;
else if (ch=='e') el=e;
else if (ch=='f') el=f;
else if (ch=='g') el=g;
else if (ch=='h') el=h;
else if (ch=='i') el=i;
else if (ch=='j') el=j;
else if (ch=='k') el=k;
else if (ch=='l') el=l;
else if (ch=='m') el=m;
else if (ch=='n') el=n;
else if (ch=='o') el=o;
else if (ch=='p') el=p;
else if (ch=='q') el=q;
else if (ch=='r') el=r;
else if (ch=='s') el=s;
else if (ch=='t') el=t;
else if (ch=='u') el=u;
else if (ch=='v') el=v;
else if (ch=='w') el=w;
else if (ch=='x') el=x;
else if (ch=='y') el=y;
else if (ch=='z') el=z;
else if (ch=='A') el=A;
else if (ch=='B') el=B;
else if (ch=='C') el=C;
else if (ch=='D') el=D;
else if (ch=='E') el=E;
else if (ch=='F') el=F;
else if (ch=='G') el=G;
else if (ch=='H') el=H;
else if (ch=='I') el=I;
else if (ch=='J') el=J;
else if (ch=='K') el=K;
else if (ch=='L') el=L;
else if (ch=='M') el=M;
else if (ch=='N') el=N;
else if (ch=='O') el=O;
else if (ch=='P') el=P;
else if (ch=='Q') el=Q;
else if (ch=='R') el=R;
else if (ch=='S') el=S;
else if (ch=='T') el=T;
else if (ch=='U') el=U;
else if (ch=='V') el=V;
else if (ch=='W') el=W;
else if (ch=='X') el=X;
else if (ch=='Y') el=Y;
else if (ch=='Z') el=Z;
else if (ch=='0') el=ch0;
else if (ch=='1') el=ch1;
else if (ch=='2') el=ch2;
else if (ch=='3') el=ch3;
else if (ch=='4') el=ch4;
else if (ch=='5') el=ch5;
else if (ch=='6') el=ch6;
else if (ch=='7') el=ch7;
else if (ch=='8') el=ch8;
else if (ch=='9') el=ch9;
member.add(el);
}
System.out.print(stringFromFile + "......");
System.out.println("adding " + member);
family.add(member);
}
}
}
stringFromFile = bro.readLine(); // read next line
}
bro.close();
out.close();
}
catch(FileNotFoundException filenotfoundexception)
{
System.out.println("input.txt does not exist");
}
catch(IOException ioexception)
{
ioexception.printStackTrace();
}
}
}
and here is one of the contents of one of the txt files with an example family that my computer is actually working on as we speak
BOF
abc
abcd,abcdA
abce,abceA
abcf,abcfA
def
defg,defgB
defh,defhB
defi,defiB
ghi
ghij,ghijC
ghik,ghikC
ghil,ghilC
jkl
jklm,jklmD
jkln,jklnD
jklo,jkloD
mno
mnop,mnopE
mnoq,mnoqE
mnor,mnorE
pqr
pqrs,pqrsF
pqrt,pqrtF
pqru,pqruF
stu
stua,stuaG
stub,stubG
stuc,stucG
#>
h3nrya at 2007-7-13 23:10:08 >

give us some examples of the type of initial sets that you are working with.
How big is n, the number of distinct elements in the generating sets that you are starting with? (what is the size of the set you get when you union all the generating sets into one?)
What I am thinking is that if your examples are not too large you could probably pack them into integers and have them run at a reasonable rate.
Are you in hopes that you might just trip over a counter example by some random choice of initial generating sets?
Ah, you have already posted some sample input.Spare me the wading through your parserWhat does the comma mean, what do the blank lines mean, what are the sets?
lol...no, not hoping to randomly stumble over a counterexample
i am working with a research grant for the summer with one of my university professors
he and my brother ( for my brothers honours thesis) came up with a notion of reducibility for union-closed families of sets...there paper is still being edited and hasn't been published anywhere yet or i would refer you to it
in any case, the professor i am working with (and myself, but to a lesser degree than him) believe there is a good chance that by introducing some certain structure to the generators of union closed families we may be able to derive some results that could potentially help us to improve what we currently know about bounds associated with a potential minimal counterexample to the conjecture
we are not so foolish as to think we will stumble accross a counterexample..i myself am of the impression that the conjecture is likely true and no counterexample exists to be found
i included an input file in my previous post...i have about 75 complete matching input/output files if you would like to see more
h3nrya at 2007-7-13 23:10:08 >

each character in an uninterupted string (except comma) is a unique element and each uninterupted string is a member set
a comma denotes a new set, as does a line
so the file above would give a family of sets like this:
{ {a,b,c}, {a,b,c,d}, {a,b,c,d,A}, {a,b,c,e} ,{a,b,c,e,A}, {a,b,c,f}, {a,b,c,f,A} .....}
h3nrya at 2007-7-13 23:10:08 >

might I ask why your input method has two different ways to split between sets?
For the family of sets
{ {a,b,c}, {a,b,c,d}, {a,b,c,d,A}, {a,b,c,e} ,{a,b,c,e,A}, {a,b,c,f}, {a,b,c,f,A} .....}
why is your input not just?
abc
abcd
abcdA
abce
abceA
abcf
abcfA
...
Is their structure in the input that I am not understanding? Part of simplifying the code can be obtained by simplifying the input.
Your driver appears to allow up to 26 + 26 + 10 distinct elements. Do any of your examples actually use that full range?
I have to take off now. Be back in about an hour. I will think about how to simplify your representation to speed your calculations.
the only reason we allow both a comma and a line break as a delimiter is for convenience when creating a set..makes it easier to visualize in some cases
i understand your point about simplifying input, but the input takes less than a second to parse even in the worst of cases, and whether we use all line breaks or both line breaks and comma's the resultant family we will be passing to the finish method will be identical. the finish method can take days to finish for a large example...
we haven't yet tried any examples that need all 62 elements, i just defined all 62 so i wouldn't have to add more as our examples grew, i figure the memory i'm wasting is negligable.
this input..which is also currently running on a different machine uses most of those characters (and, incidentally, is one set per line)
abcdefg
abcdefgh
abcdefgo
abcdefgv
abcdefgC
abcdefgJ
abcdefgQ
hijklmn
hijklmnb
hijklmnp
hijklmnw
hijklmnD
hijklmnK
hijklmnR
opqrstu
opqrstuc
opqrstuj
opqrstux
opqrstuE
opqrstuL
opqrstuS
vwxyzAB
vwxyzABd
vwxyzABk
vwxyzABr
vwxyzABF
vwxyzABM
vwxyzABT
CDEFGHI
CDEFGHIe
CDEFGHIl
CDEFGHIs
CDEFGHIz
CDEFGHIN
CDEFGHIU
JKLMNOP
JKLMNOPf
JKLMNOPm
JKLMNOPt
JKLMNOPA
JKLMNOPH
JKLMNOPV
QRSTUVW
QRSTUVWg
QRSTUVWn
QRSTUVWu
QRSTUVWB
QRSTUVWI
QRSTUVWP
h3nrya at 2007-7-13 23:10:08 >

> is there any common> sense changes anybody can suggest to speed this thing> up?Use code tags when you post code.Use an automated profiler rather than trying to guess at bottlenecks.
i apologise about not using the code button, my bad
as for using a profiler to find bottlenecks, you must not have looked at the problem
i know exactly where the bottle neck is...the question is...can anybody see anything i can change to either:
a) save a few clock cycles per iteration of the loop (cause it can loop billions of times)
b) decrease the number of times it has to loop in order to ensure the output is still a union-closed family
its not a question of "where is the bottleneck" its more of a "i'm more of a mathematician than a programmer, and the code i have come up with isn't all that efficient, but i don't know how to go about speeding it up, can anybody enlighten me?"
clearly you're not going to help..but thankfully there are others who seem like they are willing to help
h3nrya at 2007-7-13 23:10:08 >

> Your driver appears to allow up to 26 + 26 + 10
> distinct elements. Do any of your examples actually
> use that full range?
I think marlin and I are trying to get to the same point. We are asking what the number of distinct elements for a specific reason. For example, lets say the elements were the lowercase ascii letters. In that case there would be 26 distinct elements. An int in Java contains 32 bits. That means you can use a bit for each letter and have some left over. Then not only can you store each set in a single int and save space, your set operations can be done by the hardware:
union = bitwise or: |
intersection = bitwise and: &
disjuntcion = bitwise xor: ^
etc...
You can still do this with the 62 values that you appear to allow using longs. If you have more than that, you can use multiple ints and/or longs.
ok i can see what you are saying..
we do have examples which use more than just 32 unique elements to the universe..however, if i switch to a primitive value such as a long, i can no longer hold them in a HashSet unless i use their wrapper class....will i save much time or do i have to replace the hashset with my own class that will hold primitives in order to see the big speed boost associated with the bitwise comparison?
> ok i can see what you are saying..
> we do have examples which use more than just 32
> unique elements to the universe..however, if i switch
> to a primitive value such as a long, i can no longer
> hold them in a HashSet unless i use their wrapper
What do you need a HashSet for? The entire set will be stored in the primitive.
yes but these primitives need to be stored in some sort of set that doesn't allow duplicates
currently i have a hashset, filled with hashsets, filled with elements
if i switch the sets with elements to primitives, then i will have a hash set containing primitives, which isn't possible..so i would have to wrap them in the Long wrapper class, so would this significantly slow down the bitwise operations?
> > ok i can see what you are saying..
> > we do have examples which use more than just 32
> > unique elements to the universe..however, if i
> switch
> > to a primitive value such as a long, i can no
> longer
> > hold them in a HashSet unless i use their wrapper
>
> What do you need a HashSet for? The entire set will
> be stored in the primitive.
Do you mean to store the set of sets? You can use a primitive collection that the apache commons collections or you could use a custom container for the longs.
> currently i have a hashset, filled with hashsets,
> filled with elements
> if i switch the sets with elements to primitives,
> then i will have a hash set containing primitives,
> which isn't possible..so i would have to wrap them in
> the Long wrapper class, so would this significantly
> slow down the bitwise operations?
You would unwrap before doing the ops and rewrap after. It would add overhead, yes but the increase that I believe you will get from the bitwise ops will still greatly improve the performance.
> Do you mean to store the set of sets? You can use a
> primitive collection that the apache commons
> collections or you could use a custom container for
> the longs.
yes, thats exactly what i meant
what do you mean by "a primitive collection that the apache commons collections"?
it would be nice if there is already a container that i can use which holds primitives and doesn't allow duplicates, and it sounds like you might be referring to one there...
If you look at the apache commons project there is a primitives section that contains primitive collections, however, it only contains Lists. You could still you this, you just would have to write some code to do it.
I would do it with the wrappers first. It that is still not fast enough, then look into optimizing that.
thank you very much!!! both dubwai and marlin314
the more i think about it the more obvious the bitwise solution you suggested seems
i am going to do as you suggested and get it going a wrapper class, and then look into converting the set of sets into a collection of primitives afterward
One more thing. Instead of that crazy set of constants and chained ifs create a method:
long bitFor(char c) {
switch (c):
case 'a':
return 0x01;
case 'b':
return 0x02;
case 'c':
return 0x04;
case 'd':
return 0x08;
case 'e':
return 0x10;
//...
}
Call this with each char and bitwise or the results to build the initial sets. To reverse a set to output:
String toString(long set) {
StringBuffer buffer = new StringBuffer();
if (set & 0x01 == 0x01) buffer.append('a');
//...
return buffer.toString();
}
I would make sure you build these methods at the same time. If I were doing it, I would copy the code from one and modify it to get the other or generate it.
I see that d filled you in about bits while I was gone.
I got interested in the problem and hacked up the bit set method using longs. I did the hashing locally too. There is nothing particularly hard or magic about hashing.
I keep a hash table long[] al, so that I can test whether a set has been added already, and I keep a threaded list of the active elements in the hash table so that I can quickly walk through them when I am closing with a new element.
It takes a few seconds to do each of the two examples that you supplied. The second one, that use lots of letters required that I create a large hash table which exceeded the default heap size on my VM so I was forced to boost it up. I googled somewhere to find out how to do that.
Take a look at the embedded Bit class. That is a better way to deal with the input.
Here is the code:
public static class UnionClosure{
int max = 0x80000; // this must be a power of two
int maxMinus1 = max-1; //this must be one less than power of two; see hash()
int safeHash = max*4/5; // more than 80% load and chains get too long.
long[] al = new long[max]; // hash table for the sets
int[] next = new int[max]; // linked list of sets in al, ordered by creation.
int listRoot = -1;// pointer to first on list
int listLast = -1;// pointer to last on list
int totalSets = 0;// count of sets in hash table
public static class Bit{
static ArrayList allBits = new ArrayList();
static long nextBit = 1;
char name;
long bit;
private Bit(char c){
name = c;
bit = nextBit;
if(nextBit == 0)System.out.println("FAILURE - TOO MANY LETTERS!");
nextBit += nextBit;
}
static Bit getBit(char c){
// if already in bitlist return existing bit
Bit b;
for(int i = 0; i<allBits.size(); i++){
b = (Bit) allBits.get(i);
if(c == b.name) return b;
}
// here only if bit was not found so create new one
b = new Bit(c);
allBits.add(b);
return b;
}
static void constructBits(String input){
// rips through input, creates all bits we need. No sets created.
for(int i = 0; i><input.length(); i++){
char c = input.charAt(i);
if(eltName(c))getBit(c);
}
}
static boolean eltName(char c){ // true if legal elt name a-z,A-Z,0-1
return (c>= 'a' && c <= 'z') ||
(c>= 'A' && c <= 'Z') ||
(c>= '0' && c <= '9');
}
static String toString(long l){
String res = "";
for(int i = 0; i<allBits.size(); i++){
Bit b = (Bit) allBits.get(i);
if((b.bit & l)!= 0) res+=b.name;
}
return res;
}
}// end class Bit
long union(long x, long y){return x|y;}
boolean sameSet(long x, long y){return x == y;}
int hash1(long set){
int a = (int) ((set>>32)^set) ;
int b = ((a>>16) + (17*a)) & maxMinus1;
return b;
}
int hash2(long set){ // used for double hashing to spread the sets.
int a = (int) ((set>>32)^set) ;
int b = (a>>16 ^ (101*a)) ;
return (b|1)& maxMinus1; // note: odd so relatively prime to max
}
void addSet(long set){ // only adds if set was not already in Family
if(totalSets > safeHash)System.out.println("oops"); // hash table too full
int h = hash1(set); // where it should go
int hInc = hash2(set);
int chain = 0; // for stats only
while(al[h] != 0){
chain++; // stats on hash collisions
if(sameSet(al[h],set))return; // no need to add duplicates
h += hInc; // double hash chaining
if(h >= max) h = h % max;
}
// here only if al[h] = 0, so we can add
al[h] = set;
totalSets++;
// instrumentation for viewing additions & chaining stats
//if(totalSets < 100 || totalSets %1000==0){
// System.out.println(totalSets +" adding set "+ Bit.toString(set)+ " chain "+chain);
//}
appendToList(h);
}
void appendToList(int h){
if(listLast == -1){ // this is first insertion
listRoot = h;
listLast = h;
next[h] = -1; // end of the line
} else {
next[listLast] = h;
next[h] = -1;
listLast = h;
}
}
void closeWith(long set){
int beforeCount = totalSets;
addSet(set); // first add the new generator
int stop = listLast; // this will be the one we just added
int p = listRoot;
while(p != stop){
addSet(union(al[p],set));
p = next[p];
}
System.out.println(totalSets + " Closing with "+ Bit.toString(set) + " added "+ (totalSets - beforeCount));
}
void processInput(String input){
long set = 0;
for(int i = 0; i< input.length(); i++){
char c = input.charAt(i);
if(Bit.eltName(c)) {
set |= (Bit.getBit(c)).bit;
} else { // illegal character is end of set
if(set != 0) {
closeWith(set);
set = 0;
}
}
}
if(set != 0) {
closeWith(set);
}
}
void dumpStats(){
System.out.println("Total sets = " + totalSets);
for(int i =0; i< Bit.allBits.size(); i++){
int t = 0; int ts = 0;
Bit b = (Bit) Bit.allBits.get(i);
long ab = b.bit;
int iSet = listRoot;
while(iSet != -1){
if((al[iSet]&ab) != 0){ts++;} // bump total subset count
t++; // bump total count
iSet = next[iSet];
}
System.out.println(b.name + " occured in "+ ts + " sets out of "+ t);
}
}
public static void main(String[] args){
String input =
// /* // second example
"abcdefg," +
"abcdefgh," +
"abcdefgo," +
"abcdefgv," +
"abcdefgC," +
"abcdefgJ," +
"abcdefgQ," +
"hijklmn," +
"hijklmnb," +
"hijklmnp," +
"hijklmnw," +
"hijklmnD," +
"hijklmnK," +
"hijklmnR," +
"opqrstu," +
"opqrstuc," +
"opqrstuj," +
"opqrstux," +
"opqrstuE," +
"opqrstuL," +
"opqrstuS," +
"vwxyzAB," +
"vwxyzABd," +
"vwxyzABk," +
"vwxyzABr," +
"vwxyzABF," +
"vwxyzABM," +
"vwxyzABT," +
"CDEFGHI," +
"CDEFGHIe," +
"CDEFGHIl," +
"CDEFGHIs," +
"CDEFGHIz," +
"CDEFGHIN," +
"CDEFGHIU," +
"JKLMNOP," +
"JKLMNOPf," +
"JKLMNOPm," +
"JKLMNOPt," +
"JKLMNOPA," +
"JKLMNOPH," +
"JKLMNOPV," +
"QRSTUVW," +
"QRSTUVWg," +
"QRSTUVWn," +
"QRSTUVWu," +
"QRSTUVWB," +
"QRSTUVWI," +
"QRSTUVWP";
// */
/*// first example
"abc abcd,abcdA abce,abceA abcf,abcfA," +
"def,defg,defgB,defh,defhB,defi,defiB," +
"ghi,ghij,ghijC,ghik,ghikC,ghil,ghilC," +
"jkl,jklm,jklmD,jkln,jklnD,jklo,jkloD"+
"mno,mnop,mnopE,mnoq,mnoqE,mnor,mnorE," +
"pqr,pqrs,pqrsF,pqrt,pqrtF,pqru,pqruF" +
"stu,stua,stuaG,stub,stubG,stuc,stucG";
*/
// simple set test
// "a b c d e";
UnionClosure uc = new UnionClosure();
Bit.constructBits(input);
uc.processInput(input);
uc.dumpStats();
System.out.println("thas all, folks!");
}
}
Sample Output:
1 Closing with abcdefg added 1
2 Closing with abcdefgh added 1
4 Closing with abcdefgo added 2
8 Closing with abcdefgv added 4
16 Closing with abcdefgC added 8
32 Closing with abcdefgJ added 16
64 Closing with abcdefgQ added 32
97 Closing with hijklmn added 33
98 Closing with bhijklmn added 1
132 Closing with hijklmnp added 34
200 Closing with hijklmnw added 68
336 Closing with hijklmnD added 136
608 Closing with hijklmnK added 272
1152 Closing with hijklmnR added 544
1473 Closing with opqrstu added 321
1506 Closing with copqrstu added 33
1540 Closing with ojpqrstu added 34
1928 Closing with opqrstux added 388
2704 Closing with opqrstuE added 776
4256 Closing with opqrstuL added 1552
7360 Closing with opqrstuS added 3104
8737 Closing with vwxyzAB added 1377
9058 Closing with dvwxyzAB added 321
9412 Closing with vkwxyzAB added 354
9800 Closing with vwrxyzAB added 388
12240 Closing with vwxyzABF added 2440
17120 Closing with vwxyzABM added 4880
26880 Closing with vwxyzABT added 9760
30849 Closing with CDEFGHI added 3969
32226 Closing with eCDEFGHI added 1377
33924 Closing with ClDEFGHI added 1698
35976 Closing with CDsEFGHI added 2052
38416 Closing with CDEzFGHI added 2440
49952 Closing with CDEFGHIN added 11536
73024 Closing with CDEFGHIU added 23072
82177 Closing with JKLMNOP added 9153
86146 Closing with fJKLMNOP added 3969
91492 Closing with JmKLMNOP added 5346
98536 Closing with JKtLMNOP added 7044
107632 Closing with JKLAMNOP added 9096
119168 Closing with JKLMHNOP added 11536
165312 Closing with JKLMNOPV added 46144
183618 Closing with QRSTUVW added 18306
192771 Closing with gQRSTUVW added 9153
205893 Closing with QnRSTUVW added 13122
224361 Closing with QRuSTUVW added 18468
249873 Closing with QRSBTUVW added 25512
284481 Closing with QRSTIUVW added 34608
330625 Closing with QRSTUPVW added 46144
Total sets = 330625
a occured in 165313 sets out of 330625
b occured in 211457 sets out of 330625
c occured in 211457 sets out of 330625
d occured in 211457 sets out of 330625
e occured in 211457 sets out of 330625
f occured in 211457 sets out of 330625
g occured in 211457 sets out of 330625
h occured in 211457 sets out of 330625
o occured in 211457 sets out of 330625
v occured in 211457 sets out of 330625
C occured in 211457 sets out of 330625
J occured in 211457 sets out of 330625
Q occured in 211457 sets out of 330625
i occured in 165313 sets out of 330625
j occured in 211457 sets out of 330625
k occured in 211457 sets out of 330625
l occured in 211457 sets out of 330625
m occured in 211457 sets out of 330625
n occured in 211457 sets out of 330625
p occured in 211457 sets out of 330625
w occured in 211457 sets out of 330625
D occured in 211457 sets out of 330625
K occured in 211457 sets out of 330625
R occured in 211457 sets out of 330625
q occured in 165313 sets out of 330625
r occured in 211457 sets out of 330625
s occured in 211457 sets out of 330625
t occured in 211457 sets out of 330625
u occured in 211457 sets out of 330625
x occured in 211457 sets out of 330625
E occured in 211457 sets out of 330625
L occured in 211457 sets out of 330625
S occured in 211457 sets out of 330625
y occured in 165313 sets out of 330625
z occured in 211457 sets out of 330625
A occured in 211457 sets out of 330625
B occured in 211457 sets out of 330625
F occured in 211457 sets out of 330625
M occured in 211457 sets out of 330625
T occured in 211457 sets out of 330625
G occured in 165313 sets out of 330625
H occured in 211457 sets out of 330625
I occured in 211457 sets out of 330625
N occured in 211457 sets out of 330625
U occured in 211457 sets out of 330625
O occured in 165313 sets out of 330625
P occured in 211457 sets out of 330625
V occured in 211457 sets out of 330625
W occured in 165313 sets out of 330625
thas all, folks!
Once you've optimized the basic algorithm the next big increase could come from parallel processing. Depending on the hardware you are using, you may be able to process two or more branches of permutation at once.
oh wow....thats WAY faster than what I had...no comparison..but when i run larger examples, i am running out of hash table space, and getting a whole lot of "oops"'swhat do i have to do to increase this?
You can increase the value of 'max'. Mind the comment about power of 2.
ah that was easy...is there a reason for the limit that was chosen?
> ah that was easy...is there a reason for the limit
> that was chosen?
Probably not anything specific. The limit merely keeps the size of the hashtable down. You could just set it to Integer.MAX_LENGTH if you don't care about that. I believe that would guarantee no more than 2 values in each hash bucket.
> > ah that was easy...is there a reason for the limit
> > that was chosen?
>
> Probably not anything specific. The limit merely
> keeps the size of the hashtable down. You could just
> set it to Integer.MAX_LENGTH if you don't care about
> that. I believe that would guarantee no more than 2
> values in each hash bucket.
Sorry, that's utter nonsense. Late on Friday and all. You could end up with a lot more than that in each bucket. The question is whether you will ever need so many buckets.
alright...thank you
so, with this algorithm, how much work would it be to get it calculating in parallel
the pc i am using has is a pentium d 2.8 ghz (dual core), and although the speed of this program is just fine...i might as well utilize what i have if its not a lot of extra work...
it'll be a learning experience for me
You need to take portion of the algorithm that can be run in parallel and define it as a Runnable. Then you will need to join the processes at the appropriate places. I'm not sure I have a great feel for your algorithm so you will need to determine what parts (if any) can be done in parallel and where you need to rejoin. Loop bodys are good candidates.
Note that there is overhead to doing this so you need to be able to do a good section of work without having to join in order to make it worthwhile. That is, the longer the threads can run without having to talk to each other, the more efficient the parallel processing.
There are a number of ways to skin this cat. You can start a new thread on each set of work or have long running threads that wait for requests or pull them from a queue for example. If you want to take a crack at it I'm sure people here (including myself) will be happy to help.
One more thing. The other more simple way to do this in parallel is to start two instances and solve two problems at once. If you have lots of problems to solve, this might make more sense.
> One more thing. The other more simple way to do this
> in parallel is to start two instances and solve two
> problems at once. If you have lots of problems to
> solve, this might make more sense.
I must agree. The easiest code to write is NO code. However, if your dual processors share memory, and you memory is full holding an enormous hash table, you may not have room to run two at a time.
To answer one of the earlier questions, the way I chose max was simple. I set it to a power of two and the system crashed with an "out of memory exception" so I cut it in half and tried again until I got a size that would fit.
When I ran a big problem, it hit the "oops" limit of a nearly full hash table. So I went out an looked up how to adjust the maximum heap size in the JVM. I picked some random number, (actually I just copied the boost memory flags that I saw on the site that I googled on how to boost up the memory) pasted it in and discovered that I could now allocate a larger table.
So, max needs to be a power to 2, however to be safe (with the double hash scheme that I used) it can't really be 2^31 it needs to be 2^30 or less. (I use cheap masking for caclulating the mod function and addition could overflow if the table size was 2^31.)
As for what is the optimal value of max, that is easy. It should be as large as possible and no larger. The larger the hash table, the fewer the collisions, the faster the algorithm. The only caveat is that the hash table must be initialized to 0 (done automatically in java) and if you were doing little tiny sets, the time initializing a massive hash table would be essentially wasted. Nothing about the problem suggests to me that you will ever be doing this with small sets.
Personally I would not worry about parallel processing. It basically runs in O(n) where n is the size of the family that is created. Adding a new set to the family is constant and fast. Perhaps you did not post any hard examples, the ones that ran for a couple of days, but the ones you did post took less that 5 seconds each on my modest box.
When a computation takes 5 seconds or less, even if you magically improve it to run in a nano second, you have only saved 5 seconds.
IF you actually wanted to improve it, and work on the math, the real question is why, when you already have some hundred thousand elements in the family (the closure of all first n generators) do you add a new generator and produce only 5 new sets?
Yes the easy answer is that, when you unioned the one new generator with the 100,000 sets, only 5 produced something new. The issue is - if you want to save work - you don't want to waste time unioning the generator with sets that will produce nothing new.
Why will they produce nothing new? What is the underlying structure that makes this calculation waste so much time. Is there any way to order the results so that you can ignore entire portions of the underlying lattice, because you know in advance that you will get nothing by unioning over there? And if there is such a structure, how does that relate to the original question.
Enjoy!
when i attempt to run this example (or any other where the family has > 66 million members) , the program basically grinds to a halt
at the moment it is using 800mb of ram, i only have 1.5gigs so i can't quite double the size of the hash table...is that where my problem is?
also, that example uses most of the available elements...there will likely be larger examples yet to come. what would need to be done to string together 2 longs,or maybe even just a long and an int
that is, if something can be done (short of more memory) to get the program so it will calculate those in a reasonable amount of time.
abc
abcd,abcdA
abce,abceA
abcf,abcfA
def
defg,defgB
defh,defhB
defi,defiB
ghi
ghij,ghijC
ghik,ghikC
ghil,ghilC
jkl
jklm,jklmD
jkln,jklnD
jklo,jkloD
mno
mnop,mnopE
mnoq,mnoqE
mnor,mnorE
pqr
pqrs,pqrsF
pqrt,pqrtF
pqru,pqruF
stu
stuv,stuvG
stuw,stuwG
stux,stuxG
vwx
vwxy,vwxyH
vwxz,vwxzH
vwx1,vwx1H
yz1
yz12,yz12I
yz13,yz13I
yz14,yz14I
234
2345,2345J
2346,2346J
2347,2347J
567
5678,5678K
5679,5679K
5670,5670K
890
890Z,890ZL
890Y,890YL
890X,890XL
ZYX
ZYXU,ZYXUM
ZYXV,ZYXVM
ZYXW,ZYXWM
UVW
UVWR,UVWRN
UVWS,UVWSN
UVWT,UVWTN
RST
RSTa,RSTaO
RSTb,RSTbO
RSTc,RSTcO
> when i attempt to run this example (or any other
> where the family has > 66 million members) , the
> program basically grinds to a halt
If I understand the problem correctly, it seems to have a factorial or exponential growth. I don't know that you can 'solve' this without quantum computers or something.
The hashtable may be an issue as it can degrade to a linked list if you get too many items in each bucket.
Have you thought about mariln's quesiton about avoiding finding the same sets over and over again? If there's a smarter way to solve the problem that may resolve your current issue.
2^26 = 67 108 864
2^26*12 = 805 306 368
So I assume that that is the size problem that you are having difficulty with. Yes if you have >66M elements you are getting close to the limits of a hash table that is only 67M elements big. This causes the almost instantaneous insertion into the hash table to degrade into searching nearly all 67M element for the few remaining open slots each time you want to do a single insertion and each time you want to check for equality. It just won't work. You need a bigger size. Hash tables do not work when they are nearly full.
If you are unable to double your memory, your next alternative is to modify the program so that it does not require a power of two for the hash table size. That would allow you to grow in smaller increments rather than needing to double when you run out of space.
If you look at the hash functions, you will notice that the last operation is to AND the returned value with maxMinus1, since max is a multiple of 2, this operation is reducing the number mod that power of 2. This is what forces the addresses to be within the bounds of the array.
You can replace that AND (&) oepration with a true MOD (%) operation and then you will have more flexibility in choice of the value max. (mods are slower than ands, not a whole lot, just a little, you shouldn't see much difference)
The trick is in hash2. This hash is used for the double hash chaining. Hash1 just has to fit the bounds of the array, that is a simple mod. Hash2 has slightly different requirmements. It is used to create the search chain, or the buckets, that hold the hash collisions. It defines the spacing that will be used to look from one element to the next. Thus if hash1(x) = 17 and hash2(x) = 7 you will be looking first in slot 17 then in slot 17+7 then in slot 17+2*7, then in slot 17+3*7...
In order for this to work hash2 must NOT return 0 ever. And if you want the buckets to span the hash table hash2(x) must be relatively prime to size of the table which is max.
The easiest way to achieve this (or one of the easy ways) is let max be some large prime, P, and then for the last step convert hash2 to be
(hash2value % (P-1)) + 1
This way hash2 is never zero and is relatively prime to max because it is less than P. (If you look at the code, you will see that I forced hash2 to be odd. That delt with the issue of being relatively prime to the size of the array when the array was a power of two.
So you will need to fix the two hash routines, and then in the insertion code as well, where the hash2 function is used, you will note once again an AND being used to force the hash table index to be in range (after adding the hash2 value). You must modify that as well to a MOD with the table size as well.
If you want to beef up the routine to use a long and an int for the hash table, that is not particularly hard. Just create another int array to go along side. The work will all be in the fact that your generators will now not be a single long, they will be a long and an int, so you must pass two values around everywhere that you used to just deal with a single long. The driving routine that cracked your inputs into bitsets in the first place can be modified to notice when the long overflows and to flow on into the int.
Of course, when you do that, your storage requirments get larger, intead of 12 bytes per family element you will be up to 16 bytes per family element. You will not be able to investigate such large families. Your 1.5 G will top out somewhere around 80M family elements.
The real speed up will come of course, when you quit using the computer to generate huges sets and instead use your mathematical insight into the structure of the problem and just prove the conjecture, then all of this code becomes unnecessary. :)
Good luck!
Marlin
