create abbrivations for a set of strings

Does anyone know of an algorithm which produces a "meaningful" abbreviations for a word or phrase.

I am working with a state machine and would like a way to reduce names of states and transitions to a simpler set of codes.

ie. Acknowlege would become ACK, Not Acknowleged would become NACK, etc.

[314 byte] By [radix_zeroa] at [2007-11-26 19:51:20]
# 1

http://en.wikipedia.org/wiki/ASCII

ASCII control characters

001 NegativeAcknowledgement

http://en.wikipedia.org/wiki/Negative-acknowledge_character

In telecommunications, a negative-acknowledge character (NAK) is a transmission control character sent by a station as a negative response to the station with which the connection has been set up.

Note 1: In binary synchronous communication protocol, the NAK is used to indicate that an error was detected in the previously received block and that the receiver is ready to accept retransmission of that block.

Note 2: In multipoint systems, the NAK is used as the not-ready reply to a poll.

BIJ001a at 2007-7-9 22:41:18 > top of Java-index,Other Topics,Algorithms...
# 2

I know of nothing that does this, but I think you could build something in an afternoon that would work reasonably well with out too much difficulty.

I based this design on a couple of data items, I looked at the two letter abbrievations for the states developed by the Post Office, and I also looked at the abbrievations for the most common male first names from census data. JIM(James), JOHN, BOB(Robert), MIKE(Michael), BILL(William), DAVE(David), ****(Richard), CHUCK(Charles),JOE(Joseph)...

First design point. Abbrievations should not be developed in isolation. In the interests of not introducing ambiguity ou do not want you two different words to abbrievate to the same string. One could develop some method that looks at a set of words/phrases and develops all the abbrievations simultaneously, or one could simply develop them one at a time, add them to a list and require the abbrievator to come up with an alternative whenever there is a collision.

Furthermore, in the interests of coding theory, where one tends to use short strings to encode the most frequently occuring information, it would be wise to stuff the strings into the abbrievator in frequency order if those frequencies are know. Thus the most frequent words can grab the short strings. It is possible that this concern does not apply in your application.

I view the system as having two components, A - the abbrievator that generates abbrievations and D, the dictionary that looks at candidates and rejects those that don't fit. In the interests of keeping the communication between these two components simple, I assume that A, actually returns not a String, but rather a list of Strings in its perfered order of desirability. D is then permitted to filter and shuffle these strings if it so desires. For example in developing the two letter codes for the states, you wouldn't modify the abbrievator, rather the Dictionary simply throws out all suggestions that were not two letters long.

This split into A and D allows you to deal with essentially linguistic things in A and exteraneous concerns in D. Other examples, in the candidates for two letter state names, two letters was key, pronouncability was not, yet in human names pronouncability is importand. If you were developing user names introduction of numbers, like Marlin314 or Tom17 may be required, in which case, it makes far more sense for the Dictionary that is looking at all the existing Toms to append a number to insure uniqueness than it is to expect A to generate all possible numeric appendages. So both components A and D have their role to play in the design.

Of course in its simplist form, D is nothing other than, grab the first element off of the list supplied by A that does not match another entry, and if that is not possible (where all entries matched) choose an entry near the front of the list and append a number for uniqueness.

Now for A itself I propose you do something simple like this:

You may want to build some elementary linguistic filters along these lines:

respell: internal CH to K, TH to T, PH to F, DoubleVowels to first Vowel, etc. Create a list of these sorts of transformations so that you can add to them when you realsize that you forgot one like OUGH -> UF

Break up word/phrase into list of syllables. Make assumption that your respellings eliminated double consenants that should be kept together and follow simple if linguistically incorrect rules, like if there is only one consenant between two vowels, the second vowel gets it, if there are two consenants, the syllable split is between the two, if there are three give first two to the first syllable and third to the second syllable, etc. Special case silent-E (a 4 letter word like consenant-vowel-consenant-E is a single syllable not two)

Then using those linguistic routines you can generate your abbrievations in some simple order like this:

1: If input is single word and single syllable return word unchanged

2: If input is phrase return first letter from each word ingnoring stop words

3: if input is phrase return first syllable from first word and first syllable from second word

4: return initial portion up to second vowel cluster. i.e. return initial C-V-C or V-C

5: return first syllable

6: If input is single word return first letter

7: return first 2 letters

8: return first 3 letters

9: if input is word return first letter and first letter from second syllable

10: if input is word return all combinations of first letter followed by first letter from some subsequent syllable.

11: if input is phrase return first syllable followed by first syllable from any subsequent words.

12: if input is word return first and last letter

13: if input is word strip all vowels

14: if input is word return last syllable

15: return word - unmodified

16: return phrase - unmodified

clearly you can add any other transformations that you want to the list. The order that I listed may seem strange but it was choosen like this. I assme that first you want to pronounce the abbrievation, so even though those are longer, I put them at the front of the list. Then I list the short abbrievations. IF your D filter is looking for 2 or 3 letter abbrievations it will filter out the longer pronuncable ones and grab the shorter ones further down the list.

In this form my list of names, James, John, Robert, Michael, William, David, Richard, Charles, Joseph, Thomas, Christopher, Daniel, Paul, Mark, Donald, George, Kenneth, Steven, Edward, Brian, Ronald, Anthony, Kevin, Jason, Matthew... becomes

JAM, JOHN, ROB, MICH, WILL, DAV, RICH, CHARL, JOS, THOM, CHRIST, DAN, PAUL, MARK, DON, GEORG, KENN, STEV, EDW, BRIAN, RON, ANTH, KEV, JAS, MATTH,...

Now we test this with some new short name spaces, like days of the week.

MOND,TUESD,WEDN,THURSD,FRID,SAT,SUND

Too long - Dictionary can filter to 2 letters

MO,TU,WE,TH,FR,SA,SU

Still too long D can filter to 2 or less

M,T,W,TH,F,S,SU

months

JAN,FEBR,MARCH,APR,MAY,JUNE,JUL,AUG,SEPT,OCT,NOV,DEC

filter to 2 or less

J,F,M,A,MA,JU,JL,A,S,O,N,D

filter to 2

JA,FE,MA,AP,MY,JU,JL,AU,SE,OC,NO,DE

filter to 3

JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC

zodiac - aries, taurus, gemini, cancer, leo, virgo, libra, scorpio, sagittarius, capricorn, aquarius, pisces

raw

ARIES,TAUR,GEM,CANC,LEO,VIRG,LIBR,SCORP,SAG,CAPR,AQ,PISC

2 or less

A,T,G,C,L,V,LI,S,SA,CA,AQ,P

exactly 2

AR,TA,GE,CA,LE,VI,LI, SC,SA,CR,AQ,PI

Anyway, that is my suggestion for a quick and dirty abbrievation generator.

Enjoy!

marlin314a at 2007-7-9 22:41:18 > top of Java-index,Other Topics,Algorithms...
# 3
kewl thanks
radix_zeroa at 2007-7-9 22:41:18 > top of Java-index,Other Topics,Algorithms...
# 4
yes I know but I am not certain of the relevance of your reply
radix_zeroa at 2007-7-9 22:41:18 > top of Java-index,Other Topics,Algorithms...