bijective function : String <-> primitive type
Hey,
we all know Strings are memory hungry. So instead of storing Strings why not storing a long value (could be in a byte, short, integer).
For that we need a bijective function between the two sets (String set and long set)
Since Long values are bounded, the String set will also be bounded. We will be able to store Strings of an arbitrary length say L.
L depends on the characters you agree to allow in the String.
A long is 8 bytes, if we decide a character is encoded in 1 byte, a long value can code for a word of 8 characters. Critical issue of to make the bijective function betweem the String representation and the Long one.
Now image we use the double primary type, how many characters (1 byte) can we potentially code in a double?
Also Do you feel the need for such a thing?
[838 byte] By [
Mordana] at [2007-9-30 0:47:36]

double is also 8 bytes and therefore you could only encode 8 characters in it.
String representations of words are quite large because java uses 2 bytes to represent a character.
to get a byte[] representation of a String you call String.getBytes(String charset) where charset is one of
the 8 bits per char character sets (ASCII for example)
matfud
The straight-forward way to acheive such a bijective function is to define a character-to-bit-string translation table.
If S = { C1 ... CN } is the set of different characters you allow, then you need lg2(N) bits to represent each character. For ease of implementation, you should round up lg2(N), so you'll get a distinct set of bits per character.
Then you loop, lookup, and translate.
R
/ Chris
> The straight-forward way to acheive such a bijective
> function is to define a character-to-bit-string
> translation table.
>
> If S = { C1 ... CN } is the set of different
> characters you allow, then you need lg2(N) bits to
> represent each character. For ease of implementation,
> you should round up lg2(N), so you'll get a distinct
> set of bits per character.
>
> Then you loop, lookup, and translate.
>
> R
> / Chris
Thank you Chris.
My aim is to develop an ID system as an Eclipse plug-in (just a vague idea you know ;)) that have both advantages of understandable strings and the compactness of primivite types. :)
Since char (primitive type) support up to 2^16 different characters, direct translation to 1 byte per char is not too good. However, UTF-8 is an encding that translates most characters (Western Latin characters) into just 1 byte, and other uncommon characters (such as chinese signs) becomes 2 or 3 bytes. So I would do like this:
byte[] bytes=s.getBytes("UTF-8");
the "bytes" varaiable only takes extra 8 bytes, 4 for adress and 4 for length. Then you don't need to worry if your string has 4 bytes or not. If using a long, you average loss is in average 3.5 bytes anyway.
Gil
> we all know Strings are memory hungry. So instead of storing Strings why not storing a long value (could be in a byte, short, integer).how to do intend to save memory by this method? are you saying have a string table and store indexes into it? which sounds like interning
asjfa at 2007-7-16 5:20:09 >

> Also Do you feel the need for such a thing?
I'm curious why you are asking? Actually, we have an application where we badly need such mapping and we actually use it, of course. We map strings to ints, though (it is enough for us).
We need this, because we have to constantly manipulate around 1M of different strings in a real-time fashion (we process incoming real-time events). Having replaced those strings with 1M ints we save considerable amount of memory, save a lot of CPU cycles for their processing (especially in comparisons and in reading/writing them from/to streams) and place considerably less strain of garbage collector due to considerably reduced number of objects in the system. Our system is so fast, that nobody could belive this is possible even in C/C++, not even talking about Java :)
> My aim is to develop an ID system as an Eclipse
> plug-in (just a vague idea you know ;)) that have both
> advantages of understandable strings and the
> compactness of primivite types. :)
Depending on your demand you could use either base64 or base32 encodings. This will give you varying character sets and strings lengths. However, it is not worth effort is we are talking of less that 10K strings.
http://www.eclipse.org/eclipse/development/performance/bloopers.html
<i>Strings as identifiers are easy to read but terribly inefficient</i>
I want readability and efficience. Let's say I have string identifiers in a config file. Strings identifiers are parsed upon loading , converted to something more efficient. That more efficient thing is then used transparently for the equality relation.
thank you for your input. need for thinking on the matter...