text to binary (.txt to .bin)

Hello,

I'm developing an application on J2ME that needs to read from files.

I want to use binary files instead text files, because I think that in this way I have two benefits:

1. reducing the jar size

2. speeding up the reading process.

that's the starting point...and I hope this approach is right.

I've developed a small class in J2SE just as an utility to convert .txt files in .bin files and...GULP!!!...a txt file of 22KB becomes 44KB converted in .bin...there is something wrong, but I don't know.

here is the code I wrote:

package texttobin;

import java.io.*;

publicclass Main{

publicstaticvoid main(String[] args){

File inFile =null;//the file to read

File outFile =null;//the file to write

//read the path from commandline

if (args.length > 0)

{

inFile =new File(args[0]);

outFile =new File(args[1]);

}

//some checkings

if (inFile ==null)return;

if (outFile ==null)return;

try{

//setup the streams

FileOutputStream outputStream =new FileOutputStream (outFile);

DataOutputStream dataOutputStream =new DataOutputStream (outputStream);

FileInputStream sourceStream =new FileInputStream(inFile);

byte readingByte[] =newbyte[1];

while(true){

if (sourceStream.read(readingByte)!=-1){

//to be sure: I convert the readed input byte in a string

//and pass the char inside to the writeChar method...

dataOutputStream.writeChar(new String(readingByte).charAt(0));

//the same result is achieved with:

//dataOutputStream.writeChars(new String(readingByte));

}

elsebreak;

}

dataOutputStream.close();

outputStream.close();

sourceStream.close();

}

catch(FileNotFoundException fnfe)

{

System.out.println (fnfe);

return;

}

catch(IOException ioe)

{

System.out.println (ioe);

return;

}

}

}

WHAT'S WRONG?!

THANKS A LOT!!!

daniele

[3949 byte] By [palmer_eldritcha] at [2007-11-27 7:41:56]
# 1
Nothing's wrong. One Java char is made up of 2 bytes, so your binary format is no improvement. I'd suggest using some compression format such as ZIP instead.
quittea at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 2
thank you,this is clear but if a binary file uses 2 bytes for each char and txt uses 7, why the size of the bin is bigger than txt?is it not possible to make a bin file smaller from a txt?thank you again,daniele
palmer_eldritcha at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 3
> this is clear but if a binary file uses 2 bytes for> each char and txt uses 7, why the size of the bin is> bigger than txt?Because you don't understand. ISO-something: 1 byte/char. Java char = UTF16 = 2 byte/char.Nobody said anything about 7 bytes.
CeciNEstPasUnProgrammeura at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 4
> Nobody said anything about 7 bytes.I think OP mistook the 7 (lower) bits used in ASCII encoding with 7 bytes ... that was 7 bits, right?
thomas.behra at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 5
Hi,what i dont understand: why should it matter in terms of jar size whether the app reads from binary or from text files?regardsBugBunny
BugBunnya at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 6
is the what applications that run on my cellphone look under the hood?*cringe*im surprised the thing even boots
mkoryaka at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 7

Try

while (sourceStream.read(readingByte)!=-1)

{

dataOutputStream.write(readingByte);

}

This way you'll have a simple byte copy to the outFile.

Note that you're not assuring the char encoding. To acomplish

that, you should use the writeUTF(String) method to save

and readUTF() to load (as said before, your file will get bigger).

[]'s

Danniel_Williana at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 8

> This way you'll have a simple byte copy to the

> outFile.

> Note that you're not assuring the char encoding. To

> acomplish

> that, you should use the writeUTF(String) method to

> save

> and readUTF() to load (as said before, your file will

> get bigger).

thank you,

in this way i got the same size of the original file...

But doesn't help me too much...

I'm just wandering how is it possible to read about 3000 lines of a text file as faster as possible, from a txt file with only InputStreamReader I take about 50 seconds on the emulator and of course I'm sure that there is an answer...

I've downloaded some free applications that use xml or txt files converted in binary and the performances are very good, i've saw a bible in about 300K!

if you don't belive me, take a look:

http://gobible.jolon.org/

uhm...sorry, I don't want write off-topic things...

palmer_eldritcha at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 9

> I'm just wandering how is it possible to read about

> 3000 lines of a text file as faster as possible, from

> a txt file with only InputStreamReader I take about

> 50 seconds on the emulator and of course I'm sure

> that there is an answer...

After looking at your code I feel like giving a hint on speed improvement: use buffered reading. Accessing a stream byte for byte is low-performant. For example:

InputStream in = /* ... */;

BufferedInputStream buffy = new BufferedInputStream(in);

int b = 0;

while ((b = buffy.read()) != -1) {

/* use the byte */

}

// don't forget: in.close();

Or do buffering for yourself:

InputStream in = /* ... */;

byte[] buffy = new byte[4096]; // use a reasonable buffer size; example here: 4K

int r = 0;

while ((r = in.read(buffy)) != -1) {

/* use the byte array, r bytes are in there */

}

// don't forget in.close();

quittea at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 10
> im surprised the thing even bootsMine does, too. Quite often actually.
CeciNEstPasUnProgrammeura at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 11

> > im surprised the thing even boots

>

> Mine does, too. Quite often actually.

I don't understand you.

I'm sorry, my english is quite bad, but in my first post I wrote that the code was for J2SE.

OBVIOUSLY THAT CODE IS NOT FOR MOBILE and it is not optimized, have you ever seen a midlet stating with main method?!

The reason why I wrote this thread here is because I NEED HELP FOR MAKING TEXT FILES SMALLER, as the subject suggests, and then read them faster in a J2ME app.

I'm sorry again if I did not explain clearly the problem, but please, if you don't have an answer or a suggest, don't write.

thank you,

d.

palmer_eldritcha at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 12
ZIP it or use a 7-bit encoding scheme (which isn't easy). As for faster reading: you need to see that the decoding of your file format doesn't take more time than the reading of an unencoded format would take.
CeciNEstPasUnProgrammeura at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 13
> After looking at your code I feel like giving a hint> on speed improvement: use buffered reading. Accessing> a stream byte for byte is low-performant.that's intresting, I will try, thank you!
palmer_eldritcha at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 14

> ZIP it or use a 7-bit encoding scheme (which isn't

> easy). As for faster reading: you need to see that

> the decoding of your file format doesn't take more

> time than the reading of an unencoded format would

> take.

Yes, you're right, but I thought that the bottleneck is the reading process, not the decoding algorhytm (hopefully!!!)

I found an article that say:

-

An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.

However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.

-

http://www.cs.umd.edu/class/spring2003/cmsc311/Notes/BitOp/asciiBin.html

(and sorry for the mistake: 7-bit not 7 bytes!!!)

another intrested thing I've found is here:

http://www2.sys-con.com/ITSG/virtualcd/Java/archives/0607/heaton/index.html

there is some ways to store strings in bin files (like fstream in C++), but I didn't tested yet if in this way I can reduce the size of the bin.

The files are stored in the jar file, then zipping maybe can reduce more the size of the jar, but I think that when the stream opens it takes the real bytes of the contained text (well...I'm not sure...may you confirm it?)

thank you.

palmer_eldritcha at 2007-7-12 19:22:39 > top of Java-index,Java Essentials,Java Programming...
# 15

> The files are stored in the jar file, then zipping

> maybe can reduce more the size of the jar, but I

> think that when the stream opens it takes the real

> bytes of the contained text (well...I'm not

> sure...may you confirm it?)

JARs are nothing but ZIPs. Zipping a ZIP usually just adds to the overall size.

CeciNEstPasUnProgrammeura at 2007-7-21 22:19:36 > top of Java-index,Java Essentials,Java Programming...
# 16
> That means, in any ASCII file,> you're wasting 1/8 of the bits. In particular, the> most significant bit of each byte is not being used.Hah. That's what you english speakers/writers think ...
quittea at 2007-7-21 22:19:36 > top of Java-index,Java Essentials,Java Programming...
# 17
> Hah. That's what you english speakers/writers think ...Not quite. ASCII only has 7 bits. Everything that uses the eigth bit is not ASCII, and then it's not an ASCII file.
CeciNEstPasUnProgrammeura at 2007-7-21 22:19:36 > top of Java-index,Java Essentials,Java Programming...
# 18
Oh c'mon, you know I haven't been serious ...
quittea at 2007-7-21 22:19:36 > top of Java-index,Java Essentials,Java Programming...
# 19

> > Hah. That's what you english speakers/writers think

> ...

>

> Not quite. ASCII only has 7 bits. Everything that

> uses the eigth bit is not ASCII, and then it's not an

> ASCII file.

Extended ASCII is usually referred to as ASCII and uses 8 bits :)

jwentinga at 2007-7-21 22:19:36 > top of Java-index,Java Essentials,Java Programming...
# 20

> Extended ASCII is usually referred to as ASCII and

> uses 8 bits :)

then, if java stores chars in 2 byte = 16bits and if extended ascii code uses only 8 bits, is it reasonable to think that I can store strings in binary files using the half space, or I still miss something?

palmer_eldritcha at 2007-7-21 22:19:36 > top of Java-index,Java Essentials,Java Programming...
# 21

> then, if java stores chars in 2 byte = 16bits and if

> extended ascii code uses only 8 bits, is it

> reasonable to think that I can store strings in

> binary files using the half space, or I still miss

> something?

Those binary files you're talking about contain a binary format of Java variables, thus a single char consumes two bytes; the encoding used is UTF-16. Normal text files as you know them are practically never pure ASCII(-7), but most probably one of those many 8-bit ASCII extensions standardized through the ISO under standard number 8859, or it's an operating system vendor specific, ISO-like 8-bit encoding such as Microsoft's Codepages (Cp1252 for instance). Anyway, files are handled byte-oriented. The only way you can save space is to use compression, as has been suggested several times.

quittea at 2007-7-21 22:19:36 > top of Java-index,Java Essentials,Java Programming...
# 22

Thank-you,

now all it's clear!!!!!!!

saving data using DataOutputStream.write(char) saves chars using 8bit, and this is the less possible.

look at this (I post this in case it will be useful for someone else):

import java.io.*;

public class Main {

//converts the char in its real bytes

static byte [] charToByteArray( char c )

{

byte [] twoBytes = { (byte)(c & 0xff), (byte)(c >> 8 & 0xff) };

return twoBytes;

}

//convert the "significant" byte in the character

static char byteToChar(byte b)

{

char c = (char)(b & 0xFF);

return c;

}

public static void main(String[] args) {

char string[] = new char[]{'c','i','a','o',' ','s','o','n','o',' ','d','a','n','i','e','l','e'};

File txtFile = null;//the file to write

File binFile = null;//the file to write

if (args.length > 0)

{

txtFile = new File(args[0]);

binFile = new File(args[1]);

}

try {

//setup the text stream

FileOutputStream txtFileOutputStream = new FileOutputStream (txtFile);

DataOutputStream txtDataOutputStream = new DataOutputStream (txtFileOutputStream);

//setup the bin stream

FileOutputStream binFileOutputStream = new FileOutputStream (binFile);

DataOutputStream binDataOutputStream = new DataOutputStream (binFileOutputStream);

//byte for representing the char data

byte readingByte[] = new byte[1];

System.err.println( "char\tbyte[0]\tbyte[1]\tchar form byte[0]" );

for (int i=0;i<string.length;i++)

{

//writes the character as is

txtDataOutputStream.write(string[i]);

//converts the char in a couple of bytes [0000][0000]

byte b[] = charToByteArray(string[i]);

//take the "most significant" byte of the char and write it to the bin file

binDataOutputStream.write(b[0]);

System.err.println( string[i]+"\t"+b[0]+"\t"+b[1]+"\t"+ byteToChar(b[0]) );

}

//close the streams

txtDataOutputStream.close();

binDataOutputStream.close();

}

//bla bla bla...

catch(FileNotFoundException fnfe)

{

System.out.println (fnfe);

}

catch(IOException ioe)

{

System.out.println (ioe);

}

}}

Thanks everyone again!

daniele>

palmer_eldritcha at 2007-7-21 22:19:36 > top of Java-index,Java Essentials,Java Programming...