image compression algorithms for video stream
This is more of an algorithm question then a JMF question (maybe).
Scenario: User connects to webpage and loads applet. Applet makes a connection to a servlet. Servlet starts webcam and sends image data (in the form of bytes) from a webcam to the applet. The applet displays the image data.
Goals:
1. Try to achieve a fairly high frame rate (ideally >= 10 fps).
2. Keep the bandwidth consumption low (ideally so a person with a 56K modem could use it).
3. Allow the quality to be adjusted.
What I am wondering what is a good process of sending the data to achieve the above. Right now, I see two possible approaches:
1. Encode each frame as an jpeg, compress and send.
2. For each frame, compare to last frame. Record changes (if significant). Organize changes into byte array, compress and send.
Which of 1 or 2 will produce better results? Or is there a better way to do this?
[942 byte] By [
rkippena] at [2007-9-29 17:02:50]

Interesting.
The webcam produces every n seconds a JPEG.
No need for an applet, a HTML meta refresh would do fine.
An applet would be nice though.
There are more than one users out there.
So you have to place the JPEG, and for longer visitors
quicker delta files.
This requires the server to do some intensive work.
Paid back in hopefully less traffic for longer visitors.
In effect the server is playing an MPEG movie, with still frames
and deltas.
You can put every pixel group in a priority queue with factors:
- change in cumulative color
- duration of time not updated
Furthermore it makes sence to let the applet reload the entire frame
once in a while; when the queue would grow too large, or a long period
did pass.
try here
http://www.mpeg.org/MPEG/video.html#video-software
for a lot of mpeg 1,2,4 encoders/decoders.
This might be a way to go. Perhaps more efficient then trying to write
your own compression standard.
That way you could also stream to "standard" clients such as IE with
MediaPlayer or Real.
Mpeg 1 is pretty much a jazzed up version of what you propose.
10) Encode frame using JPEG.
20) Encode frame difference using JPEG
30) if reconstructed image differs by > errorMargin from inputImage goto 10.
40 goto 20.
This allows an arbitrary number of diff frames (which should be smaller)
between the static frames without the error growing to large (for
example when things move rapidly.
MPEG 1 is a bit more complex then this but not a huge amount.
matfud
> The webcam produces every n seconds a JPEG.
> No need for an applet, a HTML meta refresh would do
> fine.
Well, I tried this approach before. The problem was that each time the image is downloaded, the browser had to make a new connection to the server, which produces a limitation on the number of fps. Having an open connection is the way to go.
> An applet would be nice though.
> There are more than one users out there.
> So you have to place the JPEG, and for longer
> visitors
> quicker delta files.
>
delta files? do you mean byte stream?
There are no "files". Actually, writing to the server hard-drive would be more difficult to deal with because you would have to perform some sort of synchronization between writing to a file and the applet making a request to read the file.
> try here
> http://www.mpeg.org/MPEG/video.html#video-software
> for a lot of mpeg 1,2,4 encoders/decoders.
>
Here are a couple of links for anyone else:
http://www.hpl.hp.com/techreports/2002/HPL-2002-260.pdf
http://archive.dstc.edu.au/RDU/staff/jane-hunter/video-streaming.html
> This might be a way to go. Perhaps more efficient then
> trying to write
> your own compression standard.
>
Yes, I don't want to write a "new" compression algorithm. However, I want something simple that works with applets without downloading a new JRE or additional JMF packages (ideally the default browser java 1.1 applet would be all that is required). However, the com.sun.jpeg.codec.blah whatever package isn't available until 1.2. So I might have to implement the JPEG compression algorithm (help?).
> That way you could also stream to "standard" clients
> such as IE with
> MediaPlayer or Real.
>
I didn't consider this, but it is probably worthwhile, thanks.
> Mpeg 1 is pretty much a jazzed up version of what you
> propose.
> 10) Encode frame using JPEG.
> 20) Encode frame difference using JPEG
> 30) if reconstructed image differs by > errorMargin
> from inputImage goto 10.
> 40 goto 20.
>
Where does the data get sent? at step 10?
I was assuming that the encoded data would be pushed into a stream
if there is a client(s) waiting. Disposed of otherwise. Of course
you would need a decoder on the client. With a scheme like this you
just drop the data if no one is listening. If someone is listening you
go into the pseudo code described above.
You could just stream JPEGs but they tend to be quite large. If not much
is changing in the image then sending a difference image can save
a large number of bytes. The problem is that if you send only difference
images cumulative errors can build up (as JPEG is not lossless). Therefore
the encoder encodes an image [i1] and sends it. this is a marker frame
(an 'I' frame in MPEG I think). Each time this happens a new client can join the stream. After this you find the difference between image [i1]
and [i2] and compress it. This should produce a better compression
ratio as the difference images should contain less info then a full
frame. However you have to decompress the diff image and build the
complete image as the client would see it. If this reconstructed image
differs too much from the original image obtained from the webcam then
you will need to send a new complete frame so that subsequent diff
frames do not cause too much error to creep into the image.
Im not sure how you would do this using MPEG itself cos I havn't read
the specs for a long while. However it should be possible to allow new
clients to join the stream.
This means that each client can have a single connection to the server
(UDP perhaps). This is not ideal but multicast is not reliable on
the internet (gets dropped by most routers)
matfud
> If not much is changing in the image then sending a difference image can save a large number of bytes.
This is the part I am most interested in. How is the difference image done?
For example, suppose the following is the image:
aaaa
aaaa
aaaa
aaaa
Then suppose in the next frame the image is:
aaaa
abba
abba
aaab
Then how is the difference image stored? Is it also a JPEG?
Yes. In its simplest form it is just a pixel by pixel subtraction of the
two images. In areas where nothing has changed the difference between
the pixel values is 0.
(By the way JPEG can be lossless it is just remarkably rare to find
an implementaion that provides this aspect of the standard (lots of
issues with patents and all)
JPEG's baseline sequntial, which is the most common form of lossy compression, finds
the Discrete Cosine Transform of blocks of the image (blocks of size
8x8 pixels for example). This represntation contains spatial frequency
information. The precision of these blocks is reduced (they are quantized) by dividing by the values in a
a template block (quantisation table of which there can be many) which
reduces the magnitude of the values in the block).
The aim of this stage is to optimise the data for the second stage which is
coding the image. The coding is perfomed by runlength encoding (diagonal zigzag across the block). As this block is the DCT of the
image the last values tend to be zeros and can be compressed well.
The result of this run length encoding is then Huffman encoded (or
artimetically encoded for lossless) which minimises the number of bits
required to store each value.
This therefore is the basic outline of JPEG: Find frequency components
(DCT). Divide result by a quntisation table that minimises the residual.
Runlength encode this in such a way that you minimise the number of
bytes required to represent the residual. Huffman encode the result.
Huffman encoding finds the minimum number of bits that can represent the
data with which it is presented. So it is a case of optimise, optimise,
optimise.
If you diff two images that are similar then most of values in the 8x8 blocks in
the diff image will be close to zero. The spatial frequency components
of the block will therefore be mostly zero. Quantisation with a suitable
table will reduce the residuals even more. The block will be pretty
much all zeros. The run length encoding should
therefore "see" very long runs of the same value which it should be
efficient at compressing. This all results in a diff image being far more compressible
then the original image (as long as not much changed between one image
and the next).
So you transmit the JPEGed original image[i1] then you find the diff
between it and its sunsequent image (image[i2]-image[i1]). This is
encoded and transmitted. you keep transmitting encoded diff images until
the reconstruction of the original (done on the server of course for
just this reason) has to great an error when compared to the image it
should be. then you transmit another "full" image to resynch the clients
(taken from "Graphics File Formats: Reference and Guide". Wayne, C. and
Shepherd, B., 1995 chp 10. ISBN 1-884777-00-7)
MPEG actually uses images called I, P, B and D.
I is Intra-coded: encoded using only information from itself.
P is predictve-coded: picture encoded using motion compensated
prediction from past I or P pictures. (a fancy diff image that improves
compression more then diff does)
B is Bidirectional predictive-coded: picture encoded using motion
compensated prediction from past and/or future I and/or P pictures.
D is DC-coded: coded using only info from itself. Stores only the DC
components of the DCT blocks. This is the odd one out. MPEG streams are
either all D frames or contain none. Streams that are all D frames are
very lossy but fast to process (good for video editing previews).
matfud
> If you diff two images that are similar then most of values in the
> 8x8 blocks in the diff image will be close to zero. The spatial
> frequency components of the block will therefore be mostly zero.
> Quantisation with a suitable table will reduce the residuals even
> more. The block will be pretty much all zeros. The run length
> encoding should therefore "see" very long runs of the same value
> which it should be efficient at compressing.
Ok, this is the part that is difficult. I have done some experimentation using a 24-bit 320x240 picture (of a person looking into a webcam) and converting it to a JPEG. The resulting JPEG is about 10 KB. I then use the eraser tool and white-out about half the pixels (basically the back-ground is whited out, leaving only the head). I then save that as a JPEG and the resulting file size is 12 KB!
So, I must be missing something on how the new image is computed, if it is still a JPEG. Is the quality of the difference image lower than the quality of the original image?
Are you sure you got them the right way round. When I tried this the
size of the image dropped from 12k to about 8k.
It must be noted that whiting (or blacking) out chuncks of the image
is not quite the same as diffing the image. (but there shouldn't be
too much difference between the two)
It could be an artifact of the quality value you provide or the
shape of the image that has not changed etc. Anyway, as I said, MPEG
actually uses predictive encoding which is a gloryfied version of
diffing.
matfud
> Are you sure you got them the right way round. When I
> tried this the
> size of the image dropped from 12k to about 8k.
>
Yes, but as you said it is probably a high quality factory with sharp lines (not good for JPEG).
I have started writing an all 100% Java JPEG decoder (to work with Java 1.1). I found a link for one, but unfortunately it was broken. I also found the source for an encoder, so that will make things easier.
I will post when it's finished, until then if somebody has a (working) link that would be appreciated.
Also you can use buffering as in MS Media Player.
1. Applet sets cache-refresh time, for example 5 sec.
2. Initially applet gets 5 sec movie(JPEG or whatever).
3. When applete received JPEG and plays movie, you get another
5 sec movie.
The main relation here is network speed(NS) and cache time(CT).
CT = K / NS, where K - some coefficient.
Good bye, see ya.
> I will post when it's finished, until then if somebody
> has a (working) link that would be appreciated.
>
I am nearing completion. However, I have noticed that the
decoding of the JPEG image seems to be slower than the encoding because of all the prefix matching that has to be done to decode the Huffman codes.
Is there a simple way to improve the speed of the matching? Suppose the buffer contains 111101101011001. Also, suppose the AC Huffman table is about 256 in length. I don't want to have to check the prefix of 256 code words to decode a single entry. However, I am worried that building a tree won't be that more efficient, plus it will take a lot of time to implement. Does anyone have experience with the speed gain?
A tree should offer a large performance increase overa search. howeveryou will have to be careful about the way you implement the tree to achieve the benefit.
try this
public class HuffmanTree {
private TreeNode root = new TreeNode(0);
private TreeNode previousState=null;
private int shiftCount=0;
private int numBits=32;
private int code;
private int[] results = new int[32]; // larger then needed
private int resultCount = 0;
/**
* clears any state the tree has
*/
public void clearState() {
code=0;
previousState=null;
}
/**
* adds a node to the tree. All nodes in the path will be created if they don't
* already exist.
*
* @param value the value of the leaf
* @param c the code associated with the value
* @param numBits how many of the high order bits are part of the code
*/
public void addNode(int value,int c, int numBits) {
if(numBits<1) throw new IllegalArgumentException("numBits must be greater then 0");
if(numBits>32) numBits = 32;
System.out.println("Add node called");
TreeNode tn = null;
TreeNode parent = root;
boolean goRight=false;
while(numBits>0) {
goRight=c<0;
if(goRight)
tn=parent.right;
else
tn=parent.left;
if(tn==null) {
tn = new TreeNode(0);
if(goRight)
parent.right=tn;
else
parent.left=tn;
}
if(numBits==1)
tn.value=value;
parent=tn;
c=c<<1;
numBits--;
}
}
/**
* finds the numbers associated with the codes in the first numBits of the <code>code</code>
* parameter
*/
public int[] get(int code, int numBits){
if(numBits<1) throw new IllegalArgumentException("numBits must be greater then 0");
if(numBits>32) numBits = 32;
shiftCount=0;
resultCount = 0;
this.numBits=numBits;
this.code = code;
// if I was partway through a search an ran out of bits then
// continue from where I left off
if(previousState!=null){
TreeNode ps = previousState;
previousState=null;
int tmp = findMatch(ps); //start search at old tree position
//still not enough bits. This should never happen as the
// tree construction only allows a max of 32 bits to be used
if(previousState!=null) return new int[0];
results[resultCount++]=tmp;
}
// general search loop
while(shiftCount<this.numBits) {
System.out.println("finding resultCount "+resultCount);
int tmp = findMatch(root);
if(previousState==null) {
results[resultCount++]=tmp;
}
}
// create new array populate with results and return it
int[] r = new int[resultCount];
for(int i=0;i<resultCount;i++)
r[i]=results[i];
return r;
}
/**
* traverses the tree and returns the value at the terminal leaf. If not enough
* bits remain to find the leaf it saves its state in the previousState variable
*
*/
protected int findMatch(TreeNode node) {
while((node.left!=null)&&(node.right!=null)) {
if(shiftCount==numBits) {
previousState = node;
return 0;
}
boolean goRight = code<0;
if(goRight)
node=node.right;
else
node=node.left;
code=code<<1;
shiftCount++;
}
return node.value;
}
public static void main(String[] args) {
int[][] table = {{'e',0,3},
{'r',64,3},
{'i',112,4},
{'b',160,5},
{'t',224,3},
{'a',32,4},
{'v',96,5},
{'s',128,4},
{'g',168,5},
{' ',192,3},
{'d',48,4},
{'y',104,5},
{'n',144,4},
{'h',176,4}};
HuffmanTree ht = new HuffmanTree();
// populate tree
for(int i=0;i<table.length;i++) {
int code = table[i][1]<<(24);
//System.out.println(Integer.toBinaryString(code));
ht.addNode(table[i][0],table[i][1]><<24,table[i][2]);
}
//set up two ints with a variety of values. The first int cannot
// contain all required bits (has an extra 00 at the end). The
// tree remembers this state and when the second int is passed in it
// continues so the first letter of the second it appear to be "a" (a=0010)
Long bin = Long.valueOf("11101000100110010101110001100000",2); //"travg etc."
Long bin2 = Long.valueOf("10011101001110010011001000010010",2);
int v = bin.intValue();
int[] values = ht.get(v,32);
System.out.println("\nlength of results = "+values.length);
for(int i=0;i<values.length;i++){
System.out.println((char)values[i]);
}
v = bin2.intValue();
values = ht.get(v,32);
System.out.println("\nlength of results = "+values.length);
for(int i=0;i<values.length;i++){
System.out.println((char)values[i]);
}
}
class TreeNode {
protected TreeNode left;
protected TreeNode right;
protected int value;
protected TreeNode(int value) {
this.value = value;
}
}
}
This tree is designed so you can pull ints straight from a stream an
get the encoded values for that int. (each int can contain multiple
values). in general you call the get() method with a bitNum of 32 unless
you know that the huffman encoded stream ends before this. Beware that
the data must be in the MSbits of the int.
The tree allows searches for strings that are over 32 bits in length
(32 levels deep) however there is no facility to create such a tree
as the addNode method only allows 32 bit codes to be used. The addnode
method adds a single code. The code must be in the MSBs of the int.
The test table is taken from
http://www.la.unm.edu/~mbrettin/algorithms/huffman.html
I just modified to codes so they are byte aligned (therefore a 24 bit
left shift ensures the values are in the MSB's for easy addition to the
tree.
matfud
">
Getting there! Here is an overview of the structure:
public static void decodeJPEGData(byte[] data, int offset, JPEGTable jt, ... some additional buffers);
//data is the raw data
//offset is the starting index
int buffer = 0;
// load buffer with data from byte
int[] ret = new int[3]; // ret[0] stores the index, ret[1] stores the num bits to read, ret[2] stores the code
jt.match(buffer, ret); // goes to the jpeg table to find a match, result is stored in ret
if (ret[0] == -1) // no match (error)
else {
// ...
}
Actually, the JPEGTable contains 4 match methods, matchLumAC, matchLumDC, matchChromAC, matchChromDC. It will be interesting to see the difference in speed, but first I must finish doing the inv DCT...
for the Huffman tree decoder I posted you can sinply use byte arrays by casting
a byte to an int, right shifting by 24 and calling the get method with the result
and a numBits arg of 8.
It should be a bit faster if you can compose an int from 4 bytes and pass that in
though.
How are you implementing the Cosine Transform. A simple (but not efficient) method
is to use a fourier transform. However there are many techniques. Just interested
to know which one you chose.
matfud
Finally I finished the decoder. The JFIF specification is to use the DCT (Discrete Cosine Transform).
Here is the implementation, not sure if it is the same one you are talking about:
public class DCT {
private static double[][] c = new double[8][8];
static {
for (int i = 0; i < 8; i++)
c[0][i] = 1.0 / Math.sqrt(8);
for (int i = 1; i < 8; i++)
for (int j = 0; j < 8; j++)
c[i][j] = Math.sqrt(0.25) * Math.cos((2.0 * j + 1.0) * i * Math.PI * 0.0625);
}
/**
All arrays are 8x8. The result is stored in the "out" array. The temp array is used for
temporary storage.
*/
public static void IDCT(double[][] in, double[][] out, double[][] temp) {
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
temp[i][j] = 0.0;
for (int k = 0; k < 8; k++) {
temp[i][j] += in[i][k] * c[k][j];
}
}
}
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
out[i][j] = 0.0;
for (int k = 0; k < 8; k++) {
out[i][j] += c[k][i] * temp[k][j];
}
}
}
}
}
Actually, this implementation is considered slow. There is a faster implementation that comes with the
source code provided by the "Independent JPEG Group's software". However, it is written in C and is a
little cryptic. I will have to work on it later.
> Finally I finished the decoder. The JFIF
> specification is to use the DCT (Discrete Cosine
> Transform).
That is the spec. I was saying that if you can't find a pre written DCT but can find a pre written FFT algorithm
then you can persuade the FFT to perform the DCT for you. However it is not an efficient approach (but
can save coding time which is irrelevent now as you have written a DCT)
How much more of the decoder you you need to implement?
Did the huffman tree end up being faster then the tables?
matfud
> How much more of the decoder you you need to
> implement?
The decoder is finished. But there are some improvements to make, such as implementing the faster IDCT and looking for other ways the overall decompression can be made faster.
> Did the huffman tree end up being faster then the
> tables?
Yes. For a JPEG file of about 125 KB, the decoding part is 10-20 milliseconds faster. However, there were some changes to the HuffmanTree. The HuffmanTree you provided allows for multiple values to be associated with a prefix (it looks like) .However, when dealing with a stream of bytes there can only be one way to decompress the stream (the prefix property).In short, here is the modified Huffman tree.The "index" variable is really just the "value" variable. The "size" variable is the same as the "numBits" variable. I was deciding whether or not to store the size in the node or just use a counter when traversing the tree.
public class HuffmanTree {
private Node root = new Node();
private class Node {
public Node leftChild = null;
public Node rightChild = null;
public int index = 0;
public int size = 0;
public Node() {}
public Node(int index, int size) {
this.index = index;
this.size = size;
}
}
public void add(int index, int size, int code) {
code = code << (32 - size);
Node newNode = new Node(index, size); // create node here because size is changed
Node n = root;
while (size > 1) {
if (code < 0) {
if (n.rightChild == null)
n.rightChild = new Node();
n = n.rightChild;
}
else {
if (n.leftChild == null)
n.leftChild = new Node();
n = n.leftChild;
}
size--;
code = code << 1;
}
if (code < 0)
n.rightChild = newNode;
else
n.leftChild = newNode;
}
public void match(int buffer, int[] ret) {
Node n = root;
while (true) {
if (buffer < 0) { // 1 at start of buffer
n = n.rightChild;
}
else { // 0 at start of buffer
n = n.leftChild;
}
if (n == null) break;
if (n.leftChild == null && n.rightChild == null) {
ret[0] = n.index;
ret[1] = n.size;
//ret[2] == don't care
return;
}
buffer = buffer << 1;
}
ret[0] = -1; // no match found
}
}
The tree I posted maps one value to one prefix. Its more complex then your tree as it assumes that there
will be multiple pretfixs in the buffer variable (the int) and that the last prefix in that variable may or may not
be complete. It may require the next int of the buffer to allow the prefix to be fully decoded.
for example say a four bit prefix starts at the 30th bit of the buffer int. It would require 2 bits from the
next int to be able to complete the lookup. A fair bit of the code is there to ensure that a partial state
can be held between calls to the get method.
Also another reason the tree is a bit complicated is that it allows arbitrary length codes to be looked up
(although the tree addition method does not allow codes more then 32 bit long to be entered.)
The reason I chose to handle the lookup in this wierd way (rather then just passing in a byte buffer for
example) is that it keeps the code to represent the tree fairly small by getting rid of the array handling stuff.
It also is suitable for streaming systems as it does not need to store the entire array in memory to start
decoding.
Using an int rather then a byte allows the get method to be called less often (the int stores more prefixes
then the byte could)
To use it properly you will need a method that can take four bytes (from your socket) and squeeze them
into an int.
public static int makeInt(byte b3, byte b2, byte b1, byte b0) {
return (int)((((b3 & 0xff) << 24) |
((b2 & 0xff) << 16) |
((b1 & 0xff) << 8) |
((b0 & 0xff) << 0)));
}
However you might want a slightly more efficient one that takes a byte[] that way you can improve the
performance obtained reading from your socket by using the read(byte[]) style methods.
matfud
