The gist of the proof is that entropy is currenly DEFINED to be the what you get from an arithmetic encoder.
In the stone ages, there was a theoretical definition of entropy involving the number of bits needed to encode characters using a first order probability model, i.e. each character occurs with a certain fixed probability. The theory indicated that you could need a fractional number of bits to matach the probability of a certain character occuring.
The best at the time encoding method, Huffman, rounded that theoritical number up to the nearest bit so there was a slight difference between the encoding you got from Huffman and what you could theoretically do.
Then with a different viewpoint, thinking of the stream of code bits as being a single long floating point number, it was realized that one could in fact use a fractional number of bits to encode a character. (basically if a character theoretically requires 4.3 bits to encode, Huffman rounded that up to 5 and at the end of 5 bits you knew the character and knew nothing about the next character. With arithmetic coding, after 4 bits you almost know what character it is, and by 5 bits you know exactly what it is and also have biased the result for determining the next character. This propagation of error terms essentially allows one to match fractional number of bits used in the code stream to the fraction that is the probability of the next character.
The definition of entropy changed when there was the realization that one did not need to stick with a single character model of probability, but could use digram statistics or ANY other probability model to drive your arithmetic encoding. Once this was realized, it was simple to change the definition of entropy to include that notion. In short, entropy is essentially DEFINED as being the best you can do with an arithmetic encoder.
I know, that's not a proof. It is a hand wave of one. The point is, you need to look at your definition of entropy. See how it defines code size in terms of probabilities, improbabile strings should require longer codes, and then see how the length of an arithmetic code does in fact correspond to that probability model.
An arithmetic code simply requires a bit string long enough to locate you into an interval with a uniquly defined prefix. The width of that interval determines how many bits you need in your floating point number to be within that interval, but the width of the interval also happens to be exaclty the probability of that particular string occuring based on your probability model.
So your probability model basically maps strings onto the real line where the lengths of the intervals match the probability of that string occruing, and your arithmetic encoding represents a floating point number on that real line. The precision of your floating point number (the fact that you don't have an infinite number of bits) means that your floating point number is actually an interval.
Less likely strings map into smaller intervals and thus require more precision in the floating point number to encode. So the proof really is that entropy is defined to be what arithmetic encoding does.