First, if the sound stream is unending(streaming media), you might want to use UDP and allow packets to be lost. This is simply the only way for the sound to play as it is being sent. With TCP, packet loss would result in the receiver "lagging" behind the source as it aims to receive every last packet. It's sort of like frameskip: You have to drop a bit to keep up the speed.
If the sounds are finite-length, you might try first uploading the entire sound, then issuing a "start playing" command once all receivers are ready.
Mind you that there is simply no way to get 100% accurate timing over network connections. Even clock synchronization protocols must allow for a few milliseconds loss, and those go to great lengths to minimize the timing error. Full synchronization is simply not an option: Every network has inherant delays to it, and those delays have inherant randomness to them, as well as variance from connection to connection. There is no way to rectify that.