UDP packet checksum error
We got two SF4800 running Oracle RAC on Solaris 9 and Sun Cluster 3.1. We found the udpInCksumErrs counter dumps up suddenly when the server loading increases. We tried to use an packet analyzer software to monitor a mirrored network ports of them, but failed to find any UDP packets with incorrect check sum.
What is the meaning of udpInCksumErrs? Will it affect the system?
[388 byte] By [
ACT_Alvin] at [2007-11-26 7:06:24]

# 1
udpInCksumErrs is a counter which count the number of incoming UDP packets with bad UDP checksums. Since i can imagine that Solaris will drop these packets when it recieves them, i guess it might cause some confusion for the application which sent it.. 7/M.
# 2
Thanks for your information.
We can't find any UDP packets with checksum error coming from our network. So, we suspect those corrupted packets are coming from heartbeats, but it is very difficult to trace.
Do they have the relationship or triggered by the external packets?
Does Sun Cluster has a tool to trace such problem?
# 3
At last, we found those corrupted UDP packets in heartbeats between two Sun Cluster nodes. We captured network packets in a heartbeat for 30 seconds and got 12 UDP packets with bad checksum.Any possible causes of this problem?
# 4
You don't say what the hardware is or what OS you are running or indeed whether the checksum errors are on both links or not. It's going to be difficult to recommend a course of action without these facts.Thanks,Tim
# 5
Well, he wrote that "We got two SF4800 running Oracle RAC on Solaris 9 and Sun Cluster 3.1." ... 7/M.
# 6
Doh, should have READ back along the rest of the thread! Apologies!!!Tim
# 7
Shift happens.
Intersting point though, i wonder if it occours on both nodes or just one.
Its quite worrying if it occours on the heartbeat interface, it might mean that the cluster looses heartbeats.
After all the interconnect should just be a direct connection with a crossed cable..
7/M.
# 8
As far as I know, the Sun Cluster heartbeat does not use UDP. It just uses raw ethernet frames of type 0833. You can see these if you do a snoop on the interface. I pretty sure that the UDP packets are from the Oracle RAC distributed lock manager/cache fusion.
I would have thought, again, that RAC would create the packets it needed and ask Solaris to send them. (I'm checking this with a colleague). From scanning the bugs database, it would appear that there are a number of recommendations:
* Ensure the latest patches have been applied
* Make sure the system has been installed in accordance with Sun's EIS checklist (if Sun installed the system this would have happened).
Which says:
When using supported network adapters which use the *ce* driver for private
transport, insert into file /etc/system:
set ce:ce_taskq_disable=1
Also see InfoDoc:79189
Tim
# 9
Thanks, Abrante.We got the same problem in both nodes. They are directly connected with two cross cables. We found bad checksum UDP packets in all heartbeat interfaces.Message was edited by: ACT_Alvin
# 10
Tim
Thanks for your suggestion.
Actually, we got 3 Oracle RAC incidents before. All of them are related to the block locking problem. We suspect the bad checksum UDP packet is the root cause.
Our system was installed by a Sun certified service provider and I've checked the "system" file which already set ce_taskq_disable to 1.
# 11
I assume that your phrase 'set ce_taskq_disable to 1' is a typo as it should be 'set ce:ce_taskq_disable=1' (with a ':' not an '_').
I this is what you have I would open a case with Sun support as they'll need to go through a details check of the patches and settings you have in /etc/system.
Bottom line, it is probably a tuning issue. The corrupt headers will cause Oracle to have to retransmit the packets, resulting in lower performance.
Tim
# 12
Thanks, Tim.Any information I need to prepare for the investigation?
# 13
Support staff will ask you for explorer output. It should have been installed with the cluster and usually lives in /opt/SUNWexplo (if I remember correctly). I'm not entirely sure what the right options are. I think it is simply:# /opt/SUNWexplo/bin/explorerTim
# 15
Tim,Have you opened a case in SunSolve Support for us? May I have the service request number?
# 16
Its up to you to call (or email, or use the Online Support Center) your local service center. If not else they'll need your contract number and serial number of the affected box. 7/M.
# 17
Thanks, Abrante.I've submited a service request in Online Support Center.