UDP packet checksum error

We got two SF4800 running Oracle RAC on Solaris 9 and Sun Cluster 3.1. We found the udpInCksumErrs counter dumps up suddenly when the server loading increases. We tried to use an packet analyzer software to monitor a mirrored network ports of them, but failed to find any UDP packets with incorrect check sum.

What is the meaning of udpInCksumErrs? Will it affect the system?

[388 byte] By [ACT_Alvin] at [2007-11-26 7:06:24]
# 1
udpInCksumErrs is a counter which count the number of incoming UDP packets with bad UDP checksums. Since i can imagine that Solaris will drop these packets when it recieves them, i guess it might cause some confusion for the application which sent it.. 7/M.
mAbrante at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2

Thanks for your information.

We can't find any UDP packets with checksum error coming from our network. So, we suspect those corrupted packets are coming from heartbeats, but it is very difficult to trace.

Do they have the relationship or triggered by the external packets?

Does Sun Cluster has a tool to trace such problem?

ACT_Alvin at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3
At last, we found those corrupted UDP packets in heartbeats between two Sun Cluster nodes. We captured network packets in a heartbeat for 30 seconds and got 12 UDP packets with bad checksum.Any possible causes of this problem?
ACT_Alvin at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 4
You don't say what the hardware is or what OS you are running or indeed whether the checksum errors are on both links or not. It's going to be difficult to recommend a course of action without these facts.Thanks,Tim
TimRead at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 5
Well, he wrote that "We got two SF4800 running Oracle RAC on Solaris 9 and Sun Cluster 3.1." ... 7/M.
mAbrante at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 6
Doh, should have READ back along the rest of the thread! Apologies!!!Tim
TimRead at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 7

Shift happens.

Intersting point though, i wonder if it occours on both nodes or just one.

Its quite worrying if it occours on the heartbeat interface, it might mean that the cluster looses heartbeats.

After all the interconnect should just be a direct connection with a crossed cable..

7/M.

mAbrante at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 8

As far as I know, the Sun Cluster heartbeat does not use UDP. It just uses raw ethernet frames of type 0833. You can see these if you do a snoop on the interface. I pretty sure that the UDP packets are from the Oracle RAC distributed lock manager/cache fusion.

I would have thought, again, that RAC would create the packets it needed and ask Solaris to send them. (I'm checking this with a colleague). From scanning the bugs database, it would appear that there are a number of recommendations:

* Ensure the latest patches have been applied

* Make sure the system has been installed in accordance with Sun's EIS checklist (if Sun installed the system this would have happened).

Which says:

When using supported network adapters which use the *ce* driver for private

transport, insert into file /etc/system:

set ce:ce_taskq_disable=1

Also see InfoDoc:79189

Tim

TimRead at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 9
Thanks, Abrante.We got the same problem in both nodes. They are directly connected with two cross cables. We found bad checksum UDP packets in all heartbeat interfaces.Message was edited by: ACT_Alvin
ACT_Alvin at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 10

Tim

Thanks for your suggestion.

Actually, we got 3 Oracle RAC incidents before. All of them are related to the block locking problem. We suspect the bad checksum UDP packet is the root cause.

Our system was installed by a Sun certified service provider and I've checked the "system" file which already set ce_taskq_disable to 1.

ACT_Alvin at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 11

I assume that your phrase 'set ce_taskq_disable to 1' is a typo as it should be 'set ce:ce_taskq_disable=1' (with a ':' not an '_').

I this is what you have I would open a case with Sun support as they'll need to go through a details check of the patches and settings you have in /etc/system.

Bottom line, it is probably a tuning issue. The corrupt headers will cause Oracle to have to retransmit the packets, resulting in lower performance.

Tim

TimRead at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 12
Thanks, Tim.Any information I need to prepare for the investigation?
ACT_Alvin at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 13
Support staff will ask you for explorer output. It should have been installed with the cluster and usually lives in /opt/SUNWexplo (if I remember correctly). I'm not entirely sure what the right options are. I think it is simply:# /opt/SUNWexplo/bin/explorerTim
TimRead at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 14
Thanks, Tim.
ACT_Alvin at 2007-7-6 15:53:30 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 15
Tim,Have you opened a case in SunSolve Support for us? May I have the service request number?
ACT_Alvina at 2007-7-21 15:05:03 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 16
Its up to you to call (or email, or use the Online Support Center) your local service center. If not else they'll need your contract number and serial number of the affected box. 7/M.
mAbrantea at 2007-7-21 15:05:03 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 17
Thanks, Abrante.I've submited a service request in Online Support Center.
ACT_Alvina at 2007-7-21 15:05:03 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...