Core Dum

Hi,

I am working on 5.10 Generic_118833-24 sun4u sparc SUNW,Sun-Fire-V890 machine. Our server is Multi threaded and it is core dumping every few minutes. To debug this i have used libumem. Below is the information it gives when i run umem_status.

::umem_status

Status: ready and active

Concurrency:32

Logs:transaction=1m (inactive)

Message buffer:

umem allocator: buffer modified after being freed

modification occurred at offset 0x8 (0xdeadbeefdeadbeef replaced by 0xdeadbeeedeadbeef)

buffer=651e18 bufctl=65d608 cache: umem_alloc_56

previous transaction on buffer 651e18:

thread=4 time=T-0.007687000 slab=20e2d0 cache: umem_alloc_56

libumem.so.1'umem_cache_free+0x50

libumem.so.1'? (0xff35b3c4)

libCrun.so.1'__1c2k6Fpv_v_+0x4

SGwAPG40SFTPFileCollector'__1cJRWCStringHreplace6MIIpkcI_r0_+0x18c

SGwAPG40SFTPFileCollector'__1cOSFTPConnectionDget6MrknJRWCString_3nQFileTransfe rType_rkl_I_+0x48

SGwAPG40SFTPFileCollector'__1cNgetFileBySftp6FrnSRWTPtrSortedVector4nNFileColln ItemrknJRWCString_44ki_i_+0xab8

SGwAPG40SFTPFileCollector'__1cVSGwAPG40SFTPCollector6Fpv_0_+0x308

libc.so.1'? (0xfe0400b0)

umem: heap corruption detected

stack trace:

libumem.so.1'? (0xff35c5a8)

libumem.so.1'? (0xff35d6bc)

libumem.so.1'umem_cache_alloc+0x210

libumem.so.1'umem_alloc+0x60

libumem.so.1'malloc+0x28

libCrun.so.1'__1c2n6FI_pv_+0x2c

SGwAPG40SFTPFileCollector'__1cMRWCStringRefGgetRep6FIIpv_p0_+0x2c

SGwAPG40SFTPFileCollector'__1cJRWCString2t5B6MpkcI2I_v_+0x2c

SGwAPG40SFTPFileCollector'__1cNgetFileBySftp6FrnSRWTPtrSortedVector4nNFileColln ItemrknJRWCString_44ki_i_+0xc60

SGwAPG40SFTPFileCollector'__1cVSGwAPG40SFTPCollector6Fpv_0_+0x308

libc.so.1'? (0xfe0400b0)

When i check the memory it comes out as below

651e18/08X

0x651e18:deadbeefdeadbeefdeadbeeedeadbeefdeadbeefdeadbeefdeadbeef

deadbeef

Compiler and Studio Version used :

Sun Studio 11

Sun Studio 11 C Compiler

Sun Studio 11 C++ Compiler

Sun Studio 11 Tools.h++ 7.1

Can any one help me

Regards

Vikas

Message was edited by:

nagaraja

[2281 byte] By [nagarajaa] at [2007-11-27 5:29:58]
# 1

Apparently some part of the code is writing to memory after it has been deleted. The heap corruption later causes a crash.

Try running the program under dbx with Run-TIme Checking enabled:

% dbx myprog

(dbx) check -all

(dbx) run

RTC can find many problems associated with invalid memory usage.

clamage45a at 2007-7-12 14:53:37 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 2

Thank you very much for the replay.

In our code we dont do any new or free or malloc or delete. So a memory allocated and deleted by RWCString will never be touched in our code. Correct me if i am wrong.

So now how can i find out exactly where is the problem ?

I am asking this because the problem is occuring in a runtime machine where they dont have dbx installed. So i will not be able to run dbx.

Few more informations

1) The same code with same number of threads does not dump on Solaris 9 Machine. The process is running with the same load.

The hardware configuration with respect to memory and CPU is exactly same for both Solaris 10 and Soaris 9 machine. But the only difference is the Forte version used on Solaris 9 is

Forte University Edition 6 update 2

Sun WorkShop 6 update 2 Compilers C

Sun WorkShop 6 update 2 Compilers C++

Sun WorkShop 6 update 2 Tools.h++ 7.1

2) On Solaris 10 machine if i reduce the number of threads of my process to 1 (default setting is 5) it works perfectly fine. This is the big problem. Finding out heap corruption in MT process.

3) Also if i check the information provided by libumem none of the buffers are corrupted.

nagarajaa at 2007-7-12 14:53:37 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 3

I think you are running into a problem with the std::string class that shows up when *all* of the following are true:

- the compiler is older than Sun Studio 8 (C++ 5.5) and lacks recent patches

- the default or explicit compilation target architecture is SPARC v7 or v8

- the code is run on an UltraSPARC (v8plus[ab])

- the machine where the program is run has very old versions of the C++ runtime libraries

- the program is multi-threaded

First, what is the patch level of the C++ compiler that was used to build the program? Run the command

CC -V

and report the output

Next, what is the C++ runtime library on the system where the problem occurs? On that machine, run the command

showrev -p | grep SUNWlibC

and report the output.

In the best case, that machine will need to be updated with a new C++ runtime library patch (SUNWlibC).

In the worst case, you will need to patch your old compiler, or replace it with a current version, and rebuild the program. The target system might also need to get the current SUNWlibC patch.

About default architectures:

Until recently, Sun compilers compiled code by default for the old SPARC chips (v7 and v8) prior to the UltraSPARC line introduced in (I think) 1994. The reason was that Solaris still supported those old chips. Current versions of Solaris support only UltraSPARC and later (v8plus, v8plusa, v8plusb, VIS, T1, T2, etc). The default for current compilers v8plus.

If you need to recompile old code, and if you do not need to support ancient hardware, specify v8plus as the target architecture. You will not only avoid any problem with the string class, you will get better run-time performance of your code.

clamage45a at 2007-7-12 14:53:37 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 4

Thanks once again.

Below are information requested by you

1) clarence{eciknvi} % ../bin/CC -V

CC: Sun C++ 5.8 2005/10/13

The above command is run on development machine where we do our compilation.

2) showrev -p | grep SUNWlibC

Patch: 119963-05 Obsoletes: Requires: Incompatibles: Packages: SUNWlibC

Patch: 119963-08 Obsoletes: Requires: Incompatibles: Packages: SUNWlibC

The above command was run on the target machine where binary runs. How can i find out wehter the patch is correct one or not.

1) If i run file command on my exe below is what i get

file SGwAPG40SFTPFileCollector

SGwAPG40SFTPFileCollector:ELF 32-bit MSB executable SPARC32PLUS Version 1, V8+ Required, dynamically linked, not stripped

It says V8+ required. Does that mean it has already being compiled with v8+ ? When i checked the man pages of CC it says with this compiler the default is

v8plusThis is the default and it means the

compiler uses the instruction set for

the V8plus version of the SPARC-V9 ISA.

I have got one more information. I ran the file command on librwtool.so.2. Below is what i get

file librwtool.so.2

librwtool.so.2: ELF 32-bit MSB dynamic lib SPARC Version 1, dynamically linked, not stripped, no debugging information available

This library is not of the format SPARC32PLUS. Will this create the problem ?

When i run the file command on library generated by us using the above compiler version we get the below thing with is of the format SPARC32PLUS

file libSGwcdrapnBcpFormatter.so

libSGwcdrapnBcpFormatter.so:ELF 32-bit MSB dynamic lib SPARC32PLUS Version 1, V8+ Required, dynamically linked, not stripped

I also got the dbx installed on the target machine and ran dbx as said by you. Even dbx did not report any problem regarding memory accessed by a code which was freed even after the code dumped.

nagarajaa at 2007-7-12 14:53:37 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 5

In your original email you listed WorkShop 6 update 2 (C++ 5.3) as the compiler, which is why I was concerned about old compilers and old libraries. If you are actualy using Sun Studio 11 (C++ 5.8), and the target system is Solaris 10, there is no such problem. These products were published years after the std::string problem was fixed.

As you noted, the default architecture for C++ 5.8 is v8plus, and the file command shows that the program was indeed built that way.

librwtool is not an issue. It does not use the std::string class.

I think your problem must be memory corruption, and not a problem in the compiler or support libraries.

What happens when you run the program under dbx with RTC enabled?

clamage45a at 2007-7-12 14:53:37 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 6

I was able to solve the problem.

There was a Global RWCString object, which was accessed by all the threads. Threads used to assign some values to the object with out any synchronization b/w them. Since the operator= deletes the char * pointer when ever assignment operator is called. So we were ending up in accessing the memory in some thread which was getting deleted by the other thread. After synchronizing it everything is working fine.

Thank you very much for all the help.

But i am not able to answer the question how the same code was working fine on Solaris 9 machine ?

nagarajaa at 2007-7-12 14:53:37 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 7
> But i am not able to answer the question how the same > code was working fine on Solaris 9 machine ?I believe it worked accidentally. Heap misuse alone introduces a lot of variations, but multithreading multiplies their number.
MaximKartasheva at 2007-7-12 14:53:37 > top of Java-index,Development Tools,Solaris and Linux Development Tools...