Core dump on Solaris 10 allocating/deallocating memory

Hi,

We have a customer who reports our application causes a core dump on the Solaris 10 hosts. However, the core dump is happened on some hosts, but not on all.

I looked at several core dumps generated on differnt time, the stack

calls are different, but they all start from a new thread creation, and

end at libc and SGI STL.

Here is one example of stack calls:

#0 0xff1547b0 in getusa () from /lib/libc.so.1

#1 0x660f7a28 in ? ()

#2 0xff154fd8 in getsystemTZ () from /lib/libc.so.1

#3 0xff154f14 in _tzload () from /lib/libc.so.1

#4 0x3647e8 in _STLD::__malloc_alloc<0>::deallocate (__p=0xa1db70)

at ../../../../../third_party/stlport/stl/_alloc.h:114

#5 0x365930 in _STLD::allocator<char>::deallocate (this=0x96b124,

__p=0xa1db70 "000287890408", __n=13) at

../../../../../third_party/stlport/stl/_alloc.h:360

#6 0x36558c in _STLD::_String_base<char, _STLD::allocator><char>

>::_M_deallocate_block (this=0x96b11c) at

../../../../../third_party/stlport/stl/_string.h:140

#7 0x364f0c in _STLD::_String_base<char, _STLD::allocator><char>

>::~_String_base (this=0x96b11c, __in_chrg=2) at

../../../../../third_party/stlport/stl/_string.h:151

...

#16 0x39c374 in _STLD::__vector<device, _STLD::allocator><device>

>::clear (this=0xfed7f7e4)

at ../../../../../third_party/stlport/stl/_vector.h:488

#17 0x39bb44 in _STLD::vector<device, _STLD::allocator><device> >::clear

(this=0xfed7f7e4)

at ../../../../../third_party/stlport/stl/debug/_vector.h:262

...

#22 0x203590 in app_ThreadManager::_launch (pVoid=0x86d6c8) at

appthrdmgr.cxx:125

#23 0x240d40 in _internal_thread_wrapper (parameter=0x9c6bf8)

at appthrd.cxx:207

Other core dumps are generated when "new" or "free" are called. All the

cores are the same at the top ("#0" to "#3") and bottom (#22, #23).

Our application has no problem on other Solaris versions. Maybe our application has some problem in memory usage. But we could not reproduce this in our environment easily. Any suggests about what may be the problem?

Thanks.

[2271 byte] By [duanrt] at [2007-11-26 9:04:16]
# 1

It looks like you have a store through an invalid pointer, which is corrupting the stack or the heap (or both). Storing through an invalid pointer has random effects, varying from no harm done to wrong program behavior to program crashes. The point where you notice the problem is usually far away in program space and time from the point of the error.

Source of error include

- using an invalid pointer (not initialized, or containing garbage)

- using a pointer to a deleted object, or one that has gone out of scope

- deleting an object more than once

- writing beyond the bounds of an object (such as off-by-one)

- synchronization errors in a multi-threaded program

- building parts of a program with incompatible compiler options

Try running the program under Sun Studio dbx with Run-Time Checking (check -all). It will help you find many of these errors.

Not all the checks are available on x86 platforms.

If you are not using Sun Studio C++, your debugging options will be more limited.

clamage45 at 2007-7-6 23:14:34 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 2
Corrupting heap or stack is also what we thought. The application has over 100,000 lines of C++ code, and it is hard to reproduce the core dump.We do not have Sun Studio, instead, we use GNU C++ to compile the code.
duanrt at 2007-7-6 23:14:34 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 3
If your hardware is sparc, you can try using dbx, the Sun Studio debugger, to catch illegal memory access using dbx 'check -access' feature. It can help finding offensive code in 100,000 lines of C++.
MaximKartashev at 2007-7-6 23:14:34 > top of Java-index,Development Tools,Solaris and Linux Development Tools...