Random crashes in programs compiled with sunstudio 11

Hi all,

I am porting some server stuff from Windows to Solaris and have the problem, that I have random crashes of our main server what didn't happen on Windows, Linux or HP with the same code base.

I think that there is perhaps a structural problem, what I can't find. Perhaps somebody can look over the attached info's and try to find out, if something is wrong.

Because of boost I have to use stlport4 stuff and I have also to link both thread libraries (libthread.so and libpthread.so). My concern is that there is some missmatch which causes my problems.

See somebody some dangers how I compile and link the stuff? Any help would be appreciated.

regards

a desperate

Compiler version:

SunOS sun4 5.8 Generic_108528-24 sun4u sparc SUNW,Ultra-80

Link line:

CC ApplicationServer.o ... -mt -library=stlport4 -L/km/sqstest_plato/PlatoServer/Interface/Release -lstubs -lskell -L/km/sqstest_plato/PlatoServer/Basics/Release -lbasic -L/km/sqstest_plato/PlatoServer/XMLBase/Release -lxmlbase -L/km/sqstest_plato/PlatoServer/libs/ReleaseInfo/Release -lReleaseInfo -L/km/sqstest_plato/PlatoServer/Basics/zlib/Release -lz -L/km/iona/asp/6.2/lib -lit_art -lit_poa -lit_ifc -lit_naming -lit_location -lit_iiop -lit_csi -L/km/poet/runtime/lib -lpt95Fbs -lpt95Fex -lpt95Fin -lpt95Fkn -lpt95Foq -lpt95Fsc -lpt95Ftm -L/km/sqstest_plato/PlatoServer/Basics/boost/lib -lboost_thread-sw-mt-1_33_1 -lboost_regex-sw-mt-1_33_1 -lboost_date_time-sw-mt-1_33_1 -lboost_filesystem-sw-mt-1_33_1-L/km/sqstest_plato/libs/xerces/lib/solaris -lxerces-c -L/km/sqstest_plato/libs/xalan/lib/solaris -lxalan-c -lxalanMsg -R/usr/lib/lwp -lmtmalloc -lsocket -lnsl -lpthread -o ./ApplicationServer

Example compile line:

CC -I.. -I../poet_code -I/km/iona/asp/6.2/include -I/km/poet/inc -I/km/libs/flexlm/machind -I/km/sqstest_plato/PlatoServer/Interface -I/km/sqstest_plato/PlatoServer/Basics -I/km/sqstest_plato/PlatoServer/XMLBase -I/km/sqstest_plato/PlatoServer/XMLBase/poet_code -I../../bison++ -I/km/sqstest_plato/libs/xerces/src -I/km/sqstest_plato/libs/xalan/src -xO3 -library=stlport4 -D_ASSERTE=assert -features=extensions -features=rtti -w -DNDEBUG -D_GARBAGE_COLLECTOR +d -mt -D_APP_SERVER-c SomeFile.cpp

one of different simular dbx outputs:

dbx ApplicationServer core.odc.1

For information about new features see `help changes'

To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc

Reading ApplicationServer

dbx: warning: core object name "ApplicationServ" matches

object name "ApplicationServer" within the limit of 14. assuming they match

core file header read successfully

Reading ld.so.1

Reading libstubs.so

Reading libskell.so

Reading libbasic.so

Reading libxmlbase.so

Reading libReleaseInfo.so

Reading libit_art_sc53.so.5

Reading libit_poa_sc53.so.5

Reading libit_ifc_sc53.so.5

Reading libit_naming_sc53.so.5

Reading libit_location_sc53.so.5

Reading libit_iiop_sc53.so.5

Reading libit_csi_sc53.so.5

Reading libpt95Fbs.so

Reading libpt95Fex.so

Reading libpt95Fin.so

Reading libpt95Fkn.so

Reading libpt95Foq.so

Reading libpt95Fsc.so

Reading libpt95Ftm.so

Reading libboost_thread-sw-mt-1_33_1.so

Reading libboost_regex-sw-mt-1_33_1.so

Reading libboost_date_time-sw-mt-1_33_1.so

Reading libboost_filesystem-sw-mt-1_33_1.so

Reading libxerces-c.so.27

Reading libxalan-c.so.110

Reading libxalanMsg.so.110

Reading libmtmalloc.so.1

Reading libsocket.so.1

Reading libnsl.so.1

Reading libpthread.so.1

Reading libstlport.so.1

Reading libCrun.so.1

Reading libm.so.1

Reading libthread.so.1

Reading libc.so.1

Reading libit_atli2_ip_sc53.so.5

Reading libit_atli2_sc53.so.5

Reading libdl.so.1

Reading librt.so.1

Reading libit_atli2_iop_sc53.so.5

Reading libit_giop_sc53.so.5

Reading libit_iiop_profile_sc53.so.5

Reading libgen.so.1

Reading libmp.so.2

Reading libaio.so.1

Reading libc_psr.so.1

Reading libit_ifc_aux_sc53.so.5

Reading de.so.2

Reading libit_cfr_handler_sc53.so.5

Reading libit_cfr_sc53.so.5

Reading libit_codeset_sc53.so.5

Reading libit_icuuc.so.2

Reading libit_icui18n.so.2

Reading libit_icudata.so.2

Reading libpt95Fli.so

Reading libpt95Fix.so

t@10 (l@10) terminated by signal SEGV (no mapping at the fault address)

0xfd037810: GetBase+0x001c:ld[%o0 + 12], %o0

(dbx) where

current thread: t@10

=>[1] PtBaseHandle::GetBase(0xc077bfc8, 0x4b58c0, 0x8, 0x7, 0x1, 0xfa6fb770), at 0xfd037810

[2] PtOnDemandSet::Query(0x9f3654, 0xfa6fb790, 0xfd2bafbc, 0x0, 0x4b58c0, 0x0), at 0xfd1cf6dc

[3] CPSWorkspace::ChangePropagation(0xfffffffc, 0x9f3450, 0x1, 0xc, 0xfa6fb78c, 0x1), at 0x4359b8

[4] CPSCall::Delete(0x9f3450, 0x0, 0xffffffff, 0xffffffff, 0x21f2a8, 0xff2d5a88), at 0x31b5d4

[5] CPSBase::delete_Object(0xa1f1a0, 0x9f3450, 0x2, 0x0, 0x31b5a8, 0x82fe10), at 0xff29f294

[6] CPSTestCase::delete_Object(0xa1f1a0, 0x9f3450, 0x2, 0xfcdcb0b8, 0x8b1f08, 0x857800), at 0x3f9f5c

[7] CSBulkManipulations::DeleteObjects(0x9da880, 0xfa6fbba0, 0xfa6fba30, 0xfa6fba34, 0x8c7a28, 0x3f9ef4), at 0x4a42a0

[8] POA_IBulkManipulations::DeleteObjects_itgen_dispatch(0x1668, 0xfa6fbd30, 0xfa6fbc94, 0x1400, 0xfe7b8d60, 0x9da8c8), at 0xfe5aaca4

[9] PortableServer::ServantBase::_dispatch(0x9da8c0, 0xfa6fbd30, 0xfa6fbd74, 0x400, 0x1, 0x4a4750), at 0xfe27415c

[10] IT_POA_RequestInterceptor::invoke(0xab8c94, 0x92b74c, 0x92b750, 0xfa6fbdb0, 0xfa6fbd74, 0xfe2f8d28), at 0xfe23f12c

[11] IT_GIOP_ServerRequest::execute(0x92b100, 0x92b100, 0xfde5f368, 0xfe121678, 0x2, 0xfbbdb6d0), at 0xfbb5f5d8

[12] IT_ATLI2_IP::IPPoolImpl::execute(0xba0520, 0x4, 0xfa6fbf2c, 0x1, 0xfe0ea92c, 0x0), at 0xfc24ca30

[13] IT_Work_WorkerThread::run(0xa7fb30, 0x3, 0xac70a0, 0xac70a0, 0xac70a0, 0xfe0ea92c), at 0xfdd88c14

(dbx) quit

[6194 byte] By [ScAra] at [2007-11-27 11:42:56]
# 1

I don't see -mt on the sample compilation line.

You need to use -mt on every compilation (CC and cc) command as well as on the link command line.

clamage45a at 2007-7-29 17:47:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 2

Sorry, but in both commands are the -mt option.

regards

Arno

ScAra at 2007-7-29 17:47:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 3

In that case, I don't see anything obviously wrong with your compile or link commands.

Some observations:

-features=rtti -- You don't need this option. It is always the default.

+d -- This turns off all function inlining, which is a performance killer. It can be useful when debugging, however.

-w -- This turns off most compiler warnings. Sun C++ emits few warnings by default that are safe to ignore. I suggest you remove -w and investigate the warnings you get.

The default thread library on Solaris 8 has some problems. The optional LWP thread library has replaced the original library on later versions of Solaris. Try using the LWP thread library by adding these options to the link command:

-L/usr/lib/lwp -R/usr/lib/lwp

Other things to check:

Go to

http://developers.sun.com/sunstudio/downloads/patches/index.jsp

and see whether you have the current patches for Sun Studio 11. It is possible that you have run into a compiler bug that has been fixed.

In particular, be sure the run-time system has a recent update of the C++ runtime libraries. Although you are not using libCstd.so, all C++ programs use libCrun.so.

Run this command to find out the library patch level on your system (Solaris 8 for sparc):

showrev -p | grep 108434

You should see one or more lines mentioning 108434-xx where xx is the patch level. The current level is -22. If no patches are listed or if the patch level is much earlier than -22, get patches 108434-22 (32-bit) and 108435-22 (64-bit). The Sun Studio patch page will take you directly to the download page for the patches.

According to the dbx report as shown, the program crashes during multi-threaded static program initialization. Since it doesn't always crash, that suggests a race condition. Race conditions are notoriously difficult to find.

If you can compile and run on Solaris 9 or later (Solaris 8 is End Of Life), try Sun Studio 12. It has a Thread Analyzer tool ("THA") that reports race conditions.

clamage45a at 2007-7-29 17:47:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 4

meanwhile we made some progress. Removing mtmalloc and the -R option from the link line leads to the result that we have no longer these crashes.

I put mtmalloc there, because this forum told me that it bring more performance. Do you have a clue, why this is a problem?

My patches on the development machine are actual.

I will try to remove the +d option. Why is the performance better with inlining? We thought it is the other way around. What should we expect when we switch it off? You say it is a performance killer, so is it dramaticly reducing the performance?

One question to the lwp stuff: Where in the command line should we use the -L and -R option to /usr/lib/lwp? We made the experience, that the order of link options influence also the behaviour of the application.

We have to develop on 5.8 at the moment, because most of our customers run on this version.

regards

Arno

ScAra at 2007-7-29 17:47:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 5

I'm looking into the mtmalloc issue and I'll let you know what I find.

To use the LWP thread library, add the -L and -R option to each command that performs a link step. (It does no harm to put them on compilation commands, but they have no effect.)

Example of link steps are creating a shared library (-G option) or an executable program.

The important point is to use the -L and -R options on every link command, or you could wind up linking two different versions of the thread library into the program.

Most C++ programs, and the C++ Standard LIbrary in particular, depend on function inlining to allow separation of programming concerns while still getting good performance.

One classic example is a member access function:

class T {

int foo;

...

public:

int get_foo() { return foo; } // implicit inline function

...

};

Access to private member foo is restricted. Non-member functions can read the value via function get_foo. The definition and use of member foo, and the contents of function get_foo, could change without affecting any code that uses T. You need only recompile, not modify source code that uses T objects.

The disadvantage is the cost of a function call to get at the int member. If access to foo is high bandwidth, program performance could suffer.

But in this case, foo is declared inline. The compiler replaces the call to get_foo with the body of get_foo at the point of the call. When you write x.get_foo(), the compiler generates code as if you had written x.foo.

You get the benefit of program modularity and separation of concerns without any runtime cost.

When you use +d, you tell the compiler not to perform the inline substitution, and generate a function call instead. You often want to do that during debugging, so you can set breakpoints on calls to get_foo, which you could not otherwise do. In fact, the -g option sets the +d option to disable function inlining. There is seldom any reason to use +d explicitly.

For production code, you don't want to throw away the designed-in performance benefits of function inlining. Remove the +d option unless you have a specific reason for wanting it.

clamage45a at 2007-7-29 17:47:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 6

Thanks for these explanation, so we will switch of the '+d' option in future.

But if I read your comment on the -R and -L option, I think we have the reason for our problems when we use mtmalloc, because therefore we set also the -R option in our link lines, what you can see in the example, but we link also some third party stuff, where we have no influence, how these libraries are linked and the corba stuff handle a lot with threads and thread pools. So for me it looks like that we can't use in generell the stuff from the lwp directory, what is a pity, because we want to use all possibilities, what increase our performance on solaris, because in relation to other plattforms, the production application is mostly 30-50% slower.

If you have some more hints getting more performance, you are welcome.

regards

Arno

ScAra at 2007-7-29 17:47:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 7

Third party libraries that can't be recompiled are one of the few good reasons to use LD_LIBRARY_PATH and/or LD_PRELOAD. I haven't really followed what you are trying to do, but don't these 2 environment variables help?

Marc_Glissea at 2007-7-29 17:47:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...