SIGBUS/SIGSEGV (Signal 10/11) while iterateOverReachableObjects

Hi,

I tried to run my JVMTI based heap analyser several times without process restart. This does not work currently and I do not know why.

The first run is successful. With the second run the process panics with SIGBUS or SIGSEGV (which of both occus is not in my hands) at some nearly reproducable point.

The first run does one iterateOverReachableObjects() and two iterateOverHeap's and consumes ~ 600 MB of normal heap memory. As the code is written by myself I am sure to free all unneeded memory. From the second run the heap increases no more. With the second run the programm aborts always around the 57.000th object within iterateOverReachableObjects().

The stacktrace given by the signal handler says the the signal does not occur in my code:

<i>feca0314 waitid(0, 5951, dbe7ec68, 3)

fe8f3fd8 RtpWriteCoreInternal (fe907044, fe9b1c58, fecc02a4, fecbc000, 0, 3cec0) + 4e8

fe8f40c0 RtpWriteCore (3d0, fe9c77e4, 0, fecc0284, 7f0, 400) + 28

fef89480 PAL_sa_handler (b, dbe7f8a8, dbe7f5f0, 0, a, ff0323c4) + 160

cb82d2e0 __1cUcall_chained_handler6FpnJsigaction_ipnHsiginfo_pv_i_ (b, 10b8f0, dbe7f8a8, dbe7f5f0, 10b8e8, 126cb8) + 78

cb6141b8 JVM_handle_solaris_signal (b, dbe7f8a8, dbe7f5f0, 1, 11, 276b18) + 96c

feee5bac __sighndlr (b, dbe7f8a8, dbe7f5f0, cb61382c, 0, 0) + c

feedf804 call_user_handler (b, dbe7f8a8, dbe7f5f0, 0, 0, 0) + 234

feedf9b4 sigacthandler (b, dbe7f8a8, dbe7f5f0, cb954000, 276fe8, 160900) + 64

called from signal handler with signal 11 (SIGSEGV)

cb7f8d18 __1cPCallbackWrapper2T5B6M_v_ (276518, 276fe8, 276fe8, dbe7fa18, 73655574, 11865a30) + 18c

cb7f97a0 __1cIReporterZobject_reference_callback6FnYjvmtiObjectReferenceKind_pnHoopDesc_3i_i_ (1, 1,

9492f340, b, 11876378, 27014) + 2d8

cb7f86cc __1cbJIterateOverReachableObjectOperationSiterate_over_class6MpnMklassOopDesc__i_ (41, 8e11

3c20, 8e111978, dbe7faa4, 7fff, d) + 520

cb7f6fd0 __1cbJIterateOverReachableObjectOperationEdoit6M_v_ (efbef780, cb972010, 1, 5000, 276fb4, 2

76bc8) + 458

cb5c34cc __1cMVM_OperationIevaluate6M_v_ (efbef780, 275b28, 276fb4, 276bc0, 276bc8, 7f6687) + 80

cb4d5d90 __1cIVMThreadDrun6M_v_ (cb96dca4, 275b28, 276bc0, 5460, 276bc8, efbef780) + 6d0

cb82af08 __1cG_start6Fpv_0_ (276b18, cb96e58c, cb967650, 0, 3314, cb954000) + 200

feee5854 _lwp_start (0, 0, 0, 0, 0, 0)</i>

Any ideas to eliminate this?

Thanks, Robert

[2619 byte] By [lldaedaluslla] at [2007-10-2 23:45:53]
# 1

__1cPCallbackWrapper2T5B6M_v_ is the c++ mangled name for CallbackWrapper::~CallbackWrapper so it looks like the crash is happening immediately after the agent's callback. Can you describe what the callback is doing? As a test, can you make the callback a "no-op" to see if the crash occurs? Another useful thing would be to post the fatal error log (hs_err_<pid>.log file).

alan.batemana at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 2

Hi,

the issue cleared as I made some tryouts in my code. One thing solved the problem, but the solution does not point to an error in my code.

Using this code in the callback caused the SIGBUS/SIGSEGV with the second data dump:

<code>

static const char *special = "*JVMTI unknown*";

(...)

refereeClassSignature = (char *) realloc(refereeClassSignature, 16);

if (secStrCpy(refereeClassSignature, special, 16) == -1)

return JVMTI_ITERATION_ABORT;

</code>

This code no more caused the SIGBUS/SIGSEGV with the second data dump:

<code>

static char *special = "*JVMTI unknown*";

(...)

refereeClassSignature = (char *) realloc(refereeClassSignature, 17);

if (secStrCpy(refereeClassSignature, special, 17) == -1)

return JVMTI_ITERATION_ABORT;

</code>

secStrCpy is nothing more than a wrapper around strlcpy. What I am wondering about is that the first code snippet caused that error even though the size is big enough (including the ending \0) for that special string (you might remember it from another post :-). Why it ran without errors after I have changed the realloc size to 17 I have no clue. In the case of such hard errors I never got those hs_err_.log files. The error output did not say that such file was written.

This is the output:

<code>

signal number (sig): 10

signal number (siginfo): 10

error number (siginfo) : 0

signal code (siginfo) : 1

error number (siginfo) : 0

signal code (siginfo) : 1

Consult siginfo(3HEAD) for more information about the codes

writeCore: Calling external hook

pal-rtp: Core hook. Calling RtpWriteCore()

</code>

But in other softer errors I had such files were generated. Sorry.

Robert

lldaedaluslla at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 3

Is this Solaris sparc? Or Solaris X86? And what C compiler are you using?

Can you show me the prototype of the callback function and the call to the Iterate functions? It's hard to see what is going on from what has been posted.

The only thing I can think of is some kind of issue with the types of the arguments sent into or returned by the callback functions, maybe a 32bit vs. 64bit difference in argument types? I'm guessing at this point.

-kto

kellyohaira at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 4

USING:

java version "1.5.0_06"

Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)

Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode)

on Solaris9/FJSV,SPARC64-IV

and

java version "1.5.0"

Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-b64)

Java HotSpot(TM) Client VM (build 1.5.0-b64, mixed mode, sharing)

on Windows XP SP2/Pentium 4

On Windows I never had such problems, but there I only analysed a very small test application. As compiler I used:

cc: Sun C 5.8 2005/10/13

Here are the two tried makefiles:

<code>

JDK_HOME=/opt/SMAW/SMAWjdk/1.5 # to be changed

CC_HOME=/opt/SMAW/SUNWspro_sunstudio11/SUNWspro/prod/bin # to be changed

SRC_DIR=.

OUTPUT_DIR=.

CLASSPATH=$(LIB_DIR)

SRC_FILES= $(SRC_DIR)/Heap_Analyser.c \

$(SRC_DIR)/Helpers.c

INCLUDES=\

-I$(JDK_HOME)/include \

-I$(JDK_HOME)/include/solaris \

-I$(SRC_DIR) \

-I$(OUTPUT_DIR) \

-I/usr/include

DEFINES=SOLARIS

all: libheap_analyser.so

libheap_analyser.so:

$(CC_HOME)/cc -l c -G -D$(DEFINES) $(INCLUDES) $(SRC_FILES) -o $(OUTPUT_DIR)/$@

ldd -r $(OUTPUT_DIR)/$@

clean:

rm -f *.so *.o

showcode: $(SRC_FILES)

$(CC_HOME)/cc -G -E $(INCLUDES) $(SRC_FILES) > showcode

</code>

and

<code>

JDK_HOME=/opt/SMAW/SMAWjdk/1.5 # to be changed

CC_HOME=/opt/SMAW/SUNWspro_sunstudio11/SUNWspro/prod/bin # to be changed

SRC_DIR=.

OUTPUT_DIR=.

CLASSPATH=$(LIB_DIR)

SRC_FILES= $(SRC_DIR)/Heap_Analyser.c \

$(SRC_DIR)/Helpers.c

INCLUDES=\

-I$(JDK_HOME)/include \

-I$(JDK_HOME)/include/solaris \

-I$(SRC_DIR) \

-I$(OUTPUT_DIR) \

-I/usr/include

DEFINES=SOLARIS

all: libheap_analyser.so

libheap_analyser.so:

$(CC_HOME)/cc -xO2 -mt -xarch=v8plus -fast -xregs=no%appl -K pic -z text -z ignore -z defs -l c -G -D$(DEFINES) $(INCLUDES) $(SRC_FILES) -o $(OUTPUT_DIR)/$@

ldd -r $(OUTPUT_DIR)/$@

clean:

rm -f *.so *.o

showcode: $(SRC_FILES)

$(CC_HOME)/cc -G -E $(INCLUDES) $(SRC_FILES) > showcode

</code>

Here comes the call to the iterate function:

<code>

error = (*jvmti)->IterateOverReachableObjects(jvmti, &tagRoot, &tagStack, &processObject, gdata);

</code>

... and here are the prototypes of the callbacks ...

<code>

jvmtiIterationControl JNICALL tagRoot(jvmtiHeapRootKind root_kind, jlong class_tag, jlong size, jlong* tag_ptr, void* user_data)

jvmtiIterationControl JNICALL tagStack(jvmtiHeapRootKind root_kind, jlong class_tag, jlong size, jlong* tag_ptr, jlong thread_tag, jint depth,jmethodID method, jint slot, void* user_data)

jvmtiIterationControl JNICALL processObject(jvmtiObjectReferenceKind reference_kind, jlong class_tag, jlong size, jlong* tag_ptr, jlong referrer_tag, jint referrer_index, void* user_data)

</code>

The analysed process has 32 bit address space.

After I made several improvements and restructured the agents code the situation is now that the analysed process panics (signal 10) with the FIRST run when I insert the problematic code at the corresponding position.

Robert

lldaedaluslla at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 5

It is possible you have run into a memory alignment issue. I think the compiler you are using is slightly newer than the one used to build the JDK you are using. You might try building your 32bit native code with the option -xmemalign=4s, and make sure that your 64bit native code is NOT built with this option. I'm fading on all the exact details here, but the Sun Compilers changed the default memory alignment rules in the newer compilers.

Also see the index.html and README files in the JDK demo/jvmti directory, there are some native compiler option hints in there that might help.

-kto

kellyohaira at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 6
Hi I tried that memalign option. The result is that I it works now with the problematic code too. Thank you very much even though this code is no more needed. But at least now we know one more point to pay attention to when using JVMTI. Robert
lldaedaluslla at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 7
This memory alignment issue is not really a JVM TI issue, but would be true for any native library that gets loaded into the VM process. Any JNI library built to back up the Java 'native' methods would need to be built this way. -kto
kellyohaira at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 8
Is there information somewhere which JDK/JRE has been compiled with which memory alignment? We will not always use java 1.5 and it might be possible that we need another alignment with the next java release, not? Thanks, Robert
lldaedaluslla at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 9

The memory alignment issue is unique to SPARC 32bit and it should be safe to use memalign=4s for 32bit SPARC for all JDK's. It was the Sun Studio compiler that changed this default, starting with Sun Studio 9 I think. If the 32bit SPARC JDK changed to require 8byte alignment, we could break existing JNI libraries, so I doubt that will change.

For SPARC 64bit it should be left as memalign=8s or not specified, the 64bit alignment hasn't changed and likely never will, in fact if you use memalign=4s with 64bit, you could create more SIGBUS errors.

-kto

kellyohaira at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 10
After clearing out this I would like to understand what happens in the memory and causes the SIGBUS (without the memalign option set) when that string is allocated with 16 bytes instead of 17 bytes (which works). Could you explain that here (shortly)?Thanks Robert
lldaedaluslla at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 11

When compiled with memalign=4s, the code accepts and allows for anything to be on any 4byte boundary. This is the way the Solaris 32bit JDK is built. So when the newer compilers changed to a memalign=8s default, we had to explcitly turn it down to 4s so that existing JNI libraries would still work. Addresses are passed back and forth from the VM and the JNI (and JVM TI agent) libraries, so it's important that they follow the same alignment rules.

When it's set to memalign=8s (and I'm not an expert in this compiler feature, just explaining what I see happening), then the compiler is under the impression that 64bit objects (like jlong) will always be on an 8byte boundary. But addresses coming from the VM are on 4byte boundaries. So you have a 50% chance that 4byte aligned is also 8byte aligned :^) I'm not exactly sure what C types are impacted by this alignment.

The 16->17 change did something to the layout of objects, either the heap layout or the stack layout, something that caused the VM addresses your callback got to change from not being 8byte aligned to being 8byte aligned. It would take a few hours with dbx to completely understand how this happens. I suspect the VM is trying to optimize space and may be assuming memalign=4s, which means that code compiled with memalign=8s is at risk, but that's some guessing on my part.

-kto

kellyohaira at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 12

When I have that memory alignment activated (xmemalign=4s) and want to do some dirty pointer arithmetics, it looks like that we again cause SIGBUS.

We have a pointer to a 64-Bit integer (jlong) and want to use it to encode one short, one long and one char in it by doing something like this:

// write the objectIndex at + 2 byte position of the 64 bit tag value

void setTagObjectIndex(jlong *tag, jint val)

{

jint *objectIndex;

char *tmp;

tmp = (char *) tag;

tmp = tmp + 2;

objectIndex = (jint *) tmp;

*objectIndex = val;

}

Is it true that this causes SIGBUS in case of that memory alignment? Is it possible to do with alignment activated and not to cause SIGBUS?

Thanks

Robert

lldaedaluslla at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...
# 13
The 'int' pointers must always have 4 byte alignment. When packing/unpacking bytes like this, I'd suggest using shifts and masks, e.g. *(int*)(((*tag)>>8) & 0xFFFFFFFF) = val. -kto
kellyohaira at 2007-7-14 16:30:19 > top of Java-index,Developer Tools,Debugging and Profiling Tool APIs...