SPARC, X86 and big mistake
Hi,
I am developing mpi_C++ molecular simulation code.
After more 2 months of debug, I decide to run my code on sun Cluster on different machine (sparc and amd x86 processor). I have a good compilation and also a good execution if I don't use a lot of particule (max 100 000) on both machine (2 - 8 processors). If I use more particule (500 000 - 1 000 000 ) my code crash directly after initialisation(before mpi action) and I get SEGV signal on sparc machine and AMD x 86 machine in 64 bit. If I only compile in 32 bit on x86 machine my code turns very well. I know why a I get a SEGV signal, the machine indicates a object doesn't exist but this object (a list where are my particles) must be ok whatever the size.
My code runs also well when I use GCC 4.0.1 on mac-intel (LAM mpi distribution) with 5 000 000 particules and also on SUN AMD x86 GCC 4.0.2 with mpich2. I don't understand why on SUN AMD X86 in 32 bit the code turne well, but no in 64 bit. And also the sparc mistake, only work with 100 000 particules.
The size of my ex gives pmap is near 70 mega by processors, and I have near 12 giga ram on the differents machine, If somebody have any idea (not bug in the code) but on compilation option
thank
ps: I don't use any compilation option
[1304 byte] By [
frenchkoi] at [2007-11-26 8:33:02]

# 1
With the data you have provided, it is impossible to say what the problem is.
- It might be due to an error in your code that does not always appear. For example, storing via an invalid pointer is sometimes harmless, and sometimes crashes the program.
- Is your program multithreaded? It might have a race condition that doesn't always result in an error.
- The problem might be due to a bug in the compiler or runtime libraries.
With any of these kinds of problems, the program crash usually appears far away in space and time from the point of the error, making the reason hard to find.
On sparc, run the program under dbx with Run-Time Checking enabled. Dbx RTC will report many kinds of program errors at the point where they occur. It might also help you locate a compiler error, if that is the problem. (Not all the checks are available on x86, which is why I suggest debugging on sparc.)
# 2
I find the solution of my problem, I give some informations who could help somebody in mpi.
In the "bug" version, I make a mpi type to send my data between every processors:
MPI_Datatype TabType [3] = {MPI_UNSIGNED_SHORT,MPI_DOUBLE,MPI_UB};
but the MPI_UB is the "bug" I don't why but when I send my data :
nErr = MPI_Send (m_TabRightS, m_nRight,MPI_MOLE, m_nrightproc,99, MPI_COMM_WORLD);
if (nErr != 0)
throw CExcept ("Erreur envoie N",nErr%256);
nErr = MPI_Send (m_TabLeftS, m_nLeft,MPI_MOLE, m_nleftproc, 99, MPI_COMM_WORLD);
if (nErr != 0)
throw CExcept ("Erreur envoie N", nErr%256);
// Recevoir des autres processeurs
nErr = MPI_Recv(m_TabRightR,m_nDim, MPI_MOLE,m_nleftproc,99, MPI_COMM_WORLD, &Status);
if (nErr != 0)
throw CExcept ("Erreur Rec N",nErr % 256);
nErr = MPI_Recv(m_TabLeftR,m_nDim, MPI_MOLE,m_nrightproc,99, MPI_COMM_WORLD, &Status);
if ( nErr != 0)
throw CExcept ("Erreur Rec N",nErr %256);
Every processor receive their information in the table m_TabLeftR and m_TabRightR, but the last element MPI_Datatype in theses table was remplace by "-1" (incompatible with my code) with mpCC or "0" (compatible with my code) with GCC. Finally to receive all formations, I change the type by :
MPI_Datatype TabType [2] = {MPI_UNSIGNED_SHORT,MPI_DOUBLE};
In this case I don't lost my last element. I don't really why the "MPI_UB" disturb my exchange processor. But I finaly find my solution.
thank's