SIGN in Sun Studio 11 (x86) running slowly

Hi,

I recently discovered that the sign (dsign) intrinsic function seems to be much slower than expected when using the -fast switch on an x86 platform. I compile with f95 -f77 -fast.

Specifically, the intrinsic sign function is much slower than replacing the sign with a homegrown function which consists of an if-then-else statement (see below: function "fastsign"). However, when using -g or no optional compiler switches at all, the intrinsic sign function is slightly faster than the homegrown function.

I'm experiencing this problem on Solaris 10, Sun Studio 11, Fortran 95 8.2 Patch 121020-02 2006/04/06 with an AMD Opteron CPU at 2.6GHz.

As a check, I ran the code on an UltraSparc II 450MHz with Solaris 9 and the Fortran 77 5.3 2001/05/15 compiler, also with the -fast switch and found that the intrinsic sign and the homegrown function mentioned above take about the same amount of time.

While it's not difficult to replace the uses of the intrinsic sign with the function I created, this problem only occurs when using the -fast switch.

Any help anyone could give me would be appreciated. Alternatively, if someone could suggest an alternative library to use, that would work too.

I realize that this post is sort of similar to a post by another user experiencing slow double intrinsic functions, so I apologize in advance if this should not be a new post.

Thanks in advance,

Jon

[code]

c

program test1

c

implicitnone

integeri, imax

real*8q1, q2, q3, q4, t1, t2

realetime , dummy(2)

parameter ( imax = 1000000 )

real*8fastsign

c

c--

c

cInitialise CPU timer

c

t1 = dble ( etime ( dummy ) )

c

q1 = 1.0d0

c

t1 = dble ( etime ( dummy ) )

q3 = 0.0d0

do i = 1,imax

q2 = sign ( 0.5d0, q1 )

q3 = q3 + ( 0.5d0 + q2 ) * 2.0d0

enddo

t1 = dble ( etime ( dummy ) ) - t1

c

t2 = dble ( etime ( dummy ) )

q4 = 0.0d0

do i = 1,imax

q2 = fastsign ( 0.5d0, q1)

q4 = q4 + ( 0.5d0 + q2 ) * 2.0d0

enddo

t2 = dble ( etime ( dummy ) ) - t2

c

write(6,2000) q3, q4

write(6,2010) 'd_sign', t1

write(6,2010) 'fastsign' , t2

write(6,2010) 'sign/fastsign', t1 / t2

c

c

2000format(2e16.8)

2010format(a16,1pe16.8)

c

stop

end

c

c--

c

function fastsign ( a1, a2 )

c

real*8fastsign, a1, a2

c

c--

c

if (a2 .ge. 0.0d0) then

fastsign =a1

else

fastsign = - a1

endif

c

return

end

c

c--

c

[/code]

[2743 byte] By [PerryBothron] at [2007-11-26 9:51:38]
# 1

P.S.

I've just discovered that I can make the SIGN intrinsic function much faster by compiling with -fast -xarch=amd64a, replacing the -xarch=sse2 that was lumped into the -fast macro.

That being said, the function that I created is still faster than SIGN by almost a factor of 2. Is there anything else that I can do to close this gap?

Thanks,

Jon

PerryBothron at 2007-7-7 1:05:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 2

If you compile with -S, you will see that the implementation of SIGN is inlined from a function called __f95_sign. You can find that function in /opt/SUNWspro/prod/lib/libm.il, or in .../prod/lib/amd64/libm.il for the version used with -xarch=amd64. It's just a text file, so you can change it if you like. (It might be prudent, though, to make a different .il file just for your version of __f95_sign.)

By the way, your implementation of fastsign assumes that a1 is positive. If a1 is negative, it will get incorrect results. It also assumes that a2 is not a NaN, but that probably doesn't matter.

It's also possible that your timing results have been skewed by the compiler inlining fastsign. That inlining happens earlier than the inlining of __f95_sign, so the compiler can apply more optimizations. In some cases it will delete a loop entirely. I haven't looked at what it's doing to your code. It's best to inhibit inlining with -xinline=.

igb at 2007-7-7 1:05:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 3

Hi,

I fixed my "fastsign" with a couple of DABS, but it didn't change the timing very much. If a2 in "fastsign" is NaN, then my homegrown function not working is the least of my problems.

I also tried -xinline=no%, and in that case SIGN is definitely faster than "fastsign". However, for doing production runs with another code which uses the SIGN function a lot, I'll want to use inlining wherever I can for speed (right?).

Is there a way to make the intrinsic SIGN faster than the homegrown function?

Unfortunately, I don't know assembly.

Thanks,

Jon

PerryBothron at 2007-7-7 1:05:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 4

The problem is that your fastsign may not actually be faster in the "real world". As I said, I haven't checked what the Studio compiler is doing with it, but it may be deleting "dead code" when you use it in your benchmark.

On the other hand, it's entirely possible that your fastsign actually is faster. In that case, you would want to arrange for the compiler to call it rather than the library function. You can use the BIND(C) feature to name your function __f95_sign:

real*8 function fastsign(a, b), bind(c, name='__f95_sign')

However, you might need to add -nolibmil to the command line to avoid getting the version from libm.il.

igb at 2007-7-7 1:05:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...
# 5

I've just discovered that because I've been compiling with the -f77 switch, __f95_sign wasn't being used. Instead, __d_sign was being used. By comparing run times with and without the -f77 switch, the __f95_sign is faster than my homegrown "fastsign" function and thus faster than __d_sign.

I guess the solution is to abandon the -f77 switch altogether so that __f95_sign will be used automatically.

Thanks for your help.

Jon

PerryBothron at 2007-7-7 1:05:00 > top of Java-index,Development Tools,Solaris and Linux Development Tools...