Poor I/O Performance on W2100z Under Solaris 10 x86

I am experiencing extremely poor performance when doing large sequential writes to the Ultra 320 SCSI disk array attached to my <a href="http&#58;&#47;&#47;www.sun.com/desktop/workstation/w2100z/ind ex.jsp" target="_blank">W2100z</a> under Solaris 10 x86.

I have configured the system to dual boot between both Red Hat Enterprise Linux version 4 and Solaris 10 x86. When I execute my application under Linux, I get about 600 Mbps when writing to the SCSI disk array. This is the level of performance that I would expect from my current hardware configuration. However, when I boot the exact same machine under Solaris 10 I only see about 385 Mbps when writing to the disk.

I have tried several things to correct this issue, from tweaking kernel parameters to trying different mount options for the UFS filesystem, and I am still unable to identify the bottleneck. I would really like to use the Solaris OS for my application, but I will not be able to accept half the performance that my hardware can support.

Do any of you Solaris x86 gurus out there know of any issues with the HAL or possibly a kernel parameter that may be artificially restricting the I/O performance?

Here is a brief description of the hardware:

<ul>

<li>Workstation: W2100z

<li>RAM: 4096 MB

<li>SCSI Controller: <a href="http&#58;&#47;&#47;www.sun.com/bigadmin/hcl/data/sol/componen ts/details/890.html" target="_blank">Adaptec 2120S</a> (RAID 0 over two 73 GB Seagate Ultra 320 SCSI drives)

</ul>

[1616 byte] By [Warren] at [2007-11-25 22:55:21]
# 1

What file system and file system parameters are you using? Could it be you are comparing a journaled filesystem to a non-journaled file system?

Having a W2100z myself, but being new to Solaris, I read somewhere you can tweak the filesystem for much better performance by using "logging." The default setting in Solaris 10 should already be enabled.

Check out these links for a start:

<a href="http://docsun.cites.uiuc.edu/sun_docs/C/solaris_9/SUNWaadm/LOGVOLMGRADMIN /p40.html" target="_blank"> http://docsun.cites.uiuc.edu/sun_docs/C/solaris_9/SUNWaadm/L OGVOLMGRADMIN/p40.html</a>

<a href="http://www.itworld.com/Comp/2377/swol-0922-supersys/" target="_blank">http://www.itworld.com/Comp/2377/swol-0922-supersys/</a&g t;

<a href="http://www.tech-recipes.com/solaris_system_administration_tips798.html" target="_blank"> http://www.tech-recipes.com/solaris_system_administration_ti ps798.html</a>

<a href="http://www.unixville.com/?q=node/view/132" target="_blank">http://www.unixville.com/?q=node/view/132</a>

<a href="http://sysunconfig.net/unixtips/solaris.html#ufs" target="_blank">http://sysunconfig.net/unixtips/solaris.html#ufs</a>

I hope this helps or someone with more experience with Solaris chimes in.

zemplar at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2

Hi zemplar:

Thank you for the suggestion. I have experimented both with and without journalling enabled. When 'logging' is enabled, which is the default for Solaris 10, the filesystem attempts to write in a sequential fashion to the journal instead of directly to the location of the updated file. This can improve performance in some instances, such as compiling, where a large number of small files are being updated. The OS then unrolls the journal and updates the files in their correct locations later, presumably when the system is less busy.

Unfortunately, my application is not a good prospect for running with a journal. It is performing large sequential writes to the disk. I believe the default journal size for Solaris 10 is 1 MB per 1 GB of disk. That should produce about a 60 MB journal on my system. I am writing 250 MB files, and the system is always busy. Therefore, my application should benefit from disabling the journal by adding <b>nologging</b> to <b>/etc/vfstab</b>.

Under Linux, when running with the journal, the performance of my application is okay for a time, but then degrades quickly as the journal is forced to constantly unroll and the filesystem is basically asked to write all of my data twice. By not using the journal, I do see a boost in the performance of the application. However, and this is part of what I just do not understand about Solaris 10 x86, enabling or disabling the journal under Solaris 10 does not appear to change the performance at all. The maximum write performance remains capped at about 385 Mbps. I have verified that <b>logging</b> is not present in <b>/etc/mnttab</b> so I am at a loss as to why Solaris 10 x86 appears to always exhibit the same poor performance.

Warren at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3

Have you tried 'prstat', 'iostat', or 'dtrace' to help isolate and debug if it is a Solaris filesystem issue, driver issue, etc.?

Still, I would be interested in knowing what linux filesystem (ext2/3, XFS, reiserfs 3/4) and parameters you are using in each OS for comparison and how you are testing. Is your RAID card using a Sun or Adaptec driver with Solaris 10?

Generally speaking, I've found Solaris much faster than comparable tests to Linux, though my testing centered more on processing power and video performance than storage throughput.

zemplar at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 4

Doesn't linux use more aggressive caching on some of it's filesystems?

On Solaris with a non-logging UFS filesystem you get synchroneous

file metadata updates [1]; AFAIK I/O for changed file data is scheduled as

soon as the last open handle for the file is closed; the VM system tries to

make sure that modified file pages are written to disk after 30 second [2], ....

All of this should help reduce the possibility of data loss in case of a

catastrophic failure. And it probably means that a benchmark measures

real I/O on Solaris, when the data is still sitting in some cache on linux

and I/O was avoided on linux.

[1] Casper Dik's "fastfs" utility can be used to enable async metadata updates

on a non logging Solaris UFS filesystem:

<a href="http://gd.tuwien.ac.at/infosys/servers/isc/inn/unoff-contrib/fastfs.c" target="_blank"> http://gd.tuwien.ac.at/infosys/servers/isc/inn/unoff-contrib /fastfs.c</a>

Often the safe but slower alternative is to enable logging on the UFS

filesystem.

[2] See the "dopageflush" kernel variable in the Solaris fsflush kernel

process. I'm not sure, but doesn't linux run with ``dopageflush == 0''?

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/fs/fsflush.c#64 " target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/common/fs /fsflush.c#64</a>

jkeil at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 5

Hi:

Thank you both for the assistance. I have, in fact, used the stat utilities to try and identify the bottleneck.

Here is the output from <b>iostat -xcnCXTdz 5</b> during a typical run:

<pre>

cpu

us sy wt id

0 8 0 92

extended device statistics

r/sw/skr/skw/s wait actv wsvc_t asvc_t %w %b device

3.0 201.024.0 45686.2 33.9 1.0 166.14.9 99 100 c1

3.0 201.024.0 45686.2 33.9 1.0 166.14.9 99 100 c1t0d0

</pre>

I have also tried some of the DTrace scripts from Sun like <b>iosnoop</b>

and <b>rwsnoop</b>. Unfortunately, nothing has jumped out at me as being amiss.

The Linux file system where I am getting about 600 Mbps is a journalled ext3 partition. I realize that Linux aggressively caches files, but the size of the files I am processing far exceeds the 4 GB of physical RAM in the system. Even under those conditions, Linux is at least 30% faster than Solaris.

I just tried setting dopageflush to 0 and disabled logging on the UFS filesystem and I saw a small performance increase. I got about 410 Mbps. This is still well below the performance I am seeing under Linux.

Thank you again for all of your help.

Warren at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 6

Is that a performace problem for writing only? How does the box

perform when <i>reading</i> huge files from the SCSI disk array?

Did you try a read test on the raw SCSI disk device? How that that compare

against a filesystem read test?

E.g. "dd if=/dev/rdsk/c2t0d0p0 of=/dev/null bs=128k"

(assuming c2t0d0p0 is your SCSI disk array).

jkeil at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 7

No, I have not tried any large read performance tests. My application does not require high performance reading of files. I would guess that the performance would be higher than writing, but that is not really related to my issue. I am trying to ascertain why Solaris 10 seems to perform poorly during large sequential writes when contrasted with Linux on the exact same hardware.

I understand filesystems well enough to not expect the same performance when reading and writing. I am basing my complaints around the fact that Tom's Hardware Guide and Seagate say that my disk drives should each be able to write at about 40 MB/s when using a filesystem. Coupled together in a RAID 0 array, I would expect around 75 MB/s. That should be approximately 630 Mbps in writing performance. Which happens to be almost exactly what I am seeing under Linux. Under Solaris, the best I have seen so far is 410 Mbps or about 51 MB/s. I am simply baffled by this 30% performance hit.

I missed part of the last post, I am using the driver that shipped with Solaris 10. In case that helps.

Warren at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 8

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Warren wrote on Mon, 19 September 2005 23:03</b></td></tr><tr><td class="quote">

No, I have not tried any large read performance tests. My application does not require high performance reading of files. I would guess that the performance would be higher than writing, but that is not really related to my issue. I am trying to ascertain why Solaris 10 seems to perform poorly during large sequential writes when contrasted with Linux on the exact same hardware.

</td></tr></table>

The idea of the "read from raw disk device" test is to find out the maximum

possible transfer rate that the S10 x86 driver is able to deliver. Maybe this is

a Solaris driver issue. (Even better would be a "write to raw disk device"

test using /dev/zero as input file and the raw raid array device as the output,

but that'll destroy the filesystem on the raid! )

In case the Solaris x86 driver would be unable to access the raid array at

rates higher than 45mbytes/sec using the raw disk device and bypassing the

filesystem layer, when you'll never see filesystem performance better than

that.

jkeil at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 9

Here is the output from <b>iostat</b> during the <b>dd</b> read test:

<pre>

cpu

us sy wt id

0 0 0 100

extended device statistics

r/sw/skr/skw/s wait actv wsvc_t asvc_t %w %b device

367.50.0 47039.80.0 0.0 1.00.02.70 100 c1

367.50.0 47039.80.0 0.0 1.00.02.70 100 c1t0d0

</pre>

It capped out at about 48 MB/s reading, why would that be? I have two Seagate ST373307LW Ultra 320 SCSI disk drives, each capable of at least 40 MB/s tranfers and performing together in a RAID 0 array, connected to an Ultra 320 SCSI controller capable of 320 MB/s transfers, connected via 64-bit PCI operating at 66 MHz.

Do you know why the Solaris driver would only be capable of transferring data at about 45 MB/s?

Warren at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 10

Is it possible that you have configured the linux filesystem on the outer

part of the array (= the outer part of the disks) , and the solaris filesystem is

on the inner part of the array (= the inner part of the disks)?

Typically the transfer rates for a disk are higher on the outer parts of the disk.

OTOH, the <b>dd</b> read test used the <i>p0</i> slice, so the test should have

started reading on the outer parts of the disk, where transfer rates are

highest.

What is the stripe size that was used to build the array? The <b>dd</b> read

test used 128kbyte transfers, maybe the transfer rates improve when you

use twice the stripe size as the <b>dd</b> block size? (Assuming the acc driver /

aac hardware is clever enough to split a single big read request into two

requests, one for each disk, that both run concurrently).

There's also bug 6213082, "aac driver on x86 should support 64-bit DMA

addresses"; it seems the aac driver sometimes uses extra data copy

operations because the aac driver is restricted to use DMA physical

addresses < 4GB. Your box has 4GB of main memory, I guess 0.5 GB -

1 GB of the memory is remapped at an address >= 4GB, because PCI I/O

space needs a 0.5 - 1 GB chunk of the 32-bit address space.

<a href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6213082" target="_blank"> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6 213082</a>

jkeil at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 11

Nice catch on that bug j.keil!

Do you know of a way, possibly a DTrace script, that I could determine if the aac driver on my system is performing the extra memory copies described in the bug summary?

Also, just to clarify, you believe that this bug would still apply to my dual Opteron Sun system even though I only have 4 GB of memory because of the way that PCI I/O allocates 32-bit address space. Correct?

Warren at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 12
Although this would be a hassle, you could run your Solaris benchmark after removing some RAM to less than 4Gb to, hopefully, see if this bug is indeed the culprit.
zemplar at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 13

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Warren wrote on Tue, 20 September 2005 19:59</b></td></tr><tr><td class="quote">

Do you know of a way, possibly a DTrace script, that I could determine if the aac driver on my system is performing the extra memory copies described in the bug summary?

</td></tr></table>

Yes, probably with dtrace and the "fbt" provider. The following traces

the function calls in the kernel that happen during the "ddi_dma_sync()"

call that is mentioned in bug id 6213082

<div class="pre"><pre>#!/usr/sbin/dtrace -s

#pragma D option flowindent

fbt::ddi_dma_sync:entry

{

self->traceme = 1;

}

fbt:::

/self->traceme/

{

}

fbt::ddi_dma_sync:return

{

self->traceme = 0;

}

</pre></div>

On my 32-bit x86 box with 2GB of memory, the output is simply a series of

<div class="pre"><pre> 0 -> ddi_dma_sync

0-> rootnex_dma_flush

0<- rootnex_dma_flush

0 <- ddi_dma_sync

0 -> ddi_dma_sync

0-> rootnex_dma_flush

0<- rootnex_dma_flush

0 <- ddi_dma_sync

1 -> ddi_dma_sync

1-> rootnex_dma_flush

1<- rootnex_dma_flush

1 <- ddi_dma_sync

</pre></div>

Probably because the hp->dmai_ibufp /* intermediate buffer address */

is always NULL (= not in use), there's not much activity in

rootnex_dma_flush()

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/i86pc/io/rootnex.c#roo tnex_dma_flush" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/i86pc/io/ rootnex.c#rootnex_dma_flush</a>

But when the intermediate buffer address hp->dmai_ibufp is not NULL,

rootnex_dma_flush() calls rootnex_io_wtsync()

or rootnex_io_rdsync()

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/i86pc/io/rootnex.c#roo tnex_dma_flush" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/i86pc/io/ rootnex.c#rootnex_dma_flush</a>

<div class="pre"><pre>

1196 if (hp->dmai_ibufp) {

1197 if (cache_flags == DDI_DMA_SYNC_FORDEV) {

1198 rval = rootnex_io_wtsync(hp, MAP);

1199 } else {

1200 rval = rootnex_io_rdsync(hp);

1201 }

1202 }

</pre></div>

Apparently the extra memory copy happens inside rootnex_io_wtsync(),

and rootnex_io_rdsync(), using bcopy() calls:

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/i86pc/io/rootnex.c#roo tnex_io_wtsync" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/i86pc/io/ rootnex.c#rootnex_io_wtsync</a>

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/i86pc/io/rootnex.c#roo tnex_io_rdsync" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/i86pc/io/ rootnex.c#rootnex_io_rdsync</a>

I'm not sure if dtrace is able to trace the "bcopy" calls; bcopy contains

hand-coded low level assembler code, and dtrace has problems with

code like that and doesn't trace it.

But if you see any rootnex_io_wtsync() or rootnex_io_rdsync()

calls in the dtrace output, this would be a hint that the

intermediate buffers are in use.

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Quote:</b></td></tr><tr>& lt;td class="quote">

Also, just to clarify, you believe that this bug would still apply to my dual Opteron Sun system even though I only have 4 GB of memory because of the way that PCI I/O allocates 32-bit address space. Correct?

</td></tr></table>

Yes.

jkeil at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 14

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Warren wrote on Tue, 20 September 2005 19:59</b></td></tr><tr><td class="quote">

Also, just to clarify, you believe that this bug would still apply to my dual Opteron Sun system even though I only have 4 GB of memory because of the way that PCI I/O allocates 32-bit address space. Correct?

</td></tr></table>

The mdb ::memlist command can be used to print base addresses and

sizes for the physical memory installed in the machine:

echo ::memlist | mdb -k

How much of the 4GB of system memory is below / above the 4GB mark?

jkeil at 2007-7-5 17:10:43 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 15

Okay, I ran the supplied DTrace script; however, I did not see any calls to either <b>rootnex_io_wtsync()</b> nor <b>rootnex_io_rdsync()</b> during my test.

Here is the output from the <b>::memlist</b> command:

<pre>

phys_install:

ADDR BASE SIZE

fffffffff4c0100009c000

fffffffff4c01020100000600000

fffffffff4c01040700000 cf860000

</pre>

Thanks again for all of your help.

Warren at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 16

Interesting, there's only ~3.5 GB of physical ram listed by ::memlist.

The remaining 0.5 GB of ram appears to be lost, or is hidden by

PCI I/O space.

According to Mike Riley ...

<a href="http://groups.yahoo.com/group/solarisx86/message/28608" target="_blank">http://groups.yahoo.com/group/solarisx86/message/28608</a >

... I was expecting that Solaris is able to use all of the 4GB.

I guess "prtconf -p" does not report a memory size of 4096 Megabytes,

either.

Well, apparently bug 6213082 doesn't explain the performance

problem on your machine, because all of the physical memory is

below 4GB.

j.keil at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 17

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Warren wrote on Fri, 16 September 2005 21:19</b></td></tr><tr><td class="quote">

Here is the output from <b>iostat -xcnCXTdz 5</b> during a typical run:

<pre>

cpu

us sy wt id

0 8 0 92

extended device statistics

r/sw/skr/skw/s wait actv wsvc_t asvc_t %w %b device

3.0 201.024.0 45686.2 33.9 1.0 166.14.9 99 100 c1

3.0 201.024.0 45686.2 33.9 1.0 166.14.9 99 100 c1t0d0

</pre>

</td></tr></table>

The average write request size is 45686.2 [kw/s] / 201.0 [w/s] = 227.2 kbytes.

Is that less or greater than the raid's stripe size?

If it's less than the stripe size (and the raid adapter does not cache the data),

then you're probably measuring the maximum transfer rate of a single disk

drive.

The "active queue" length is exactly 1.0 requests (with lots of waiting

requests), so the Solaris aac driver appears to submit one 227.2 kbytes

request to the raid controller, and waits until that request has completed

before the next write request is submitted.

Maybe the linux version of the aac driver submits more than one write

request to the raid controller, so that the raid hardware can write on

both scsi disks in parallel?

j.keil at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 18

Okay, the RAID 0 array's stripe size is 256 kbytes. That is larger than 227.2 kbytes. Also note that the RAID controller has 64 MB of cache on board. Any parameters (kernel, mount, etc.) that I can tweak to enable more active requests?

Here is the output of <b>mkfs -m /dev/dsk/c1t0d0s0</b> for the Solaris partition in question:

<pre>

mkfs -F ufs -o nsect=63,ntrack=240,bsize=8192,fragsize=1024,

cgsize=7,free=1,rps=166,nbpi=8188,opt=t,apc=0,gap=0,nrpos=8,

maxcontig=32,mtb=n /dev/rdsk/c1t0d0s0 126576134

</pre>

Also, I discovered the following BIOS settings related to the PCI I/O memory hole: (Note: The typos are not my own, they are actually in the BIOS.)

<pre>

- Memory Hole remapping[Disabled, Software, Hardware]

Enable Memory remapping

Memory hole space will

mapping to above 4GB

address.

This function in useful

on 64bit OS or support

above 4GB memory OS.

- 4 GB Memory Hole Adjust[Manual, Auto]

Either manually adjust

the PCI memory hole or

select automatic.

Be aware that if Memhole

mapping is set to

Software, the actual

hole size will depend on

the size of installed

DIMMs.

</pre>

Memory Hole remapping was set to Disabled, so I changed it to Software and now <b>prtconf -p</b> reports 4095 Megabytes of RAM.

Here is the new output from the <b>::memlist</b> command:

<pre>

phys_install:

ADDR BASE SIZE

fffffffff360100009c000

fffffffff3601020100000600000

fffffffff3601040700000 bf860000

fffffffff3601060100000000 40000000

</pre>

The performance appears to still be about the same. I re-ran the DTrace script after the BIOS change and I still do not see any calls to either <b>rootnex_io_wtsync()</b> or <b>rootnex_io_rdsync()</b> during my test.

I have been watching vmstat closely during my test and noticed something I did not expect. Based on my understanding of the Solaris disk cache, Solaris should be trying to cache the files in memory, at least up to the 4 GB of available physical memory. However, I do not see the free memory size decreasing to zero during my test. Is this expected behavior?

Warren at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 19

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Warren wrote on Fri, 23 September 2005 19:04</b></td></tr><tr><td class="quote">

I have been watching vmstat closely during my test and noticed something I did not expect. Based on my understanding of the Solaris disk cache, Solaris should be trying to cache the files in memory, at least up to the 4 GB of available physical memory. However, I do not see the free memory size decreasing to zero during my test. Is this expected behavior?

</td></tr></table>

Yes, I think this was changed in Solaris 8 and newer. Cached file

pages are immediatelly put on the free list. The data is cached and

could be re-used, but is reported as free memory now.

<a href="http://sunsolve.sun.com/pub-cgi/show.pl?target=content/content8" target="_blank"> http://sunsolve.sun.com/pub-cgi/show.pl?target=content/conte nt8</a>

j.keil at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 20

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Warren wrote on Fri, 23 September 2005 19:04</b></td></tr><tr><td class="quote">

Okay, the RAID 0 array's stripe size is 256 kbytes. That is larger than 227.2 kbytes. Also note that the RAID controller has 64 MB of cache on board. Any parameters (kernel, mount, etc.) that I can tweak to enable more active requests?

</td></tr></table>

Some guys on freebsd seem to have similar problems with write

performance and the aac driver / Adaptec 2120S raid controller.

The following link to a freebsd mailing list message contains both a hint that

an enabled read cache on the 2120S kills write performance (==> the read

cache should be disabled; and the write cache enabled; apparently this

can be changed in the SCSI BIOS), and it contains a hint that the problem

only occurs on amd64:

<a href="http://lists.freebsd.org/pipermail/freebsd-amd64/2004-December/003111.htm l" target="_blank"> http://lists.freebsd.org/pipermail/freebsd-amd64/2004-Decemb er/003111.html</a>

<div class="pre"><pre>On Tue, 28 Dec 2004, Don Bowman wrote:

> yeah, having the read cache on wrecks the write speed.

> Something to do with having it write and then read it back into the

> cache.

Thanks, I'll try to reset system after container reconfiguration. Hope

this helps. Without reboot I cannot see any difference.

> This made a dramatic difference for me. I'm a dual Xeon, 5.3

Maybe this is the point. On i386/Dual Xeon I have no problems with raid5

controller speed...

> CLI > open aac0

> Executing: open "aac0"

>

> AAC0> container show cache 0

> Executing: container show cache 0

>

> Global Container Read Cache Size : 475136

> Global Container Write Cache Size : 40140800

>

> Read Cache Setting: DISABLE

> Write Cache Setting: ENABLE ALWAYS

> Write Cache Status: Active, not protected, battery not present

...

</pre></div>

A similar hint:

<a href="http://lists.freebsd.org/pipermail/freebsd-scsi/2004-May/001191.html" target="_blank"> http://lists.freebsd.org/pipermail/freebsd-scsi/2004-May/001 191.html</a>

j.keil at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 21

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Warren wrote on Fri, 23 September 2005 19:04</b></td></tr><tr><td class="quote">

Okay, the RAID 0 array's stripe size is 256 kbytes. That is larger than 227.2 kbytes. Also note that the RAID controller has 64 MB of cache on board. Any parameters (kernel, mount, etc.) that I can tweak to enable more active requests?

</td></tr></table>

I had a closer look at the <i>sd</i> and <i>aac</i> opensolaris driver sources

(should be very similar to what is released with Solaris 10 x86).

The <i>sd</i> driver implements a "throttle" mechanism, which is used to

submit more than one active command to a SCSI-2 disk that supports

tagged command queueing.

Some host adapters also support untagged queueing (internal queueing

inside the host adapter driver).

When neither tagged nor untagged queueing is available with the host

adapter driver, <i>sd</i> restricts itself to one outstanding I/O request at a time,

by setting "un->un_throttle" to 1.

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io/scsi/targets /sd.c#7979" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io /scsi/targets/sd.c#7979</a>

<div class="pre"><pre>

7979 /*

7980 * The following if/else code was relocated here from below as part

7981 * of the fix for bug (4430280). However with the default setup added

7982 * on entry to this routine, it's no longer absolutely necessary for

7983 * this to be before the call to sd_spin_up_unit.

7984 */

7985 if (SD_IS_PARALLEL_SCSI(un)) {

7986 /*

7987 * If SCSI-2 tagged queueing is supported by the target

7988 * and by the host adapter then we will enable it.

7989 */

7990 un->un_tagflags = 0;

7991 if ((devp->sd_inq->inq_rdf == RDF_SCSI2) &&

7992(devp->sd_inq->inq_cmdque) &&

7993(un->un_f_arq_enabled == TRUE)) {

7994 if (scsi_ifsetcap(SD_ADDRESS(un), "tagged-qing",

79951, 1) == 1) {

7996 un->un_tagflags = FLAG_STAG;

7997 SD_INFO(SD_LOG_ATTACH_DETACH, un,

7998"sd_unit_attach: un:0x%p tag queueing "

7999"enabled\n", un);

8000 } else if (scsi_ifgetcap(SD_ADDRESS(un),

8001"untagged-qing", 0) == 1) {

8002 un->un_f_opt_queueing = TRUE;

8003 un->un_saved_throttle = un->un_throttle =

8004min(un->un_throttle, 3);

8005 } else {

8006 un->un_f_opt_queueing = FALSE;

8007 un->un_saved_throttle = un->un_throttle = 1;

8008 }

8009 } else if ((scsi_ifgetcap(SD_ADDRESS(un), "untagged-qing", 0)

8010== 1) && (un->un_f_arq_enabled == TRUE)) {

8011 /* The Host Adapter supports internal queueing. */

8012 un->un_f_opt_queueing = TRUE;

8013 un->un_saved_throttle = un->un_throttle =

8014min(un->un_throttle, 3);

8015 } else {

8016 un->un_f_opt_queueing = FALSE;

8017 un->un_saved_throttle = un->un_throttle = 1;

8018 SD_INFO(SD_LOG_ATTACH_DETACH, un,

8019"sd_unit_attach: un:0x%p no tag queueing\n", un);

8020 }

</pre></div>

The <i>aac</i> driver appears to implement adapter internal queueing

(= untagged-qing); I see a queue of 512 entries inside the aac driver:

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/io/aac/aac_regs. h#344" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/io/ aac/aac_regs.h#344</a>

<div class="pre"><pre>

344 struct aac_queue_entry qt_AdapNormCmdQueue \

345 [AAC_ADAP_NORM_CMD_ENTRIES];

</pre></div>

And functions like aac_do_async_io() appear to support multiple

commands submitted to the command queue just fine:

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/io/aac/aac.c#aac _do_async_io" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/io/ aac/aac.c#aac_do_async_io</a>

<b>But:</b> The <i>aac</i> driver doesn't appear to annouce it's queueing

ability to the outside world, the functions aac_tran_getcap() and

aac_tran_setcap() don't support the SCSI_CAP_UNTAGGED_QING

property.

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/io/aac/aac.c#aac _tran_getcap" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/io/ aac/aac.c#aac_tran_getcap</a>

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/io/aac/aac.c#aac _tran_setcap" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/io/ aac/aac.c#aac_tran_setcap</a>

This probably explains why <i>sd</i> refuses to submit more than one

active command to the adaptec raid controller. And I guess that doesn't

help performance.

To fix this, the <i>aac</i> driver probably needs to be changed, the functions

aac_tran_getcap() and aac_tran_setcap() should accept the

SCSI_CAP_UNTAGGED_QING property and should always report that

untagged-qing is enabled. With the fixed <i>aac</i> driver, <i>sd</i> should

start to submit up to three concurrent I/O requests.

For an example, see the USB mass storage device driver; it announces

untagged queuing:

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io/usb/scsa2usb /scsa2usb.c#scsa2usb_scsi_getcap" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io /usb/scsa2usb/scsa2usb.c#scsa2usb_scsi_getcap</a>

<a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io/usb/scsa2usb /scsa2usb.c#scsa2usb_scsi_setcap" target="_blank"> http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io /usb/scsa2usb/scsa2usb.c#scsa2usb_scsi_setcap</a>

j.keil at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 22

I disabled the read cache on the 2120S; unfortunately, there was no noticable change in the disk write performance.

Interesting find in the driver. What are the chances of having Sun fix the driver and roll it into a patch for Solaris 10? Any possibility of fooling sd into submitting more than one request?

Warren at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 23

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Warren wrote on Mon, 26 September 2005 19:21</b></td></tr><tr><td class="quote">

Interesting find in the driver. What are the chances of having Sun fix the driver and roll it into a patch for Solaris 10?

</td></tr></table>

Don't know. I guess a support contract could help getting a fix faster :-)

For now I've reported the issue on the opensolaris bugs forum,

maybe some Sun engineer comments on it:

<a href="http://www.opensolaris.org/jive/thread.jspa?threadID=2511&amp;tstart= 0" target="_blank"> http://www.opensolaris.org/jive/thread.jspa?threadID=2511&am p;tstart=0</a>

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Quote:</b></td></tr><tr>& lt;td class="quote">

Any possibility of fooling sd into submitting more than one request?

</td></tr></table>

I guess it would be easier to download current opensolaris source,

setup an opensolaris build environment, add the two-line (?) fix to

aac, and compile opensolaris. Then replace /kernel/drv/aac and

/kernel/drv/amd64/aac with the modules compiled from opensolaris

sources. I'd expect that the interface between the kernel and a scsi

host bus driver is unchanged, so it should be OK to use the opensolaris

compiled aac kernel module on s10 x86. Of cause this would be

compietely unsupported.

Of cause, for an offical supported fix getting a s10 patch would be

preferred.

Btw.: to print the current <i>sd</i> throttle settings, you can use the following

mdb commands:

# <b>mdb -k</b>

> <b>::prtconf</b>

DEVINFO NAME

...

daaf4010 pci-ide, instance #0

daaf3c20 ide, instance #1

<i>d399a168</i> sd, instance #0

The <i>::prtconf</i> prints the device tree, somewhere in this tree you

should find the <i>aac</i> raid controller device with a child node using

the <i>sd</i> driver.

Using the hexadecimal number in front of the <i>sd</i> instance, you

can print the <i>sd</i> driver's internal kernel state (I'm using the atapi

device on the <i>ide</i> controller as an example):

> <b>d399a168::print struct dev_info devi_driver_data | ::print struct scsi_device sd_private | ::print struct sd_lun</b>

Somewhere in the sd_lun structure you'll find these throttle related fields:

un_ncmds_in_driver = 0

un_ncmds_in_transport = 0

un_throttle = 0x3

un_saved_throttle = 0x3

un_busy_throttle = 0

un_min_throttle = 0x8

un_reset_throttle_timeid = 0

The important field is "un_throttle". I expect that you'll get the value "1" for

the aac raid disk device.

j.keil at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 24

You are correct, the value of un_throttle is 0x1 for the aac driver.

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Quote:</b></td></tr><tr>& lt;td class="quote">

Don't know. I guess a support contract could help getting a fix faster :-)

</td></tr></table>

Perhaps I could just purchase a well supported RAID controller. Any idea if there is a relatively inexpensive PCI Express Ultra320 SCSI RAID controller that has first rate driver support under Solaris x86?

Warren at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 25

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Quote:</b></td></tr><tr>& lt;td class="quote">

Perhaps I could just purchase a well supported RAID controller. Any idea if there is a relatively inexpensive PCI Express Ultra320 SCSI RAID controller that has first rate driver support under Solaris x86?

</td></tr></table>

I'd first try the <a href="http://www.sun.com/bigadmin/hcl/data/sol/components/views/disk_controller _sun_certified.page1.html" target="_blank"> http://www.sun.com/bigadmin/hcl/data/sol/components/views/di sk_controller_sun_certified.page1.html</a> Sun Hardware Compatability List. Notice I've put the <b>Sun certified</b> components first in the link.

I'm also glald you've found such a knowledgable respondant such as j.keil to help figure this out.

zemplar at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 26

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>zemplar wrote on Mon, 26 September 2005 21:55</b></td></tr><tr><td class="quote">

<table border="0" align="center" width="90%" cellpadding="3" cellspacing="1"><tr><td class="SmallText"><b>Quote:</b></td></tr><tr>& lt;td class="quote">

Perhaps I could just purchase a well supported RAID controller. Any idea if there is a relatively inexpensive PCI Express Ultra320 SCSI RAID controller that has first rate driver support under Solaris x86?

</td></tr></table>

I'd first try the <a href="http&#58;&#47;&#47;www.sun.com/bigadmin/hcl/data/sol/componen ts/views/disk_controller_sun_certified.page1.html" target="_blank"> http://www.sun.com/bigadmin/hcl/data/sol/components/views/di sk_controller_sun_certified.page1.html</a> Sun Hardware Compatability List. Notice I've put the <b>Sun certified</b> components first in the link.

I'm also glald you've found such a knowledgable respondant such as j.keil to help figure this out.

</td></tr></table>

I noticed the controller Warren is using 2120S is certified with HCT level 2

tests. And you did mention Sun certified hct level 1 tests. Is there some

sort of test results between level 1 and level 2 that would show this

type of performance problem (as described in this thread) with HCT? Actually there should be some performance test results of the drivers published somewhere?

Bob

I'm also glald you've found such a knowledgable respondant such as j.keil to help figure this out.[/quote]

palowoda at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 27
I wholeheartedly agree, j.keil rules.Does anyone know where I might obtain a copy of the Sun Hardware Compatibility Tests? It would definitely be interesting to try them out.
Warren at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 28
Never mind, I located it <a href="http&#58;&#47;&#47;www.sun.com/bigadmin/hcl/hcts/hcts.html" target="_blank">here</a>.
Warren at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 29

I dont think there has been a new controller or disk in the last 10 years that didnt support tag queuing, so it is an unfortunate default to disable multiple requests to a controller, but I suppose it is better safe than sorry.

Be suspicious of an actv of 1.0 (reported by iostat) as a indication of a single threaded limit to the controller, as this is an average and the instantaneous load level can easily slip through the cracks, however, with the combination of 1.0 actv, with high wait% while observing that un_throttle=1 it appears conclusive that a single threaded limit is the case.

Given the excellent investigation already in this thread, there isnt much to add, however in lue of hacking the driver or waiting for a patch, you can at least make the single threaded access more efficient by tuning UFS and the RAID controller for larger I/O. You might want to try a UFS maxcontig of 128 (newfs C or tunefs a, and make sure maxphys is set to allow 1 MB transfers) and use a stripe unit of 1 MB as well. A stream of 1 MB sequential transfers is going to do better than 256 KB.

Note also that is not easy to get much more than 50-70 MB/sec through UFS anyway, and your raw write bandwidth to the drives is already ~ 90 MB/sec. How hot is CPU running in this example? If you have high CPU utilization there may be scheduling delays as well, you might do better with direct I/O. Also, if CPU is strained, shorter time quanta set with priocntl(1M) on the heavy hitting processes can help I/O latencies.

HTH,

daveCfisk at 2007-7-21 14:27:36 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 30

Hi Dave:

Unfortunately, the CPUs are basically snoring. If you look back to the sample <b>iostat</b> output that I posted, the CPUs are 92% idle. The machine appears to be I/O bound.

Also, I tried modifying the maxcontig parameter on the filesystem to be 128; however, that managed to break Solaris' ability to mount the filesystem. I had to boot the installer disc, mount, and reset the maxcontig parameter to 32 in order to be able to boot the machine under Solaris 10 again. Any idea what may have gone wrong?

Thanks for your suggestions.

Warren at 2007-7-21 14:27:40 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...