T3 IO bandwidth
We have a T3 array (not a T3+) with 9*36GB disks configured as a 290GB RAID5 array:
IDTYPEVENDORMODELREVISIONSERIAL
-- -- -- - --
u1ctrcontroller cardSLR-MI375-0084-02- 0210/011804020939
u1d1disk drive SEAGATEST336704FSUN A7263CD0SNRQ
u1d2disk drive SEAGATEST336704FSUN A7263CD0SQL4
u1d3disk drive SEAGATEST336704FSUN A7263CD0SNKW
u1d4disk drive SEAGATEST336704FSUN A7263CD0SN1C
u1d5disk drive SEAGATEST336704FSUN A7263CD0SNKN
u1d6disk drive SEAGATEST336704FSUN A7263CD0SMQ9
u1d7disk drive SEAGATEST336704FSUN A7263CD0SG45
u1d8disk drive SEAGATEST336704FSUN A7263CD0SM5Q
u1d9disk drive SEAGATEST336704FSUN A7263CD0SQF6
u1l1loop card SLR-MI375-0085-01- 5.02 Flash026127
u1l2loop card SLR-MI375-0085-01- 5.02 Flash026372
u1pcu1 power/cooling unit TECTROL-CAN 300-1454-01( 0000018786
u1pcu2 power/cooling unit TECTROL-CAN 300-1454-01( 0000027116
u1mpnmid plane SLR-MI370-3990-01- 0000017505
The array is connected to a V250, via a single port 1Gb FCAL card, running Solaris 10.
The T3 LUN is mounted as a UFS on /data1.
What I want to know is...what is the maximum sustainable IO bandwith, in MBytes/second, that we should see when we access files on the T3?
Thanks,
Paul.
[1316 byte] By [
] at [2007-11-25 22:59:49]

# 1
depends :-)whats the config of the T3 exactly with blocksize/cache settings, whats the file system settings and is there a VM involved?sustainable writes or reads/random/mixture?
at 2007-7-5 17:48:55 >

# 2
So this is one of those questions that's highly loaded.
Based on your given output, we can surmise the following:
100 MB/s band to the array
10k RPM drives(system handbook)
2k or 8k application coalesced read's/writes(UFS block size)
RAID 5
We don't know whether you're going to be accessing the files
read, write, sequential, random, or even how many IO's/s
the application feeds into the band.
Think about it this way, if you have an application that does 8k
read IO's, and only can perform a maximum of 200 read IO's
per second, the maximum throughput of the application is:
200 x 8k, or 1600k = 1.5 MB/s
Hardly a heavily loaded storage subsystem. A JBOD could perform this work readily.
So you're gonna have to do some homework, mainly:
-get the IOPS of your application(iostat during a heavy load should help)
- find out your access pattern(read/write ratio, sequential vs. random)
- evaluate your volume segment size on the T3 based on the
access pattern.
Remember that UFS is considered an application, so it runs in
either 2k or 8k chunks. Make sure your application performs the
access pattern in these amounts. If it never exceeds 16k, your
best bet is to set the segment size to 16k. Otherwise you will
not be hitting all of the spindles on the T3 brick.
at 2007-7-5 17:48:55 >

# 3
OK, back from holidays now :-(
The T3 is now connected to a V210 running Solaris 9 via the same PCI:FCAL card as before.
We are mainly doing sequential reads against large files (25GB) as part of an OLAP cube build process. The output file that is written is about 500MB due to data compression.
The array has 256MB cache and was setup with default settings for a single RAID 5 LUN.
We are using a T3 as it's a nice bit of kit and we only paid USD350 for it off ebay!
Paul.
at 2007-7-5 17:48:55 >

# 4
I ran a set of profiles on the older T3, testing each I/O size and access type combination over a range of concurrent threads, from 1 to saturation on a single 1 Gbit FC-AL, on an E4000, which will give you a good idea of the older T3 bandwidth limits.
By default, I believe you have a 64 KB stripe unit, over a single 8+1 RAID-5 LUN. Also, as it sounds like you have a single brick this would be with write cache mirroring disabled.
The maximum raw bandwidth, at the LUN level, was 82 MB/sec. (1024 bytes per KB), and this was with a 512 KB I/O size, sequential write, with 6 threads of I/O. The single threaded throughput for the same test was only 27.7 MB/sec, so multi-threading plays a big roll in achieving the maximum bandwidth capability of the array with sequential writes.
In order to achieve this under UFS you need to tune the kernel maxphys up from the default of 128 KB to say 1 MB to avoid fragmentation. Also the file system must be tuned for large I/O with newfs 鼵 64, or tunefs 齛 64, which targets a 512 KB I/O size.
The maximum read bandwidth was slightly less, at 78.4 MB/sec with six threads of 1 MB sequential read. The single threaded sequential read number for this test was much better than the corresponding write, coming in at 52 MB/sec. So, the read bandwidth is not as dependent on concurrency as the write bandwidth.
The maximum random access bandwidth, over the full seek range, was 31 MB/sec, with 6 threads of 256 KB I/O random read, and 25 MB/sec with 3 threads of 256 KB I/O random write.
Note that these are all maximum bandwidth numbers, higher throughput in terms of IOPS is obtained with small I/O.
HTH,
at 2007-7-5 17:48:55 >

# 5
hows it going Dave?
great post and some interesting figures, which leads me to two questions, one being a favour?
1. what *type* of data are you playing with here for these figures? is it *proper* data or just synthetic to get some figures from the box?
2. when you did these benchmark runs, did you have the read-ahead on or off? I'm presuming off, but could you run with whatever the reverse is and post the figures?
cheers
TB
*edited for typo in orig post*
at 2007-7-5 17:48:55 >

# 6
Hello TB,
Thanks. These numbers were generated with vxbench. It齭 just writing zeros and reading what ever happens to be there. The I/O threads are seek coordinated however, so the bytes are assembled in order, as would need to be the case for a real payload.
Regarding the pre-fetch, that齭 a good question, and it reminds me that one should always be diligent about carefully documenting the configuration when measuring performance :-) Unfortunately, this detail is missing from my notes, and I no longer have access to the configuration.
For the subject case of large (1MB) sequential reads I don齮 think the pre-fetch would have made much difference. I have seen cases where pre-fetch had a negative impact for random reads of any size, while, as I齧 sure you know, pre-fetch is a huge win for small sequential reads. I have generally found that both pre-fetch and write behind tend to have a negative impact, if any, when maximum bandwidth large sequential I/O is the objective.
Regards,
at 2007-7-5 17:48:55 >

# 7
hi Dave,
i have to smile re:you not documenting the configuration; I fall in to that trap from time to time and find myself wondering about stuff days after an event :-)
erm, I guess following on from that, you couldn't answer my third question which I didn't fit in my post, which was did you use any mount settings for the UFS?
with the pre-fetch set differently and the use of different mount settings (direct I/O for e.g.) i should imagine would give much different readings. also, I point out the read-ahead as sometimes this term is misunderstood on this array; that being rh off = pre-fetch on, rh on = pre-fetch seek on. i should imagine that your 'negative' impact you see sometimes may be due to you having rh on, where you are pretty much wasting an I/O per read when the data is sequentially located.
Cheers
TB
at 2007-7-5 17:48:55 >

# 8
Hello TB,
I should clarify that the numbers I sited are raw I/O directly to the LUN.
To deliver this bandwidth to the filesystem, the filesystem I/O to the LUN needs to emulate the raw I/O in the experiment. Direct I/O will do the trick if your application is already issuing I/O that fits the description, or the fileysystem page cache can be tuned for the required I/O size, and auotup used to regulate the concurrency (as best as you can with ufs) as a function of fsflush().
In older Solaris the ufs has some internal limits that hold it to about 50 MB/sec or so. Like single threaded memory scan. Direct I/O was largely inspired by such limits. Memory scan was replaced with a list in Solaris 8. However, there may still be a single threaded limit in fsflush(), that齭 supposed to be fixed, but I don齮 know if it is yet. If you become short on page cache, you may hit this limit before you exhaust the array.
So having identified the maximum sustainable bandwidth of the array, the question becomes how to harness that capability from the file system. Which I propose, is to emulate the raw I/O in how you configure the system, and filsystem, at least, as best as you can.
Once you have determined how to do so, it is important to document the configuration ;-)
Cheers,
at 2007-7-5 17:48:55 >

# 9
So having identified the maximum sustainable bandwidth of the array, the question becomes how to harness that capability from the file system. Which I propose, is to emulate the raw I/O in how you configure the system, and filsystem, at least, as best as you can.
TB>Indeed matey
Once you have determined how to do so, it is important to document the configuration ;-)
TB>would be best I guess :-)
I had a look at your site (plug, plug) and it looks like a pretty cool product offering by the way ;-)
at 2007-7-5 17:48:55 >

# 10
Hi
Thanks for a great post. I am trying to get our T3 raid array to do more than 8mb per second. It has 64k block size with read ahead turned on as a single brick raid5 1-8 + hot spare. It is mounted as a standard ufs file system and I have made no changes to the kernel etc. When I back it up using ufsdump it's just slow. With Oracle it is also slow. Oracle has async io turned on. Can anyone give me some advice on what would need to be changed to tune this up ? I have never seen this go quicker than 8mb per second with iostat -xn 3.
The t3 is directly attached to a v440. Runing solaris 9.
Regards
# 11
If you read through the conversation between Dave and TB here, you'll see Dave has suggested that the key to good performance on the t3 solution is being able to achieve similar throughput figures at filesystem level. Have you tuned the filesystem or determined if it is the bottleneck? What has your DBA suggested here? This kind of post goes outside the scope of Sun storage support and into a Oracle DBA discussion. My OCP is currently outdated, but I'll fire out a suggestion at you. Get your sys admin and dba to create an I/O profile for the database, if the database is I/O bound ( high amount of I/O ), Oracle can be configured to use raw devices to improve performance ( basically what you are doing is eliminating the overhead of the filesystem ).
# 12
I am the dba and sys admin. I have been running filebench with the varmail profile against the same t3 on a solaris 10 machine.
test 6ufs normal t3IO Summary:99359 ops 1644.9 ops/s, (253/253 r/w)8.4mb/s,1458us cpu/op, 29.0ms latency
test 7ufs noatime t3 IO Summary:99467 ops 1646.9 ops/s, (253/254 r/w)8.2mb/s,1450us cpu/op, 30.5ms latency
test 8ufs directio t3IO Summary:103881 ops 1720.9 ops/s, (265/265 r/w)8.6mb/s,1446us cpu/op, 28.8ms latency (best for t3 ufs)
test 9zfsnormal t3IO Summary:340518 ops 5639.0 ops/s, (868/868 r/w) 29.1mb/s,901us cpu/op,7.2ms latency
test 10zfs noatime t3 IO Summary:336663 ops 5576.1 ops/s, (858/858 r/w) 30.0mb/s,909us cpu/op,6.3ms latency (best for t3 zfs)
You can see here than zfs performance is massive by comparison to ufs on the same box. On my production machine which is solaris 9 we are only getting about 8mb/sec max on read. Write is quicker but not alow. I was under the impression that a t3 would be much quicker than a single drive. My single drives can do 4.5mb/s per second .. the t3 is only returning 8.4 approx.
What do you think ? I'm starting to think I have a problem with firmware or something on the t3 disks.
# 13
I don't have much time to look at this issue this weekend, but just by looking at your figures I would think that it shows UFS as the bottleneck rather that the array firmware ( keep in mind that it is always good to keep the system controller and disk firmware at the latest patch level ). In our t3 test environment we are also using a sythetic load generator as well, but I'd need some more time to look through your figures to replicate them.
# 14
I've had a chance to take a look at some of the figures on this issue today. I also welcome input from anyone else who has anything they would like to contribute to this thread! First off, I decided not to use filebench, at this point I don't think it would be of benefit to me personally to start working with other filesystem benchmarking tools. For the tests I have carried out, I have used a single t3 array configured with factory RAID 5 LUN and a default UFS filesystem ( Sol 10 # newfs /dev/rdsk/cNtNdNsN ). I think that the very poor results that you are experiencing is the result of filebench doing random read/write or large IOs to the filesystem. With the above mentioned configuration, if I do random I/O ( 4 threads with 8k I/O size ) with mixed read / write to the filesystem, I get throughput figures of roughly 1MB/s. If I do the test with 4 threads, an I/O count of +1500, I/O size of 8k and do a sequential read to the filesystem I get 140MB/s. For a sequential write I get 80MB/s ( if I use a single thread I get 300KB/s, this should reflect the type of performance figures obtained by using tools like mkfile or dd ). This brings us back to the point I have made previously on this forum, select your storage based on your application I/O profile. So if your Oracle database is going to have a random I/O bound profile, a single t3 would not be up to the task. Furthermore as your test results show, a more advanced filesystem may be needed to improve the performance ( eg. zfs, qfs or vxfs ) if the budget restricts implementing more advanced hardware ( eg t4 - hds ).
