Poor performance - StorEdge 3320
Hello!
I'm experiencing what I believe to be abysmal performance with a StorEdge 3320 RAID unit (two RAID cards installed, 12 73GB disks) hooked up to a Fire x4100 with U320 SCSI card installed.
Whatever configuration I try, I am utterly unable to pass 40-50MB/s.
The connection from host to 3320 is at 320MB/s - I've verified this both from the server end (in both Solaris and FreeBSD) and at the 3320 end (view and edit channels says the host channel is 160MHz, wide). All the disks are synced at 320MB/s, verified in the 3320 terminal.
I've been trying various configs for weeks. I've fiddled with both single an d split bus, varied the number of disks in the arrays, tried raid-0, raid-5, varying the caching system and blocksizes, trying multiple raid-0 then mirroring/striping on the host, and even just exporting one disk per lun with NRAID then using zfs striping on the host (thus bypassing the expensive raid card) all to no avail.
Typical performance follows; this particular config is 4 luns exported, each one has a single disk on, and the filesystem is a 4-way zfs stripe:
extended device statistics
r/sw/skr/skw/s wait actv wsvc_t asvc_t %w %b device
0.0 109.80.0 14055.3 0.0 35.00.0 318.70 100 c3t0d0
0.0 112.00.0 14336.9 0.0 35.00.0 312.50 100 c3t0d1
0.0 112.00.0 14336.9 0.0 35.00.0 312.50 100 c3t0d2
0.0 112.00.0 14336.9 0.0 35.00.0 312.50 100 c3t0d3
Note the %b is locked at 100% (this is the case whatever raid config I happen to be using) and yet each disk is only doing 14MB/s, giving a total in this particular config of just 56MB/s. This was done with:
zpool create -f test c3t0d0 c3t0d1 c3t0d2 c3t0d3
cd /test/
dd if=/dev/zero of=foo bs=16k
Varying bs does not change the result, as I suppose I could expect with a sequential write to zfs.
I beg for some tips on where to look for this performance bottleneck. I am convinced that a ?5k disk array must be able to do better than 56MB/s. Or are they really this slow?
I have tried multiple cables and multiple scsi cards in different hosts, all with similar results, so I'm sure no hardware is at fault. I also have two of these 3320 RAID systems both of which are doing the same thing.
I would also like to note that I have a couple of 3320 JBOD systems with 12 73GB disks too, and within seconds of hooking these up and putting them into a zfs filesystem, I was happily pulling and pushing 230MB/s and beyond on them. That's the kind of performance I'd like from the substantially more expensive RAID systems. Or am I just not going to get it?
Thanks very much for any advice, I'm really at the end of my rope with these! :(
Bit of 3320 info:
Firmware Version: 4.12E
Bootrecord Version: 1.31K
Battery Backup Unit: Present
Message was edited by:
celsworth [added more 3320 info, firmware etc]
[2939 byte] By [
celsworth] at [2007-11-26 9:30:46]

# 1
What can I say other than complain bitterly to Sun about this and get them to get DotHill to do something about their less than perfect controller firmware. Almost every version of the firmware that DotHill puts out does something to affect the performance of the 3000 series arrays.
I totally gave up on a 3310 for its spectacularly pathetic responses to writing a large file to the filesystem and then even deleting it was painfully slow. That was the same version as you have. They did the same thing with the FC arrays. When they released 4.11, the performance went downhill by as much as 15%.
Our 3510s are really pushing to get over 90 MB's per second (thats using four fibre channels with multipathing) and we normally get about 60 MB's under heavy load. We have the systems (6900's) and the SAN infrastructure to push the data. The 3510's are just really slow in my opinion.
So much for 200 MB full duplex. That is running 3.27R because we cant upgrade to 4.15 due to operational requirements. The last time we planned on upgrading to 4.13I, Sun told us not to do it due to the high likelyhood of double disk failures in the array.
Now our 3510s are used as secondary storage because why chuck away hundreds of thousands of dollars? You need something to do a disk to disk backup.
I cant provide any help, but at least you know you are not the only person who thinks they are slow.
Stephen
# 2
Gents, please test your storage properly... I can get +90MB/s read on a single T3 with a 256MB controller ( factory set RAID 5 LUN ) and a single 100MB/s FC HBA on a Sun Blade 1000. To be rude and crude... dd is a shite tool for storage performance analysis, I've even seen it mentioned in spectrum infodocs, it should be banned, don't use it for anything except creating Linux boot floppies! Performance can be affected by many factors, for example, large chunks of data ( say 128k ) being written to a default controller stripe size of 16k will cause a major bottle neck. A request of 128k would mean that the storage system would have to access 8 drives ( 8 x 16 = 128 ) to write 128k, sustained requests like this from an application translates to lots of waiting time for each IO request to be completed, this is what's happening with dd, you're trashing the storage system. Sun storage, just like EMC, HP or OTHER no-name brands will suffer if poorly set up or poorly tested. Unfortunately the Sun benchmarking tools are Sun internal and I cannot post information about it here, contact Sun or an iForce storage specialist and hire them to test the storage performance if you are a current contract customer. Additionally there are tools like vxbench from Veritas, this tool is similar ( but more basic ) to Sun's vdbench, but it was available to download a few months back ( I'll check for the URL and post back ). You can set these tools up to access the storage system just as the production application would, but this is the key... knowing your applications IO profile. No data base, messaging system or any other application's IO profile used on storage systems today can be emulated by dd.
# 3
I read the original post again and I see you have experimented with various contoller stripe sizes...
sorry for the rant on this earlier, but I will pull out some more points about how rubbish dd is.
Here is a link to a blog from Dominic Kay of Sun:
http://blogs.sun.com/roller/page/dom?entry=filesystem_benchmarks_vxbench
He discusses some of the issues with the vxbench license, not sure if I want to link to it myself because of this information, but you can source the application from other posts such as the following:
http://mailman.eng.auburn.edu/pipermail/veritas-vx/2003-March/005727.html
But again Mr. Kay seems to suggest that if you use vxbench without Veritas technical support engineer and you crash your storage system, you have no warranty. Based on this, I would strongly suggest that you hire Sun to perform the statistical analysis of your storage system, this way you can have the work carried out without risk of damage to the system.
Another filesystem benchmarking tool:
http://www.iozone.org/
Message was edited by:
m-lennon
# 4
I would suggest testing the hw raid controller with a block i/o tool. This will tell you if this is a controller limitation or a zfs issue.
In fact, if you look at the zfs-discuss archives on opensolaris.org you'll see a lot of information on this topic. The take away being that ZFS appears to have a per lun i/o limitation. Later revs of Solaris will fix this but if you're on the latest Solaris 10 release you don't have the most current code. You might want to try a Solaris Express build.
# 5
This has mostly nothing to do with the 3320 but I thought this should be included.
I would love to point out that the performance problems I have seen are expensive solutions provided by the best that Sun can offer. I am talking about a couple of million dollars (ok include a Sun Fire 6900) spent on this system to get exceptional performance for an exceptional database solution. I recall seeing names involved in the solution that would command good dollars at seminars and they have written many books on performance.
I have had nothing but problems with my 3510 solution and I just spent another couple of million dollars to get a TagmaStore 1100 to replace them. Why was this not purchased in the beginning? Because Sun make no money on them. They do make lots of money selling lots of 3510's and 6920's. The only money they probably make from selling HDS systems is the consulting that is bound to be associated with them.
So, I find it a bit insulting that someone says the setup is all wrong when I personally had nothing to do with and Sun did it all and charged a fortune for a dud solution that they can't make any better. I am personally sick of blaming firmware for every fault that happens with the 3510's. I am very sick of its inability to keep Quick I/O running when a bad block in encountered on the 3510 and the controller does not even know what to do with it and we start getting silent filesystem corruptions.
I now refuse to even consider the 3510's for anything other than something to store my mp3s on.
Check out the following Sun Alert. It does not mention 3320's but that is probably because they are just an updated 3310.
http://sunsolve.sun.com/search/document.do?assetkey=1-26-102127-1
Incase you cant see it..
SE 3310 and SE3510 systems will experience performance degradation of varying degrees on systems running revisions 4.1x of the controller firmware. A performance impact may also be seen with the SE3511 array as well.
Oh. and there is no fix yet.
So, perhaps it is the 3320 that is sick and not the original posters lack of ability.
Stephen
# 6
Probably more insults for you then... if you had nothing to do with the set up, and Sun were unable to delivery the performance you require, why are you talking about it on a storage support forum? As a multi million dollar customer, surely you must have some sort of clout to demand that the solution be upgraded to something with satisfactory performance? Furthermore the 3510 is essentially an entry level storage system and what you are saying here is, that for your application, the solution you require is high end. Why didn't you order the 9990 ( HDS ) that you are talking about in the other thread? Why didn't you make a " proof of concept " request before the project began - ensuring the right storage system was used in the project. I do have access to the SPE, so I can read the alert. Nowhere in my comments have I suggested that there is a problem with abilities the original person who started the thread, nor have I said anything about you. I did make the point that I was being " rude and crude " sorry for the insult, that was not my intention, I was trying to crudely emphasize how bad I think dd is for testing storage systems. Customer service will deal with your issues, here we are talking about " my array is performing poorly and I've tried many different configuration, is it really that bad? ". Seriously though, dd is not the tool for storage performance analysis period. Given the monetary figures you are talking about here, I would have at least expected some mention of credible performance statistical analysis tool, vdbench and SWAT are Sun internal storage tools, it could be put to Sun, a request to demonstrate performance figures using these tools, then hire a non-bias 3rd party to validate the figures or show your conclusions that the 3510s are under performing.
# 7
To get away from the politics of the of Stephens company and Sun Microsystems, can we please get back to the topic of the original post in the thread ( the issue at hand, if you will ). A thought occurred to me, but I may be wrong.. the 3320 is in the same family as the 3310. This storage system has poor write performance issues in dual controller configuration. If this issue is also applicable to the 3320, combined that with use of dd, this could explain such a huge performance gap between the JBOD and RAID storage systems. Make a service call with Sun and have you FE confirm if there is an issue with the 3320 as well.
# 8
Through some discussion with the vendor of the 3320 units, I've managed to secure a Sun engineer visit to site tomorrow.
I'll post back with whether he can improve throughput and how he did it if so.
Interesting point about the zfs io-per-lun limitation, and yes, I will upgrade Solaris to test this. I'm currently using Solaris Express 5/06 which won't have the io-per-lun improvements in. I'll try 7/06.
I have also tried ufs and even just some reading/writing to the raw disks all with similar results (struggles to top 50MB/s).
# 9
I had some additional information I wanted to add into the other thread, but I could not edit it.. here goes:
To confirm this issue on the 3320, configure the system in single controller configuration, run the tests again ( as you did with dd on JBOD and dual RAID ). If there is a difference, that will confirm that the issue exists with this platform as well. Ask the FE to demonstrate the system performance with vdbench in various configurations ( JBOD, single and dual RAID ), at least then you will have a realistic throughput figures for the storage system. Take a serious look at an alternative to dd... please
# 10
An other datapoint: Performance isn't everything. The reason dual-controllers are used is because, in some cases, you could lose a connection to one controller. Or it could short out. Or .. <insert cause of failure here>.Cheap, fast, reliable: Pick two. :)
# 11
Well, results from the Sun engineer visit: practically nothing.
He was unable to determine what the problem was, although he acknowledged there was one, especially after I demonstrated a JBOD unit (using zfs, raid-10 over 12 disks) doing 150MB/s next to a RAID unit doing 40MB/s.
He retreated after 5 hours to regroup. I'm not entirely convinced he'll be back with a solution - especially after he resorted to googling for help (yes, because I obviously can't do that myself) and found this very thread :)
I've been trying other tests in the meantime, more real-world stuff such as creating multiple 1G files on both the RAID and JBOD units (many times the amount of memory the machine has (2G) to be sure it can't cache anything, then copying the files to swapspace; I haven't got any numbers to hand and there's not much point pasting them, but the JBOD unit was quicker every time.
I realise I'm going to sacrifice some performance for the reliability, but really, in a suitable configuration, I don't see how the JBOD unit is any less reliable than the RAID unit. I can use the JBOD unit in a split bus configuration, with two connections back to the host, and mirror data across the connections.
Furthermore, with two JBOD units, and two connections to the host from each, I can lose an entire 3320 and replace it without having to take the filesystem down.
Unless I'm missing something fundamental, with a little thought I can get pretty much the same fault tolerance as a RAID unit could offer me?
# 12
Hello,
from time to time [url=http://forum.sun.com/jive/profile.jspa?userID=124975]Dave Fisk[/url] makes very interesting and detailed posts on the Storage Forum... He had posted on the former Sun Hardware Support Forum as daveCfisk. Maybe he can comment on the performance.
This is another post by [url=http://forum.sun.com/jive/thread.jspa?threadID=67291&messageID=332168# 332168]Adrian Cockcroft[/url], who mentions his former colleague in his blog.
Note: I'm unable to edit (remove the links) my post after someone answers. In this case report to the Forums staff or file a RFE to allow to edit your own posts even after someone answers.
Regarding the ability to edit your own posts see [url=http://forum.java.sun.com/thread.jspa?threadID=731615&tstart=0]Foru ms UI and feature changes[/url].
The ability to edit your own posts was missing after switching to a new Forums software or release, it isn't a new feature.
Michael
# 13
While discussing I/O performance today with a HDS guru, we covered some very interesting points about block and stripe sizes. The stripe size that HDS uses in raid 5 is 256 kb. It is almost impossible to change that unless you really need to use older systems for replication. Also, he said to use about 4 MB setting for Veritas filesystems.
When he mentioned 256 kb, there were stunned looks in the room including me. So we did disk I/O 101. When he was finished, I now know why I have seen what I thought was poor performance and it is directly related to this forums problem with speed.
It is almost impossible to get the maximum speed that you think you should get. 2 Gb fibre (approx 200 MBs) is extremely lucky to get 1 Gb (approx 100 MBs) throughput and 80 MB per second is pushing the boundaries. Thats probably why vendors offer over subscription for fibre channel switch ports. Obviously SCSI is going to be slower.
Basically, the size of the data has to be much larger than the stripe sizes on the storage. Problem is most hosts are incapable of providing the large block sizes. Reducing the amount of work the heads do over the disks also has major impact considerations.
In a nutshell, if you are doing sequential writes and little random reads, you need to set your array as sequential with 128k block stripe sizes and then try to send 512kb sized data to it. If you can achieve this, your performance will be noticeably better. How you do this is going to be a challenge. This does not mean that you should set your stripe size to a small setting say 4 kb and try to send 16 kb blocks to it. It wont work that way and I can't give you the reason why.
So, in theory, what you are seeing could be the best you can actually achieve out of your array without doing some major mucking about with your system if that is possible.
Cache is paramount to being able to get performance with large block sizes.
Also, IOPS have a significant impact on how the striping and block sizes interact.
I wish I had gotten a copy of the information that was provided today so I could give it to you.
So my main complaint about DotHill firmware may not be fully justified. Ofcourse, it you use JBOD's, you don't get to worry about striping sizes and you get better performance but you dont get the hardware raid reliability. Therefore, the decision is yours. Software or hardware RAID.
I hope this helps and that the information I received today was correct... It is a pity there is nothing easily available on the web to read about this. Perhaps it is just a ploy to not get my hopes up with the new storage?
Stephen
# 14
Interesting that you mention Dave, Michael, he has provided some valuable input on performance on the old forum. I was also interested to read Cockcrofts blog, I read it a week or so ago. This led us to start testing Atlas this weekend, we began by building large Postgres and Oracle databases, this is to geterate an I/O profile for the storage system ( I'm trying to replicate a typical enterprise DB I/O profile ). Later we will test it with vdbench, SWAT and Atlas. ( Hope the guys at ORtera don't mind that I reference their products here ( Dave Fisk used the URL in his signiture on the old forum and his current profile for this forum linked by Michael )). We are currently bogged with building lots of tables and running queries on the database though, you wouldn't believe the amount of work involved, I'm worn out ( it will be worth it though, not to rely 100% on load generating tools! ). Torreysun is correct in their suggestion that there is a performance overhead when configured in dual controller configuration, this is an issue with the 3000 series enty level systems. Lets not forget that statement, the 3000 series arrays are only entry or workgroup level systems and are limited in performance. Were any tests carried out with the system configured with a single controller? If so what were the figures? The FE may be limited to what input he could put in, perhaps without a hardware fault, hardware related issue or poor configuration there is little else he could have done.
# 15
I have some numbers from a 2 GB FC 3510, RAID-1 6+6@64 KB that may be of interest.
For a 128 KB Random Write, the maximum bandwidth was 41 MB/sec with 5 threads. The performance degraded at higher load levels.
For a 128 KB Sequential Write, the maximum bandwidth was 125 MB/sec with 15 threads.
For a 1 MB Random Write, the maximum bandwidth was 127 MB/sec with 9 threads.
For a 1 MB Sequential Write, the maximum bandwidth was 134 MB/sec with 5 threads. This represents the maximum sustained write bandwidth of the array measured.
A single thread of 128 KB Sequential Write, such as produced by dd, was only 21 MB/sec. This is why dd is not a good test unless you are only interested in single threaded performance.
A single thread of 1 MB Sequential Write was 83 MB/sec. So even with dd and a 1 MB I/O size you should be able to get more than 50 MB/sec.
In the context of ZFS a key factor is the limit of I/O size to a maximum of 128 KB, and the limit of threads to 35 per LUN at the pool level. The thread limit is not too bad as there are not enough disks in the 3510/3520 to exhaust 35 threads, at least for RAID-1 configurations. For RAID-5 you might exhaust the 35 threads if you build a single RAID LUN of 11+1, hence two LUNS of 5+1 would be better. The use of multiple LUNS in general will get around the thread limit; however, most hope of internal sequential I/O is lost when you share the drives across LUNS.
For a 3520 that can only get 50 MB/sec on a 320 MB/sec channel, something is wrong. I would check iostat E and have a look at /var/adm/messages as well for any complaints or errors. I would consider asking Sun to exchange the unit if it cannot deliver more than 50 MB/sec in sustained write performance. It could be a marginal internal drive or other component is killing performance with silent retries or something like that.
HTH,
Dave (the ORtera man)
# 16
Hello All,
A few comments on the topic of IOP's , throughput, segment sizes etc etc.
First of all when seeing performance figures quoted, the first question that comes to mind is, how where those figures obtained, i.e. understand the context with which they were produced. Understanding that will help you to determine if those figures are realistic for your application requirements.
I would say that for most performance figures quoted, especially if they are part of marketing material, they will not be a measure or guarantee that you will archive those in your application environment.
IOP's are produced using the smallest amount of data that can be read/writing with scsi, 512 bytes. Throughput figures are obtained by using very large blocksizes, 1-2MB are not exceptional.
In both cases not representing your application requirement.
Also realize that if you have a 4Gb FC link and your app is using 8K io's you need a source that can generate 50.000 IOP's to saturize the link!
And as mentioned earlier in the thread dd is a singlethreaded (!!!) block copy tool, not at all suitable to measure performance with.
Now how do you determine the right segment sizes? Again you need to have a proper understanding of your application/host environment. This includes HBA driver stack, filesystems on to of it, etc etc.
Key paramaters to find are HBA driver block sizes used, % write, sequential/random and data locallity. With the last remark I mean if you have a large volume but data is only referenced in a small area data is accessed very local in the total filesystem. Cache is good in such a situation.
Getting to the cache topic, cache only helps for writes and reuse of blocks for reads. Remember if you write the data it eventually always has to end up on disk. For write, cache only helps to speed up bursts of writes, on average it is the spindle speed that counts.
Now how to determine good segment size. You need to know how random or sequential your IO is. In general smaller segment sizes work better for random IO's with smaller block sizes.
In RAID 5 and dealing with sequential writes it is good to try to get the host/app to write bigger blocks. A block that is bigger or equal the size of a segment width ( segment size of raidset * the number of data spindels ( exclude tha parity spindle(s)) is the situation to aim for. This saves the process of parity updates for each segment size write.
There is much more to say about this topic and books have been filled, but I hope it helps a bit.
regards
Peter
# 17
I found your comment very interesting especially with the problems that I have experienced with the StorEdge 3000 systems and the rubbish that is published on performance. Sun put out a comparison that rated the 3510's almost the fastest storage in the world for what it is with regards to IOPS.
If you ignore all the junk, I put it down to you get what you pay for.
My 3510's were lucky to get more than 50 MB's per second throughput at best. Occasionally backups from one 3510 to the other would rate at nearly 90 MB's per second which really annoyed me no end.
Anyway, as I need to extensively grow my system I purchased a fully configured 9990 (or HDS Tagmastore USP 1100). Quite a different piece of kit to a 3510 but it is mind blowing quick. Sustained throughput is 140 MB's per second on each port (16 of them). Disk to disk backups are in the order of 180 MB's per second. The key to this performance is obviously the 9990 but we experimented and made the Veritas File System block size to be just under 4 MB. The 3510 used 64 KB. I have ten times the amount of data on the 9990 as I had on 10 x 3510's each with a JBOD and the performance is 250 % on the 3510's.
Same host though.. so go figure. I think the massive cache on the 9990 and the block size has made for most of the performance enhancements. The system is one large Oracle database server.
So what block size to use is obviously a very important but understated question. Almost everybody knows that Oracle writes in bursts of 1 MB per second so why does almost every vendor suggest using small block sizes on the array. The biggest you can get on a 3510 is 256K. I had a single 3510 with a TB filesystem that ran like a crippled snail for Oracle. 64k blocksize was changed to 256k and after some modifications to the filesystem and the RAID layout, it fairly flies now.
Cheers
Stephen
# 18
Dave, I'd be interested to see some sequential write performance differences between single and dual controller configuartions. I don't have access to this equipment right now so I can't perform the tests myself. There is a performance overhead on all the 3000 series arrays when in dual controller configuration and I think this explains the issue that the original poster is experiencing, unfortunately no one has mentioned performance differences between single and dual controller configurations up 'til now. Additionally there are some performence related issues caused by certain firmware revisions, check the spe handbook ( I can email specific details if you want ).
Stephen, in my opinion almost all computer storage manufacturers publish performance benchmark figures on various products, most of the time the benchmark figures are not applicable to the " enterprise application " real world. I also think that comparing the entry level 3510 to the 9990 is as simple minded as comparing the outright performance of a BMW 116i to a BMW M5. The M5 will rip down the autobahn at heart stopping speeds, but it's V10 motor will probably guzzle heart stopping amounts of fuel if driven around the city. Sun Microsystems produces a range of hardware products to suite various enterprise needs, the 3510 in an entry level storage product aimed at small workgroups.
# 19
I did mention that you get what you pay for and what I was trying to suggest at the end of the post was how much difference configuring the 3510 away from the norm made to performance.
When my company first looked at the requirement for this database, the 3510's were quite new and revolutionary with their speed. Existing HDS systems at that time were still second generation which had internal speeds of 1 Gbps. So Sun agressively pushed the 3510 solution for performance as it touted 2 Gbps fibre channel speeds.
When you look at a 8 TB (yes eight terabytes) solution based on the 3510's that was basically much faster (going by the published stats) than other systems that Sun could offer, no wonder it was chosen. In the two years of production, the system has performed well enough but in my opinion nowhere near the speed it should do. Each query that was performed on the database was spread across every disk on the 10 x 3510's each with a JBOD or for the maths challenged, thats 220 x 36 GB 15K spindles.
I would be willing to take on that M5 with my 350Z and it is much cheaper. So thats the reason why people look at solutions that they think are of value and should be able to perform.
The new 9990 has 500% more capacity but only cost about 120% more that the original solution. Sun refused to even sell me the 9990 saying I should buy a 6920. When HDS got involved, they partnered with Sun and we now have a dual badged system.
Anyway, I am on holidays for the next six weeks so you wont hear from me again.
Stephen
# 20
I hope we hear from you again after your 6 week holiday Stephen, I always enjoy your input into the forum.Message was edited by: m-lennon
# 21
Hi,Did you get a resolution to this. I did have a similar issue and it required the settings on failures and default settings to be set to write-back not write-through. If you want more detail drop me a line and I will dig the detail out for you.Graham