reproducible control domain hang

I configured a T2000 as described in the beginner's guide (<a href="http://www.sun.com/blueprints/0207/820-0832.pdf">http://www.sun.com/bl ueprints/0207/820-0832.pdf</a>) with the exception of the device allocated for the root disk. For that I came up with my own variant of <a href="http://unixconsole.blogspot.com/2007/04/time-to-build-guest-domain.html"& gt;http://unixconsole.blogspot.com/2007/04/time-to-build-guest-domain.html</a >.

My variant of using a file involved creating the file with mkfile on a zfs file system. That is...

<pre>

zpool create zfs mirror c1t0d0s4 c1t1d0s4

zfs create zfs/ldoms

zfs set compress=on zfs/ldoms

mkfile 32G /zfs/ldoms/root.img

</pre>

As I install Solaris in the ldom, the server (control domain) dies after extracting a few hundred megabytes of a flash archive. I have traced this down to it running out of memory.

Here's "vmstat 4" output on the control domain console:

<pre>

...

0 0 0 8636536 24776 1 70 0 0 0 0 0 0 0 0 0 1954 311 2244 0 23 76

0 0 0 8645096 22280 1 36 0 0 0 0 0 0 0 0 0 5957 370 9827 0 19 80

0 0 0 8649008 17720 1 40 0 296 313 0 44 0 0 0 0 9975 361 15877 0 24 75

0 0 0 8651104 15944 1 52 0 807 1671 0 700 0 0 0 0 10725 347 17545 0 26 74

0 0 0 8650800 18376 0 60 0 88 239 0 127 0 0 0 0 9816 391 15545 1 33 67

0 0 0 8640432 15936 0 76 0 497 3025 0 3874 0 0 0 0 11367 432 17975 0 35 65

0 0 0 8642968 17032 1 59 0 452 2028 0 842 0 0 0 0 10266 363 16127 0 27 73

kthrmemorypagedisk faultscpu

r b wswap free re mf pi po fr de sr m0 m1 m2 m1insycs us sy id

0 0 0 8644768 15744 0 56 0 387 1298 0 126 0 0 0 0 10170 330 16355 0 24 75

0 0 0 8652504 18368 1 113 0 372 2462 0 273 0 0 0 0 11171 321 18613 0 35 65

0 0 0 8652832 15720 1 134 0 411 6081 0 738 0 0 0 0 11541 332 18979 0 34 66

0 0 0 8652232 14312 1 94 0 413 1806 0 7775 0 0 0 0 10718 358 18271 0 38 62

0 0 0 8647360 12592 18 133 9 555 5176 0 17490 1 0 1 0 10394 320 16970 1 37 63

0 0 0 8645248 14408 2 73 22 486 5039 0 3111 2 1 1 0 11749 383 18336 0 40 59

2 0 43 8641800 2784 1 148 99 1070 1517 0 53982 19 9 9 5 8316 356 14226 0 43 57

0 0 116 8647032 800 1 42 127 134 312 3688 76207 14 7 7 1 2153 114 3726 0 29 71

</pre>

At this point the server froze. Note that 116 processes were swapped and the "de" column is 3688. Very bad news.

My initial thoughts were that I was running into some of the low-memory problems known to happen with the ZFS arc. This does not seem to be the case. According to mdb, the arc size is around 60 MB:

<pre>

# mdb unix.3 vmcore.3

...

> arc::print -td size

uint64_t size = 0t61455360

</pre>

The control domain is S10 11/06 + 118833-36 + those required for ldoms + many others. The ldom is in the process of being installed is booted from a S10 11/06 netinstall image (118833-33).

[3032 byte] By [mike.gerdtsa] at [2007-11-27 0:46:08]
# 1
Are you trying to jumpstart the control domain or a guest domain?
unixconsolea at 2007-7-11 23:12:01 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 2
This would be the guest domain.
mike.gerdtsa at 2007-7-11 23:12:01 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 3
How much memory did you allocate for the control/service domain? Also, what are the bindings for your control and guest domain?
unixconsolea at 2007-7-11 23:12:01 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 4

I allocated 1 GB for the control domain and 4 GB for the service domain. Here's my control domain bindings...

<pre>

# ldm list-bindings primary

Name:primary

State: active

Flags: transition,control,vio service

OS:

Util:0.7%

Uptime: 7d 19h 25m

Vcpu:4

vidpidutil strand

001.3%100%

110.6%100%

221.1%100%

330.0%100%

Mau:1

mau cpuset (0, 1, 2, 3)

Memory: 1G

real-addrphys-addrsize

0x40000000x40000001G

Vars:auto-boot-on-error?=true

boot-device=<i>truncated so as to not mess up the forum formatting</i>

error-reset-recovery=sync

reboot-command=boot

IO:pci@780 (bus_a)

pci@7c0 (bus_b)

Vldc:primary-vldc0

(HV Control channel)]

[LDC: 0x1]

[LDom primary(Domain Services channel)]

[LDC: 0x3]

[LDom primary(FMA Services channel)]

[LDC: 0xf]

[LDom lab-test-prod-01 (Domain Services channel)]

Vldc:primary-vldc3

(SP channel)]

(SP channel)]

(SP channel)]

(SP channel)]

(SP channel)]

(SP channel)]

(SP channel)]

Vds:primary-vds0

vdsdev: rootdevice=/zfs/ldomroots/2007Q1.img

vdsdev: vol1device=/dev/md/dsk/d1000

[LDom lab-test-prod-01, dev-name: root]

[LDC: 0xc]

Vcc:primary-vcc0

[LDC: 0xe]

[LDom lab-test-prod-01, group: lab-test-prod-01, port: 5000]

port-range=5000-5100

Vsw:primary-vsw0

mac-addr=0:14:4f:f8:ec:5e

net-dev=e1000g0

[LDC: 0xb]

[LDom lab-test-prod-01, name: prod0, mac-addr:0x144ffb5821]

mode=prog,promisc

Vldcc: vldcc1 [FMA Services]

service: ldmfma

service: primary-vldc0 @ primary

[LDC: 0x4]

Vldcc: vldcc2 [SP Channel]

service: spfma

Vldcc: vldcc0 [Domain Services]

service: primary-vldc0 @ primary

[LDC: 0x2]

Vldcc: hvctl[Hypervisor Control]

service: primary-vldc0 @ primary

[LDC: 0x0]

Vcons: SP

</pre>

I have since removed the binding for vol1@primary-vds0 - it was a leftover from trying to use soft partitions and failing there too. Retrying the jumpstart after removing this led to a similar hang.

During this latest hang I was also running "prstat -s size -c" on the console (no user processes were growing) and I had run ::kmastat in mdb. ::kmastat did not finish but I did get...

<pre>

> ::kmastat

cachebufbufbufmemoryalloc alloc

namesize in use totalin usesucceed fail

---

kmem_magazine_11621681232768554520

kmem_magazine_3324474572294912247640

kmem_magazine_764556 115081122304731800

kmem_magazine_15 12849757961130496432930

kmem_magazine_31 256611050344064114620

kmem_magazine_47 3845525211468811830

kmem_magazine_63 51212539222937648580

kmem_magazine_95 768854491522620

kmem_magazine_143115200 0 00

kmem_slab_cache56 24574 28381230195214781170

kmem_bufctl_cache 2411384056196608501600

kmem_bufctl_audit_cache128 794790 848318 13112115244368270

kmem_va_81928192 15539 17696 144965632205090

kmem_va_16384 16384249288471859227510

kmem_va_24576 2457639003910 10249830450640

kmem_va_32768 327685165242888700

kmem_va_40960 40960101878643219740

kmem_va_49152 491526105242887270

kmem_va_57344 573441936235929619670

kmem_va_65536 65536323328 2149580810290

kmem_alloc_889880 15360491520 293841450

kmem_alloc_1616 49180 516122072576 1065800760

kmem_alloc_24245007 1717082739236248610

kmem_alloc_3232 23743 386902170880 139371870

kmem_alloc_40406650 35456226918424376860

kmem_alloc_4848 51877 101022732364813376830

kmem_alloc_56563754 53550430080019051130

kmem_alloc_64642097 295043776512 114618120

kmem_alloc_80804665 2808029491207377180

kmem_alloc_96968837 14076169574420322610

kmem_alloc_112112 24855 2634035962881089210

kmem_alloc_1281281338256249971219201740

kmem_alloc_160160911184834406449335500

kmem_alloc_1921922703208192015596360

kmem_alloc_2242245508252048001769070

kmem_alloc_25625645475024576084452510

kmem_alloc_3203202623781474563922540

kmem_alloc_384384 41240 41598 1893171260281880

kmem_alloc_44844863964915224612520

kmem_alloc_51251235392454067212602590

kmem_alloc_6406409679258982490922110

kmem_alloc_768768234540960828250

kmem_alloc_8968961748491523420

kmem_alloc_1152 115261065081920028699730

kmem_alloc_1344 1344318813107283010

kmem_alloc_1600 160072749152938810

kmem_alloc_2048 204883058327 18604032550530

kmem_alloc_2688 2688 24686 24704 758906882135460

kmem_alloc_4096 40967676622592294750

kmem_alloc_8192 819227152715 222412803648030

kmem_alloc_122881228819193112962270

kmem_alloc_16384163841515245760393600

streams_mblk64 26144 272852629632 5065882530

streams_dblk_16 128741262457646232010

streams_dblk_80 1922921088278528 2935188700

streams_dblk_144 256230098304 236524130

streams_dblk_208 32017342016384013596580

streams_dblk_272 384190409607203280

streams_dblk_336 448064327689758620

streams_dblk_528 6400443276816677520

streams_dblk_10401152165819201776810

streams_dblk_14881600018327685540

streams_dblk_1936204824498304216040

streams_dblk_257626880192589824 504300160

streams_dblk_3920403206245763660

streams_dblk_81921120428192121820

streams_dblk_12112 1222400 02130

streams_dblk_163841120428192 10

streams_dblk_20304 2041600 0 00

streams_dblk_2457611200 0 00

streams_dblk_28496 2860800 0 00

streams_dblk_327681120428192 30

streams_dblk_36688 3680000 0 30

streams_dblk_4096011200 0 00

streams_dblk_44880 4499200 0 00

streams_dblk_491521120428192 10

streams_dblk_53072 5318400 0 00

streams_dblk_5734411200 0 00

streams_dblk_61264 6137600 0 00

streams_dblk_655361120428192 20

streams_dblk_69456 6956800 04615300

streams_dblk_7372811200 0 00

streams_dblk_esb 112 25600 256204997120 475991920

streams_fthdr26400 0 00

streams_ftblk23200 0 00

multidata24800 0 00

multidata_pdslab711200 0 00

multidata_pattbl 3200 0 00

log_cons_cache481311381922220

taskq_ent_cache5618372346188416148810

taskq_cache21689102245761140

id32_cache3211288192270

bp_map_8192819200 0234140

bp_map_163841638400 0 80

bp_map_245762457600 0 20

bp_map_327683276800 0 50

bp_map_409604096000 0 00

bp_map_491524915200 0 00

bp_map_573445734400 0 00

bp_map_655366553600 0 00

memseg_cache11200 0 00

mod_hash_entries 242656803276813472180

ipp_mod 30400 0 00

ipp_action36800 0 00

ipp_packet6400 0 00

sfmmuid_cache25683116327681517490

sfmmu_tsbinfo_cache6484186163842988640

sfmmu_tsb8k_cache819200 0 00

sfmmu_tsb_cache 819238406553601533700

sfmmu8_cache312 17465 46152 1575321665328210

sfmmu1_cache881732763286835213142190

pa_hment_cache6462184163841533180

ism_blk_cache27200 0 00

ism_ment_cache3200 0 00

seg_cache 724633561054067257527340

dev_info_node_cache 4802042561310728150

segkp_819281926980655360313590

segkp_163841638400 0 00

segkp_245762457600 0 00

segkp_3276832768628672 2202009666660

segkp_409604096000 0 00

umem_np_8192819200 0123900

umem_np_16384 1638400 061890

umem_np_24576 2457600 0 00

umem_np_32768 3276800 061330

umem_np_40960 4096000 0 00

umem_np_49152 4915200 0 00

umem_np_57344 5734400 0 00

umem_np_65536 6553600 060830

thread_cache792236279253952950830

lwp_cache90423628028672054240

turnstile_cache64633837737281131650

tslabel_cache4821138192 20

cred_cache1727212024576121320

rctl_cache4011831408901129828090

rctl_val_cache642371260422937620881320

task_cache104451281638466110

cyclic_id_cache643938192 30

dnlc_space_cache 2406803276828910

vn_cache 2409502 83325 2730393661573110

vsk_anchor_cache 40241288192580

file_cache565907145734425424920

stream_head_cache4002533041310721994390

queue_cache6565846484423682924480

syncq_cache160164481922500

qband_cache642938192 20

linkinfo_cache48131138192230

ciputctrl_cache 102400 0 00

serializer_cache 64249381927500

as_cache 21682136327681465480

marker_cache1280538192777190

anon_cache48 17448 18156145817671077740

anonmap_cache482825350325395231268200

segvn_cache1044633518466355256246350

flk_edges 4801138192 20

fdb_cache10400 0 00

timer_cache1362518192 30

physio_buf_cache 2480308192310

snode_cache152353414737286336870

ufs_inode_cache 3688333 75240 3081830413629960

directio_buf_cache27200 0 00

lufs_save 2401708192391620

lufs_bufs2560298192444730

lufs_mapentry_cache 1126102013926414477480

px1_dvma_81928192162419660848889020

mpt0_cache48061125734429303310

dv_node_cache120814486553616340

clnt_clts_endpnt_cache880738192 20

md_stripe_parent 9601361638421787670

md_stripe_child 3120722457624708490

md_mirror_parent 1600881638414869410

md_mirror_child 3040963276823761520

md_mirror_wow 1644000 0100

md_softpart_parent8807381921030

md_softpart_child30402481921150

ldc_memhdl_cache 48102811308192027820

ldc_memseg_cache 6451555849152288680

mac_impl_cache7525108192 50

dls_cache1688841638425750

soft_ring_cache 17600 0 00

dls_vlan_cache4891138192130

dls_link_cache6245128192 50

dld_ctl_1 100 0 00

dld_str_cache2561029819235760

px0_dvma_8192819200 0 00

kcf_sreq_cache4806481922440

kcf_areq_cache27200 0 00

kcf_context_cache 8800 0 00

ipsec_actions7228581922120

ipsec_selectors722858192 20

ipsec_policy724858192 40

ipsec_info3040248192135780

ip_minor_arena_1193128128408810

ipcl_conn_cache 480608040960396610

ipcl_tcpconn_cache 166434549830416960

ire_cache352366324576537730

tcp_timercache88481461638426110

tcp_sack_info_cache801078819210570

tcp_iphc_cache120341121638414460

squeue_cache1364428192 40

sctp_conn_cache 22321716384 10

sctp_faddr_cache 16800 0 00

sctp_set_cache2400 0 00

sctp_ftsn_set_cache1600 0 00

ire_gw_secattr_cache 3200 0 00

sctpsock 61600 0 00

sctp_assoc6400 0 00

socktpi_cache456546832768322200

socktpi_unix_cache456334163844110

udp_cache408467232768394220

process_cache304884104319488701000

exacct_object_cache400128819278813180

fctl_cache1120608192200

tl_cache 4323568327689290

keysock_1 100 0 00

spdsock_1 100 0 90

fnode_cache17673681923190

pipe_cache320429232768744590

fp2_cache7281108192320

fp0_cache7281108192180

fcp2_cache1168039491521248480

fcp0_cache1168039491521376410

ncp_ds_cache8000 0 00

ncp_mactl_cache2400 0 00

ncp_mabuf_cache 204800 0 00

kssl_cache156000 0 00

namefs_inodes_11216464210

port_cache807788192320

qif_head_cache2641028819235750

ip_minor_1 100 0 00

ar_minor_1 100 0 00

lnode_cache3221468192 20

icmp_minor_1100 0 00

zio_buf_51251214452457616520

zio_buf_102410245404096069870

zio_buf_15361536010163841110

zio_buf_20482048016327689670

zio_buf_25602560015409603590

zio_buf_307230720164915210540

zio_buf_35843584093276819100

zio_buf_40964096514573449390

zio_buf_512051200168192072300

zio_buf_61446144084915248370

zio_buf_71687168085734443330

zio_buf_81928192064915237970

zio_buf_10240 102400881920272500

zio_buf_12288 12288067372859370

zio_buf_14336 14336040573440157500

zio_buf_16384 16384162163267059248161440

zio_buf_20480 204800612288088550

zio_buf_24576 245760512288069190

zio_buf_28672 286720822937657190

zio_buf_32768 327680516384064770

zio_buf_40960 409600312288096720

zio_buf_49152 491520419660850270

zio_buf_57344 573440528672038720

zio_buf_65536 65536322324 21233664288240

zio_buf_73728 737280429491258060

zio_buf_81920 819200432768035130

zio_buf_90112 901120327033624580

zio_buf_98304 983040439321624580

zio_buf_1064961064960553248018270

zio_buf_11468811468817802816299600

zio_buf_122880122880056144001820

zio_buf_131072131072378385 5046272033754810

dmu_buf_impl_t328558207073728081249930

dnode_t 640323624576560

arc_buf_hdr_t128719286244236881082370

arc_buf_t 4055014089011281203130

zil_lwb_cache2161348192 50

zfs_znode_cache 1924378192 40

pty_map5641028192590

crypto_session_cache 9600 0 00

md_raid_parent12000 0 00

md_raid_child104000 0 00

md_raid_cbufs37600 0 00

md_trans_parent8000 0 00

md_trans_child24800 0 00

authkern_cache720858192126370

authloopback_cache72085819259450

authdes_cache_handle 8000 0 00

rnode_cache6481172491524280

nfs_access_cache 56010281922010

client_handle_cache3211468192150

rnode4_cache96000 0 00

svnode_cache4000 0 00

nfs4_access_cache 5600 0 00

client_handle4_cache 3200 0 00

nfs4_ace4vals_cache4800 0 00

nfs4_ace4_list_cache 26400 0 00

NFS_idmap_cache4800 0 00

lm_vnode 18400 0 00

lm_xprt3200 0 00

lm_sysid 1601448192 20

lm_client1280538192 20

lm_async 3200 0 00

lm_sleep 961688192 10

lm_config 801788192 10

dtrace_state_cache 409600 0 00

sd2_cache 5600 0 00

hsfs_hsnode_cache1921378192 10

---

Total [static]2703362484010

Total [hat_memload]1575321665328210

Total [kmem_msb]13693747261795580

Total [kmem_firewall]234209288956210

Total [kmem_va]277872640348910

Total [kmem_default]303939584 12368027020

Total [bp_map]0234290

Total [kmem_tsb_default] 6553601533700

Total [hat_memload1]86835213142190

Total [umem_np]0307950

Total [id32]8192270

Total [segkp]22675456380250

Total [px1_dvma] 19660848889020

Total [ip_minor_arena]128408810

Total [spdsock]0 90

Total [namefs_inodes] 64210

---

vmem memorymemorymemoryalloc alloc

name in usetotalimportsucceed fail

- ---

</pre>

mike.gerdtsa at 2007-7-11 23:12:01 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 5
Hmm.. I have not run into this issue and can't reproduce it. What version of Solaris are you using? Also, I don't see the bindings for your guest domain.
unixconsolea at 2007-7-11 23:12:01 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 6

It looks like the control domain and the service domain are one and the same (the domain 'primary' contains both bus_a and bus_b). This domain has only 1G of memory so I think you are running into '6447701 ZFS hangs when iSCSI Target attempts to initialize its backing store' which is fixed in OpenSolaris build 49 and will be in the next update release of S10.

In the meantime a workaround is to ensure that there is at least 4G of memory in the control/service domain - aka 'primary'.

merwicka at 2007-7-11 23:12:01 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...