Aug 22

Solaris had seen better days with the release of Solaris 9.  No ground breaking innovations had occurred, the sparc architecture had started to lose it’s place as the data center chip of choice and linux was really kicking it in the teeth with it’s ease of access by the younger sysadmins. An x86 version existed, but it was really just a hobby OS  and no data center in its right mind would deploy it as production.  Things looked bleak, and then came Solaris 10. Solaris 10, and the cool threads/niagra CPUs, helped to put the shine back on Sun. Zones and containers helped to virtualize server hardware, giving a bit more return on investment, but what really did it for the geeks was ZFS.   ZFS is coined “the last word in file systems” and I gotta say, I believe it.  It combines LVM, RAID a journaled atomic file system and manages to increase performance all at the same time. Add to the equation that VMWare recently released ESXi (the bare metal hypervisor that they had been charging 3500 per node for) and you have a really sweet SAN backed virtualization solution in the making.

First things first, install open solaris and immediatly patch it.  You can find instructions on how to do that Here but the condensed version is

pfexec pkg refresh
pfexec pkg image-update
pfexec mount -F zfs rpool/ROOT/opensolaris-2 /mnt
pfexec /mnt/boot/solaris/bin/update_grub -R /mnt

Depending on your internet connection this may take an hour or a few.  The reason for the upgrade is that the shipping version of Open Solaris (2008.5) has a bug with the serial number generation that prevents VMWare from using volumes exported via iscsi.   Once you’ve upgraded solaris, we need to create our pool. We’re going to assume three drives, c0t0d0, c0t0d1 and c0t0d2 and we’re going to put them into a raidz (better look this up, think raid 5 but better)

zpool create tank raidz c0t0d0 c0t0d1 c0t0d2

And you can check your handy work by running

zpool status -v tank

So, we now have a zfs pool called tank that is made of 3 drives we’re going to create a 100 gig volume that we’ll use in the SAN.

zfs create -V 100g tank/iscsi-vol

We now have a 100 gig volume in /tank called iscsi-vol. Next step is to share that bugger out via iscsi

zfs set shareiscsi=on tank/iscsi-vol

and we’re done. you can verify with

iscsitadm list target -v

Now that we have the volume shared out, we need to get access to it with vmware. I’m assuming here that you have a single ESXi 3.5 Update 2 node to play with, so this is assuming a virtual center client to a single ESXi node. This is a pretty simple operation.  In the vmware console, click on configuration and go to networking.  add a vmkernel and then click properties and enable iscsi for that adapter.  Back to the main configuration tab,  click on storage adapters and select properties for the iscsi software adapter.  You’ll need to enable the device and then click on and close the window.  Open that property window again and go to dynamic discovery. Here you’ll add the IP of the Solaris box and then click ok.

Right click on the iscsi adapter and select rescan, this may take a minute.   When it’s done go into storage and click add storage. Looky what shows up in your vmfs storage pools, our new 100 gig volume.

May 5

Just a random tidbit I thought I would post. I have a server at home acting as an iSCSI SAN. I ran a batch of hdparm tests against it, a single SATA drive in that array, a 5 disc SAS array in a compaq server and a 4 disc RAID 5 3Ware SATA array. here are the results. These are averages ran over 5 passes BTW on an otherwise silent machine. The CPU’s are all different speeds, but are all of the same class (dual core 800mhz FSB)

5 disc SAS array with 136g 10k drives

Timing cached reads: 13336 MB in 2.00 seconds = 6673.96 MB/sec
Timing buffered disk reads: 98 MB in 1.18 seconds = 83.31 MB/sec

4 disc Linux RAID 5 with 3Ware 9650SE and 500g 7200RPM drives

Timing cached reads: 6576 MB in 2.00 seconds = 3293.08 MB/sec
Timing buffered disk reads: 448 MB in 3.00 seconds = 149.20 MB/sec

Single 500g 7200 RPM SATA drive

Timing cached reads: 14220 MB in 2.00 seconds = 7119.78 MB/sec
Timing buffered disk reads: 198 MB in 3.02 seconds = 65.51 MB/sec

6 500g 7200 RPM SATA drives in a software RAID 5 array

Timing cached reads: 14364 MB in 2.00 seconds = 7191.86 MB/sec
Timing buffered disk reads: 852 MB in 3.00 seconds = 283.64 MB/sec

Now, for those of you that have storage experience, yes I didn’t mention chunk size or any of that fun stuff. But the point that I’m trying to get across is that, if you have the CPU cycles to spare, software RAID can be wicked fast.

Apr 28

This tip assumes that you have two disks that have equal sized free partitions that can be used.
RAID 1 is commonly referred to as mirroring. Every bit that is written to disk 1 is also written to disk 2, so in the event of a failure of disk 1, disk 2 has a complete mirror of all data and you can keep right on going. To create a RAID 1 mirror of two partitions, run the following

mdadm –create –verbose /dev/md0 –level=1 –raid-devices=2 /dev/$DISK1 /dev/$DISK2

Where $DISK1 is the first partition (sda1 for example) and $DISK2 is the second partition (sdb1)
To verify that worked, cat /proc/mdstat

Apr 25

The file /proc/mdstat will tell you most of what you need to know when you are working with a software RAID array in linux.

cat /proc/mdstat

Returns this on my file server at home

Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sda1[0] sdb1[1] sdc1[2](S) sdd1[3](S) sde1[4](S) sdf1[5](S)
104320 blocks [2/2] [UU]
resync=DELAYED

md1 : active raid5 sdf2[6] sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0]
2441396480 blocks level 5, 256k chunk, algorithm 2 [6/5] [UUUUU_]
[========>............] recovery = 43.8% (213871868/488279296) finish=88.4min speed=51686K/sec

unused devices: <none>

The first line of interest is the personalities. This tells you what RAID levels the box supports.

Next is the array name (md0) Whether it’s active or not, it’s RAID level (1) and the devices and partitions that make up this array (sda2-f2). Notice that 4 of the disks have an S behind them to indicate they are spare drives.

The next line is the number of blocks the array has (this can be translated into size) and a section in square brackets. The first number in the brackets is the number of drives that are configured to be active in this array, the next number is the number of drives that actually are active. In my case, 2 are configured and 2 are active, so 2/2. The last box tells us what state those drives are in, U for Up.

Looking at the md1 line, we see a few differences. The immediate difference is the additional information about chunk size and algorithms. this is a RAID 5 set, so that information makes sense, but I’ll let you research it on your own. We can see on this array, 6 drives are configured but only 5 are active. That’s because the array is rebuilding so one of the drives is out of sync. When the rebuild is complete, all of the drive spaces will be U’s. Note that the out of sync drive is flagged as “_” which means the drive isn’t failed, it’s just building.