My Initial NetApp Thoughts
1I’ve been living in a mostly EMC Fiber Channel world for the last 5+ years both as a customer and consultant. Now, just two months after becoming a NetApp customer and going through the NCDA bootcamp I’m starting to form some opinions.
First, NetApp seems to be a solid array so far. I’ve gone from pure FC to FC/iSCSI/NFS. Considering my current environment, we’re not taxing our FAS3140′s too hard but I certainly haven’t been starved for performance. We’re running ONTAP 7.3.3 so we’re not seeing some of the VAAI gains but that will be solved with a software upgrade in the near future. The dedupe is good (although I would prefer in-line) and our arrays have just about every available license. We’re also using all the built-in backup tools for our entire backup solution (OSSV, SnapMirror, SAN Snapshots, etc). This is the first time I’ve been in an evironment that didn’t use Veeam, NetBackup, Commvault, Backup Exec, etc. in some capacity. No tapes, no 3rd party backup.
I’m absolutely still learning OnCommand and reverse engineering the previous Engineer’s design decisions but I have to say so far the DR piece of NetApp has been a royal pain in the ass. It isn’t intuitive in the least and there are too many management points (or no single pane of glass). I have to jump from screen to screen to get information and the interfaces are slow. I don’t know if the new OnCommand System Manager 2.0R1 is an improvement over 1.1 but it can be painful to click around due to the waiting.
vSphere integration is decent. I’m used to Navisphere/Unisphere but I think NetApp has done a great job with their integrations and interfaces. I like that the vSphere plugin doesn’t require an install in the vSphere client – its there just waiting for you the first time you login to vSphere. And when vSphere 5 was release they had updated versions to go with it fairly quickly.
I’m not going to get into the debate about WAFL other than just saying for file storage it actually seems like a good algorithm. I think you have to really know what you’re doing to get the block performance out of it, though. But there are many more people smarter than I who can better have the WAFL discussion.
As far as support goes, it’s on par with other vendors. I’ve called in quite a few times and most of the calls have been great. However, the one ticket I have open that is really important is regarding one of our filers that crashed and produced a core dump. It has been 12 days now and I don’t have an answer back as to why the array crashed. I’ve had to call every day to get a status update and I’m on my third support engineer. It took a week just to get the core team to review the dump. Every vendor I’ve worked with has had good and bad. But this one is really bad and unfortunately it tarnishes the rest of the good calls and overall feelings.
To recap, I feel like NetApp has a solid lineup that can compete with the Clariion, Celerra, and VNX lines. I haven’t been up the scale to the VMAX and beyond so I can’t speak to a comparison at that level. And although my heart still lives with EMC the NetApp is getting the job done. I’m learning every day and keeping a positive attitude about the differences in the technologies and methodologies.
The Cloud is Cloudy
0I’m just like so many IT folk these days who are pummeled with the pontifications of the industry regarding that “C” word. And while it is a movement I believe in I also treat it like any other facet of life and question the hell out of it. Why? How? Who? Some of us may be beyond acceptance at this point, but there are still many unanswered questions and issues in my mind.
The definition of cloud computing is debatable, but this is what I believe. True Cloud means the complete abstraction of an application or “workload” from a single or multiple points of failure. Translation, an application that runs in a cloud does not have outages due to hardware, network, or datacenter annihilation. The workload moves freely as necessary to avoid bottlenecks and downtime. It seems that many of the advanced technologies that we utilize in cloud infrastructure still depend on some of the core IT services that have been around a long time, such as DNS, and those technologies haven’t evolved much at all. The recent outages on Amazon’s EC2 and Microsoft’s 365 “Clouds” really makes me question whether or not we can get to the true cloud without some additional fundamental changes in the core infrastructure services. Or perhaps is it just as simple as improving the controls on the infrastructure – preventing a costly mistake of an admin or automated system from taking down a cloud application? Or is the OS perhaps the piece that needs to change? (I don’t know alot about PaaS yet, but I do read very interesting things about where that might be headed.)
Resiliency should be the 1st requirement of a cloud, public or private. For the sake of simplicity I’m not going to include the private cloud because I believe that, in general, this type of infrastructure is too costly and unattainable for most organizations (which is why the public cloud should be so appealing with its economies of scale). Enterprise public cloud is my focus because I see many organizations moving commodity applications to a public cloud infrastructure and continue to get hit with major outages. For example, Microsoft Office 365 has had several major outages since launching earlier this year. In fact, they seem to expect outages of up to 18 full days per year (http://www.theregister.co.uk/2011/06/30/microsoft_cloud_uptime/). Do you allow your mail system to be down that much in your own private or non-cloud infrastructure?
I definitely think we’re moving in the right direction and that we will get to the right solution. The marketing machine is moving at a blistering pace, however. I’d like to see some more talk about what today’s enterprise public cloud can’t do for you. I think we need to evaluate our unencumbered trust in the cloud and keep the innovation train moving. It is amazing to me how many great minds are out there coming up with the new technology and then also evangelizing it. I’m very excited for the future of IT – that much is for certain. I just don’t think today’s Cloud is what I had envisioned. But, hey, maybe Cloud 2.0 will be the next big thing!
Apparent Cisco UCS 1.4(2b) Bug
0Earlier this week I attempted to do something seemingly simple. It was actually my first production change to the UCS chassis. I added a new VLAN to the fabric and then added the VLAN to the appropriate service profiles. Upon UCSM reassociating the service profile with the blades I immediately noticed errors and warnings popping up in the UCSM.
After some head scratching and self-doubting of my UCS abilities I started up a TAC case. Luckily there was no ill-effects from these errors yet so I was optimistic that it would be some small oversight on my part and a quick fix by the TAC Engineer. What we found was that the SUP recv_q queue was quickly filling up on both fabric interconnects. According to TAC we just need to flush the queue to clear the errors and we’d be off and running. She was correct and within a few minutes we were closing the case. She strongly suggested to upgrade to 1.4(3m) or 1.4(3q) as, according to her, 1.4(2b) has been a very problematic release. I’d be interested to know if others have experienced any similar cases or general issues with 1.4(2b). I don’t have an article or documentation that this is actually a bug, that’s just what the TAC Engineer was claiming.
Here’s how we were able to flush the queue on the interconnects. It involves the debug plugin which is unsupported for use without a TAC Engineer’s involvement. The only way to get the tool (that I know of) is through a TAC Engineer and their policy is to delete it once they are done.
UCS-01-B# connect nxos
UCS-01-B(nxos)# show system internal mts buffers summary
node sapno recv_q pers_q npers_q log_q
sup 58715 209705 0 0 0
sup 1432 2 0 0 0
sup 284 0 2 0 0
sup 761 0 0 1 0
UCS-01-B(nxos)# exit
UCS-01-B# connect local-mgmt
UCS-01-B(local-mgmt)# pwd
workspace:/
UCS-01-B(local-mgmt)# ls
1 16 Sep 09 11:51:34 2010 cores
2 1024 May 19 13:14:49 2011 debug_plugin/
1 31 Sep 09 11:51:34 2010 diagnostics
2 1024 Sep 09 11:49:29 2010 lost+found/
2 1024 Sep 06 16:56:20 2011 techsupport/
1 2299442 Aug 09 12:10:39 2011 ucs-dplug.4.2.1.N1.1.42.gbin
Usage for workspace://
290835456 bytes total
10737664 bytes used
265081856 bytes free
UCS-01-B(local-mgmt)# copy workspace:///ucs-dplug.4.2.1.N1.1.42.gbin volatile:dplug
UCS-01-B(local-mgmt)# load-debug-plugin volatile:dplug
###############################################################
Warning: debug-plugin is for engineering internal use only!
For security reason, plugin image has been deleted.
###############################################################
Successfully loaded debug-plugin!!!
Linux(debug)# ps -ef | grep portAG
root 8028 4370 1 Aug09 ? 10:23:07 svc_sam_portAG --x
root 6565 6555 0 14:58 pts/1 00:00:00 grep portAG
Linux(debug)# kill 8028
Linux(debug)# ps -ef | grep portAG
root 6605 4370 27 14:59 ? 00:00:05 svc_sam_portAG --x
root 6649 6555 0 14:59 pts/1 00:00:00 grep portAG
Linux(debug)# exit
exit
UCS-01-B(local-mgmt)# exit
UCS-01-B# connect nxos
UCS-01-B(nxos)# show system internal mts buffers summary
node sapno recv_q pers_q npers_q log_q
sup 54263 16 0 0 0
sup 1432 2 0 0 0
sup 284 0 2 0 0
sup 761 0 0 1 0
Notice the recv_q size before and after the kill command. Hopefully this hasn’t affected many of you but if it has I hope you found this and it helped.
Indy VMUG Slides
0Here are the slides from the Indianapolis VMUG featuring a presentation on storage best practices with VMware by vSpecialist Brian Lewis.
Long Overdue Twins Pics Up
0Added 3 new albums after a long period of collecting photos. Over 250 new ones for family and friends to enjoy!
Renaming vmnic (pNIC) Names in vSphere
1There looks to be several causes of a condition where ESX(i) will assign new names/numbers to (or re-enumerate) your pNICs. A BIOS or firmware update has been known to cause this from time to time but most recently I ran across this issue when trying to setup a host profile for a small cluster of DL380G6 servers with multiple Quad-port gigabit PCI cards. The servers each had a total of 12 pNICs and most of the servers assigned the pNICs the same vmnic numbers. However, there were a couple problem servers – one of which was missing vmnics 4 & 5 while the other had one of the PCI cards numbered differently than the rest.
So I set out to solve this problem for two reasons – consistency and to allow the application of a host profile. Host profiles will reassign pNICs to vSwitch port groups based on the vmnic number. So, for example, if you create a host profile on an ESX(i) server with vmnic0 attached to your Management network port group and then apply the profile to a second server, it will attache vmnic0 on that second server to your management port group. If you haven’t configured vmnic0 as your management interface then you just killed your connection to that host. Seems fairly straight forward but when you get into large numbers of hosts with lots of pNICs it can get frustrating when someone (self included) isn’t careful when assigning consistent pNICs to the vSwitches and Port Groups.
Reassigning vmnic numbers is actually not such a tough task. Here’s what you need to do:
- On the ESX(i) host, use the appropriate tool to navigate and edit the /etc/vmware/esx.conf (vi or nano are both available on ESX(i))
- Search the file for “vmnic” and you should quickly find the section where the pNICs are assigned as vmnic’s
- Modify the vmnics as desired
- Reboot the host
Now that you’ve made the necessary adjustments to your enumerations you can apply your host profile to your servers without losing network connectivity to your hosts. Or you can just sleep at night knowing that all the pNICs of your hosts are systematically named and numbered to match
Enabling FT on a Virtual vCenter that has Lazy Zeroed Disks
0Enabling Fault Tolerance in vSphere has a few requirements that I won’t go into much here. I’ll assume that you know those requirement’s or that you can check them out HERE.
I ran across and issue at a client where, at one time, there was a space crunch and quite a few VM’s were Storage vMotion’d and reconfigured with thin disks. Once that space issue was resolved everything was reconfigured back to thick disks. However, once a vmdk is thin provisioned it does not automatically get eager zeroed. By default when you clone, Storage vMotion, or create a new disk it is lazy zeroed. In vSphere 4.1 they introduced an option to eager zero at creation time.
FT requires eager zeroed disks. Normally this isn’t a problem as you can enable FT on a lazy zeroed disk as long as you power it off first. Enabling FT will automatically convert the disk to eager zeroed. The problem I encountered was that we were trying to enable FT on the vCenter with a lazy zeroed disk. The vCenter is required to enable FT on a VM and cannot be done by logging into an ESX(i) host using the vSphere client.
We came up with two options. I would NOT recommend this first option but will describe it for educational purposes. This first option required us to clone the vCenter server. After cloning we would power off the original, power on the clone, and then enable FT from the new, temporary vCenter. Power off that vCenter and then power on the original with FT enabled. Then we were able to remove the temporary vCenter. This method was a little messy, tedious, and had several caveats. But it did work. Depending on how you have your vCenter database setup you definitely need to be careful as having a cloned vCenter can cause any number of problems. Again, I would NOT recommend doing this.
The second and recommended option is to use the vSphere CLI to eager zero your vCenter’s disk. It still requires that your vCenter be turned off so make the appropriate plans for this maintenance. You can download the vSphere CLI HERE. Note that this is the perl-based CLI not the PowerShell based PowerCLI. The command to run against your lazy zeroed disk is:
vmkfstools.pl –server=<ESX Host (Name or IP)> -i <path to source> <path to destination> -d <format>
The –i parameter clones the disk so if there is a problem your source disk isn’t touched. After running you just need to go in and remove the source disk from the VM and add in the new disk using the vSphere client under VM Settings. Boot up the VM with its new eager zeroed disk and enable FT. Assuming you have all the other requirements met you should now be on your way with a FT-enabled vCenter.
In larger environments an FT-enabled vCenter might not be ideal, but in smaller environments with a SQL Express DB it provides a good measure of resiliency without too much overhead or performance loss.
New Drew & Ethan Pics – 6 months old!
0Just uploaded 155 new pics from this month and September. Drew is up to 17 lbs 8 oz and Ethan is 15 lbs 8 oz. They also were able to pick their first pumpkins today and slide down a slide for the first time (with help).
Unable to Install VMware Tools after P2V
0So I’ve been P2V’ing a bunch of servers over the last 2 weeks and everything has gone exceptionally well. I haven’t ever used vCenter Converter so extensively but I must say the latest build as of this writing (4.1) has performed wonderfully! To clarify I am running vCenter Converter not the Standalone Converter (4.3).
So I P2V more than a dozen servers and the last machine comes around. Sure enough I run into an issue that has taken nearly a week to resolve. Upon first boot of the VM the sysprep runs and does its thing and then I get a BSOD. Reboot and right into Windows I go so I think, “Hmm, that’s interesting.”
The first thing I notice is that VMware Tools is not installed even though I had requested it be installed during the sysprep process. So I start the normal procedure and as the wizard goes through I get several errors:
- “Setup failed to install the mouse driver automatically. This driver will have to be installed manually.”
- “Setup failed to install the VMXNet driver automatically. This driver will have to be installed manually.”
- “Error 25028. Setup failed to install the VMCI driver.
- “Setup failed to install the SVGA driver automatically. This driver will have to be installed manually. Instructions for how to do this will appear at the end of the installation.”
After a week of re-P2V’ing the machine and mucking with VMware Tools I finally start grasping for straws. Google produced few results which almost all were related to Fusion anyways. I looked in device manager on the physical machine and noticed a problem with a piece of hardware. This “server” happened to be a Windows XP SP3 licensing server which had an Intel 915G motherboard and onboard graphics. The graphics controller had a yellow bang (!) and said it was unable to start. I stumbled over to Intel’s website and downloaded the latest driver, installed, and rebooted. No my yellow bang. <insert subdued “Yay!” here>
The next P2V attempt and subsequent VMware Tools installation went through without error! So the moral to this story is that if VMware Tools will not install double-check your hardware and drivers on your physical machine. I tried all sorts of things like uninstalling all of the ghost hardware (see this post – http://www.lexneck.com/2010/07/29/showing-hidden-devices-server-2008-r2/) but that had no effect. I might even go so far as to say that you might consider updating all your drivers to the latest supported versions before undergoing the P2V process.
This is the only time I’ve run into this issue for the many machines I’ve virtualized so hopefully it won’t come up for you. But if it does hopefully Google’s index is your friend and you find my post
