Enterprise Server/Storage Architects: May 2010

Thursday, May 27, 2010

Ubuntu-ext4 umount Takes Long time-BUG

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/543617

Fix Released in 10.04

Saturday, May 8, 2010

Ghetto VCB- ESXi Backup VMs

http://communities.vmware.com/docs/DOC-8760

Vsphere RCLI on Windows-7 -he ordinal 2821 could not be located in the dynamic link library LIBEAY32.dll Error-Fix

perl.exe - ordinal not found: The ordinal 2821 could not be located in the dynamic link library LIBEAY32.dll

Rename the libeay32.dll in c:\windows\system32 to libeay32.dll.old and then copy the one in the c:\program files\vmware\vmware remote cli\perl\bin into that folder.

Friday, May 7, 2010

vSphere client could not connect w/vCenter Server.... Operation Timeout

With ESX it is sometimes necessary to run the command services vmware-mgmt restart should you not be able to connect with the VI client or if you need the host to re-read the esx.conf configuration file because changes you have made at the console are not visible in VirtualCenter. The services command is not available on ESXi and the supported method is to access the Direct Console User Interface (DCUI) and then to select the restart option from there as in shown in the first image.

If you have console access or have enabled SSH on your ESXi host, you can run the command /sbin/services.sh restart to accomplish the same thing as is shown in the second image. This will restart the agents that are installed in /etc/init.d/ and with a default install that includes: hostd, ntpd, sfcbd, sfcbd-watchdog, slpd and wsmand. It will aalso restart the VMware HA agent (AAM) if that has been enabled on the host.

Thursday, May 6, 2010

Simplest Motor-Amazing

http://www.facebook.com/video/video.php?v=1401027819272

KVM-Vmware Vsphere ESXi 4.0

Seems like XEN is in its last breathing stages and the battle is going to heat up for KVM and Vmware. Let not talk about Hyper-V yet.

http://www.howtoforge.com/kvm-and-openvz-virtualization-and-cloud-computing-with-proxmox-ve

I am working on this and will update with some cool results.

Vpshere HA Cluster- Good Read

http://www.virtualizationadmin.com/articles-tutorials/vmware-esx-articles/vmotion-drs-high-availability/configure-vmware-high-availability-vmha.html

Kickstart Ubuntu Server( Intrepid Bug)

https://bugs.launchpad.net/ubuntu/+bug/293586

There's a fixed i386 initrd available here:
http://people.ubuntu.com/~cjwatson/tmp/intrepid-busybox-fix/netboot/ubuntu-installer/i386/

As per Marc.
This is a huge bug. I hit it when trying to kickstart Intrepid Ibex yesterday, and I tracked it down exactly to that missing CONFIG_GETOPT_LONG. When will the initrd...

http://archive.ubuntu.com/ubuntu/dists/intrepid/main/installer-$ARCH/current/images/netboot/ubuntu-installer/$ARCH/initrd.gz

...be regenerated to include the new version of busybox ? For those waiting for it to be regenerated, here is a workaround that consists of patching a broken initrd, to make kickstart's "url --url http://xxx" option work again (and only this option, others will remain broken):

$ mkdir i && cd i
$ zcat .. /initrd.gz | cpio --make-dir -i
$ patch -Np1 <.. /fix-kickseed-url-handler.patch
$ find . | cpio -o -H newc | gzip -9 >.. /initrd.gz

Wednesday, May 5, 2010

Exchange 2010 Cluster

http://www.shudnow.net/2009/10/29/exchange-2010-rtm-dag-using-server-2008-r2-%E2%80%93-part-1/

EMC SAN Hardware Failure

Customers of Exchange hosting provider Intermedia.net Inc. had their service interrupted earlier this month due to a hardware failure in Intermedia's EMC SAN, resulting in the hosting company crediting customers for failing to meet the terms of their service-level agreements (SLAs).

Intermedia Chief Operating Officer Jonathan McCormick sent a letter to impacted customers last week explaining the reasons for the April 16-17 outage. McCormick also posted an update last Thursday on Intermedia's official blog.

According to the statement sent to customers:

At approximately 6:15 a.m. PT on Thursday 4/16, a hardware failure occurred on one of the EMC storage area networks (SANs) located in Intermedia's New Jersey data center. The service processor for one of the controller nodes had a failure. This failure caused the entire load for that SAN to be shifted to the service processor on the redundant controller node.
The spare capacity on the single service processor was not enough to handle the entire load of all systems connected to the SAN, which caused a degradation of performance for the reading and writing of data to the SAN. The degradation of performance on the SAN in turn impacted the overall system's ability to process email messages creating a queuing of several hundred thousand messages within the system. The back log was large enough that it took 32 hours for it to clear after the original event. At approximately 2 p.m. PT on Friday 4/17, all systems were functioning normally and mail delivery was considered to be "real-time."

The statement continues:

* The vendor [EMC] determined that the service processor failure occurred due to a unique bug in the specific version of firmware on the system. This bug caused the service processor to "panic" and automatically take itself off line. As the first corrective action, on Friday 4/17 at 11 p.m. PT, our vendor performed an emergency upgrade to the version of firmware running on the SAN. This newer version of firmware has a fix for the bug that caused the failure we experienced.

* Since the outage, as the second corrective action, we have added additional processing capacity to the SMTP hub farm in this domain. We have also performed performance tuning on the SMTP hubs to guarantee that they are able to more rapidly process a larger than normal queue of messages.

* Over the next several weeks, we will be taking additional corrective actions to make certain that there is enough spare capacity on the SAN to guarantee that it performs without performance degradation in the case of a single hardware failure. An additional SAN is being installed this week and starting as early as this weekend we will begin to migrate a portion of the existing systems to the new SAN. Additionally, we have engaged our SAN vendor to review the performance tuning of our SAN and implement adjustments to increase its overall performance capabilities. These events in tandem will guarantee that the SAN will be able to perform without an impact to the service in the event we experience another individual hardware error.

Intermedia declined to comment on which of EMC's SAN products was involved, and also declined to disclose the firmware level before and after the outage, citing security concerns. An EMC spokesperson also declined to comment.

"We can confirm that the issue impacted customers on two of our 21 domains," wrote Intermedia's spokesperson in an email to SearchStorage.com "Impacted customers will be proactively credited on 4/23 under the terms of our service level agreement."

More on EMC SANs

EMC officials admit Clariion and Celerra consolidation afoot

EMC sees IT spending recovery, plans further cloud storage push in 2010

EMC upgrades Symmetrix V-Max arrays, thin provisioning

According to a blog post by Intermedia customer David Mok, who is chief technology officer at soccer social media website OleOle.com, Intermedia also suffered an outage in March which was attributed to similar causes.

Mok wrote on March 12:

Today I received their formal RFO (Reasons for Outage) letter via email which goes into great details describing why this outage occurred and what steps they are taking to try to prevent a re-occurrence for the same reasons in future. In a nutshell, there was a hardware failure in one of their EMC SAN devices, and this failure occurred in such a way that prevented the device's own in-built fault tolerance mechanisms from allowing the SAN to effectively remain "up" – that is, they are saying this is one of those failures that should not have happened. These devices are designed precisely NOT to fail under such circumstances, but nonetheless it did fail.
Intermedia's letter goes on to describe the actions they are taking along with the hardware vendor to guard against this in future. All very good and well. Now on to the little gem in the letter that I found the most surprising, and from which all technologists with "uptime" responsibility for Software as a Service (SAS) systems would do well to learn from.

Mok declined comment on the most recent outage, saying his email service had not been affected this time.

In its email to customers, Intermedia acknowledged having received "significant constructive feedback regarding our communication throughout the outage…we have developed a new client notification tool that will be used by the Technical Support organization to proactively notify and communicate with clients during a service interruption." Intermedia's spokesperson confirmed there were two outages, but did not disclose further details.

Enterprise Server/Storage Architects