<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tech Talk &#187; Clustering</title>
	<atom:link href="http://tech.philipsellers.com/category/clustering/feed/" rel="self" type="application/rss+xml" />
	<link>http://tech.philipsellers.com</link>
	<description>Philip Sellers&#039; random thoughts on technology</description>
	<lastBuildDate>Mon, 30 Jan 2012 15:01:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>A problem for VMware: If it&#8217;s &#8220;good enough&#8221; then why pay more?</title>
		<link>http://tech.philipsellers.com/2011/10/24/a-problem-for-vmware-if-its-good-enough-then-why-pay-more/</link>
		<comments>http://tech.philipsellers.com/2011/10/24/a-problem-for-vmware-if-its-good-enough-then-why-pay-more/#comments</comments>
		<pubDate>Mon, 24 Oct 2011 11:00:41 +0000</pubDate>
		<dc:creator>Philip</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Virtualization]]></category>
		<category><![CDATA[VMware]]></category>
		<category><![CDATA[XenServer]]></category>
		<category><![CDATA[Competition]]></category>
		<category><![CDATA[VMware Fusion]]></category>
		<category><![CDATA[vSphere]]></category>

		<guid isPermaLink="false">http://tech.philipsellers.com/?p=1353</guid>
		<description><![CDATA[I have often commented to my coworkers that VMware is facing a &#8220;good enough&#8221; problem.  Even though I believe in VMware and their software, I&#8217;ve said there is day coming soon that competing products will be &#8220;good enough&#8221; and customers will no longer see the need to buy VMware&#8217;s vSphere suite, even though it is [...]]]></description>
			<content:encoded><![CDATA[<p>I have often commented to my coworkers that VMware is facing a &#8220;good enough&#8221; problem.  Even though I believe in VMware and their software, I&#8217;ve said there is day coming soon that competing products will be &#8220;good enough&#8221; and customers will no longer see the need to buy VMware&#8217;s vSphere suite, even though it is the better and more stable technology.  As a customer, I might put up with an occasional glitch or headache from the competitor if I didn&#8217;t have to pay much higher prices for similar technology.  And looking at how much Windows is deployed on VMware, there is a serious threat to consolidate it all to Microsoft and their famous Enterprise Agreement as we move forward.    As a customer, I might overlook a feature here or there that does not exist, even if its a feature I would make use of.</p>
<p>I am not a VMware basher, just the opposite actually.  I serve as a primary VMware advocate in my company.   But, my company has not embraced the vCloud vision of VMware.   I am a VCP3 and VCP4 and hope to be a VCP5 in the near future.  I know their products well and use them on a daily basis, both at work (vSphere) and at home (Fusion).  But it is harder for me to make a technical or business case for their product.  The first issue is cost.  The second is the &#8220;good enough&#8221; factor, since we are not using some of the additional value they have added to their product in vSphere 4 and 5.</p>
<p>There are already good cases in the datacenter where all the advanced VMware features don&#8217;t matter, and in those cases my company has already adopted XenServer as a secondary hypervisor.  And XenServer works well, which becomes a problem.  We have proven its ability to run our workloads and consolidate servers.    In some cases, the applications we run on it were built with high availability and fail-over and the tried and true VMware features like clustering, HA and DRS do not matter, specifically our XenApp servers .</p>
<p>In other some ways, VMware is erroding the existing value of their vSphere product suite by pulling features its customers are using.  The primary reason I have heard to do this is because there is overlap with new products they have purchased or developed.  Guest patch management is an example of this.  Since their Configuration Management product handles patch management, a feature that has existed in vSphere for two generations, Update Manager is now being downgraded to only patch vSphere hosts.  But the kicker in this case is that Configuration Manager does much more than patch management and is priced as such.  We aren&#8217;t seeking the additional features and VMware has priced themselves out of the game for us.</p>
<p>VMware&#8217;s decision on patch management leaves companies with a big void to fill.  But no solution, including the VMware Configuration Manager, fills the void as seamlessly as the Update Manager product that once patched my systems.  Because we have firewalls in between our vCenter and hosts, Update Manager worked well because it used the same vCenter ports for patching.  Configuration Manager and other solutions do not, which is actually kind of a pain.</p>
<p>VMware has cast a vision of the vCloud and added API sets for storage, security and networking that help to pave the path to the cloud for companies.  In our case, we have not embraced the cloud vision and while we may in the future, today, the enhancements added to vSphere have not added real value to us.  Unless a company embraces VMware&#8217;s vision and adopts these technologies, the vSphere suite continues to erode value.</p>
<p>The cloud is a vision I have written about before and I stated then that it&#8217;s one that systems administration groups have little to do with influencing in organizations (<a href="http://tech.philipsellers.com/2011/08/31/the-political-challenge-of-moving-to-the-cloud/">see my original post here</a>).  This is a particular challenge for VMware and its advocates.  It is, frankly, a problem that worries me as a VMware believer and administrator.  But, as the tides change &#8211; such as the transition from Netware to Active Directory, we as administrators move where we need to and adapt, like the chameleons that we are.  But I am also wondering what other VMware administrators are feeling?  There was a time my company used all the features in vSphere.  Has VMware left you to in a corner while they focus on the cloud?  Talk back to me&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://tech.philipsellers.com/2011/10/24/a-problem-for-vmware-if-its-good-enough-then-why-pay-more/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>TA3105 Long Distance VMotion session recap</title>
		<link>http://tech.philipsellers.com/2009/09/02/ta3105-long-distance-vmotion-session-recap/</link>
		<comments>http://tech.philipsellers.com/2009/09/02/ta3105-long-distance-vmotion-session-recap/#comments</comments>
		<pubDate>Thu, 03 Sep 2009 03:01:15 +0000</pubDate>
		<dc:creator>Philip</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Virtualization]]></category>
		<category><![CDATA[VMware]]></category>
		<category><![CDATA[VMworld]]></category>
		<category><![CDATA[VMworld 2009]]></category>

		<guid isPermaLink="false">http://tech.philipsellers.com/?p=594</guid>
		<description><![CDATA[Long Distance VMotion is by far the best session I&#8217;ve attended and the most exciting news for me of the VMworld week this far.  The session was a presentation of a research project performed by VMware, EMC and Cisco.  The session presented four options for performing a long distance VMotion using stock vSphere and existing [...]]]></description>
			<content:encoded><![CDATA[<p>Long Distance VMotion is by far the best session I&#8217;ve attended and the most exciting news for me of the VMworld week this far.  The session was a presentation of a research project performed by VMware, EMC and Cisco.  The session presented four options for performing a long distance VMotion using stock vSphere and existing technologies, well, almost.  Three of the four include technologies currently available.</p>
<p>Why would you want to do a long distance VMotion?  In my case, we have two data-centers &#8211; geographically close to one another.  We currently stretch our cluster between the two locations and it allows us to float VM&#8217;s using VMotion between the two.  The problem is that if we lose our primary datacenter, all storage is presented from here.  Long Distance VMotion is the notion of having two separate clusters, one in each datacenter, and being able to VMotion between them.</p>
<p>What was really news to me from this session (I&#8217;ll get to what was presented) was that we can present the same data stores to two different clusters and have them recognized on both clusters.  I am pretty sure I tried this way back in the 3.0 days and it failed to work.  This must have been added in 3.5 or 4.0 &#8211; I have not tried in recent years.</p>
<p>So, what was presented?  The three companies worked together to identify and trial a solution to allow for long distance VMotion.  At this point, there is a very narrow set of criteria must be satisified to be support and for this to occur.  Much of the restriction comes on the storage side, but network also presents some problems.  Apparently, everything you need in vSphere is there, if you separate each datacenter into its own set of hosts.</p>
<p><strong>Requirements</strong></p>
<ul>
<li>Distance between datacenters must be less than 200 km</li>
<li>A single instance of vCenter to control both clusters</li>
<li>Each site must be configured in it&#8217;s own cluster &#8211; the cluster cannot be stretched</li>
<li>Dedicated gigabit ethernet network</li>
<li>A single VMware Distributed Switch stretched across both clusters (ok, didn&#8217;t know we could that either)</li>
<li>Same IP subnet configured on both clusters for the VM to run</li>
<li>Cisco DCI (Datacenter Interconnect) type technology &#8211; if you have something similiar by another vendor, you&#8217;ll be supported &#8211; this means that you should have a core network that can handle routing traffic to either location for the IP &#8211; the VM networks must be stretched between datacenters</li>
<li>Datacenter storage should be R/W on both sides</li>
<li>VMFS Storage is presented to both clusters &#8211; VM&#8217;s in each datacenter are run from the local storage LUNs.</li>
<li>Must have less than 5 ms latency and at least 622 mbps bandwidth (OC12)</li>
<li>No FT across sites!</li>
</ul>
<p>Surprisingly, we have most of this configured in our environment and its been status-quo for us for several years.  The biggest difference between our environment and this spec configuration is that we run a stretched cluster to achieve this.  Our datacenters are very close to one another and we only present storage from our primary datacenter so that we don&#8217;t have a split-brain scenario.  But, it does give me new things to think about and talk about with co-workers.  We currently don&#8217;t run two clusters or SRM because we like the flexibility to VMotions between datacenters &#8211; with that now a possiblity, we may have something new to investigate&#8230;</p>
<p><strong> </strong></p>
]]></content:encoded>
			<wfw:commentRss>http://tech.philipsellers.com/2009/09/02/ta3105-long-distance-vmotion-session-recap/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>HP Blades, Virtual Connect &amp; ESX considerations</title>
		<link>http://tech.philipsellers.com/2009/03/09/hp-blades-virtual-connect-esx-considerations/</link>
		<comments>http://tech.philipsellers.com/2009/03/09/hp-blades-virtual-connect-esx-considerations/#comments</comments>
		<pubDate>Mon, 09 Mar 2009 13:51:07 +0000</pubDate>
		<dc:creator>Philip</dc:creator>
				<category><![CDATA[Blades]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Virtualization]]></category>
		<category><![CDATA[VMware]]></category>
		<category><![CDATA[ESX]]></category>
		<category><![CDATA[HP]]></category>

		<guid isPermaLink="false">http://tech.philipsellers.com/?p=368</guid>
		<description><![CDATA[I may have already mentioned that one of our projects for the year is to transition our corporate ESX cluster from 2U hardware onto blades.  The process of transitioning does not come without some concern and some caveots moving to the blade architecture.  We feel that blades are a good fit in our case for [...]]]></description>
			<content:encoded><![CDATA[<p>I may have already mentioned that one of our projects for the year is to transition our corporate ESX cluster from 2U hardware onto blades.  The process of transitioning does not come without some concern and some caveots moving to the blade architecture.  We feel that blades are a good fit in our case for this particular cluster (we run several ESX clusters).   Our <a href="http://tech.philipsellers.com/2009/03/05/vmware-view-implemented/">VMware View deployment</a> is our first production ESX workload on blade hardware.  We have learned a few things from this deployment that might be helpful.<span id="more-368"></span></p>
<p><strong>Multiple VLANs on a single NIC</strong><br />
When using Virtual Connect, the default configuration sets VLAN tagging support to &#8220;Tunnel VLAN Tags.&#8221;  The mode is self-explanitory and just means that the &#8220;Ethernet Network&#8221; chosen and assigned to the server profile in Virtual Connect is the only network visible to the blade.  For most blade users, this setting works fine and many ESX deployments might be ok with this configuration.  But for many ESX deployments, people require multiple VLANs to be brought up on the same physical NIC.  The &#8220;Tunnel VLAN Tags&#8221; mode does not allow for this functionality.  </p>
<p>To allow for multiple VLANs on a single NIC, you must login to Virtual Connect, expand Ethernet Settings, select Advanced Settings and change the VLAN Tagging Support from Tunnel to &#8220;Map VLAN Tags.&#8221;  Map VLAN Tags exposes a Multiple VLANs option in the network assignment drop-down under your server profile.  Once you select Mutliple VLANs, a new window appears and you may select as many VLANs as you need exposed to the server.  The ESX host is then required to tag its traffic on these NICs.</p>
<p><strong>Number of nodes per enclosure<br />
<span style="font-weight: normal;">Thou shalt not run more than 4 nodes of an ESX cluster in the same blade enclosure.   No, its not the 11th commandement, but it is an important rule to know.  Thanks to <a href="http://www.yellow-bricks.com/2009/02/09/blades-and-ha-cluster-design/">Duncan Epping&#8217;s article on the topic</a>, we discovered a major implementation hazard for ESX on blade architecture that we didn&#8217;t have to experience first hand.  <a href="http://tech.philipsellers.com/2009/03/05/month-of-silence-because-of-a-blade-enclosure/">We did have our own enclosure failure</a>, which made us aware that we could have been affected, however.  The pitfall is that HA has primary and secondary nodes in a cluster.  An ESX HA cluster can have up to five primary nodes, but never more.  The first five nodes in the HA cluster become primaries and these roles never get reassigned if a primary node fails.  The primary nodes are responsible for directing the HA activites.  So, you don&#8217;t want all your HA primary nodes running in the same enclosure.  If all five HA nodes are running in the same enclosure and it fails, you will not get the desired HA restarts on the other ESX nodes in your cluster.  Duncan&#8217;s article gives a great overview of the HA clustering architecture and sheds light on a little known consideration.  </span></strong></p>
<p><strong>Service Console &amp; VMotion<br />
<span style="font-weight: normal;">Perhaps the favorite thing about Virtual Connect and ESX is the ability to creatively configure Service Console and VMotion using just two NICs, and providing redundancy and </span></strong>isolation as needed for these functions.  Lets look at the best practices:  </p>
<ol>
<li>Service Console should have two NICs teamed for redundancy of this network link.</li>
<li>VMotion should have its own dedicated NIC for the best performance of VMotion traffic.</li>
</ol>
<p><em>So, what makes Virtual Connect suit this well?   </em><br />
First, Virtual Connect is redundant on the VC-Ethernet side.  We can create a single &#8220;Shared Uplink Set&#8221; with both Service Console and VMotion tagged traffic.  The stacking link between the two VC-Ethernet module will provide for traffic on NIC0 to be rerouted on Bay 2 if the uplink on Bay 1 is down.  As long as both VC-Ethernet modules are functioning, the stacking links would be utilized.  </p>
<p>Second, we can use ESX NIC Teaming failover settings to keep the traffic separate, except when a failure occurs.   If you lost a VC-Ethernet module, your Service Console and VMotion traffic would be failed over using ESX onto the same NIC on the other VC-Ethernet module.  </p>
<p>There are really a lot of options in this space and this is my high-level implementation.  I think its a great solution and can&#8217;t find many trade-offs for this.  In an blade environment where NICs are a premium, this is a wonderful solution. </p>
<p>(By the way, I didn&#8217;t devise this SC/VMotion configuration &#8211; its something someone else posted, but I can&#8217;t find the original blog post to give you credit&#8230; If it was you, please let me know.)</p>
]]></content:encoded>
			<wfw:commentRss>http://tech.philipsellers.com/2009/03/09/hp-blades-virtual-connect-esx-considerations/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Month of silence, because of a blade enclosure</title>
		<link>http://tech.philipsellers.com/2009/03/05/month-of-silence-because-of-a-blade-enclosure/</link>
		<comments>http://tech.philipsellers.com/2009/03/05/month-of-silence-because-of-a-blade-enclosure/#comments</comments>
		<pubDate>Thu, 05 Mar 2009 13:37:55 +0000</pubDate>
		<dc:creator>Philip</dc:creator>
				<category><![CDATA[Blades]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[My Projects]]></category>
		<category><![CDATA[cluster]]></category>
		<category><![CDATA[ESX]]></category>
		<category><![CDATA[Exchange]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[failure]]></category>
		<category><![CDATA[Problems]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[support]]></category>
		<category><![CDATA[VMware]]></category>

		<guid isPermaLink="false">http://tech.philipsellers.com/?p=349</guid>
		<description><![CDATA[The past month of my life has been spent dealing with the fall-out over a massive failure of our local blade enclosure.  ]]></description>
			<content:encoded><![CDATA[<p>I can&#8217;t believe it, but its been almost a month since my last post.  And what a month its been around my work.  This has been one of the busiest and most difficult months that I can remember with the company.  I have my hands in several different technologies, VMware and our blades are just two of my primary responsiblities.  Over the past month, though, we&#8217;ve experienced a catastrophic failure of one of our blade enclosures.   The failure has only occurred once, but the fall-out from this has taken almost a month to work out.  And honestly, we&#8217;re still not through working out the kinks.  </p>
<p>Of course, my story has to begin on Friday the 13th&#8230;  Sometime around 9:00am, we started getting calls for both our SQL 2005 database cluster and our Exchange cluster.  After investigation, we found that the active nodes were both in the same enclosure and a third ESX host in the same was experiencing problems, too.  The problems were affecting both network and disk IO on the blades.  All of our blades are boot from SAN, so the IO had to be a fiber-channel issue.  </p>
<p>Several hours later, we were finally able to get enough response out of the nodes to be able to force a failover of services for Exchange, shortly followed by SQL 2005.  As I worked with HP support, nothing improved on the affected servers.  We were finally diagnosed with a problem mid-plane on the enclosure.  </p>
<p> While waiting for the mid-plane to be dispatched to the field service folks, I requested that we go ahead and do a complete power-down on the enclosure and bring it up clean.  This required physically removing power from the enclosure after powering down everything that I could from the onboard administrator.  </p>
<p>After the reboot, everything looked much healthier.  The blades came back to life and everything began operating as expected.  After intense discussions on the HP side, we reseated our OA&#8217;s and the sleeve that they plug into on the back side of the enclosure.  Net outcome was the same &#8211; everything still operating well.  The OA&#8217;s nor the sleeve were loose, so we doubted that was the cause.  </p>
<p>One nugget I learned from HP support (please vett this information on your own), is that the Virtual Connect interconnect modules require communication with the onboard administrators (OA&#8217;s).  I&#8217;m still not sure I fully understand, but HP support did tell us that if VC lost communication to the OA, its possible that it caused our problems.  If this is so, this smells like very, very bad engineering and design&#8230;</p>
<p>Continued investigation on HP&#8217;s part has pointed us back to the original diagnosis &#8211; a faulty mid-plane.  Only by default did we return to that conculsion, however.  This is the only piece of hardware common to the problems.  Our only other conclusion was that this was a very bad, &#8220;hiccup&#8221; &#8212; which obviously buys us no real peace of mind&#8230;  </p>
<p>So, sometime soon, we will be replacing the mid-plane of our enclosure.   I have, of course, lost some faith in the HP blade ecosystem.  We have plans to migrate our corporate VMware cluster onto blades, as well as some Citrix and other servers.  Losing an enclosure like this has un-nerved those plans.  We were fortunate <span style="text-decoration: line-through;">to have drug our feet </span> to only have 3 blades populated and serving anything at the time this happened.  I will post updates as we move forward&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://tech.philipsellers.com/2009/03/05/month-of-silence-because-of-a-blade-enclosure/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Twas the Night Before New Years, sysadmin style</title>
		<link>http://tech.philipsellers.com/2009/01/03/twas-the-night-before-new-years-sysadmin-style/</link>
		<comments>http://tech.philipsellers.com/2009/01/03/twas-the-night-before-new-years-sysadmin-style/#comments</comments>
		<pubDate>Sat, 03 Jan 2009 05:35:52 +0000</pubDate>
		<dc:creator>Philip</dc:creator>
				<category><![CDATA[Blades]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[My Projects]]></category>
		<category><![CDATA[ESX]]></category>
		<category><![CDATA[EVA]]></category>
		<category><![CDATA[HTC]]></category>
		<category><![CDATA[Problems]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[support]]></category>
		<category><![CDATA[Virtualization]]></category>
		<category><![CDATA[VMware]]></category>
		<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://tech.philipsellers.com/?p=240</guid>
		<description><![CDATA[Twas the night of new years, and all through the house, not a creature was stirring, not even a mouse.  The little one had passed out, and we&#8217;d put her to bed.  We had all celebrated with Carson, Dick Clark and the rest. Mom in her kerchief and I in my cap, had just settled [...]]]></description>
			<content:encoded><![CDATA[<p>Twas the night of new years, and all through the house, not a creature was stirring, not even a mouse.  The little one had passed out, and we&#8217;d put her to bed.  We had all celebrated with Carson, Dick Clark and the rest. Mom in her kerchief and I in my cap, had just settled in for a long winter&#8217;s nap.  When all of a sudden, I awoke to a clatter, it must be my text paging, I wonder what is the matter?  I spring from my bed and stumble to the Mac, oh man, my VMware at work has gone all to crap.</p>
<p>That&#8217;s how my 2009 started&#8230; about 13 hours later, I finally left work and resumed my long-interrupted nap.   <span id="more-240"></span>We had what seems to have been a storage meltdown behind our VMware farm yesterday.  Our file sharing cluster was also affected and so our few employees who were working on New Years Day, well, weren&#8217;t working at all.  The short version of the story goes like this.  Our scheduled backup process, using EMC Networker, kicked off VCB backups on the ESX 3.5 hosts around 1:30 am.  By 2:00 am, the process was trying to create snapshots on VMs and this caused some sort of meltdown due to SCSI reservations (found the SCSI reservation problem after VMware analysis).  Turns out the HP Insight Agents loaded on our VMware hosts were causing these SCSI reservation issues.  The agents were checking the disks at a consistent interval and we had not upgraded the agents to the latest revision, which was supported with ESX 3.5 &#8211; so not VMware&#8217;s fault &#8211; they have a great KB article about this issue (see KB <a href="http://kb.vmware.com/kb/1005009">1005009</a>).   As an immediate resolution, one of my co-workers removed the HP agents from our hosts and worked our way through rebooting the entire farm, one host at a time to remove the SCSI reservations.  I cross my fingers on VCB backups working when they kick off in an hour.  Had this been the only issue, we would have been fine.</p>
<p>Unfortunately, at around 4:30 a.m., while I was unaware, our cluster began experiencing troubles, too.  And this is where our detective skills have come up short.  We have been sleuthing to find the cause of some weirdness in both our file sharing and Exchange clusters for several weeks, now.  The file share cluster is, dumbly enough, critical in our environment.  Without it, our users home directories are inaccessible and, since these home directories are defined in Active Directory, it seemingly hoses up our employee&#8217;s workstations.  Things that should otherwise be speedy, say opening a program &#8212; any program &#8212; or browsing to your local hard drive, become unbearably slow.  Even running applications sometimes lock up as it attempts to access some unknown part of Windows during normal operation.  It brings our entire business operation to crawl and that&#8217;s unacceptable.  (BTW, if all this sounds familiar, please leave comments or send me an email with suggestions.)</p>
<p>So, what actually happened to our Windows file sharing cluster?  We have an issue where we see the network utilization on the file share cluster drop to nothing, but the cluster nodes still respond to ping and other non-storage related network services &#8211; but what we found out later in the process &#8211; not to anything which needed IO to respond.  After repeated network sniffs, we were seeing that traffic would come to the cluster, it would be acknowledge, but the node would not start sending data.  The break between the request and data could be as long as 20 or 30 seconds.  And that was consistent with our &#8216;outage&#8217; periods.  So, I decided to fail over the cluster shares from one node to the node that had been &#8216;solving&#8217; the problems in the past.  When attempting to fail the share, they locked and never became accessible again.  After waiting for almost an hour, I rebooted a node trying to clear up the locks and let the other node take control.  That was never possible either.  A reboot of the second node only served to cause it to stall during boot, and never provide me a login screen.  Rebooting the other node, same result.  And then, a lengthy phone call with HP support after driving into the office.</p>
<p>The short version of this is that we are running Windows Server 2003 with SP2.  Apparently, therein lies our problems with a) clustering and b) storport.sys.  The StorPort driver issues are pretty well documented and it in combination with several other hotfixes are HP&#8217;s recommendation to us.  The hotfixes were released outside of Microsoft&#8217;s normal patch schedule to the large number of customers having issues similiar to this.  HP&#8217;s recommendation to us was to install the list of suggested hotfixes for post-SP2 (<a href="http://support.microsoft.com/kb/935640/en-us">Microsoft KB 935640</a>).  My co-workers successfully completed that on the file share cluster this evening without incident.  (Hallelujia!)  I got the call shortly around 10:15 pm with the all clear.</p>
<p>The file sharing problem has been plauging us for several weeks and we have not been able to deduct the full cause.  We have had theories and as soon as we seem to figure it out or know how it will act, we&#8217;re proven wrong.  At least until yesterday, we hope.  The next week will tell for sure.</p>
<p>I also mentioned issues on our Exchange cluster.  We&#8217;re not actually sure its having issues.  It may only be showing issues at the same time the file share cluster was having issues.  We believe the above issues on file shares are causing lock ups on client machines and their programs, so we are currently thinking Exchange&#8217;s perceived network issues are just the fact that our employee&#8217;s Outlook is locked and can&#8217;t connect, so we see what looks like Exchange issues.  But, then again, we&#8217;re not 100% sure.  We still have some detective work to do here.  Where&#8217;s Sherlock Holmes when you need him?</p>
<p>As for VMware, we still have a punch list for a few additional things to do &#8211; putting newer HP agents on the ESX hosts for one.  There are some things we may want to customize here based on feedback from VMware.  We may want to disable the storage monitoring agent on these hosts, but more research is required.  Removing these agents all together for now is a our preferred fix.  So, for the next hour, I just need to keep myself awake and hope that VCB backups will go well tonight.</p>
]]></content:encoded>
			<wfw:commentRss>http://tech.philipsellers.com/2009/01/03/twas-the-night-before-new-years-sysadmin-style/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

