In this blog I plan on answering some of the questions people often ask me about stateless blades or stateless computing resources. You may not realize it but this is quite a large subject so I’ve broken it down into a series so you won’t fall asleep reading a ridiculously enormous blog. I’ll be answering the following questions..
- What exactly are stateless blades?
- How can stateless blades simplify High Availability?
- How can stateless blades simplify Disaster Recovery?
- Can a stateless blade approach prevent hardware vendor lock-in?
- Can a stateless approach simplify hardware upgrades?
- Is it true that I can drive up server utilization with stateless blades?
- Should I care about Stateless blades when running virtualized servers?
- Is Cisco’s UCS version of stateless blades really that much better that the rest?
- So are Stateless Blades for you?
If there is a specific question that you’d like answering that isn’t covered in this series then please feel free to email me at firstname.lastname@example.org
At the end of this series you will have an insight as to why can this approach can have a dramatic impact on your IT infrastructure and why you need it for physical and virtual environments.
What exactly are stateless blades?
Stateless Computing Resources isn’t a new concept, heck it’s older than me, its been used in mainframes for years where processing resources are assigned to run a task (or operating system), and once that task (operating system) has completed its job the processing resource (blade) can be assigned to a different task (operating system), or if there is a hardware failure a free processing module (blade) can be assigned to restart that task (operating system). Egenera’s Chairman, President and CEO Pete Manca blogged about state almost 3 years before Cisco UCS and their “stateless computing model” was available. In fact Egenera have been providing this kind of facility with PAN Manager on the BladeFrame over 11 years. Today we offer these same features on HP, Dell, Fujitsu, NEC, and now IBM hardware platforms providing customers with the choice they demand.
So to quickly sum up what state is. State is all the things you can think of that gives a server its identify such as its UUID, MAC addresses on NICs, World Wide Names on HBA’s, remote KVM details, what VLAN a NIC is attached to, the device it boots from, and the operating system on that boot device and applications associated with that operating system. These things together define its state.
What PAN Manager does is disassociates a servers state from the physical hardware. So in a PAN enabled environment the UUID, the MAC addresses, the WWN’s, the boot order, etc. is all stored in an XML database and it contains an XML definition for each server. You’ll often hear this definition referred to as a service or server profile, Egenera refers to this as a pServer. The boot devices required to achieve stateless computing are SAN or iSCSI based storage this is done so storage is not tied to any physical blade resource, otherwise that resource would have a state. Doing this allows for simple server provisioning, the ability to re-purpose servers, perform automated server failover, gives centralized management, simplifies disaster recovery, and reduces physical infrastructure complexity. As Pete said in his blog “It removes the complexity at its core”.
Stateless blades can dramatically reduce the time it takes to provision a new server, you simply insert the blade, assign or create a new profile to that blade and away you go. In fact you can create a profile ahead of time. You could tell your SAN team what the WWN’s for the vHBA’s will be, and your networking team can tell you which VLANS will be used. Then when the physical blade arrives you assign that profile you made weeks ago to the blade and away you go. This can result a huge reduction on server deployment times for physical servers, and ultimately the time to market for critical applications. If you want to get really slick you could use SAN based duplication and clone golden images to your boot device, which is what many service providers using PAN Manager actually do. They can literally roll out a new environment is ten’s of minutes using physical resources!
Now many hardware vendors view of what state is varies some talk about raid settings, but if you’re configuring raid settings for a raid controller you’re using internal disks and that gives the computing resource a state. Others talk about managing firmware levels, firmware levels can be important and I think the ability to mange firmware centrally is a great feature, but its not a requirement for stateless computing, in my 8 years with Egenera and using PAN Manager on different hardware platforms it has never impacted the ability to migrate from one stateless physical computing resource to another. I’ve often failed over from AMD processors to resources using Intel processors without incident. Some talk about the firmware revs on NIC’s and HBA’s as being critical, but many HBA and NIC vendors supply drivers that use a unified driver model where the driver contains the firmware for the HBA or NIC and during the driver initialization process uploads the firmware into the HBA or NIC thus eliminating potential interoperability issues that cause systems to crash or hang. Also associating firmware updates with a server profile means that if you want to perform automated failover due to a hardware failure. The failover time will be increased as the blades firmware needs to be rolled forward or back and this extended period of downtime could have serious financial implications to your business. One vendor even associates NIC attributes to a server profile such as TCP Checksum offloading, again its great that you can do that, but not necessary for server failover or disaster recovery, and typically if I’m tuning those kind of settings its for a specific reason and I can do this in the Operating System where part of the server profile’s state resides, and everyone of my system admin team knows where to look for those settings regardless of the hardware platform being used. Additionally many organizations have standard builds for different types of services, so when they first deploy a server they would specify a specific build for this server where tcp checksum offloading is enabled.