20110127

Fun with HP Support Europe

Disclaimer: We are happy customers of HP Hardware.

The HP hardware is great, no complains here, but what about the quality of the HP Support Department?

It's a disaster.

Here's the story:

In November 2010 we had a strange outage of one HP DL385 G5P Server, it suddenly froze during operations. As we were running a DRBD device on it, this was really a bugger. The active machine tried to sync the data over the network to the passive machine (which froze). As the data wasn't acked by the passive node (hence DRBD Mode C), the performance of the active DRBD node was going from 100 to 0.

So we halted the faulty machine and transported it from our datacenter  to our office, for testing.

As written earlier, we hadn't had the time to test this machine until yesterday.

So, installing Ubuntu 10.04 on the machine and "apt-get install stress".
Running the "stress" tool was fun, because the machine reacted as we expected, but after one hour of doing nothing, the machine suddenly halted, triggered by the HP ILO Managment sensors.

So, we tried to power on the machine again, but no. No Power. Machine started and stopped directly.
As we are really trained to guess hardware faults, we were sure, that it could only be the CPUs which could  have a hardware fault. So we called HP Support Desk (for what do we have 24/7/4 care pack?).
We described the problem and what we think the problem actually is, and the guys from the HP Europe Hardware Support desk send us a technician with a new systemboard, because they were thinking: "It's the systemboard, not the CPUs".

Here we go, as we are happy customers, we were believing that.

This morning, the technician came with a new systemboard and replaced it. What happened?
Nothing, the machine doesn't start, it gets power, and drops the power.
The technician was surprised, and checked again all parts and cabling. He also guessed, that the power supplies could be the problem, so we got some spare power supplies from our hardware pool and replaced them. No change. The machine starts and stops immediately.

Mr. HP Technician then "oracled" that the CPUs are at fault, because the first thing the machine does, after powerup, is to check the CPUs. When CPUs are failing, the machine powers down.

Now, the HP Technician then called the HP Hardware Support desk again, and tried to order a new systemboard (again), new CPUs and new Power Supplies, because he didn't know for sure, that the CPUs were broken.

During the phone call with his colleague, the guy on the other end of the line told him: It's not the CPU it's the SPS Backplane (that's the board which sits between the power supplies and the systemboard).

He was saying something about "I don't understand them, why don't they send what I want??" and went out of the office to another customer.

So, this afternoon the SPS Backplane arrived and the very same technician from this morning came back one our later then the Backplane.
He replaced the backplane, and what? Right, the machine doesn't power up properly.

So, back to the beginning. He called again, and now we are waiting for the CPUs (which are coming from Munich and Frankfurt as it seems).

Honestly, we are really experienced with HP hardware and we know, most of the times, what is the cause for a failure. Why can't HP Europe Support Desk just listen to their customers, especially when they work with their hardware. I mean, it's not the first time this happens, but during the last months the HP Europe Support became much more worse then ever.

I already talked about that matter with some people from HP Germany and with our distributor, but it seems nobody is interested.

I wonder how other customers from HP Europe / Germany are handled by the HP Europe Support. It can't be that when you have a 24/7/4 support contract et all that your machine will be  up and running in more then 24 hours.

I'm happy to receive some comments when you do have experience with HP Europe Support (good or bad), too.

20110121

Puppet Recipe: How to determine the role of a drbd device?

It's not perfect, but this little facter script helps to determine which role a drbd device has.

This is a puppet faceter plugin, you should put it somewhere under
/etc/puppet/modules/drbd/plugins/facter/drbd_role.rb

It checks which version of drbd you are running, the older DRBD setups had their config in /etc/drbd.conf, the newer versions especially on Ubuntu do have their resource config in /etc/drbd.d/*.res


require 'facter'

filename=""
if File::exist?('/sbin/drbdadm')
        if File::exist?("/etc/drbd.d")
                old_drbd=false
        else
                old_drbd=true
        end

        if old_drbd==false
                Dir.glob('/etc/drbd.d/*.res') do | fileitem |
                        next if fileitem == '.' or fileitem == '..'
                        filename=File::basename(fileitem,'.res')
                end
        end

        if old_drbd
                role=%x{drbdadm role `grep "resource" /etc/drbd.conf|awk '{print $2}'`}.chomp.downcase
        else
                resource_name=%x{cat /etc/drbd.d/#{filename}.res|grep "resource"|awk '{print $2}'}.chomp.downcase
                role=%x{drbdadm role #{resource_name}}.chomp.downcase
        end

        Facter.add("drbd_role") do
                setcode do
                        role
                end
        end
end


20110120

Shell Goodies: Fetching NIC Interfaces with carrier without SED/AWK

As I'm rewriting some parts of the dhcp boot mechanism of live-boot, I needed the possibiility to fetch network interfaces, without the use of SED/AWK or whatever could help to parse the "ip -oneline link show" output.
As we somehow don't have sed or awk in our initramfs tools, I scribbled this:

#!/bin/sh
for device in /sys/class/net/* ; do 
        if [ -f "$device/carrier" ]; then
                carrier=$(cat "$device/carrier" 2> /dev/null)
                if [ "$carrier" = "1" ]; then
                        devicename=`basename $device`
                        if [ "$devicename" != "lo" ]; then
                                interface=`cat $device/address`
                                echo "$carrier of $devicename ($interface) is up"
                        fi
                fi
        fi
done

20110119

8 Months of Hard Work -> Success

First of all great news:

we are running now with round about 350 hosts on Ubuntu Lucid (10.04 LTS) Server Flavour on Bare Metal (HP Rackmounts DL360/DL365/DL380/DL385 from G5 via G5P, G6 and G7 , HP BladeServers BL465c G5 and G7 with the Flex10 Fabric) and VMWare Machines.

This was not the case until the last weekend.

In the past, we were running Ubuntu Jaunty (9.04) and that had to change, because 9.04 was EOL when Ubuntu Maverick.

Well, normally it would be easy to follow the non LTS releases with do-release-upgrade or apt-get update / apt-get dist-upgrade, but during our tests we found some really strange things.
We are running on Ubuntu many different services and some of them are involving DRBD setups. Especially this DRBD setup gave us problems.

First, in 10.04 LTS no Heartbeat1/2 was existing anymore, so we had to replace all our puppet recipes which are dealing with HA1/2 to pacemaker. This was one of the serious buggers
Second, while we were test-upgraing from 9.04 to 9.10 to 1.04 we found out that during this update all DRBD devices were horribly broken (we don't know why, but they were, and we had no time to investigate).

Therefore, we decided that we have to totally redeploy our Servers during Operational Times from Scratch.

What does this mean:


  1. Setup the whole infrastructure, or update the existing infrastructure to deploy Ubuntu 10.04
  2. Test Deploy VMWare Machines and Bare Metal Test Machines
  3. Test new hardware, especially the BL465C G7 blade servers from HP, because of the new Flex10 Fabric NIC
  4. Test Database Setups with Replications for our Production Services. From 5.0 to 5.1 many things changed. This was crucial for us, because some of our databases are running under high load (IO, CPU and Memory wise)
  5. Test many pacemaker setups, and write puppet recipes for them (pacemaker + ipvs + ldirectord, pacemaker+drbd+mysql, pacemaker+apache2, pacemaker + bind, pacemaker + postfix etc.)
  6. Test FAI Deployment of Bare Metal


Well, the problem with all that, we only had 8 months of time, without interrupting the daily operations.

Result: Many days with too many hours and a lot of brainfck involved.

At this time, when we started this adventure, we were 4 team members, and everybody got a share of the work.

My special topic was: Rewrite the FAIManager I wrote in 2008/2009. The result was DC².

I want to spare you the technical details of this adventure, but it was hard work. Especially when you get new hardware which was really untested, and you find problems during Network Boot Setups.
In the last 5 days, before the big bang started, I had to replace klibcs ipconfig network setup in live-initramfs overlay with udhcpc. This was a success, but it costed work time.

Anyhow, last weekend was the high time for us. We started on Saturday, around 10am (UTC+1) and after 36 hours we were finished.

All of our services are redundant. So, we deployed from scratch the second line of our machines. We tested the product on this second line and when we were sure, that everything worked, we switched from old Ubuntu 9.04 First Line Machines to the newly deployed Ubuntu 10.04 LTS line.
After the switch we re-checked the product services, so we were really sure that everything worked as before.
After the final test, we started to deploy the first line. Sunday evening we were then ready to bring up the  newly deployed machines as redundancy.

The last action on this sunday was to drink some beer and smoke a cigar to celebrate our success.

All in all, it was a success, everything worked as expected and the downtime was not more then 30 minutes.

Coming to an end, this project wouldn't have worked out without many people involved.

  1. All OPS team members involved. Without their energy to work day and night this wouldn't have worked out nicely.
  2. All people working for Ubuntu, Debian and especially my dear friends from the FAI project.
  3. A special thanks to St├ęphane Graber and the people from the LTSP project, who had already UDHCPd in their initramfs setup, from where I got the idea and parts of the implementation.
  4. The people from the Puppetlabs for their great software, FAI + Puppet are great!
  5. The people from  the Qooxdoo Project, this is really a nifty piece of javascript framework
  6. The people from the Django Project, the backend application runs with it
  7. David Fischer for his great rpc4django project, really a cool implementation for xmlrpc and json-rpc
  8. The developer of Googles Chromium Browser, Mozilla Firefox and Firebug
  9. Hewlett-Packard for the great hardware

20110111

5 Years to retirement

Oh well, we all know IT Business is not for old people.

As it happened, I turned 40 today, and I'm already thinking about my future with >= 45. What to do?

Doing the Google Recruitment Cycle?

I don't think this is really what I want. In one of my last replies to their HR crew I wrote: "You need young people, and not old people" so no Google for me, really. I mean, it would be interesting and fun to work for Google, but not at my age anymore.

Applying for a job at Canonical?

Honestly, I really like the people working for Canonical, but I can't imagine sitting at home most of the time and doing my work from there.
I need a handful of good and trustworthy people around me when doing my job. Eye2Eye Communication is a must in my work life, discussing problems, finding solutions etc. this is what I like most working for a company. So no Canonical for me.

Having my own business?

Oh well, this would be fun, but without a product?
Hey wait, there is DC² and I think this project has potential. I could imagine that this would be fun, integrating Linux automation systems, to help other companies to maintain their data centers at large.
But there is the problem with the money. I don't have it to startup a company, and I don't think that I'll get it in the next 5 years. And I'm not a fan of business angels or venture capitalists.

But wait...I don't need that much money. I could leave Germany and do my job in another country, no not the US or UK or whatever developed country.
I'm thinking more of the African Continent. Let's see how it is in Cameroon, when I'm visiting my in-laws. Eventually that's a way to spend the last years 20 years of my life ( ;-) ) Helping there to build up good IT environments, educating young and smart Cameroon (and / or other African) IT Youngsters.

Let's see what the next 5 and more years have for me.

Anyways, I turned 40 and there is still a lot to do for me.

It's not that bad...Celebration starts now :)

20110110

Back on track

Happy New Year 2011 Everyone,

so I'm back in the office and also starting to work on Ubuntu, DC² and FAI after my vacation.

For our Zend Lovers, zend-framework 1.11.2 is uploaded right now to Natty and backports will be available too.

So, let's make this year 2011 even more (as Jono would say) awesome,  then 2010 :)

Happy Hacking, Folks