Dizwell Informatics

News from Nowhere

What were the odds?

In modernising Churchill to work for Oracle 12c and the latest 6.x releases of RHCSL, I’ve encountered a bizarre bug (#19476913 if you’re able to check up on it), whereby startup of the cluster stack on a remote node fails if its hostname is longer than (or equal to) the hostname of the local node.

That is, if you are running the Grid Infrastructure installer from Alpher (6 characters) and pushing to Bethe (5 characters) then the CRS starts on Bethe just fine: local 6 is greater than remote 5. But if you are running the GI installer on Gamow (5 characters) and pushing to Dalton (6 characters) then the installer’s attempt to restart the CRS on Dalton will fail, since now local 5 is less than remote 6. Alpher/Bethe managed to dodge this bullet, of course -but only by pure luck.

The symptoms are that during the installation of Grid Infrastructure, all works well until the root scripts are run, at which point (and after a long wait), this pops up:

Poke around in the [Details] of that dialog and you’ll see this:

CRS-2676: Start of 'ora.cssdmonitor' on 'dalton' succeeded 
CRS-2672: Attempting to start 'ora.cssd' on 'dalton' 
CRS-2672: Attempting to start 'ora.diskmon' on 'dalton' 
CRS-2676: Start of 'ora.diskmon' on 'dalton' succeeded 
CRS-2676: Start of 'ora.cssd' on 'dalton' succeeded 
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'dalton' 
CRS-2672: Attempting to start 'ora.ctssd' on 'dalton' 
CRS-2883: Resource 'ora.ctssd' failed during Clusterware stack start. 
CRS-4406: Oracle High Availability Services synchronous start failed. 
CRS-4000: Command Start failed, or completed with errors. 2017/02/18 10:21:41 
CLSRSC-117: Failed to start Oracle Clusterware stack Died at /u01/app/12.1.0/grid/crs/install/crsinstall.pm line 914.

The installation log is not much more useful: it just documents everything starting nicely until it fails for no discernible reason when trying to start ora.ctssd.

Take exactly the same two nodes and do the installation from the Dalton node, though, and everything just works -so it’s not, as I first thought it might be, something to do with networks, firewalls, DNS names resolution or the myriad other things that RAC depends on being ‘right’ before it will work. It’s purely and simply a matter of whether the local node’s name is longer or shorter than the remote node’s!

The problem is fixed in PSU 1 for 12.1.0.2, but it’s inappropriate to mandate its use in Churchill, since that’s supposed to work with the vanilla software available from OTN (I assume my readers lack support contracts, so everything has to work as-supplied from OTN for free).

The obvious fix for Churchill, therefore, is to (a) either make the ‘Gamow’ name one character longer (maybe spell it incorrectly as ‘gammow’?); or find a ‘D’ name that is both a physicist and only 4 characters long or fewer; or (c) change both names ensuring that the second is shorter than the first.

Largely due to the distinct lack of short-named, D-named physicists, I’ve gone for the (c) option: Churchill 1.7 therefore builds its Data Guard cluster using hosts geiger and dirac. Paul Dirac (that’s him on the top-left) was an English theoretical physicist, greatly admired by Richard Feynman (which makes him something of a star in these parts) and invented the relativistic equation of motion for the wave function of the electron. He used his equation to predict the existence of the positron -and of anti-matter in general, something for which he won a share of the 1933 Nobel prize for physics. Geiger is a frankly much less distinguished physicist whose main claim to fame is that he invented (most of) the Geiger counter and wasn’t (apparently) a Nazi. He gets into the Churchill Pantheon by the skin of his initial letter and not much else, to be honest!

Short version then: Churchill 1.7 now uses Alpher/Bethe and Geiger/Dirac clusters, and both Gamow and Dalton are no more. Quite a bit of documentation needs updating to take account of this trivial change! Hopefully, I should have that sorted by the end of the day. And that will teach me to test all parts of Churchill before declaring that ‘it works with 12c’. (Oooops!)