What were the odds?

In modernising Churchill to work for Oracle 12c and the latest 6.x releases of RHCSL, I’ve encountered a bizarre bug (#19476913 if you’re able to check up on it), whereby startup of the cluster stack on a remote node fails if its hostname is longer than (or equal to) the hostname of the local node.

That is, if you are running the Grid Infrastructure installer from Alpher (6 characters) and pushing to Bethe (5 characters) then the CRS starts on Bethe just fine: local 6 is greater than remote 5. But if you are running the GI installer on Gamow (5 characters) and pushing to Dalton (6 characters) then the installer’s attempt to restart the CRS on Dalton will fail, since now local 5 is less than remote 6. Alpher/Bethe managed to dodge this bullet, of course -but only by pure luck.

The symptoms are that during the installation of Grid Infrastructure, all works well until the root scripts are run, at which point (and after a long wait), this pops up:

Poke around in the [Details] of that dialog and you’ll see this:

CRS-2676: Start of 'ora.cssdmonitor' on 'dalton' succeeded 
CRS-2672: Attempting to start 'ora.cssd' on 'dalton' 
CRS-2672: Attempting to start 'ora.diskmon' on 'dalton' 
CRS-2676: Start of 'ora.diskmon' on 'dalton' succeeded 
CRS-2676: Start of 'ora.cssd' on 'dalton' succeeded 
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'dalton' 
CRS-2672: Attempting to start 'ora.ctssd' on 'dalton' 
CRS-2883: Resource 'ora.ctssd' failed during Clusterware stack start. 
CRS-4406: Oracle High Availability Services synchronous start failed. 
CRS-4000: Command Start failed, or completed with errors. 2017/02/18 10:21:41 
CLSRSC-117: Failed to start Oracle Clusterware stack Died at /u01/app/12.1.0/grid/crs/install/crsinstall.pm line 914.

The installation log is not much more useful: it just documents everything starting nicely until it fails for no discernible reason when trying to start ora.ctssd.

Take exactly the same two nodes and do the installation from the Dalton node, though, and everything just works -so it’s not, as I first thought it might be, something to do with networks, firewalls, DNS names resolution or the myriad other things that RAC depends on being ‘right’ before it will work. It’s purely and simply a matter of whether the local node’s name is longer or shorter than the remote node’s!

The problem is fixed in PSU 1 for 12.1.0.2, but it’s inappropriate to mandate its use in Churchill, since that’s supposed to work with the vanilla software available from OTN (I assume my readers lack support contracts, so everything has to work as-supplied from OTN for free).

The obvious fix for Churchill, therefore, is to (a) either make the ‘Gamow’ name one character longer (maybe spell it incorrectly as ‘gammow’?); or find a ‘D’ name that is both a physicist and only 4 characters long or fewer; or (c) change both names ensuring that the second is shorter than the first.

Largely due to the distinct lack of short-named, D-named physicists, I’ve gone for the (c) option: Churchill 1.7 therefore builds its Data Guard cluster using hosts geiger and dirac. Paul Dirac (that’s him on the top-left) was an English theoretical physicist, greatly admired by Richard Feynman (which makes him something of a star in these parts) and invented the relativistic equation of motion for the wave function of the electron. He used his equation to predict the existence of the positron -and of anti-matter in general, something for which he won a share of the 1933 Nobel prize for physics. Geiger is a frankly much less distinguished physicist whose main claim to fame is that he invented (most of) the Geiger counter and wasn’t (apparently) a Nazi. He gets into the Churchill Pantheon by the skin of his initial letter and not much else, to be honest!

Short version then: Churchill 1.7 now uses Alpher/Bethe and Geiger/Dirac clusters, and both Gamow and Dalton are no more. Quite a bit of documentation needs updating to take account of this trivial change! Hopefully, I should have that sorted by the end of the day. And that will teach me to test all parts of Churchill before declaring that ‘it works with 12c’. (Oooops!)

Recompile with -fPIC

Let me start by wishing a happy New Year to all my readers, complete with fireworks from our local council’s display!

And then let’s swiftly move on to the bad news!

If you are interested in installing Oracle onto non-approved Linux distros, you are very soon going to have to contend with this sort of error message:

/usr/bin/ld: /u01/app/oracle/product/12.1.0/db_1/lib//libpls12.a(pci.o): relocation R_X86_64_32 against `.rodata.str1.4' can not be used when making a shared object; recompile with -fPIC

This will be found in the Oracle installer’s log immediately after the “linking phase” of the Oracle installation starts.

Unfortunately, the error message dialog that appears at this point looks like this:

…and that particular error message has long been familiar from 12.1.0.1 installs on assorted distros. The workarounds then were to add various compilation flags to assorted makefiles.

But in this case, the graphical error dialog deceives: for a start, this is happening on 12.1.0.2, and although the dialog text is the same as back in the 12.1.0.1 days, the underlying cause is completely different. It’s only when you inspect the installActions log do you (eventually!) see the error text I showed above, which tells you that this is no “ordinary” compilation problem.

Welcome to the world of position-independent code.

Putting it as simply as I know how, the basic idea of position-independent code is that it allows execution of code regardless of its absolute memory address. It’s thus a ‘good thing’, on the whole.

Trouble is, if objects within the code you’re trying to compile haven’t themselves been compiled to be position-independent, then you aren’t yourself allowed to compile the code that references them into shared libraries.

As the error message above says, since “pci.o” isn’t position-independent, you can’t compile references to it into the libpls12 library. Note that the error message does not mean that your attempt to compile libpls12 should use fPIC: if it meant that, you could do something about it. No: it’s telling you that pci.o was compiled by Oracle without fPIC. Only if they re-compile that object with the fPIC compiler option switched on would you then be able to compile it into the libpls12 library successfully.

If you’re best mates with Larry, then, perhaps you’ll be able to get him to do the necessary recompilations for you! Mere mortals, however, are stuck with the unhappy fact that vast swathes of the Oracle 12c code-base has not been compiled to be position-independent… and there’s nothing you can do to fix that when your version of gcc insists that it should be.

The problem doesn’t manifest itself on all distros: Ubuntu 16.04, for example, has no problem installing 12c at all (apart from the usual ones associated with not using a supported distro, of course!) But Ubuntu 16.10 does have this fatal problem. Similarly, Debian 8 is fine with Oracle 12c, but Debian 9 (the testing branch, so not yet ready for mainstream release) fails. And whereas Manjaro didn’t have a problem earlier in the year when I first released my mercury pre-installer script, it does now.

This, of course, gives us a clue: there’s clearly some component of these distros which is being upgraded over time so that later releases fail where earlier ones didn’t. So what’s the upgraded component causing all the trouble?

Perhaps unsurprisingly, given that it’s a compilation error that shows you there’s a problem in the first place, it turns out that the gcc compiler is the culprit.

If you do a fresh install of Ubuntu 16.04 (the long-term support version, so still very much current and relevant), whilst making sure NOT to update anything as part of the installation process itself, issuing the command gcc -v will show you that version 5.4.0 is in use. Do the same thing on Ubuntu 16.10, however, and you’ll discover you’re now using gcc version 6.2.0.

A fresh Debian 8.6 install, subjected to an apt-get install gcc command,  ends up running gcc version 4.9.2. The same thing done to a fresh Debian 9 install results in a gcc version of 6.2.1.

Manjaro is a rolling release, of course, so it’s software components are forever being incrementally upgraded: it makes finding out what gcc version was in use at the start of the year rather tricky! So I don’t have hard evidence for the gcc version shift there -but my main desktop is currently reporting version 6.2.1, so I’ll stick my neck out and say that I would lay odds that, had I checked back in January 2016, I think I would have found it to be around version 5.something.

In short, for all three distros currently under my microscope, a shift from gcc 4- or 5-something to 6-something has taken place… and broken Oracle’s installation routine in the process.

It means that all distros will eventually come across this compilation problem as they eventually upgrade their gcc versions. Expect Fedora to keel over in short order, for example, when their version 26 is released next April (assuming they go for a late-release version of gcc 6.something, which I expect they will). No doubt we’ll have all moved on to installing Oracle 12c Release 2 by then, which is probably suitably position-independent throughout… so maybe no-one will ever have to worry about this issue again. But in the meantime… the constantly changing nature of gcc is a problem.

So, what’s to be done if you want Oracle 12c installed on these distros with their fancy new gcc versions? Nothing I could really think of, except to ensure that the old, functional versions of gcc and related development tools are installed …and that can be easier said than done!

On Debian 9 (‘testing’), for example, instead of just saying apt-get install gcc, you need to now say apt-get install gcc-5, which ensures the ‘right’ compiler version is installed, with which the Oracle installer can live. Thus can this screenshot be taken:

…which shows me happily querying the EMP table on 12.1.0.2 whilst demonstrating that I’m running Debian Testing (codenamed “Stretch”). That’s only possible by careful curating of your development tool versions.

The same sort of ‘install old gcc version and make it the default’ trick is required to get 12c running on Ubuntu 16.10 too:

-though I had to specifically install gcc-4.9 rather than “gcc-5”, since the compilation error still arose when ‘gcc-5’ was installed. These things get tricky!

Anyway: there it is. Gcc’s constant version increments create havoc with Oracle 12c installations. Coming to a distro near you, soonish!

Fun Fedora 24

Just as my playing with the new Linux Mint release begins, so the Fedora team finalise a new version of their distro: Fedora 24 was released on 21st June.

It’s still very blue; it’s still very Gnome-y and therefore pretty awful as far as I’m concerned and I wouldn’t personally touch it with a feathered hat-band, let alone a bargepole.

But it’s out and therefore my Bogart preinstaller script, which makes Fedora a suitable platform for running Oracle Enterprise Edition, needs a run in the park to make sure it still works with the new version. Happily it does without any substantial changes at all.

However, I took the opportunity to do two things with Bogart. One was to remove its ability for preparing for an 11g installation. I know 11.2.0.4 is still supported, but you can’t get hold of that without a support contract; and if you’ve got a support contract, you won’t likely be wanting to run Oracle on an unsupported platform like Fedora! Meanwhile, any other version of 11g you can get your hands on has long-since been de-suppported… so Bogart is now 12c only.

And that means, two: I’ve re-written the Oracle-on-Fedora article to reflect it’s new only-12c-ness.

The revised article is here, and the updated Bogart preinstaller script is here.

Churchill on Windows?!

I’ve had many requests over the years to repeat my ‘Churchill Framework’ on Windows, “Churchill” being my mostly-automated way of building a virtual RAC using Linux as the operating system of choice.

I’ve always refused: if you want a desktop RAC on your Windows PC, why not just deploy Churchill ‘proper’ and have three virtual machines running CentOS. It’s a RAC, and it’s still “on” Windows, isn’t it?!

Well, of course, that wasn’t quite the point my correspondents were making. They wanted a desktop RAC running on top of purely Windows operating systems. They aren’t Linux users, and they’re not interested in working at a command line. Could I please oblige?

Again, I’ve always said no, because Windows costs lots of money. It’s easy to build a 3-node or a 6-node setup in Linux, because you aren’t paying $1000 a pop every time you install your operating system! It seemed to me that RAC-on-Windows was a nice idea (I had it working back in 2001 with 9i on Windows 2000 after all), but it wasn’t very practical as a learning platform.

Happily for my correspondents, I’ve now changed my view in that regard. All the Windows-based would-be DBAs of my acquaintance are working for companies that supply them with MSDN subscriptions. And Microsoft’s Technet evaluation options allow even people with no MSDN access to download and use Windows Server 2012 and beyond for free, for at least 6 months.

So I’ve given in. There’s now available a new article for doing Desktop RAC using nothing but Windows. It bears a passing resemblance to ‘proper’ Churchill: there are three servers to build, with one acting as the supplier of shared storage and needed network services to the others. There’s even the use of iSCSI to provide the virtual shared storage layer. But it’s about as non-Churchill as it gets, really, because everything is hand-built… which explains the enormous number of screenshots and the overall length of the article!

Oracle 12.1.0.2 on Linux Mint 17.2

I am casting my eyes around for a new desktop operating system (yet again!). I’m currently running Windows 10, and whilst it’s OK, there are some privacy issues I have concerns about. So, though I’m happy-ish with the current status quo, I am doing some experiments on the side for an alternative.

Linux Mint is right up there with the best of them as a suitable desktop OS. In the past, I’ve run Linux Mint Debian Edition (LMDE), but this time I thought I’d take the ‘standard’ version for a run: it’s based on Ubuntu.

Of course, getting Oracle 12c running on any form of non-Red Hat/non-CentOS distro can be tricky, and although my old Gladstone script used to automate it for a number of ‘peripheral’ distros, I haven’t maintained it for a while.

So I’ve written a new automation script, stripping out a lot of the complexity from the original gladstone script in the process. The new script configures a fresh 64-bit Linux Mint 17.2 desktop for an Oracle 12.1.0.2 64-bit database installation, making the user that runs it the “oracle user” and owner of the resulting Oracle software installation. By running it, a “linking-error-fix.sh” script is written to the user’s desktop, for later use at the point when the Oracle software installation starts to throw up linking errors.

Installing Oracle on LM17.2 thus becomes a matter of downloading and unpacking the software from Oracle and the oracleonmint.sh script from me, double-clicking the shell script and supplying the root password and waiting for a lot of prerequisite software to be downloaded from the standard Linux Mint repositories. When that’s all finished, you’ll be prompted to reboot.

Once your PC is back up and running, you just launch Oracle’s own runInstaller as usual, click [Next] lots of times, and wait for the inevitable linking errors to occur once the installer tries to build Oracle binaries.

At that point, you just double-click the linking-error-fix.sh script that should be sitting on your desktop, click [Retry] in the Oracle installer …and wait for the installation and database creation process to complete. Update: There’s now a full-blown article documenting what you do, step by step.

No messing around with editing various obscure compiler files by hand: just run the first shell script to create the second; and run the second when the Oracle installer throws up its first linking error.

I’m running out of prime ministers these days, so this new script doesn’t have a fancy name. It’s just my “Oracle on Mint” script :-) It has been tested on all three of the Cinnamon, Mate and KDE versions of Linux Mint 17.2. There are no differences in how it runs on any of them.

Whether or not I do end up ditching Windows 10 for LM17.2, I can’t yet say: but being able to run Oracle on LM17.2 certainly makes the idea of a transition a whole lot more feasible.

Oracle 12c and Salisbury

Version 1.04 of Salisbury is now available. It contains two key enhancements over the previous version: (1) it automatically initialises hard disks, even when they contain no previous partition information; and (2) it works to create standalone, RAC or RAC+Data Guard Oracle Version 12c setups.

I have not updated the Salisbury home page yet, though, to link to the new release (this post is the only place to do so at the moment). That’s because I have yet to update all the associated articles to reflect a bit of syntax-tweaking I’ve had to introduce. Once I do that, I’ll make 1.04 available from the “proper” place.

In the meantime, here’s a quick explanation of that syntax change, brought about because of a silly design flaw I introduced to begin with.

When you’re setting up a Salisbury RAC, you probably and usually want the Oracle software copied across to the first node, but not to the second (because the second node doesn’t need it: it gets the software ‘pushed’ to it during the Oracle installation anyway, from the first, as part of the standard RAC installation process). To accomplish this, I originally had you say ORAVER=1120x on the bootstrap line when building your first node, and ORAVER=NONE when building the second.

Even though you said ORAVER=NONE, I still set up paths and environment variables which are correct for running Oracle 11g …because that was the only version of Oracle then available.

You now see the problem, I hope! Saying ORAVER=NONE certainly tells me you don’t want the Oracle software copied to your new server… but now I don’t know whether I should set paths and variables to expect, eventually, an 11g or 12c installation. The arrival of a new Oracle version creates an ambiguity that using one bootstrap parameter cannot overcome.

The solution was to invent a new bootstrap parameter: FILECOPY=y or FILECOPY=n. It does what it says on the lid: a value of “y” means you do want the Oracle software copied from Salisbury to the new server’s /osource directory. A value of “n” means you don’t. Meanwhile, ORAVER changes meaning ever-so-slightly: it now says what version of Oracle you intend to run, regardless of whether the installation software is to be copied to the new server or not.

In other words, for the first node of a new RAC, you’d say something like:

...oraver=12101 filecopy=y

…and for the second node, you’d say:

...oraver=12101 filecopy=n

This applies to 12c installations and to 11g ones, equally well. Technically, you can still say “ORAVER=NONE”, but this now means you don’t intend to run Oracle at all, so no directories or environment variables associated with running Oracle will be created for you at all. If you’re building 11g RACs using Salisbury, you will need to remember this new need to specify two parameters where one previously sufficed.

Other than that slight change to bootstrap options, everything else remains as it was. In particular, the “speed keys” still work for 12c, just as they did for 11g, so “sk=1″ builds you a node called “alpher” with IP address 192.168.8.101, “sk=2″ builds you “bethe” on 192.168.8.102, and so on.

Of course, you will need to upload 12c software to the Salisbury server before you can build subsidiary Salisbury nodes at all: Oracle themselves made a change here by making the Grid Clusterware come in two zip files instead of one.

As before, you are required to change the names of the downloaded files before Salisbury can make use of them. In the case of 12c, you will need to end up with files named:

  • oradb-12101-1of2.zip
  • oradb-12101-2of2.zip
  • oragrid-12101-1of2.zip
  • oragrid-12101-2of2.zip

Under the hood, as I explained in my last post, I’ve had to relax the NFS settings to be “insecure” so that 12c’s propensity to use Direct NFS doesn’t cause the database creation process to blow up. This new setting back-applies to 11g installations, too -not that you’d notice.

As I say, once I get a chance to update the doco linked to the home page, I’ll link to version 1.04 from there, too. In the meantime, this post will be the only place to link to it. Have fun!

Oracle 12c and NFS

Here’s a little something that can trip you up if you’re not expecting it. The standard NFS exports options (the ones used by, for example, Salisbury) expect to handle I/O requests on ports lower than 1024. However, the new Oracle 12c defaults to using Direct NFS -which uses ports above 1024.

The result is that if you are using the Oracle Universal Installer to create a starter database on an NFS mount, by default, the thing will fail with nasty-looking ORA-17500 and ORA-17503 errors (the message text will suggest that it’s not able to open various files).

Happily, the fix is to add the insecure option to the end of your various export options in the /etc/exports file on the NFS server itself. The next release of Salisbury will be doing this automatically. Secure NFS is obviously there for a reason, but when it trips up your laboratory-only 12c RAC installs, “insecuring” it is the kindest option!

DB Express – Mystery Solved. Problems remain.

In my defence, it’s a new feature, my eyes are bad and I’m propped up in bed feeling like death cooled down a bit. It is also true that we all make mistakes. Thus it is that I am happy to report that there is a difference between

exec dbms_xdb_config.sethttpport(5500);

…and…

exec dbms_xdb_config.sethttpsport(5500);

Still can’t spot the difference? Well, the second command adds a secure http port to your database, because it’s got an extra ‘s’ in the middle of it. This will explain why, though I was able to manually add DB Express to my response-file-created databases, I could only reach it by typing in the URL http://192.168.8.101:5500/em. I’d inadvertently used the first command to add in DB Express capability, so I was only able to use unsecured http addresses.

Functionally, you can see the difference when you check the listener’s status. If you issue the first command and then check the listener, you’ll see something like this:

Listener Log File         /u01/app/oracle/diag/tnslsnr/alpher/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=alpher)(PORT=1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=alpher)(PORT=5500))(Presentation=HTTP)(Session=RAW))
Services Summary...
Service "orcl" has 1 instance(s).

And if you issue the second and then go check the listener, you’ll see this:

Listener Log File         /u01/app/oracle/diag/tnslsnr/alpher/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=alpher)(PORT=1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=alpher)(PORT=5500))(Security=(my_wallet_directory=/u01/app/oracle/admin/orcl/xdb_wallet))(Presentation=HTTP)(Session=RAW))
Services Summary...
Service "orcl" has 1 instance(s).

The presence or absence of a mention of a wallet credential is the difference between a secured and unsecured http connection to the database.

The fact that I have to add anything at all to a database created from a response file that was generated after having explicitly asked for DB Express integration is, however, still a bug in my book. But the fact that I used the wrong form of the command to add it -well, that is all down to me!

DB Express is still problematic in other ways, too. Here she is in Windows Internet Explorer 10:

And here she is in Google Chrome (version 593792 or so):

Spot the difference? Allow me to zoom in a fraction:

The blue background behind the ‘Configuration’, ‘Storage’, ‘Security’ and other menus is missing in Internet Explorer 10, leaving you with nothing but mysterious icons to navigate by. (You can hover over them to get a bit of colour and text back, but it’s a game of hit-and-miss at that point). Firefox is happy to display the blue background as well as Chrome.

So, congratulations, Oracle Corporation, on managing to produce a web tool that refuses to display correctly in the latest version of the one browser which still commands a majority share of the entire browser market! And you’ve only had six years to get it right! (Oh, and thanks for all the Flash, too!)

Another issue you might come up against:

I managed to get myself in this trap by building a database with DB Express on a virtual machine, visiting it in my browser successfully, and then re-building the VM with the same IP address as before. Every time I try to visit DB Express now, it generates this error. If I click ‘OK’ on that dialog, I get this:

It’s prompting for SYS’s password, because that’s who I tried to log on to the main DB Express as… but it’s not actually wanting that account’s credentials at all.

This is a long-standing problem with XDB applications (see, for example, this thread on OTN… none of the suggestions mentioned there actually fix this problem, but you get the point that this is not an unknown situation!)

Best suggestion for a workaround I can come up with right now: switch to using a different browser. Quite what you switch to when you’ve managed to lock yourself out like this four or five times, I haven’t yet worked out… but I’m lining up downloads of Opera, Avant Browser Ultimate, Comodo IceDragon, Slimboat and others as we speak!

Anyway, ‘flu-ish-inspired gripes aside, I have finally mastered the art of 12c response files and DB Express configuration (mostly), so a 12c Salisbury is but hours away…