Hypothetical

Suppose that about six weeks ago you, as a proactive kind of DBA, had noticed that your 2TB database was running at about 80% disk usage and had accordingly asked the SysAdmin to provision an additional 2TB slice of the SAN so that you could then add a new ASM disk to your database.

Imagine that the SysAdmin had provisioned as requested, and you as the DBA had applied the change in the form of adding a new ASM disk to your production instance -and that, in consequence, you’d been running at a much healthier 50% disk usage ever since. You’d probably feel pretty good at having been so proactive and helpful in avoiding space problems, right?

Suppose that weeks pass and it is now late October…

Now imagine that for some reason or other that made sense at the time, you kick off a new metadata-only Data Pump export which, workplace distractions being commonplace, you lose sight of, until 6 hours after you started it, you get told there’s a Sev 1 because the non-ASM, standard file system to which your data_pump_dir points has hit 100% usage and there’s no more free space. Foolish DBA!

But no matter, right? You just kill the export job, clear up the relevant hard disk… suddenly the OS is happy there’s space once more on its main hard disk.

But pile up the hypotheticals: the OS reports itself happy, but suppose you nevertheless discover that as a result of the space problems caused by your export, none of the nodeapps are listed as running on Node 1 and any attempt to start them with svrctl on node ends with an error message to the effect that it can’t contact the OHASD/CRSD on that node.

Suppose GV$INSTANCE still returns a count of 2: Node 1 is therefore still up, but no-one can connect to it, because no services are able to run on it. Basically, your Node 1 has kind-of left the building and the only possibility of getting it back, you might reasonably think, would be a whole node reboot. Thank God Node 2 is still up and has no difficulty working alone for a few hours! It’s good enough to cope with the rest of the day’s workload anyway.

So, in this hypothetical house of horrors, suppose that you arrange a schedule outage in which you will reboot Node 1 and wait for it to come back up as a fully-fledged cluster member once more. It should only be a matter of moments before Node 1 is back to its normal happy state, noticing that the non-ASM disk has loads of space once more, right?

Only, imagine that it doesn’t. Imagine instead that it takes at least 10 minutes to restart and, in fact, it’s response-less at that point and looking like it might take another 10 minutes more. Imagine, indeed, that after another 10 minutes on top of that lot, maybe you look at the ASM alert log for Node 1 and find these entries:

ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing from group number "2"

At this point, hypothetically… you might start adding 2 and 2 together and getting an approximation of 4: for you would know that disk 1 is the new 2TB one that you added to the database way back in September.

But why would that new disk, which has been in daily and heavy use ever since, be posing a problem now, rather than before now? You might start idly wondering whether, potentially, when it was provisioned, it was provisioned incorrectly somehow. This being the first reboot since that time, tonight (for it is now past midnight) is maybe the first opportunity which that mis-provisioning has had a chance to reveal itself?

You might at this point very well make a big mental note: on no account reboot node 2, because if it loses the ability to read ASM disks too the entire primary site will have been destroyed.

It would make for an interesting night, wouldn’t it? Especially if the SysAdmin who did the disk provisioning back in September was no longer available for consultation because he was on paternity leave. In New Zealand.

What might you as the DBA do about this state of affairs? Apart from panic, I mean?!

Well, first I think you might very well get your manager to call the SysAdmin and get him off paternity leave in a hurry -and he might take a quick look over the disks and confirm that he’d partitioned the disk back in September to start from cylinder 0… which is, er… a big no-no.

It is, in fact, perhaps the biggest no-no you can do when provisioning disk space for Oracle ASM. This is because doing so means your physical partition table starts at cylinder 0… but, unfortunately, Oracle’s ASM-specific information gets written at the beginning of the disk you give it, so it over-writes the partition table information with its own ASM-specific data. When ASM data replaces disk partition data… you don’t have any disk partitions anymore. Though you won’t know about it yet, because the disk partition information was read into memory at the time the disk was added and has thus been readable ever since.

To stop that happening, you’re supposed to make sure you start your partitions at something other than cylinder 0. Then Solaris can write partition info literally at cylinder 0, and Oracle’s ASM data can start… NOT at cylinder 0!

Apparently, the only operating system that even allows you to add cylinder-0-partitioned disks is Solaris: Oracle on other operating systems spots the potential for disaster and prevents you from adding it in the first place. Tough luck if, in this hypothetical situation, you’re stuck using Solaris, then!

Until you try and re-mount a disk after a reboot, you don’t know the partition table has been destroyed by Oracle’s ASM shenanigans. The partition information is in memory and the operating system is therefore happy. You can run like this forever… until you reboot the affected server, at which point the ASM over-write of the disk partition information proves fatal.

The second thing you might do is raise a severity 1 SR with Oracle to see if there’s any possible way of fixing the partition table on this disk without destroying it’s ASM-ness. However, Oracle support being what it is, chances are good that they will simply hum-and-haw and make dark noises about checking your backups. (Have you ever restored a 2TB database from tape? I imagine it might take one or two days…or weeks…)

So then you might start thinking: we have a Data Guard set up. Let’s declare a disaster, switch over to the secondary site, and thus free up the primary’s disks for being re-partitioned correctly. And at this point, hypothetically of course, you might then realise that when we added a disk to the ASM groups back in September on primary… er… we probably also did exactly the same on the standby!

This means (or would mean, because this is just hypothetical, right?!) that our disaster recovery site would be just as vulnerable to an inadvertent reboot or power outage as our primary is. And then you’d probably get the sysadmin who’s been contacted by phone to check the standby site and confirm your worst suspicions: the standby site is vulnerable.

At this point, you would have a single primary node running, provided it didn’t reboot for any reason. And a Data Guard site running, so long as it didn’t need to reboot. That warm glow of ‘my data is protected’ you would have been feeling about 12 hours ago would have long since disappeared.

Hypothetically speaking, you’ve just broken your primary and the disaster recovery site you were relying on to get you out of that fix is itself one power failure away from total loss. In which case, your multi-terabyte database that runs the entire city’s public transport system would cease to exist, for at least several days whilst a restore from tape took place.

If only they had decided to use ‘normal redundancy’ on their ASM disk groups! For then you would be able to drop the bad disk forcibly and know that other copies of data stored on the remaining good disks would suffice. But alas, they (hypothetically) had adopted external redundancy, for it runs on a SAN and SANs never go wrong…

At this point, you’ve been up in the wee small hours of the night for over 12 hours, but you might nevertheless come up with a cunning plan: use the fact that node 2 is still up (just!) and get it to add a new, good disk to the disk group and re-balance. The data is distributed off the badly-configured disk onto the new one (which you’ve made triply sure was not partitioned at cylinder 0!)

You could then drop the badly-configured disk, using standard ASM ‘drop disk’ commands. The data would then be moved off the bad disks onto the good ones. You could then remove the bad disk from the ASM array and your Data Guard site would, at least, be protected from complete failure once more.

Of course, Oracle support might tell you that it won’t work, because you can’t drop a disk group with external redundancy… because they seem to have forgotten that the second node is still running. And you’ve certainly never tried this before, so you’re basically testing really critical stuff out on your production standby site first. But what choice do you have, realistically?!

So, hypothetically of course, you’d do it. You’d add a disk, wait for a rebalance to complete (and notice that ASM’s ability to predict when a rebalance operation is finished is pretty hopeless: if it tells you 25 minutes, it means an hour and a half). And then you’d drop a disk and wait for a rebalance to complete. And then you’d reboot one of the Data Guard nodes… and when it failed to come back up, you might slump in the Slough of Despond and declare failure. Managers being by this time very supportive, they might propose that we abandon in-house efforts to achieve a fix, and call in Oracle technical staff for on-site help. And that decision having been taken in an emergency meeting, you might idly re-glance at your Data Guard site and discover that not only is +ASM1 instance up and running after all, but so is the database instance #1. It’s actually all come up fine, but you had lacked the patience to wait for it to sort itself out and had declared failure prematurely. Impatient DBA!

Flushed with the (eventual) success of getting the Data Guard site running on all-known-good-disks, you might want to hurry up and get the primary site repaired in like manner. Only this is a production environment under heavy change management control, so you’ll likely be told it can only be fiddled with at 11pm. So you would be looking at having worked 45 hours non-stop before the fix is in.

Nevertheless, hypothetically, you might manage to stay up until 11pm, perform the same add/rebalance/drop/rebalance/reboot trick on the primary’s node 2… and, at around 3am, discover yourself the proud owner of a fully-functioning 2-node RAC cluster once again.

(The point being here that Node 2 on the primary was never rebooted, though that reboot had been scheduled to happen and the SysAdmin sometimes reboots both nodes at the same time, to ‘speed things up’ a bit! Had it been rebooted, it too would have failed to come back up and the entire primary site would have been lost, requiring a failover from the now-re-protected standby. But since Node 2 is still available, it can still do ASM re-structuring, using the ‘add-good-disk; rebalance; drop bad-disk; rebalance’ technique.)

There might be a little bit of pride at having been able to calmly and methodically work out a solution to a problem that seemed initially intractable. A bit of pleasure that you managed to save a database from having to be restored from tape (with an associated outage measured in days that would have cost the company millions). There might even be a bit of relief that it wasn’t you letting an export consume too much disk space that was the root cause, but a sysadmin partitioning a disk incorrectly weeks ago.

It would make for an interesting couple of days, I think. If it was not, obviously and entirely, hypothetical. Wouldn’t it??!