What were the odds?

In modernising Churchill to work for Oracle 12c and the latest 6.x releases of RHCSL, I’ve encountered a bizarre bug (#19476913 if you’re able to check up on it), whereby startup of the cluster stack on a remote node fails if its hostname is longer than (or equal to) the hostname of the local node.

That is, if you are running the Grid Infrastructure installer from Alpher (6 characters) and pushing to Bethe (5 characters) then the CRS starts on Bethe just fine: local 6 is greater than remote 5. But if you are running the GI installer on Gamow (5 characters) and pushing to Dalton (6 characters) then the installer’s attempt to restart the CRS on Dalton will fail, since now local 5 is less than remote 6. Alpher/Bethe managed to dodge this bullet, of course -but only by pure luck.

The symptoms are that during the installation of Grid Infrastructure, all works well until the root scripts are run, at which point (and after a long wait), this pops up:

Poke around in the [Details] of that dialog and you’ll see this:

CRS-2676: Start of 'ora.cssdmonitor' on 'dalton' succeeded 
CRS-2672: Attempting to start 'ora.cssd' on 'dalton' 
CRS-2672: Attempting to start 'ora.diskmon' on 'dalton' 
CRS-2676: Start of 'ora.diskmon' on 'dalton' succeeded 
CRS-2676: Start of 'ora.cssd' on 'dalton' succeeded 
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'dalton' 
CRS-2672: Attempting to start 'ora.ctssd' on 'dalton' 
CRS-2883: Resource 'ora.ctssd' failed during Clusterware stack start. 
CRS-4406: Oracle High Availability Services synchronous start failed. 
CRS-4000: Command Start failed, or completed with errors. 2017/02/18 10:21:41 
CLSRSC-117: Failed to start Oracle Clusterware stack Died at /u01/app/12.1.0/grid/crs/install/crsinstall.pm line 914.

The installation log is not much more useful: it just documents everything starting nicely until it fails for no discernible reason when trying to start ora.ctssd.

Take exactly the same two nodes and do the installation from the Dalton node, though, and everything just works -so it’s not, as I first thought it might be, something to do with networks, firewalls, DNS names resolution or the myriad other things that RAC depends on being ‘right’ before it will work. It’s purely and simply a matter of whether the local node’s name is longer or shorter than the remote node’s!

The problem is fixed in PSU 1 for 12.1.0.2, but it’s inappropriate to mandate its use in Churchill, since that’s supposed to work with the vanilla software available from OTN (I assume my readers lack support contracts, so everything has to work as-supplied from OTN for free).

The obvious fix for Churchill, therefore, is to (a) either make the ‘Gamow’ name one character longer (maybe spell it incorrectly as ‘gammow’?); or find a ‘D’ name that is both a physicist and only 4 characters long or fewer; or (c) change both names ensuring that the second is shorter than the first.

Largely due to the distinct lack of short-named, D-named physicists, I’ve gone for the (c) option: Churchill 1.7 therefore builds its Data Guard cluster using hosts geiger and dirac. Paul Dirac (that’s him on the top-left) was an English theoretical physicist, greatly admired by Richard Feynman (which makes him something of a star in these parts) and invented the relativistic equation of motion for the wave function of the electron. He used his equation to predict the existence of the positron -and of anti-matter in general, something for which he won a share of the 1933 Nobel prize for physics. Geiger is a frankly much less distinguished physicist whose main claim to fame is that he invented (most of) the Geiger counter and wasn’t (apparently) a Nazi. He gets into the Churchill Pantheon by the skin of his initial letter and not much else, to be honest!

Short version then: Churchill 1.7 now uses Alpher/Bethe and Geiger/Dirac clusters, and both Gamow and Dalton are no more. Quite a bit of documentation needs updating to take account of this trivial change! Hopefully, I should have that sorted by the end of the day. And that will teach me to test all parts of Churchill before declaring that ‘it works with 12c’. (Oooops!)

State of Play

churchill150It has been almost a year since I did anything to my Churchill automation framework. (If you didn’t know or realise, Churchill lets you easily create virtual multi-node RAC and Active Data Guard environments).

I took a look at it lately to see about refreshing it. I discovered that whilst Churchill’s speed keys will assume that you’re using CentOS 6.6, CentOS has actually released versions 6.7 and 6.8 since the last Churchill refresh (as have Red Hat, Scientific Linux and Oracle, of course)! It is also still the case that no version of Churchill works with a RHCSL 7.x O/S (because systemd screws up network automation).

No worries, I thought to myself: tweak a few things and at least bring Churchill up-to-date-ish with version 6.8 as the new speedkey default.

No chance!

In their wisdom, the CentOS developers decided to split the main DVD release of 6.8 over two DVDs. That means Churchill’s trick of copying its own installation media onto its own hard disks (for subsequent automated network installs onto the RAC nodes to use) fails, because it can only copy one of the two install DVDs. (There is a special, 6.8-only Dual Layer DVD you can use, and that would work… but I decided not to go there for now).

This, combined with the systemd debacle, basically means that Churchill is dead in the water, and I therefore have made only one more tweak to it -whereby Scientific Linux version 6.7 is assumed to be the speedkey default- and after that, I won’t be maintaining it further.

Something new is required to automate 12c RACs on RHCSL 7.x environments… and I don’t know what that will be yet! Watch this space, I guess…

Churchill on Windows?!

I’ve had many requests over the years to repeat my ‘Churchill Framework’ on Windows, “Churchill” being my mostly-automated way of building a virtual RAC using Linux as the operating system of choice.

I’ve always refused: if you want a desktop RAC on your Windows PC, why not just deploy Churchill ‘proper’ and have three virtual machines running CentOS. It’s a RAC, and it’s still “on” Windows, isn’t it?!

Well, of course, that wasn’t quite the point my correspondents were making. They wanted a desktop RAC running on top of purely Windows operating systems. They aren’t Linux users, and they’re not interested in working at a command line. Could I please oblige?

Again, I’ve always said no, because Windows costs lots of money. It’s easy to build a 3-node or a 6-node setup in Linux, because you aren’t paying $1000 a pop every time you install your operating system! It seemed to me that RAC-on-Windows was a nice idea (I had it working back in 2001 with 9i on Windows 2000 after all), but it wasn’t very practical as a learning platform.

Happily for my correspondents, I’ve now changed my view in that regard. All the Windows-based would-be DBAs of my acquaintance are working for companies that supply them with MSDN subscriptions. And Microsoft’s Technet evaluation options allow even people with no MSDN access to download and use Windows Server 2012 and beyond for free, for at least 6 months.

So I’ve given in. There’s now available a new article for doing Desktop RAC using nothing but Windows. It bears a passing resemblance to ‘proper’ Churchill: there are three servers to build, with one acting as the supplier of shared storage and needed network services to the others. There’s even the use of iSCSI to provide the virtual shared storage layer. But it’s about as non-Churchill as it gets, really, because everything is hand-built… which explains the enormous number of screenshots and the overall length of the article!

Hypothetical

Suppose that about six weeks ago you, as a proactive kind of DBA, had noticed that your 2TB database was running at about 80% disk usage and had accordingly asked the SysAdmin to provision an additional 2TB slice of the SAN so that you could then add a new ASM disk to your database.

Imagine that the SysAdmin had provisioned as requested, and you as the DBA had applied the change in the form of adding a new ASM disk to your production instance -and that, in consequence, you’d been running at a much healthier 50% disk usage ever since. You’d probably feel pretty good at having been so proactive and helpful in avoiding space problems, right?

Suppose that weeks pass and it is now late October…

Now imagine that for some reason or other that made sense at the time, you kick off a new metadata-only Data Pump export which, workplace distractions being commonplace, you lose sight of, until 6 hours after you started it, you get told there’s a Sev 1 because the non-ASM, standard file system to which your data_pump_dir points has hit 100% usage and there’s no more free space. Foolish DBA!

But no matter, right? You just kill the export job, clear up the relevant hard disk… suddenly the OS is happy there’s space once more on its main hard disk.

But pile up the hypotheticals: the OS reports itself happy, but suppose you nevertheless discover that as a result of the space problems caused by your export, none of the nodeapps are listed as running on Node 1 and any attempt to start them with svrctl on node ends with an error message to the effect that it can’t contact the OHASD/CRSD on that node.

Suppose GV$INSTANCE still returns a count of 2: Node 1 is therefore still up, but no-one can connect to it, because no services are able to run on it. Basically, your Node 1 has kind-of left the building and the only possibility of getting it back, you might reasonably think, would be a whole node reboot. Thank God Node 2 is still up and has no difficulty working alone for a few hours! It’s good enough to cope with the rest of the day’s workload anyway.

So, in this hypothetical house of horrors, suppose that you arrange a schedule outage in which you will reboot Node 1 and wait for it to come back up as a fully-fledged cluster member once more. It should only be a matter of moments before Node 1 is back to its normal happy state, noticing that the non-ASM disk has loads of space once more, right?

Only, imagine that it doesn’t. Imagine instead that it takes at least 10 minutes to restart and, in fact, it’s response-less at that point and looking like it might take another 10 minutes more. Imagine, indeed, that after another 10 minutes on top of that lot, maybe you look at the ASM alert log for Node 1 and find these entries:

ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing from group number "2"

At this point, hypothetically… you might start adding 2 and 2 together and getting an approximation of 4: for you would know that disk 1 is the new 2TB one that you added to the database way back in September.

But why would that new disk, which has been in daily and heavy use ever since, be posing a problem now, rather than before now? You might start idly wondering whether, potentially, when it was provisioned, it was provisioned incorrectly somehow. This being the first reboot since that time, tonight (for it is now past midnight) is maybe the first opportunity which that mis-provisioning has had a chance to reveal itself?

You might at this point very well make a big mental note: on no account reboot node 2, because if it loses the ability to read ASM disks too the entire primary site will have been destroyed.

It would make for an interesting night, wouldn’t it? Especially if the SysAdmin who did the disk provisioning back in September was no longer available for consultation because he was on paternity leave. In New Zealand.

What might you as the DBA do about this state of affairs? Apart from panic, I mean?!

Well, first I think you might very well get your manager to call the SysAdmin and get him off paternity leave in a hurry -and he might take a quick look over the disks and confirm that he’d partitioned the disk back in September to start from cylinder 0… which is, er… a big no-no.

It is, in fact, perhaps the biggest no-no you can do when provisioning disk space for Oracle ASM. This is because doing so means your physical partition table starts at cylinder 0… but, unfortunately, Oracle’s ASM-specific information gets written at the beginning of the disk you give it, so it over-writes the partition table information with its own ASM-specific data. When ASM data replaces disk partition data… you don’t have any disk partitions anymore. Though you won’t know about it yet, because the disk partition information was read into memory at the time the disk was added and has thus been readable ever since.

To stop that happening, you’re supposed to make sure you start your partitions at something other than cylinder 0. Then Solaris can write partition info literally at cylinder 0, and Oracle’s ASM data can start… NOT at cylinder 0!

Apparently, the only operating system that even allows you to add cylinder-0-partitioned disks is Solaris: Oracle on other operating systems spots the potential for disaster and prevents you from adding it in the first place. Tough luck if, in this hypothetical situation, you’re stuck using Solaris, then!

Until you try and re-mount a disk after a reboot, you don’t know the partition table has been destroyed by Oracle’s ASM shenanigans. The partition information is in memory and the operating system is therefore happy. You can run like this forever… until you reboot the affected server, at which point the ASM over-write of the disk partition information proves fatal.

The second thing you might do is raise a severity 1 SR with Oracle to see if there’s any possible way of fixing the partition table on this disk without destroying it’s ASM-ness. However, Oracle support being what it is, chances are good that they will simply hum-and-haw and make dark noises about checking your backups. (Have you ever restored a 2TB database from tape? I imagine it might take one or two days…or weeks…)

So then you might start thinking: we have a Data Guard set up. Let’s declare a disaster, switch over to the secondary site, and thus free up the primary’s disks for being re-partitioned correctly. And at this point, hypothetically of course, you might then realise that when we added a disk to the ASM groups back in September on primary… er… we probably also did exactly the same on the standby!

This means (or would mean, because this is just hypothetical, right?!) that our disaster recovery site would be just as vulnerable to an inadvertent reboot or power outage as our primary is. And then you’d probably get the sysadmin who’s been contacted by phone to check the standby site and confirm your worst suspicions: the standby site is vulnerable.

At this point, you would have a single primary node running, provided it didn’t reboot for any reason. And a Data Guard site running, so long as it didn’t need to reboot. That warm glow of ‘my data is protected’ you would have been feeling about 12 hours ago would have long since disappeared.

Hypothetically speaking, you’ve just broken your primary and the disaster recovery site you were relying on to get you out of that fix is itself one power failure away from total loss. In which case, your multi-terabyte database that runs the entire city’s public transport system would cease to exist, for at least several days whilst a restore from tape took place.

If only they had decided to use ‘normal redundancy’ on their ASM disk groups! For then you would be able to drop the bad disk forcibly and know that other copies of data stored on the remaining good disks would suffice. But alas, they (hypothetically) had adopted external redundancy, for it runs on a SAN and SANs never go wrong…

At this point, you’ve been up in the wee small hours of the night for over 12 hours, but you might nevertheless come up with a cunning plan: use the fact that node 2 is still up (just!) and get it to add a new, good disk to the disk group and re-balance. The data is distributed off the badly-configured disk onto the new one (which you’ve made triply sure was not partitioned at cylinder 0!)

You could then drop the badly-configured disk, using standard ASM ‘drop disk’ commands. The data would then be moved off the bad disks onto the good ones. You could then remove the bad disk from the ASM array and your Data Guard site would, at least, be protected from complete failure once more.

Of course, Oracle support might tell you that it won’t work, because you can’t drop a disk group with external redundancy… because they seem to have forgotten that the second node is still running. And you’ve certainly never tried this before, so you’re basically testing really critical stuff out on your production standby site first. But what choice do you have, realistically?!

So, hypothetically of course, you’d do it. You’d add a disk, wait for a rebalance to complete (and notice that ASM’s ability to predict when a rebalance operation is finished is pretty hopeless: if it tells you 25 minutes, it means an hour and a half). And then you’d drop a disk and wait for a rebalance to complete. And then you’d reboot one of the Data Guard nodes… and when it failed to come back up, you might slump in the Slough of Despond and declare failure. Managers being by this time very supportive, they might propose that we abandon in-house efforts to achieve a fix, and call in Oracle technical staff for on-site help. And that decision having been taken in an emergency meeting, you might idly re-glance at your Data Guard site and discover that not only is +ASM1 instance up and running after all, but so is the database instance #1. It’s actually all come up fine, but you had lacked the patience to wait for it to sort itself out and had declared failure prematurely. Impatient DBA!

Flushed with the (eventual) success of getting the Data Guard site running on all-known-good-disks, you might want to hurry up and get the primary site repaired in like manner. Only this is a production environment under heavy change management control, so you’ll likely be told it can only be fiddled with at 11pm. So you would be looking at having worked 45 hours non-stop before the fix is in.

Nevertheless, hypothetically, you might manage to stay up until 11pm, perform the same add/rebalance/drop/rebalance/reboot trick on the primary’s node 2… and, at around 3am, discover yourself the proud owner of a fully-functioning 2-node RAC cluster once again.

(The point being here that Node 2 on the primary was never rebooted, though that reboot had been scheduled to happen and the SysAdmin sometimes reboots both nodes at the same time, to ‘speed things up’ a bit! Had it been rebooted, it too would have failed to come back up and the entire primary site would have been lost, requiring a failover from the now-re-protected standby. But since Node 2 is still available, it can still do ASM re-structuring, using the ‘add-good-disk; rebalance; drop bad-disk; rebalance’ technique.)

There might be a little bit of pride at having been able to calmly and methodically work out a solution to a problem that seemed initially intractable. A bit of pleasure that you managed to save a database from having to be restored from tape (with an associated outage measured in days that would have cost the company millions). There might even be a bit of relief that it wasn’t you letting an export consume too much disk space that was the root cause, but a sysadmin partitioning a disk incorrectly weeks ago.

It would make for an interesting couple of days, I think. If it was not, obviously and entirely, hypothetical. Wouldn’t it??!

Asquith 1.09

It’s been a long time coming, but I’ve just released a new version of Asquith which now supports installing 11.2.0.4 standalone and clustered databases. Previously, in the 11g product range, it only supported 11.2.0.1 and 11.2.0.3 versions. It still supports 12c, too, of course.

The only change in behaviour over the previous version is that you supply a ORAVER=11204 bootstrap parameter when booting a member server (having previously copied the 11.2.0.4 installation media to your Asquith server first, of course).

It will take a while to update various pages/articles to reflect the new ORAVER option, but hopefully by the end of the weekend I’ll have it all done.

Note that Salisbury doesn’t get this update: Asquith and Salisbury parted ways some time ago.

The Crackle of Static

I’ve been off-air for quite a while of late, I realise, and the sound of the static crackle in place of my words of wit and wisdom must be getting quite deafening for some. :-)

Well, it’s just because I’ve been incredibly busy at work. I am the development DBA, because there was supposed to be an entirely separate team of operational DBAs. They never materialized and instead, a plan was hatched to outsource operational DBA responsibilities to an entirely separate company. But that never materialized either. Which has left yours truly the development AND operations DBA …for an organization whose primary database runs on Solaris, RAC+Active Data Guard on ASM, uses Virutal Private Database and has 6TB of disk space allocated to it to match the 32CPUs and 64GB RAM. It’s a non-trivial database environment, in other words, and if I screw it up, millions of people will notice and questions might reasonably be asked in Parliament. :-( (I speak quite literally, too! This database is running the public transport system for the whole of Sydney: if it stuffs up, a million commuters will be screaming about it on the front page of the Sydney Morning Herald the next day!)

Oh, and I am also the SQL Server DBA. Indeed, I have morphed into being something of the world-wide corporate expert on SQL Server-to-Oracle replication.

I barely get time to think, basically, let alone maintain a blog.

So, when I can, I will; but often of late, I can’t.

One little nugget I will just drop here (because there’s no other place where it would make a lot of sense!): last Saturday night, I spent 6 hours repeatedly trying to patch our Standby site from 11.2.0.3.5 to 11.2.0.3.9, before declaring it a failure and rolling it back. The next Monday, a colleague took a look at the logs I’d spent a long time poring over to no effect and within seconds had said, “That’s your problem right there: what directory did you run opatch from?”

I’ll back up a little at this point: the standard Oracle doco on RAC installation on ASM makes provision for splitting the ownership of the database software from the ownership of the Grid Infrastructure software. Thus, it proposes creating users generically called “oracle” and “grid” to own each bit respectively. I’ll call this the ‘split ownership’ model of Oracle software installation. It’s standard, well-documented, but optional.

Anyone who has had a go creating their own Salisbury or Asquith RAC and/or Dataguard setups will know, however, that I don’t really like the split ownership model: my installation instructions tell you to create an oracle user and have him own both the database and Grid Infrastructure software alike. What we might call the “unitary ownership model”. That’s perfectly standard, too, and fully supported. I’d have said it’s more common, too, than split ownership -which is really intended for vast organisations that have large, separate teams of Sysadmins and DBAs (who generally don’t talk much to each other).

Unfortunately, if your mindset is expecting unitary ownership, as mine is, you get into the habit of applying Oracle RAC & Grid patches by becoming root, travelling to the /home/oracle directory and issuing the necessary opatch auto commands. I say “unfortunately” because this won’t work in a split ownership environment!

Why not? Because although you run the patch in as the root user, it works only because the patching script issues commands to be run as the oracle and grid users at different times. And that means both the oracle and grid users have to have equal read/write access to the directory from which you launch the patch, otherwise one or other of them (or both if you try running it out of /root, for example) will not be able to determine their current working directory properly …and it’s downhill all the way from that point on.

So my six hours of frustration on Saturday arose entirely from the fact that I was sitting in /home/oracle when I ran the opatch command. When the time came for the grid user to issue his commands, he had no read/write privileges on oracle’s private home directory, and so the thing failed miserably. I didn’t recognise the symptoms for what they were saying (in fairness, it was late and my eyes had somewhat glazed over); my wiser, genius System Admin colleague spotted it at once.

Last night I made another attempt to run the same patch in. This time, I launched the opatch command from /u01/app -a directory to which both the grid and oracle users had equivalent access. Forty minutes later, patching completed with success.

So, the short version: if you do split ownership of your RACs, watch out for where you run your patches from! You have to find a directory to which both owners have equivalent access. Private home directories won’t cut it. Yes, it’s probably documented a million times in the official doco… but I got burnt, nonetheless.

Asquith and the new Red Hat

Whilst I was busy planning my Paris perambulations, Red Hat went and released version 6.5 of their Enterprise Server distro. Oracle swiftly followed …and, even more remarkably, CentOS managed to be almost equally as swift, releasing their 6.5 version on December 1st. Scientific Linux has not yet joined this particular party, but I assume it won’t be long before they do.

I had also assumed Asquith would work unchanged with the new distro -but I hadn’t banked on the clumsy way I originally determined the distro version number which actually meant it all fell into a nasty heap of broken. Happily, it only took a minute or so to work out which bit of my crumbly code was responsible and that’s now been fixed.

Asquith therefore has been bumped to a new 1.07 release, making it entirely compatible with any 6.5 Red Hat-a-like distro (and any future 6.x releases, come to that).

Another feature of this release is that the ‘speedkeys’ parameters have been altered so that they assume the use of version 6.5 of the relevant distro. That is, if you build your RAC nodes by using a bootstrap line that reads something like ks=http://192.168.8.250/kickstart.php?sk=1…, then you’ll be assumed to be using a 6.5 distro and the source OS for the new server will be assumed to reside in a <distro>/65 directory.

If you want to continue using 6.4 or 6.3 versions, of course, you can still spell that out (ks=http://192.168.8.250/kickstart.php?distro=centos&version=63…). You just can’t use speedkeys to do it.

An equivalent update to Salisbury has also just been released.

Here’s a curious thing

My 2-node production RAC had been suffering from ‘checkpoint incomplete’ messages in the alert log for a while, so back at the end of October, I finally got off my bottom and bothered to take a look: only to discover the beast had been created with just 2 logs per thread and each of only 50MB.

My ‘standard, do it without even thinking’ approach to online logs has long been: 4 logs per thread minimum, each at least 500MB in size.

So, this database was under-specc’d by quite a long way. No problem: it is easy enough to alter database add logfile thread 1 ‘/blah/blah/log4a.rdo’ size 500M; several times until the requisite number of logs of the right size has been created. Problem solved.

Now, this 2-node RAC happens to be the Primary database in a Primary-Standby Active Data Guard setup. I did idly wonder whether the creation of the 500MB online logs would automatically happen over on the standby site, especially since we long ago issued the command alter system set standby_file_management=auto, but since “online” logs are never used on a genuinely standby database, it didn’t seem important to check it out one way or another. (I should clarify that redo generated by a primary is shipped to an Active Data Guard standby by LGWR and stored in standby redo logs, from where they are read by the managed recovery process, and out of which archived redo logs are thus generated. So standby logs are definitely used at an active data guard standby database, but not the “online” logs… they are there for when disaster strikes and the standby needs to become the new primary).

So, anyway: long story short, I increased the size of the primary’s online logs and didn’t bother to check what had happened over on the standby, Redo continued to flow from the primary to the standby, and a check of the latency of redo transmission showed that all was well (the standby never lagged the primary by more than 12 seconds). All’s well that ends well, I guess.

Except that, one day, for no real reason, I did this:

SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination +FRA
Oldest online log sequence 23445
Next log sequence to archive 23448
Current log sequence 23448

That’s on the primary node 1. And just for the hell of it, I did the same thing on the standby node 1:

SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination +FRA
Oldest online log sequence 23244
Next log sequence to archive 0
Current log sequence 23245

And that’s an apparent discrepancy of around 200 archive logs! This worried me, so I checked the alert log of the standby:

Thu Nov 07 11:39:23 2013
RFS[2]: Selected log 10 for thread 1 sequence 23448 dbid -2003148368 branch 798942256
Thu Nov 07 11:39:23 2013
Media Recovery Waiting for thread 1 sequence 23448 (in transit)
Recovery of Online Redo Log: Thread 1 Group 10 Seq 23448 Reading mem 0

…which showed that the standby was actually processing redo from time 23448 or so, which is exactly the ‘time’ being displayed by the archive log list command when run on the primary node 1. So the alert log was saying “no discrepancy”, but the SQL*Plus archive log list command was saying “200 logs out of whack!”.

Puzzled, I dug a little deeper:

SQL> select to_char(first_time,'DD-MON-YYYY HH24:MI')
 2 from v$archived_log where sequence#=23245 and thread#=1;

TO_CHAR(FIRST_TIM
-----------------
25-OCT-2013 10:11

The standby’s response to the archive log list command showed that it thought log 23245 was the last one applied.This query shows that specific log to have been created a couple of weeks ago, on 25th October at 10:11AM. So what happened around then that apparently stalled the increment of the redo log sequence number? Well, here’s primary node 1′s alert log for the relevant time:

Fri Oct 25 10:17:25 2013
alter database add logfile '+DATA/proddb/log1a.rdo' size 500m
Completed: alter database add logfile '+DATA/proddb/log1a.rdo' size 500m

…and that’s me resizing the primary’s online redo logs, at about 10:17AM on 25th October!

Personally, I think this is a bug. The standby was always receiving the latest redo, into its standby logs, as designed. Yet the SQL*Plus command was returning incorrect data, apparently flummoxed by the size discrepancy in the online logs between the primary and standby sites.

But whether it’s a bug or not, the fix-up suggested itself: take the standby out of managed recovery mode (alter database recover managed standby database cancel), switch to manual file management (alter system set standby_file_management=manual) and then add new online redo logs of the right size and drop the originals. When all log groups are of 500MB, simply reverse the process: file management becomes auto once more and recovery of the standby is re-commenced. Net result: output from archive log list immediately ‘catches up’ and starts displaying exactly the log sequence numbers that the primary reports.

Anyway, and happily, I don’t generally go around resizing redo logs more than once in the lifetime of a database. Just be aware that SQL*Plus gets a bit upset if you do and neglect to do it equivalently on both sides of your Data Guard setup.

Archibald Primrose, cut-throat, thief and leader of the infamous Slethwick Street gang of nineteenth century East London pick-pockets was…

Er, no.

Sorry… wrong notes. That’s actually Archibald Primrose, 5th Earl of Rosebery, sometime Prime Minister of the United Kingdom of Great Britain and Ireland (as it was back then).

An easy mistake to make, I rather think, all the same.

Anyway, “Slasher” Rosebery makes it to these pages because his name is associated with the secondary storage server a Data Guard environment will need to use. In the language of this blog, Rosebery is to Asquith what Balfour is to Salisbury: the secondary server in an Active Data Guard configuration using ASM via iSCSI shares. A new article on how to build one has just gone up.