Continuing the saga of “catching up” with software developments whilst I was en route to the UK from Australia, the question next to be addressed is simply: does Atlas work to allow Oracle 12c Release 2 to be installed onto the many and various distros it claims to support.
The answer is a bit mixed, I’m afraid!
First thing to say: I deliberately coded Atlas to specifically declare 12cR2 ‘didn’t work yet’ when I first released it. So, for it to work for 12cR2 at all, that bit of code has to be removed -and that means Atlas straightaway gets bumped to version 1.2.
Once that code has gone, the good news is that Atlas will successfully help 12cR2 be installed on the majority of the distros it works with (15 out of 19, as it happens). But the bad news is that if your distro is Ubuntu or even just Ubuntu-based, then the linking phase of the Oracle 12c Release 2 installation fails irreparably after Atlas has been run.
The complete list of distros and their results with Atlas for 12c Release 2 is therefore as follows:
Debian 8.2+ ............................ Works fine
Linux Mint 18+ ......................... Fails
Mint Debian Edition 2+ ................. Works fine
Red Hat ES 7.0+ ........................ Works fine
Scientific Linux 7.0+ .................. Works fine
CentOS 7.0+ ............................ Works fine
OpenSuse Leap 42+ ...................... Works fine
Antergos 2016.11+ ..................... Works fine
elementary OS 0.4+ ..................... Fails
Mageia 5+ .............................. Works fine
Korora 25+ ............................. Works fine
Zorin Core 12 .......................... Fails
Ubuntu 16+ ............................. Fails
Manjaro 15+ ............................ Works fine
Fedora 23+ ............................. Works fine
Peppermint Linux 7+ .................... Fails
GeckoLinux Static 422+ ................. Works fine
Chapeau Linux 24+ ...................... Works fine
PCLinuxOS 2016+ ........................ Works fine
As I say, the pattern is pretty obvious and I suspect the problem is that the gcc versioning tricks I had to pull to get 12cR1 to compile properly on any Ubuntu-based distro back in January are now the cause of woes for 12cR2. The failure always manifests itself in the following manner:
There’s no recovering from that as yet, but I hope to get it sorted within the week.
On the other hand, for any distro listed as working fine in the above list, you can expect the usual plain sailing:
That’s specifically Oracle 22.214.171.124 running on GeckoLinux, which is a spin of openSUSE Tumbleweed, but you get the same outcome for all non-Ubuntu distros.
Whilst testing all this, I discovered a couple of distros which had incremented their version strings since January (Manjaro, for instance, now reports itself as version ’17.something’, so the part of Atlas where it checked for version strings containing the numbers 15 or 16 obviously needed updating). Those sorts of versioning updates are now also included in Atlas 1.2.
There is very little updating of the Atlas doco to do, happily. For a start, you will obtain the new Atlas version just by running exactly the same wget command as was previously documented: the URL alias simply points to the latest version, but the URL itself doesn’t change.
When you install Oracle 12cR2 onto any of these non-standard distros (except the RHCSL ones, of course), you will get this dire-looking warning:
It’s better than the 12cR1 equivalent, which was to say ‘your system is inadequate’! Anyway, for all the distros for which Atlas works at all, it’s perfectly OK to say ‘yes, I want to continue’. The installation will succeed anyway.
Have fun… and wish me luck whilst battling with Ubuntu later this week 🙂
Let me start by wishing a happy New Year to all my readers, complete with fireworks from our local council’s display!
And then let’s swiftly move on to the bad news!
If you are interested in installing Oracle onto non-approved Linux distros, you are very soon going to have to contend with this sort of error message:
/usr/bin/ld: /u01/app/oracle/product/12.1.0/db_1/lib//libpls12.a(pci.o): relocation R_X86_64_32 against `.rodata.str1.4' can not be used when making a shared object; recompile with -fPIC
This will be found in the Oracle installer’s log immediately after the “linking phase” of the Oracle installation starts.
Unfortunately, the error message dialog that appears at this point looks like this:
…and that particular error message has long been familiar from 126.96.36.199 installs on assorted distros. The workarounds then were to add various compilation flags to assorted makefiles.
But in this case, the graphical error dialog deceives: for a start, this is happening on 188.8.131.52, and although the dialog text is the same as back in the 184.108.40.206 days, the underlying cause is completely different. It’s only when you inspect the installActions log do you (eventually!) see the error text I showed above, which tells you that this is no “ordinary” compilation problem.
Putting it as simply as I know how, the basic idea of position-independent code is that it allows execution of code regardless of its absolute memory address. It’s thus a ‘good thing’, on the whole.
Trouble is, if objects within the code you’re trying to compile haven’t themselves been compiled to be position-independent, then you aren’t yourself allowed to compile the code that references them into shared libraries.
As the error message above says, since “pci.o” isn’t position-independent, you can’t compile references to it into the libpls12 library. Note that the error message does not mean that your attempt to compile libpls12 should use fPIC: if it meant that, you could do something about it. No: it’s telling you that pci.o was compiled by Oraclewithout fPIC. Only if they re-compile that object with the fPIC compiler option switched on would you then be able to compile it into the libpls12 library successfully.
If you’re best mates with Larry, then, perhaps you’ll be able to get him to do the necessary recompilations for you! Mere mortals, however, are stuck with the unhappy fact that vast swathes of the Oracle 12c code-base has not been compiled to be position-independent… and there’s nothing you can do to fix that when your version of gcc insists that it should be.
The problem doesn’t manifest itself on all distros: Ubuntu 16.04, for example, has no problem installing 12c at all (apart from the usual ones associated with not using a supported distro, of course!) But Ubuntu 16.10 does have this fatal problem. Similarly, Debian 8 is fine with Oracle 12c, but Debian 9 (the testing branch, so not yet ready for mainstream release) fails. And whereas Manjaro didn’t have a problem earlier in the year when I first released my mercury pre-installer script, it does now.
This, of course, gives us a clue: there’s clearly some component of these distros which is being upgraded over time so that later releases fail where earlier ones didn’t. So what’s the upgraded component causing all the trouble?
Perhaps unsurprisingly, given that it’s a compilation error that shows you there’s a problem in the first place, it turns out that the gcc compiler is the culprit.
If you do a fresh install of Ubuntu 16.04 (the long-term support version, so still very much current and relevant), whilst making sure NOT to update anything as part of the installation process itself, issuing the command gcc -v will show you that version 5.4.0 is in use. Do the same thing on Ubuntu 16.10, however, and you’ll discover you’re now using gcc version 6.2.0.
A fresh Debian 8.6 install, subjected to an apt-get install gcc command, ends up running gcc version 4.9.2. The same thing done to a fresh Debian 9 install results in a gcc version of 6.2.1.
Manjaro is a rolling release, of course, so it’s software components are forever being incrementally upgraded: it makes finding out what gcc version was in use at the start of the year rather tricky! So I don’t have hard evidence for the gcc version shift there -but my main desktop is currently reporting version 6.2.1, so I’ll stick my neck out and say that I would lay odds that, had I checked back in January 2016, I think I would have found it to be around version 5.something.
In short, for all three distros currently under my microscope, a shift from gcc 4- or 5-something to 6-something has taken place… and broken Oracle’s installation routine in the process.
It means that all distros will eventually come across this compilation problem as they eventually upgrade their gcc versions. Expect Fedora to keel over in short order, for example, when their version 26 is released next April (assuming they go for a late-release version of gcc 6.something, which I expect they will). No doubt we’ll have all moved on to installing Oracle 12c Release 2 by then, which is probably suitably position-independent throughout… so maybe no-one will ever have to worry about this issue again. But in the meantime… the constantly changing nature of gcc is a problem.
So, what’s to be done if you want Oracle 12c installed on these distros with their fancy new gcc versions? Nothing I could really think of, except to ensure that the old, functional versions of gcc and related development tools are installed …and that can be easier said than done!
On Debian 9 (‘testing’), for example, instead of just saying apt-get install gcc, you need to now say apt-get install gcc-5, which ensures the ‘right’ compiler version is installed, with which the Oracle installer can live. Thus can this screenshot be taken:
…which shows me happily querying the EMP table on 220.127.116.11 whilst demonstrating that I’m running Debian Testing (codenamed “Stretch”). That’s only possible by careful curating of your development tool versions.
The same sort of ‘install old gcc version and make it the default’ trick is required to get 12c running on Ubuntu 16.10 too:
-though I had to specifically install gcc-4.9 rather than “gcc-5”, since the compilation error still arose when ‘gcc-5’ was installed. These things get tricky!
Anyway: there it is. Gcc’s constant version increments create havoc with Oracle 12c installations. Coming to a distro near you, soonish!
No version of SQL Server supports replicating to Oracle 12c out of the box.
This took us a little by surprise at work, since things had been replicating just fine in our equally-12c test environment. The difference in Production that arose this week, however, was that somebody decided to re-initialise the SQL Server publish-and-subscribe replication in that environment. This caused SQL Server to drop the replication-controlling table from Oracle and to re-create it from scratch -and only at that point does the lack of 12c support reveal itself.
The symptoms are easy to describe: if you check your replication monitor in the SQL Server Management Studio, you’ll see this error (or variations on it):
The cause of this is also quite easily found. In the SQL Server msdb database (under Programmability > Stored Procedures > System Stored Procedures, if you’re browsing in the Management Studio), you’ll find a procedure called sys.sp_MSrepl_createdatatypemappings. It contains this code snippet:
IF OBJECT_ID(N'dbo.MSdbms_datatype', 'U') IS NULL
print 'Creating table MSdbms_datatype'
create table MSdbms_datatype
datatype_id int NOT NULL IDENTITY,
dbms_id int NOT NULL,
type sysname NOT NULL,
createparams int NOT NULL DEFAULT 0,
CONSTRAINT pk_MSdbms_datatype PRIMARY KEY (datatype_id),
CONSTRAINT fk_MSdbms_datatype_dbms_id FOREIGN KEY (dbms_id) REFERENCES MSdbms (dbms_id)
exec dbo.sp_MS_marksystemobject 'MSdbms_datatype'
-- Define default dbms data types
exec sys.sp_MSrepl_MSSQLdatatypes 'MSSQLServer'
exec sys.sp_MSrepl_DB2datatypes 'DB2'
exec sys.sp_MSrepl_ORAdatatypes 'Oracle'
exec sys.sp_MSrepl_ORAdatatypes 'Oracle', '8'
exec sys.sp_MSrepl_ORAdatatypes 'Oracle', '9'
exec sys.sp_MSrepl_ORAdatatypes 'Oracle', '10'
exec sys.sp_MSrepl_ORAdatatypes 'Oracle', '11'
exec sys.sp_MSrepl_SASdatatypes 'SYBASE'
You’ll note that it’s creating a table to govern the ‘data type mappings’ between SQL Server and various other databases -in other words, and as an example, where SQL Server says ‘INT’ or ‘BIGINT’, Oracle will say ‘NUMBER’. SQL Server needs to be able to translate its data types to those used by ‘foreign’ databases, and this code sets that translation mechanism up.
Unfortuately, you’ll also note that whilst the code goes on to populate this data mapping table with translation entries for Oracle versions 8, 9, 10 and 11… there’s nothing at all for Oracle 12! It’s this lack of a description as to which SQL Server data types should become which Oracle 12c ones that explains replication’s inability to create its MSREPL7 table correctly.
If, despite receiving that earlier error when first attempting to create a new subscription to an Oracle database, you describe the MSREPL7 table that will actually have been created in Oracle, you’ll see this sort of thing:
Notice all those “INTERVAL DAY(2) TO SECOND(6)” columns? They’re actually supposed to be NUMBER(10) and NUMBER(3) columns. Those BINARY FILE LOB columns are all supposed to be RAW(16), too.
MSREPL7 is the ‘governor’ for SQL Server replication: it records what tables are supposed to be replicating and what transaction they’re up to in the destination (Oracle) database. When the SQL Server source database next wants to send transactions to Oracle, it’s this table which tells it which transactions have already been sent and which have not. So when this ‘governor’ table is screwed up, you won’t get very far with any replication!
As an aside, things are slightly more complicated if you upgrade to 12c from a previously-working 11g. In that case, you’d already have had replication working nicely, even after completing the 12c upgrade process, and thus the MSREPL7 table would have all the right data types and there’d be no impediment to replication continuing to work. The other tables you’re replicating will also already exist in your newly-upgraded 12c database and they’ll have all the right columns too. So they will replicate nicely as well. So, you will think that all is well… because it is!
It’s only when someone decides to reinitialize replication that it will all go pear-shaped, because it’s only at that point that SQL Server will drop the tables being replicated and attempt to re-create them… with all the wrong data types! When you then attempt to push a number from SQL Server into a column in Oracle that’s been freshly-created as an INTERVAL data type, that’s when sparks will fly.
This is why we didn’t realise that upgrading to 12c would be an issue in our various test environments: no-one had attempted to re-initialize the replication; and with many multi-hundred-million rows to be pushed across into Oracle, it was a fair bet that no-one would be daft enough to do that in Production either! Unfortunately, it turned out that my gambling instincts were wrong on this occasion: someone was daft enough to hit the wrong menu options, and thus the trouble began.
I couldn’t find much about this problem on Google, which was rather surprising (if one major database has trouble replicating to another major database, I would have expected to hear a lot about it by now!)
I did find enough to make me look at the column types in MSREPL7, and notice that all sorts of odd data types were in use. That got me hunting down any references to how SQL Server knows what data type maps to what in Oracle. And finally, that got me to the stored procedure the code from which appears above.
Looking at that code, it’s just begging for an extra line to be added to it, no?! I mean, if it read:
…then maybe all the troubles would go away, right?
But you can’t just run that sort of thing in the standard SQL Server Management Studio. For this, you need to run it in DAC (or ‘dedicated administrative connection’) mode. To do that, fire up SSMS as normal and connect to your SQL Server instance as normal. Click New Query to start a new query window. Finally, click the ‘change connection’ button, just to the left of the database selector:
That will pop-up a new dialog box to allow you to make a fresh connection to your SQL Server instance:
But notice this time that I’ve stuck ADMIN: in front of my usual ‘server name’ connection string. That will connect you in the right ‘god mode’. You can’t really tell anything is different from a non-DAC connection, apart from the ‘admin:’ identifiers scattered all round the SSMS interface if you look in all the right places!
Anyway, once you’ve got a DAC connection, you can type in this lot of SQL and run it:
exec sys.sp_MSrepl_ORAdatatypes 'Oracle', '12'
exec sys.sp_MSrepl_MSSQL_ORA_datatypemappings @source_dbms = N'MSSQLSERVER', @destination_dbms = N'ORACLE', @destination_version = '12'
If all goes according to plan, you should see this sort of response:
Once the data types have been created for Oracle 12, drop the MSREPL7 table from your Oracle database and then try re-initializing your replication subscription once again. This time, it should just work:
That shows you me dropping MSREPL7 in a SQL*Plus window in blue; then I went and re-initialized the replication in SQL Server (which you can’t see me doing). You then see me describing the structure of MSREPL7 in the blue window: notice (a) the table has been re-created despite just being dropped; and (b) it’s been created with NUMBER and RAW data types for its columns, not INTERVALS and BLOBs. The Synchronization Status progress window finally shows that replication transactions were successfully delivered to the Oracle database.
You can also issue the required fixup SQL in the command-line sqlcmd tool. Connect to your database in DAC mode with the “-A” switch:
sqlcmd -S winsql\winsqldb -E -A
The -S takes the form it does here because it describes the server AND the instance to connect to (I use named instances). Hence “winsql” is the server name and “winsqldb” is the instance to connect to running on that server. The -E switch means “use Windows authentication”, because that’s what I do! If you use named users, you’d use the -U switch and specify the username (and the -P switch if you fancied typing his password in clear text on the command line!). The -A switch means, as I’ve mentioned, connect in DAC mode.
Note that DAC mode is only available locally: that is, you have to be working on the actual server that’s hosting the SQL Server instance. If you insist on running the fixup SQL remotely, you can enable remote DAC access by right-clicking the instance in SSMS, clicking Facets, selecting Surface Area Configuration from the drop-down menu and then setting RemoteDacEnabled to True. You also need the SQL Browser service to be running (which you can switch on from the normal services utility, or by running the SQL Server Configuration Manager, clicking the SQL Server Services item and then the SQL Browser item in the right-hand pane (right-click it and click Start).
Incidentally: you may have thought (as I innocently did) that upgrading your SQL Server from (say) 2008 R2 to (say) 2016 would help. All I can tell you is that the code snippet which appears at the top of this piece is as identically 12c-deficient in SQL Server 2016 as it is in any previous version. I found this surprising and completely inexplicable: I’m prepared to forgive SQL Server vintage 2008 for not knowing about Oracle 12c; I find it much harder to understand why SQL Server 2016 is silent on the subject!
So, as far as I know, this trick of manually adding a reference to 12c in DAC mode is the only way to get any version of SQL Server replicating to Oracle at all.
Whether bolting on a ’12’ to the end of a bit of code and running it is actually supported by Microsoft, I can’t say: I’ve looked for any indication it’s a properly-supported technique in all the usual places and turned up nothing either way.
But if you’re replicating SQL Server to Oracle 11g and thinking of upgrading to 12c any time soon, it might be useful to bear in mind that you’re going to need this technique should you ever need to re-initialize your subscriptions 🙂
I mentioned some time ago that Churchill was looking pretty sick since no RHCSL 6.x distro with a version number greater than 6.6 seemed to deliver its software on a single DVD any more, and since multi-DVD installs break Churchill, Churchill was doomed.
But I was (happily) quite wrong on that point. It turns out that, in fact, CentOS is the onlyRHCSL distro that has chosen to package its distro in such a way that it breaks Churchill.
I’ve checked OEL 6.6, 6.7 and 6.8; and then I checked Scientific 6.6., 6.7 and 6.8; and I even checked genuine Red Hat 6.6, 6.7 and 6.8… and every single one of them works fine with Churchill.
So, as I mentioned last time, that now means Scientific Linux is the distro that Churchill will assume is in use if you use its ‘speed keys’ to build your Oracle server nodes. (Rather more bluntly, it means I’ve stopped using CentOS at all, for anything). You can still use OEL or Red Hat if you prefer, by spelling out your requirements on the boot strap line. But use a speed key to save some typing, and SL will be assumed.
And, of course, if you want to use CentOS 6.7 or 6.8 to build your Oracle servers using Churchill, forget it: because they just don’t work.
Why is the new default Scientific Linux and not, say, OEL? Simply because it’s (a) free to obtain without signing up for anything and (b) it’s not as butt-ugly as OEL when you boot it up (white text on a red background is neither pretty nor particularly legible).
Other than this issue of CentOS stopping working, the latest version of Churchill (we’re now up to version 1.6) introduces a rather more radical change of tack: no more support for Oracle 11g installations, of any sort, whatsoever. Of course, version 1.5 is still available which does still do 11g quite nicely -but 11g is so long in the tooth now, and completely out of support unless you happen to be using 18.104.22.168, for which an expensive paid support agreement is needed before it can even be downloaded, that it seemed a waste of time to keep 11g support in newer Churchill releases. So, 12c only it is.
Which of course means that Churchill’s hardware requirements have just gone through the roof to match Oracle’s own. A 2GB VM just doesn’t cut it any more. The new hardware minima are: 5GB RAM and a 40GB hard disk. So building a 2-node RAC and a 2-node Active Data Guard setup on your desktop now basically demands a 32GB desktop… and no laptops need apply.
All of which in turn means that I think it might be time to re-write the Churchill documentation (especially as it currently exists only as pages on the old version of this site): take the opportunity to spruce it up, refresh it, make it part of the native WordPress site and so on. Expect something soon-ish.
Having just finished a push for domestic consistency by installing Ubuntu 16.04 on practically every machine I control (which therefore definitely excludes a certain Significant Other’s Windows PC!), I now see that the Linux Mint crew have just released a new version of their green and fragrant distro.
Never one to let the opportunity of a new distro installation pass me by if I can help it, I have accordingly just installed the Mate version onto my ancient(circa 2009) laptop… and it’s running nicely. Pretty slow, of course, but hardly unbearable. I probably don’t want to be doing Blender renders on it any time soon, but it’s fine for a bit of web editing, browsing, video playing and photo up-touching.
It also has its windows controls (maximise, minimise and close) on the right side of the window title bar (which is to say, yes it’s on the right-hand side, but it’s also the correct side as far as I’m concerned!) It is a small cosmetic change from vanilla Ubuntu, I guess; but it’s enough that it might persuade me to deploy it on my main PC instead of the current Ubuntu install. Time will tell.
I noticed in passing that though I had written simple installation scripts for Oracle 12c for a lot of other distros I’ve dabbled with over the years, I had curiously neglected to write one for Linux Mint, of whatever vintage.
So I’ve rectified that. On the Oracle articles page is a new one for installing 22.214.171.124 on Linux Mint 18. I’ve run out of inspiration for my script names now, so this one is just called Mentha -which, as we all know, is the Latin for ‘mint’.
I tested it on both the Cinnamon and Mate spins, but the screenshots in the article are all from Mate (for absolutely no reason at all, other than I happened to have it handy at the time the screenshots needed to be taken).
I should perhaps mention that I’m trialling the use of new ‘slideshow’ technology in this new article: it means the bazillion screenshots don’t take up nearly so much room on the screen and fade neatly from one to the other as you step through the slideshow. The image captions provide the instructions. It looks good -but there are at least two issues with it. One, if you’re using script blockers, the slideshow won’t work properly. And two, the plain-text version of the article (obtainable by clicking the ‘print’ link at the very end of the it) displays all the screenshots sans instructions, which is somewhat less than useful! If that is an issue for you, let me know and I’ll re-think… I am not entirely sure how many people bother with the plain-text versions of articles, so I don’t know how much of a deal-breaker this is. As I say, tell me (in the comments) if it is one for you…
With Ubuntu 16.04 as my new desktop, it was inevitable that I would need to cook up one of my pre-installer scripts to help automate the installation of Oracle 12c onto it.
Like the pre-installers for various other distros and O/Ses, I’ve given my script a name… and “Mandela” was the obvious choice. Ubuntu is a South African word that means ‘oneness of humanity’ (very roughly!). Mandela was intimately aware of it and was a great exponent of it. It also meant a lot to Mark Shuttleworth, the founder of Canonical, developer of Ubuntu, the O/S. It’s the reason why he called his distro ‘Ubuntu’ in the first place.
So anyway: I’ve put together an article on how you use mandela (the Oracle preinstaller) to make installation of 12c on Ubuntu relatively painless. The script is available for download either here or here.
The article about using clooney.sh to configure Solaris as a platform on which to run Oracle 11gR2 or 12c is finally complete.
Which means you have the perfect environment in which to replicate my recent bug-hunting expedition!
Take an 11gR2 (i.e., 126.96.36.199 or 188.8.131.52) database running on Solaris (or Linux or Windows, come to that), and create a non-SYS user with DBA privileges:
SQL> select * from v$version;BANNER--------------------------------------------------------------------------------Oracle Database 11g Enterprise Edition Release 184.108.40.206.0 - 64bit ProductionPL/SQL Release 220.127.116.11.0 - ProductionCORE 18.104.22.168.0 ProductionTNS for Solaris: Version 22.214.171.124.0 - ProductionNLSRTL Version 126.96.36.199.0 - ProductionSQL> create user hjr identified by hjr;User created.SQL> grant dba to hjr;Grant succeeded.
Now, connected as that new non-SYS user, here is a little script to create and populate two tables, one partitioned and one not (so you need the Enterprise Edition for this, otherwise you can’t use partitioning):
create table ABC (key NUMBER(10) not null);insert into abc values (10);insert into abc values (11);insert into abc values (9);create table DEF (key NUMBER(10) not null)partition by range (key)(partition p0 values less than (10), partition p1 values less than (11), partition p2 values less than (12), partition p3 values less than (13));alter table ABC add primary key (key) using index;create index DEF_IX on DEF (KEY) local;insert into def values (10);insert into def values (11);commit;
Table ABC contains three rows; table DEF contains 2 -and they happen to share values with two of the rows in the ABC table. There’s an index on both tables, too: on the partitioned table, the index is locally partitioned.
Now, create a little procedure that does nothing more than join the two tables and select for values greater than or equal to something:
create or replace procedure codetest (v_k IN number)is v_txt varchar2(1000);begin select abc.key into v_txt from abc, def where abc.key = def.key and abc.key >= v_k and rownum < 2; dbms_output.put_line(v_txt);end;/
So both values 10 and 11 work fine, and the little select statement contained within the procedure correctly fetches both values as required. So far, so normal.
Now do this:
alter index DEF_IX MODIFY PARTITION p1 unusable;
That renders the p1 partition of the index on the DEF table unusable. Note that partition p1 would hold values “less than 11 (and greater than 9)”, so no way, no how is value 11 involved in that partition. Now run the procedure again, selecting for value 11:
SQL> exec codetest(11);BEGIN codetest(11); END;*ERROR at line 1:ORA-01008: not all variables boundORA-06512: at "SYS.CODETEST", line 5ORA-06512: at line 1
Though partition p1 cannot be relevant for the query, the query fails declaring that “not all variables bound” …a classic error that you often get when submitting SQL queries using bind or substitution variables (along the lines of “select * from emp where ename=:B1”), and when you’ve forgotten to assign a value to one of the variables.
But there are no bind or subtitution variables in this code, except in the sense that v_k is one… and that definitely has had a value assigned to it. So why on Earth is the ORA-01008 being raised?
When this happened on one of my production databases, Artem (my esteemed co-DBA) and I puzzled about it for ages. We suspected corruption (but DBMS_REPAIR said all was fine). We re-wrote the select part of the PL/SQL code every which way we could think of, just in case some syntactical issue was tripping up the optimizer. We did this, we did that… but then Artem did the following:
SQL> alter index DEF_IX REBUILD PARTITION p1;Index altered.SQL> select status from dba_ind_partitions where index_name='DEF_IX';STATUS--------USABLEUSABLEUSABLEUSABLE
(So we know that all the index partitions are fine once more). And:
SQL> alter index DEF_IX MODIFY PARTITION p1 unusable;Index altered.SQL> exec codetest(10);10PL/SQL procedure successfully completed.SQL> exec codetest(11);11PL/SQL procedure successfully completed.
So this time, we’ve marked the same partition as before as unusable before we do anything else, and then we run the procedure selecting for value 10… which is involved in the index partition we’ve just invalidated. And yet the code still works! And, even more surprisingly, if the procedure worked fine on value 10, it goes on to work fine on value 11 too -the same value it had previously failed at in the presence of the same unusable index partition.
So then we immediately did this:
SQL> alter system flush shared_pool;System altered.SQL> exec codetest(11);BEGIN codetest(11); END;*ERROR at line 1:ORA-01008: not all variables boundORA-06512: at "SYS.CODETEST", line 5ORA-06512: at line 1
Just by flushing the shared pool, the code reverts back to its previous behaviour, where it raises an ORA-01008 for value 11 …the same value it just worked for!
I’ve no way of knowing precisely what is going on, but it seems pretty apparent we have a bug here -but one that took a lot of hunting down! The ingredients needed to make the bug reveal itself would seem to be:
Joining a non-partitioned table to a partitioned table
The non-partitioned table has values less than any of the values in the partitioned tables
The two tables share at least two values
You query to retrieve the higher of the two values in the partitioned table that overlap/match with values in the non-partitioned table. In this case, the overlapping values were 10 and 11 -query for 10 and the ORA-01008 never arises. Query for 11 in the presence of unusable index partitions and the non-existence of a previously-compiled shareable cursor and the ORA-01008 errors come thick and fast.
And your query must use “>=”. If it selects for “<=” or “=”, then the index invalidation never causes the ORA-01008 to appear, no matter what index partitions are unusable or what value you select for.
You have to have the initialization parameter skip_unusable_indexes set to TRUE (which is the default in 11g). If it’s set to FALSE, then the existence of unusable index partitions never causes an ORA-01008 to be raised under any circumstances.
You’re not doing this as SYS. SYS seems to be immune from whatever code-path causes the ORA-01008. Non-SYS users can be victims of it!
If you get all those conditions met, then marking an index partition unusable seems to cause the internal, recursive SQL (that’s generated under the hood whenever you make use of the database) to be using a bind variable to which no value has been assigned. The ORA-01008 is therefore “correct” in the sense that there’s genuinely a bind variable somewhere that’s missing a value, but it’s a problem with Oracle’s own internal SQL, not the code you’re running.
If you happen to query for value 10, you don’t encounter the problem in the presence of an unusable index partition -and the cursor that’s created (and stored in the library cache) when your query is successfully parsed is then available for use by the procedure when you then make it query for value 11. So if you’ve successfully queried for value 10, you can successfully query for value 11 (or any other value), because a parsed cursor is already at your disposal. But flush the shared pool, and that cursor is wiped -so that when you now select for value 11, a new cursor has to be compiled… and the problem with the recursive SQL needed to make that happen then rears its ugly head once more.
Happily, the bug would seem to have been fixed in 188.8.131.52, but 11.2.04 remains supported for a good few years yet! Therefore I’ll be raising an SR soon enough, and we’ll see what Oracle has to say about it. I’ve already raised an Oracle forums question about it, and got some useful suggestions which got us on the right track (though quite a few more comments about the quality of my code than were strictly warranted, I feel! Why do people look at a forest and see trees quite so often?)
There are so many variables that make the difference between the execution of the procedure fail or succeed, though, that this truly did feel like Pooh and Piglet felt when hunting Woozles. And at the end, I felt as if I’d fallen in a Heffalump trap for Poohs (or Piglets):
“I have been Foolish and Deluded,” said he, “and I am a Bear of No Brain at All.”
“You’re the Best Bear in All the World,” said Christopher Robin soothingly.
“Am I?” said Pooh hopefully. And then he brightened up suddenly.
As promised, Salisbury and Asquith have been “retired” and have accordingly disappeared from the front page. They can still be reached from the Downloads page, though, should anyone still need them.
Churchill is now very nearly completely documented and replaces both. The only thing still missing is the description of how to create a Wilson server to act as an Enterprise Manager Cloud Control, and that should be finished by the end of the week.
I’ve also set up my own “ownCloud” hosting service and am hosting the Churchill ISO from there rather than from Dropbox. I think it’s all working and the necessary files are available to all, but if you run into any problems, just drop me a line and I’ll get it sorted ASAP.
Suppose that about six weeks ago you, as a proactive kind of DBA, had noticed that your 2TB database was running at about 80% disk usage and had accordingly asked the SysAdmin to provision an additional 2TB slice of the SAN so that you could then add a new ASM disk to your database.
Imagine that the SysAdmin had provisioned as requested, and you as the DBA had applied the change in the form of adding a new ASM disk to your production instance -and that, in consequence, you’d been running at a much healthier 50% disk usage ever since. You’d probably feel pretty good at having been so proactive and helpful in avoiding space problems, right?
Suppose that weeks pass and it is now late October…
Now imagine that for some reason or other that made sense at the time, you kick off a new metadata-only Data Pump export which, workplace distractions being commonplace, you lose sight of, until 6 hours after you started it, you get told there’s a Sev 1 because the non-ASM, standard file system to which your data_pump_dir points has hit 100% usage and there’s no more free space. Foolish DBA!
But no matter, right? You just kill the export job, clear up the relevant hard disk… suddenly the OS is happy there’s space once more on its main hard disk.
But pile up the hypotheticals: the OS reports itself happy, but suppose you nevertheless discover that as a result of the space problems caused by your export, none of the nodeapps are listed as running on Node 1 and any attempt to start them with svrctl on node ends with an error message to the effect that it can’t contact the OHASD/CRSD on that node.
Suppose GV$INSTANCE still returns a count of 2: Node 1 is therefore still up, but no-one can connect to it, because no services are able to run on it. Basically, your Node 1 has kind-of left the building and the only possibility of getting it back, you might reasonably think, would be a whole node reboot. Thank God Node 2 is still up and has no difficulty working alone for a few hours! It’s good enough to cope with the rest of the day’s workload anyway.
So, in this hypothetical house of horrors, suppose that you arrange a schedule outage in which you will reboot Node 1 and wait for it to come back up as a fully-fledged cluster member once more. It should only be a matter of moments before Node 1 is back to its normal happy state, noticing that the non-ASM disk has loads of space once more, right?
Only, imagine that it doesn’t. Imagine instead that it takes at least 10 minutes to restart and, in fact, it’s response-less at that point and looking like it might take another 10 minutes more. Imagine, indeed, that after another 10 minutes on top of that lot, maybe you look at the ASM alert log for Node 1 and find these entries:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing from group number "2"
At this point, hypothetically… you might start adding 2 and 2 together and getting an approximation of 4: for you would know that disk 1 is the new 2TB one that you added to the database way back in September.
But why would that new disk, which has been in daily and heavy use ever since, be posing a problem now, rather than before now? You might start idly wondering whether, potentially, when it was provisioned, it was provisioned incorrectly somehow. This being the first reboot since that time, tonight (for it is now past midnight) is maybe the first opportunity which that mis-provisioning has had a chance to reveal itself?
You might at this point very well make a big mental note: on no account reboot node 2, because if it loses the ability to read ASM disks too the entire primary site will have been destroyed.
It would make for an interesting night, wouldn’t it? Especially if the SysAdmin who did the disk provisioning back in September was no longer available for consultation because he was on paternity leave. In New Zealand.
What might you as the DBA do about this state of affairs? Apart from panic, I mean?!
Well, first I think you might very well get your manager to call the SysAdmin and get him off paternity leave in a hurry -and he might take a quick look over the disks and confirm that he’d partitioned the disk back in September to start from cylinder 0… which is, er… a big no-no.
It is, in fact, perhaps the biggest no-no you can do when provisioning disk space for Oracle ASM. This is because doing so means your physical partition table starts at cylinder 0… but, unfortunately, Oracle’s ASM-specific information gets written at the beginning of the disk you give it, so it over-writes the partition table information with its own ASM-specific data. When ASM data replaces disk partition data… you don’t have any disk partitions anymore. Though you won’t know about it yet, because the disk partition information was read into memory at the time the disk was added and has thus been readable ever since.
To stop that happening, you’re supposed to make sure you start your partitions at something other than cylinder 0. Then Solaris can write partition info literally at cylinder 0, and Oracle’s ASM data can start… NOT at cylinder 0!
Apparently, the only operating system that even allows you to add cylinder-0-partitioned disks is Solaris: Oracle on other operating systems spots the potential for disaster and prevents you from adding it in the first place. Tough luck if, in this hypothetical situation, you’re stuck using Solaris, then!
Until you try and re-mount a disk after a reboot, you don’t know the partition table has been destroyed by Oracle’s ASM shenanigans. The partition information is in memory and the operating system is therefore happy. You can run like this forever… until you reboot the affected server, at which point the ASM over-write of the disk partition information proves fatal.
The second thing you might do is raise a severity 1 SR with Oracle to see if there’s any possible way of fixing the partition table on this disk without destroying it’s ASM-ness. However, Oracle support being what it is, chances are good that they will simply hum-and-haw and make dark noises about checking your backups. (Have you ever restored a 2TB database from tape? I imagine it might take one or two days…or weeks…)
So then you might start thinking: we have a Data Guard set up. Let’s declare a disaster, switch over to the secondary site, and thus free up the primary’s disks for being re-partitioned correctly. And at this point, hypothetically of course, you might then realise that when we added a disk to the ASM groups back in September on primary… er… we probably also did exactly the same on the standby!
This means (or would mean, because this is just hypothetical, right?!) that our disaster recovery site would be just as vulnerable to an inadvertent reboot or power outage as our primary is. And then you’d probably get the sysadmin who’s been contacted by phone to check the standby site and confirm your worst suspicions: the standby site is vulnerable.
At this point, you would have a single primary node running, provided it didn’t reboot for any reason. And a Data Guard site running, so long as it didn’t need to reboot. That warm glow of ‘my data is protected’ you would have been feeling about 12 hours ago would have long since disappeared.
Hypothetically speaking, you’ve just broken your primary and the disaster recovery site you were relying on to get you out of that fix is itself one power failure away from total loss. In which case, your multi-terabyte database that runs the entire city’s public transport system would cease to exist, for at least several days whilst a restore from tape took place.
If only they had decided to use ‘normal redundancy’ on their ASM disk groups! For then you would be able to drop the bad disk forcibly and know that other copies of data stored on the remaining good disks would suffice. But alas, they (hypothetically) had adopted external redundancy, for it runs on a SAN and SANs never go wrong…
At this point, you’ve been up in the wee small hours of the night for over 12 hours, but you might nevertheless come up with a cunning plan: use the fact that node 2 is still up (just!) and get it to add a new, good disk to the disk group and re-balance. The data is distributed off the badly-configured disk onto the new one (which you’ve made triply sure was not partitioned at cylinder 0!)
You could then drop the badly-configured disk, using standard ASM ‘drop disk’ commands. The data would then be moved off the bad disks onto the good ones. You could then remove the bad disk from the ASM array and your Data Guard site would, at least, be protected from complete failure once more.
Of course, Oracle support might tell you that it won’t work, because you can’t drop a disk group with external redundancy… because they seem to have forgotten that the second node is still running. And you’ve certainly never tried this before, so you’re basically testing really critical stuff out on your production standby site first. But what choice do you have, realistically?!
So, hypothetically of course, you’d do it. You’d add a disk, wait for a rebalance to complete (and notice that ASM’s ability to predict when a rebalance operation is finished is pretty hopeless: if it tells you 25 minutes, it means an hour and a half). And then you’d drop a disk and wait for a rebalance to complete. And then you’d reboot one of the Data Guard nodes… and when it failed to come back up, you might slump in the Slough of Despond and declare failure. Managers being by this time very supportive, they might propose that we abandon in-house efforts to achieve a fix, and call in Oracle technical staff for on-site help. And that decision having been taken in an emergency meeting, you might idly re-glance at your Data Guard site and discover that not only is +ASM1 instance up and running after all, but so is the database instance #1. It’s actually all come up fine, but you had lacked the patience to wait for it to sort itself out and had declared failure prematurely. Impatient DBA!
Flushed with the (eventual) success of getting the Data Guard site running on all-known-good-disks, you might want to hurry up and get the primary site repaired in like manner. Only this is a production environment under heavy change management control, so you’ll likely be told it can only be fiddled with at 11pm. So you would be looking at having worked 45 hours non-stop before the fix is in.
Nevertheless, hypothetically, you might manage to stay up until 11pm, perform the same add/rebalance/drop/rebalance/reboot trick on the primary’s node 2… and, at around 3am, discover yourself the proud owner of a fully-functioning 2-node RAC cluster once again.
(The point being here that Node 2 on the primary was never rebooted, though that reboot had been scheduled to happen and the SysAdmin sometimes reboots both nodes at the same time, to ‘speed things up’ a bit! Had it been rebooted, it too would have failed to come back up and the entire primary site would have been lost, requiring a failover from the now-re-protected standby. But since Node 2 is still available, it can still do ASM re-structuring, using the ‘add-good-disk; rebalance; drop bad-disk; rebalance’ technique.)
There might be a little bit of pride at having been able to calmly and methodically work out a solution to a problem that seemed initially intractable. A bit of pleasure that you managed to save a database from having to be restored from tape (with an associated outage measured in days that would have cost the company millions). There might even be a bit of relief that it wasn’t you letting an export consume too much disk space that was the root cause, but a sysadmin partitioning a disk incorrectly weeks ago.
It would make for an interesting couple of days, I think. If it was not, obviously and entirely, hypothetical. Wouldn’t it??!