PostgreSQL versus Oracle

It’s been an interesting few days, comparing the performance of a production Oracle database performing full-text search on about 10 million records with a rough-and-ready PostgreSQL prototype (with the aim of working out whether the business can save itself a couple of hundred thousand dollars in licensing fees, of course!)

Both systems ran on separate servers with identical hardware: 24GB of RAM; a pair of mirrored 1TB hard drives for the OS and software installation; a pair of mirrored 256GB solid state hard disks for the database storage itself; a pair of quad-core, hyperthreaded Xeon CPUs; 64-bit Centos 5.5 (yup, we run production Oracle on Centos… gasp!) The data structures were near-identical in each case: two tables, each containing about half the records, queried individually with the results then “union all’d” into a single set of matching records. One column in each table contained all the comma-separated keywords describing the documents’ contents. That column was indexed with Oracle Text in one case, or a gin/tsvector index in the case of PostgreSQL. The data used was almost identical in each case, too: 9.4 million rows selected from a separate, third database and copied into the two sets of local search tables. Apart from a couple of thousand records that crept into the Oracle database that weren’t there at the time the PostgreSQL one was populated, there was no real difference between the two data sets. The Oracle version in use was 11.2; the PostgreSQL version in use was 8.4.4.

Each search performed on the production Oracle database is always captured and timed for monitoring and management purposes. I mined the record of those captures for a set of about 27,000 distinct search ‘phrases’ which I coule run against the PostgreSQL database. In this way, production Oracle timings could be compared to equivalent PostgreSQL benchmark times.

If you add up the total number of seconds it took the Oracle database to respond to all 5000 searches (and to return all the results), you get a grand total of 3,448 seconds. On PostgreSQL (identical hardware, near-identical data), it took 2,207 seconds. Overall, therefore, PostgreSQL is about 36% faster than Oracle on full text searches. Not bad for free!

But I wondered how each database handled searches that resulted in many matches. In other words, was PostgreSQL consistently faster than Oracle, or did it handle high-match searches better or worse than Oracle? Well, here’s the graph (click on it to make it more legible):

As you can see, as the searches return more and more records, both Oracle’s and PostgreSQL’s response times start to slow down (not, perhaps, unsurprisingly)… but PostgreSQL slows down a lot earlier than Oracle and a lot more dramatically. For a search term that matches only 15,000 records (which happens to be our default match, hence that vertical red line), PostgreSQL is faster than Oracle by quite a margin (about 60%). But for a search that matches 200,000 records, PostgreSQL is about twice as slow as Oracle; and by the time you’re matching more than 500,000 records, PostgreSQL is more than four times slower than Oracle. Given that we very rarely perform searches that match that many records (on the grounds that that’s not so much a search as a trawl!), here are the same figures, but zoomed in a bit to the left-hand side of the graph:

I think the speed advantage PostgreSQL has over Oracle for a “reasonable” number of matches is much clearer here. But I couldn’t help but notice that the orange line (PostgreSQL) was a lot ‘wavier’ than the blue (Oracle): that basically means that whilst Oracle is slower than PostgreSQL, it’s more consistent in its response times. PostgreSQL is always faster than Oracle, but it behaves in a rather less predictable fashion, especially when the number of records being returned is more than about 25,000. I suspect this is a consequence of Oracle managing its own buffer cache, whilst PostgreSQL relies much more heavily on the file system cache -and Lord knows what the file system might want to do to that at any given time!

Finally, I wondered how each database coped with search strings of different lengths. Some people might just want documents about “negligence”, for example; others might ask for “negligence, banking, insurance, fraud, liability”… would the length of the search term submitted affect the performance of search at all?

I find this graph particularly fascinating. Oracle (blue) has no problem coping with more and more words in the search phrase: search response times come down almost linearly as the search terms proliferate, probably because each new search term makes the search more precise -and therefore the set of results gets smaller and smaller. But PostgreSQL (orange) behaves in almost completely the opposite way: the more search terms you supply, the slower full text search gets (despite the size of the results set getting ever-smaller). It would seem that Oracle has an optimisation here that PostgreSQL lacks. I can only guess that Oracle constructs a bitmap of each search term, direct from the Oracle Text index, and then does a trivial bit of bit-wise addition -it now knows which records are true for all search terms and then goes to fetch them. Similarly, I can only guess that PostgreSQL has to scan for the results of each search term, and only at the end performs row-elimination to arrive at the set of rows which are true for all search terms… and that it’s this multiple scanning of the large amount of data which causes the orange line to slope upwards as the number of search terms increases.

So, all in all, it’s an interesting result. Do you get anything worth $10,000 when you buy your (dual-CPU SE1) Oracle license? Sure you do: you get a full text search that can handle huge numbers of matches and large numbers of search terms with aplomb. PostgreSQL in contrast struggles with both these extremes. But, on the other hand, for absolutely zero dollars, you get a screamingly-fast full text search capability with PostgreSQL -one that can perform excellently provided the ‘extremes’ aren’t met. Considering that any search containing more than a half-dozen search terms is more like an essay than a realistic search; and considering that returning half a million matches is more a data dump than a sensible search facility, I’d have to conclude that PostgreSQL is more than capable of being the basis of a very cheap, very viable, very capable search engine for quite large data sets.

I have to say, I like PostgreSQL enormously, despite my very-much-Oracle background. It’s highly standards compliant, which I like from a theoretical perspective. It’s nicely hands-on -you can fiddle, and tweak, and delve into it, in a way that Oracle has deliberately tried to move away from since about version 8.0. It has all the functionality most Oracle users could ever ask for, plus some. And it performs really very well indeed. Impressive, actually.

Gentium

A completely free font called Gentium (and its denser cousin, Gentium Book Basic) can be obtained from this website.

Just download the “Gentium Basic/Gentium Book Basic” link, all 848KB of it, and then unpack the resulting zip file. In Fedora, you can then view each font in turn by right-clicking it and selecting the Open with Font Viewer menu option. A button to Install Font will then be visible: click it, and you’re done. Repeat for each additional font variant (bold, bold italic, italic). The uncompressed folder can be deleted once all eight font files have been installed in this way:

Note that the font(s) may already be available from your distro’s standard package manager.

Linux Mint, for example, definitely has ‘ttf-sil-gentium’ and ‘ttf-sil-gentium-basic’ as packages in the standard repositories (so, presumably, Ubuntu does too). Try there first, anyway, before seeking to install from the website previously linked to.

Update (27/1/2017): In Fedora 25, you can install the font and its various weights with the commands:

sudo dnf install sil-gentium-fonts
sudo dnf install -y sil-gentium-basic-book-fonts

(The ‘book’ fonts are slightly heavier/bolder versions of the basic fonts).

Heterogeneous Database Connectivity

I’m not sure I’ve ever put quite so many syllables in a Blog Post Title before?! But this is a little cri du coeur on the topic of getting Oracle databases talking to non-Oracle databases, such as PostgreSQL: because the technology is bloody difficult and doesn’t work properly!

There are two aspects to the problem, of course: getting Oracle talking to the ‘foreign’ database and/or getting the foreign database talking to Oracle. In the first case (Oracle-to-other), you can use the Oracle Gateways products to do the deed. A lot of vendor-specific ones are licensable, but a generic ODBC one is actually part of a standard 11g install and doesn’t, therefore, involve any additional software purchases or installations at all. In the second case (Other-to-Oracle), there will be all sorts of options -but the one I am specifically aware of in the PostgreSQL context is something called DBI-Link, which uses a bit of Perl, a sprinkling of YAML and a dash of ODBC to make the connection.

Neither approach is particularly difficult to set up, though I think it involves more typing than I’m entirely comfortable with!

In the case of Oracle’s ODBC Gateway, you first use UnixODBC on your Oracle box to create a Linux-to-other connection. This involves creating a System data source in /etc/odbc.ini, specifying where the PostgreSQL ODBC drivers can be found by editing /etc/odbcinst.ini. Such connections are almost trivial to set up (and can be tested with the isql tool that ships with UnixODBC). That done, you then visit $ORACLE_HOME/hs/admin and create an init.ora with a name that matches the data source name you’ve just created (so, if you edited odbc.ini and created a data source called hjr, you’d now create an inithjr.ora file in the hs/admin directory). Then you move to $ORACLE_HOME/network /admin and edit your listener.ora, adding a SID_LIST entry which says ‘HS=OK’ and ‘program=dg4odbc’. Finally, you edit your tnsnames.ora so that there’s an alias which points to the local host, requesting a connection to a SID whose name matches the data source name previously created. Restart your listener to get the new settings picked up, and you should be able to fire up SQL*Plus, create a database link that uses the new tnsnames alias and be able to select * from sometable@your_new_link. The listener will receive the call, know to invoke the dg4odbc program, which will know how to read the /etc/odbc.ini stuff: shortly thereafter, your PostgreSQL data will be winging its way into the Oracle database.

Except that it almost certainly won’t! I can get the above working when both the Oracle and PostgreSQL databases are 32-bit, almost with my eyes shut. But you make either one of them (let alone both!) a 64-bit database running on a 64-bit server and you’re stuffed. I spent six hours trying to get it working earlier this week, and nothing I did made the magic moment, where foreign data finally turns up in an Oracle context, happen.

I’m not even going to get into the pig’s breakfast which is PostgreSQL’s DBI-Link software. I’ll just say that, for starters, the software which you download from the official PostgreSQL ‘foundry page’ is 3 years out of date and the author has a more up-to-date version available which, if you’re lucky, you’ll find by burrowing about a bit through a mountain of mailing list contributions. Secondly, there are approximately four hundred and thirteen elements which may, or may not, affect your chances of success. LD_LIBRARY_PATH might need changing, or it may not; it might need to reference /lib64 before, or maybe after, /lib. Your setting of NLS_LANG may or may not be an issue. You might have to set TWO_TASK, but then again you may not have to -and if you do, don’t ask too closely what it’s supposed to be set to. Feel free, variously, to experiment with your path, your ORACLE_HOME, your environment variables… there’s a very, very long list of them, any one of which might or might not be contributing to the problem! Oh, and whether or not you auto-start your PostgreSQL database may make a difference, too! Each time you adjust one of these degrees of freedom, you can certainly try selecting data from your Oracle database, but it probably won’t work.

Frankly, I’ve never seen anything like it. It’s the technological equivalent of a soufflé: if it stays up, it’s a miracle, but chances are it will just collapse in a non-functioning heap and look a mess. After 11 hours of trying, I gave up.

I thought I might put together a short series of showing the “fun” you can have with this stuff. We would start with a strictly 32-bit world, where everything more or less works as advertised. Then we’d try exactly the same thing in a 64-bit universe and see how hilarious it is when nothing at all works. But I’m so exhausted by my efforts earlier this week that I don’t think I can look a database in the eye again for quite a while! So, consider it an item on the to-do list!

Meanwhile, I did finally get 64-bit Oracle data into a 64-bit PostgreSQL database by the not-so-simple expedient of using three servers. One running Oracle, one running PostgreSQL and one running EnterpriseDB (which is PostgreSQL with “Oracle-compatible” knobs on). EnterpriseDB makes talking to Oracle a piece of cake -incredibly simple, actually, via JDBC as far as I can tell. Being at heart a PostgreSQL database itself, EnterpriseDB also is trivially easy to get talking to a ‘native PostgreSQL’ database, via PostgreSQL’s dblink software. The syntax is ghastly, but it is easy to set up. So, if you create views in EnterpriseDB on tables selected from the Oracle database, you can subsequently query those views in the native PostgreSQL database via the dblink software. It is, of course, slow as all buggery, but it is at least functional!

Personally, I don’t think Oracle-to-Other should be that hard -and in 32-bit land, it isn’t. But in 64-bit land, it’s a mess, and it oughtn’t to be. So there!

Unmetered

Living where I live (i.e., slightly less than 100Km from Sydney city centre), we cannot get “proper” broadband. No cable company flogs its wares in our neighbourhood; no local telelphone exchange is equipped for anything more than tin-can-and-string telephony, so ADSL is right off the options list! Our phone lines are so bad, we can’t even get ISDN -and that’s supposed to be accessible by 98% of the population! In desperation, we use Wireless Broadband, courtesy of Telstra Bigpond.

The plans available are enough to make you cry: the best I get is a 10GB monthly download limit for an eye-watering $129. Once you hit 10GB, there’s no extra data allowance available for purchase (and, in fairness, no extra charges): you just get shaped to speeds that would make a lethargic snail look sprightly. Friends and colleagues remark casually about their 200GB plans for $90 and wonder why I walk away in a hurry!

Nevertheless, I am generally happy with Bigpond: the connection seems always-on for about 95% of the time; the reception is good; speeds are excellent; and, when I need to, I can pack the whole thing, together with my netbook, and have Internet access on the morning train. Convenient, speedy, reliable -what more could you want (apart from more bandwidth and lower costs!)?

And my feeble 10GB allowance goes a lot further than you might think, thanks to the wonders of files.bigpond.com. Here, you’ll find Linux distro ISOs galore, a yum or apt-get update repository for the likes of Ubuntu or Fedora… and every single byte of this data munificence is completely unmetered, meaning that none of it counts in any way towards your 10GB monthly limit. Some months, therefore, I’ll actually download in excess of 25 or 30GB of stuff: 10GB on the ‘plan’ and maybe 20GB from files.bigpond.com, which magically ‘doesn’t count’!

Cue the inevitable sting in the tail: on June 30th, with all of three days’ notice, files.bigpond.com was taken down. No more unmetered access to anything… and therefore about 2/3rds of my effective monthly download limit abolished, at a stroke. This, as you can imagine, did not make for the happiest day of my life when I found out about it the moment I tried to download a new Fedora 13 ISO and kept getting redirected to the Bigpond home page. (Why they couldn’t email us to warn us, I have no idea: they’re happy enough to email a notification every time a phone bill arrives, after all!)

This was actually a deal-breaker for me. A loss of effective functionality so severe meant that I was fed up enough to go find another ISP. Actually, it turns out that there isn’t a single ISP in Australia that offers unmetered downloads for wireless broadband accounts, which is a bummer of major proportions! But there are lots of ISPs who will sell you a 6GB plan for $50 or so, with extra 6GB data blocks available on demand for about the same price (think Internode, for example). Once you’ve resigned yourself to never having access to unmetered downloads again, it’s a simple calculation to work out that Internode will sell you 12GB for $100, compared to Bigpond’s 10GB for $130: it’s not hard to work out where to go to!

The only thing you can hold against Internode is that they use the Optus wireless network, which is half the speed of Telstra’s on a good day -and I’ve had reception difficulties with them in the past (though their coverage maps now indicate a lot of that should be ancient history). But still, more data for less money: what’s not to like?!

More out of a sense of duty than actually expecting a decent reply, I took the trouble to write to the Bigpond sales people in these terms: I like Bigpond’s wireless service; I don’t want to change providers; but without that unmetered content, your product is sub-standard and non-competitive. Please tell me some good news that means I won’t have to change.

The usual two days’ wait for a reply ticked by.

Then yesterday, I got it: dear Howard, please be advised that we’ve made http://mirror.aarnet.edu.au unmetered.

Well, this a game-changer …and very, very unexpected (it’s always unexpected when Telstra/Bigpond actually listen to their customers!) Aarnet is an excellent mirror -much better, in fact, than the original files.bigpond.com. It has all the appropriate distros in DVD ISO format (apart from Centos, which is a bit of a bummer), and yum and apt-get repositories for updates. CPAN is there, so is Mozilla, Apache and a lot of others. For that to be unmetered makes me even happier than I was before and renders any thought of moving to the likes of Internode completely moot. Well done, Bigpond!

Which begs the question, I suppose: why pull the plug on a valuable resource, only to put the plug back in after you’ve pissed off a significant proportion of your customers? If it’s that easy to unmeter a site like aarnet, why not arrange to do that first, and then announce that since a gold-plated unmetered site is now available, there’s no need for the home-brew bronze alloy version? It would have been the sensible thing to do, I think (unless they simply had no idea about their customers and honestly weren’t expecting the storm of protest and discontent their original switch-off decision provoked).

It reminds me a bit of Julia Gillard’s approach to being Prime Minister: announce a regional processing centre for refugees in East Timor one day and only then start negotiating with the government of that country as to whether it’s actually possible to do! Surely the negotiations might usefully have preceded the announcement? But then that would mean having to hold off on the announcement whilst the practicalities were nailed down. It’s always harder to actually achieve something (i.e., actually do some governing!) and then announce it than the other way around, of course: which is presumably why it’s so often the other way around these days!

Anyway, Bigpond get at least half a thumbs-up from me for being relatively nimble in their ability to turn a mess of their own making into a positive. And I shall now get back to downloading some more ISOs… unmetered!

Obscuring Email Addresses

Just a tiny tip: instead of including your email address in shell scripts which you might share about the Internet, use this instead:

echo h~r*di^wel#=com|tr ~*^#= j@zl.

The “tr” bit of that command translates a character into something else. The list of “from” characters appears first, separated by a space from the list of “to” characters, in order. Thus, I’m asking to translate a tilde (~) to a “j”, an asterisk (*) to an “@” symbol and so on. Issue the complete echo command at a Linux terminal prompt and the correct email address is revealed. Before that, it looks mere gibberish!