It’s been an interesting few days, comparing the performance of a production Oracle database performing full-text search on about 10 million records with a rough-and-ready PostgreSQL prototype (with the aim of working out whether the business can save itself a couple of hundred thousand dollars in licensing fees, of course!)
Both systems ran on separate servers with identical hardware: 24GB of RAM; a pair of mirrored 1TB hard drives for the OS and software installation; a pair of mirrored 256GB solid state hard disks for the database storage itself; a pair of quad-core, hyperthreaded Xeon CPUs; 64-bit Centos 5.5 (yup, we run production Oracle on Centos… gasp!) The data structures were near-identical in each case: two tables, each containing about half the records, queried individually with the results then “union all’d” into a single set of matching records. One column in each table contained all the comma-separated keywords describing the documents’ contents. That column was indexed with Oracle Text in one case, or a gin/tsvector index in the case of PostgreSQL. The data used was almost identical in each case, too: 9.4 million rows selected from a separate, third database and copied into the two sets of local search tables. Apart from a couple of thousand records that crept into the Oracle database that weren’t there at the time the PostgreSQL one was populated, there was no real difference between the two data sets. The Oracle version in use was 11.2; the PostgreSQL version in use was 8.4.4.
Each search performed on the production Oracle database is always captured and timed for monitoring and management purposes. I mined the record of those captures for a set of about 27,000 distinct search ‘phrases’ which I coule run against the PostgreSQL database. In this way, production Oracle timings could be compared to equivalent PostgreSQL benchmark times.
If you add up the total number of seconds it took the Oracle database to respond to all 5000 searches (and to return all the results), you get a grand total of 3,448 seconds. On PostgreSQL (identical hardware, near-identical data), it took 2,207 seconds. Overall, therefore, PostgreSQL is about 36% faster than Oracle on full text searches. Not bad for free!
But I wondered how each database handled searches that resulted in many matches. In other words, was PostgreSQL consistently faster than Oracle, or did it handle high-match searches better or worse than Oracle? Well, here’s the graph (click on it to make it more legible):
As you can see, as the searches return more and more records, both Oracle’s and PostgreSQL’s response times start to slow down (not, perhaps, unsurprisingly)… but PostgreSQL slows down a lot earlier than Oracle and a lot more dramatically. For a search term that matches only 15,000 records (which happens to be our default match, hence that vertical red line), PostgreSQL is faster than Oracle by quite a margin (about 60%). But for a search that matches 200,000 records, PostgreSQL is about twice as slow as Oracle; and by the time you’re matching more than 500,000 records, PostgreSQL is more than four times slower than Oracle. Given that we very rarely perform searches that match that many records (on the grounds that that’s not so much a search as a trawl!), here are the same figures, but zoomed in a bit to the left-hand side of the graph:
I think the speed advantage PostgreSQL has over Oracle for a “reasonable” number of matches is much clearer here. But I couldn’t help but notice that the orange line (PostgreSQL) was a lot ‘wavier’ than the blue (Oracle): that basically means that whilst Oracle is slower than PostgreSQL, it’s more consistent in its response times. PostgreSQL is always faster than Oracle, but it behaves in a rather less predictable fashion, especially when the number of records being returned is more than about 25,000. I suspect this is a consequence of Oracle managing its own buffer cache, whilst PostgreSQL relies much more heavily on the file system cache -and Lord knows what the file system might want to do to that at any given time!
Finally, I wondered how each database coped with search strings of different lengths. Some people might just want documents about “negligence”, for example; others might ask for “negligence, banking, insurance, fraud, liability”… would the length of the search term submitted affect the performance of search at all?
I find this graph particularly fascinating. Oracle (blue) has no problem coping with more and more words in the search phrase: search response times come down almost linearly as the search terms proliferate, probably because each new search term makes the search more precise -and therefore the set of results gets smaller and smaller. But PostgreSQL (orange) behaves in almost completely the opposite way: the more search terms you supply, the slower full text search gets (despite the size of the results set getting ever-smaller). It would seem that Oracle has an optimisation here that PostgreSQL lacks. I can only guess that Oracle constructs a bitmap of each search term, direct from the Oracle Text index, and then does a trivial bit of bit-wise addition -it now knows which records are true for all search terms and then goes to fetch them. Similarly, I can only guess that PostgreSQL has to scan for the results of each search term, and only at the end performs row-elimination to arrive at the set of rows which are true for all search terms… and that it’s this multiple scanning of the large amount of data which causes the orange line to slope upwards as the number of search terms increases.
So, all in all, it’s an interesting result. Do you get anything worth $10,000 when you buy your (dual-CPU SE1) Oracle license? Sure you do: you get a full text search that can handle huge numbers of matches and large numbers of search terms with aplomb. PostgreSQL in contrast struggles with both these extremes. But, on the other hand, for absolutely zero dollars, you get a screamingly-fast full text search capability with PostgreSQL -one that can perform excellently provided the ‘extremes’ aren’t met. Considering that any search containing more than a half-dozen search terms is more like an essay than a realistic search; and considering that returning half a million matches is more a data dump than a sensible search facility, I’d have to conclude that PostgreSQL is more than capable of being the basis of a very cheap, very viable, very capable search engine for quite large data sets.
I have to say, I like PostgreSQL enormously, despite my very-much-Oracle background. It’s highly standards compliant, which I like from a theoretical perspective. It’s nicely hands-on -you can fiddle, and tweak, and delve into it, in a way that Oracle has deliberately tried to move away from since about version 8.0. It has all the functionality most Oracle users could ever ask for, plus some. And it performs really very well indeed. Impressive, actually.