We’re currently wrapping up a project which uses PostgreSQL as its data store. In terms of the amounts of data (or rather, number of pieces of data) it’s one of the biggest we’ve designed. The schema is just a handful of tables, however there are tables in the order of tens of millions to hundreds of millions of rows, with one at over a billion rows – and growing. The database is definitely write heavy.
Here’s some things of note when working with PostgreSQL at this scale. I’ll discuss AWS Aurora, PostgreSQL data limits, indexing, upserts, and performance.
PostgreSQL can scale, if your hardware can
The heading of this paragraph probably seems kind of obvious. I’m sure there’s an old saying like “there’s no slow algorithms, just slow processors”. Maybe that’s just programmers pointing the fingers at the hardware engineers, but I digress.
What I mean is that PostgreSQL doesn’t have hard limits on a lot of their parameters – it just comes down to: how much memory and storage do you have to play with?
The only set limit is table size, of 32TB by default. This would allow 32 billion 1KB rows to be stored, so it’s not something that should be hit any time soon.
But how can your hardware scale? This has been my first opportunity to work with Amazon Aurora Serverless databases. Essentially this means your database hardware can scale up and down on the fly, as your CPU/RAM requirements increase or decrease.
It’s not magic though. Sometimes you’ll get the dreaded “The DB cluster failed to scale from X capacity units to Y capacity units for this reason: A scaling point wasn’t found.” message. This is because Aurora can’t scale if there are active transactions in the database. You can tell it to force scaling but this might cause connections to be dropped.
The fix we’ve found for this is to make sure your application is well behaved and is cleaning up after itself in a timely manner, by making sure transactions are committed and connections are closed when finished.
This is generally good design for any application with any database though.
Aurora can also be useful to determine what capacity of database server you actually need. You can see how much CPU and RAM your application requires and then migrate to a standard RDS instance of the same size, for more predictable billing.
Text storage and indexing
When storing text in a database you might be used to using a
VARCHAR field with a specific length. You probably also want to make this a power of 2, for some unknown reason. For example, maybe a
VARCHAR(64) sounds “about right” to store an email address.
TEXT field types all use the same underlying storage mechanism, and there is no performance difference between them. Therefore, unless you’re very sure about the length of text data you’re storing, you may as well use the
TEXT type so you’re not limiting yourself to a certain length.
Where you might run into issues with this approach, though, is when using btree (unique) indexing on the text column. PostgreSQL limits the size of indexes for rows to around 2KB, give or take. Trying to insert a row that needs to index text data of this size will cause errors like:
index row size 5992 exceeds maximum 2712 for index <index name>.
There are a couple of options for fixing this. If you need a unique index on the column, you can use an expression index
For example, if you want to find a
document by its unique
content, your query might be like this:
SELECT * FROM documents WHERE content = 'my long document content';
And to speed it up, you’d create a unique index on
content, which would of course fail if
content were too long.
Instead, you can hash the value and index on that. Doing it manually might look something like this:
INSERT INTO documents (content, hash) VALUES ('my long document content', md5('my long document content'))
And then to query it:
SELECT * FROM documents WHERE hash = md5('my long document content');
Aside: consider a more secure hash function if you like, MD5 is used here as a well known example.
This requires you to remember to keep the the hash up to date. You could achieve this with a trigger function to make sure the
hash column is kept up to date as the
documents table is updated, but that’s all getting a bit complicated.
Instead, let’s come back to the use of an expression index. An index on the hash of the value can be created like this:
CREATE UNIQUE INDEX index_name ON documents ( md5(content) );
Then your index is used when querying on the hash of that column:
SELECT * FROM documents WHERE md5(content) = md5('my long document content');
This doesn’t actually perform an
md5 on each row’s
content since that’s been precomputed as part of the index. Remember now that, ironically, searching for the content without using the hash is going to be slower since the unhashed content is not indexed.
You may still consider storing a hash as its own field if you want to move the hashing function to your application layer. For example, if you want to compute the hash across multiple fields in different ways, with more complexity than PostgreSQL provides.
If you need to search inside text fields, also consider GIN and GiST Indexes. These allow for “full text search”, but aren’t meant for unique indexing.
The word upsert is a combination of update and insert, and it refers to an atomic operation that will either update a record, or insert a new one if it doesn’t exist.
PostgreSQL supports this type of operation, but doesn’t use the term
UPSERT in the query. Instead, it’s an
INSERT with extra parameters. Either
ON CONFLICT DO NOTHING or
ON CONFLICT DO UPDATE …. The former allows you to ignore errors if the row exists, the latter allows you to update fields on the affected rows. The full documentation on the syntax of PostgreSQL “Upserts” is available on the PostgreSQL website.
Upserts (as I’ll continue to refer to them) have the advantage of not needing to do a query to check if a row exists before choosing to update or insert it. This can also eliminate race conditions too, if two processes are checking/inserting the same data at the same time.
PostgreSQL will only return the ID of the upserted row if it is inserted, or if using
ON CONFLICT DO UPDATE. If you’re intending to to try to insert data and get its ID back if it exists, in a single operation, you’ll need to revert to using
ON CONFLICT DO UPDATE – even if you don’t update anything.
For example, you have a table like this:
CREATE TABLE users ( id serial, email text UNIQUE )
You might want to upsert an email and get back its
id, regardless of if the email already exists or not. Your insert would need to perform a query like this:
INSERT INTO users (email) VALUES('email@example.com') ON CONFLICT (email) DO UPDATE SET email=EXCLUDED.email;
So you’re performing an update, but just setting the value back to what it was.
EXCLUDED is a special table containing the values that would have been inserted.
This may be a little bit hackish but if you’re determined to perform only one query then this is the approach you should take.
Something to be aware of when using upserts is that if you’re using an auto-incrementing (
BIGSERIAL) primary key you’ll be using up an ID for each failed insert. To see why, let’s take a step back and see how PostgreSQL generates incrementing IDs.
When you create a
BIGSERIAL field, PostgreSQL will automatically add a sequence for the field. In simple terms, this is a generator that atomically generates the next ID in sequence. You can actually query from the sequence and get the next ID without using it:
dbname=# SELECT nextval('table_id_seq'); nextval --------- 233 (1 row) dbname=# SELECT nextval('table_id_seq'); nextval --------- 234 (1 row)
Two IDs wasted!
When using the
INSERT … ON CONFLICT … query, PostgreSQL actually queries the sequence to get the ID for the row that is to be inserted. It then attempts to insert the row, and on conflict the row and its ID are discarded, never to be seen again.
Is this a problem? Not really, but you should be aware of it in case you’re wondering why your database is mysteriously missing a bunch of IDs. You may be scratching your head over all the missing data if your maximum
id is 1,000,000 but you only have 800,000 rows.
You might also run the risk of exhausting the sequence earlier than you expect. The
SERIAL type can store integers up to 4 bytes (i.e. 2,147,483,647, or 2.1B).
BIGSERIAL goes all the way to 9,223,372,036,854,775,807 or 9.2 quintillion. Unless you’re giving an ID to every grain of sand on earth (estimated to be 1021) you should be fine.
Performance is good, but not magic
At the hundreds of millions or billions of rows scale, PostgreSQL is still fast. I don’t have metrics or comparisons to other databases, as everyone’s schema is different, so it’s not always useful to compare these. Suffice to say though, if you can query using indexes you’ll get back results quickly. We’re still talking milliseconds to get results from billions of rows.
But any time you have to perform data updates on the whole table you’ll be waiting a while. Even if you can update a row in 1 millisecond, it would take eleven and a half days to update a billion rows. Aurora scales well, and can increase its capacity horizontally if you can get enough machines pushing data into it, but again, that’s not a magic bullet if you’ve got LOTS of data.
So try to be sure that your schema and indexes are absolutely correct before moving to the production data size, otherwise you might be in for a lot of downtime while you rejig stuff.
Aurora is not completely magical, but it is pretty magical. It can also be useful to find out what sort of capacity your application needs before switching to a more-standard sized RDS (or other) database solution.
Be aware of the limits on PostgreSQL. Mostly there are none, but table size and row length for indexes need to be kept in mind.
Upserts can solve problems with atomic inserts without resorting to table locking, but do be aware of “unnecessary updates” and “ID exhaustion”. This isn’t necessarily a big problem but it is something to keep in mind.
Big data takes big time to manipulate.