Everything I've seen on optimizing Postgres on ZFS

mastax · on Dec 22, 2021

Trying to remember what I researched 6 months ago.

1. You don't want to use AIO on Linux because "the driver’s AIO implementation is a compatibility shim that just barely passes the POSIX standard"[0] and is bad for performance, at least in database-like applications that have an alternate threaded sync implementation.

2. You probably don't want to use O_DIRECT since it's currently implemented using a shim on top of AIO[1].

3. `primarycache=metadata` + fsync + sync I/O works just fine for databases unless you have a very fast SSD array[2] in which case you'll have to wait for real O_DIRECT support[3].

[0]: https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

[1]: https://github.com/openzfs/zfs/pull/7823

[2]: https://github.com/openzfs/zfs/issues/8381

[3]: https://github.com/openzfs/zfs/pull/10018

hardwaresofton · on Dec 23, 2021

Hey mastax, thanks for the notes. I'm no expert but want to respond to the following:

1. AFAIK none of these parts of the stack (postgres, ZFS) are using asynchronous IO or Direct IO -- postgres doesn't support either, and ZFS supports async writes (that concept means something very specific in ZFS land), and ZFS doesn't support direct IO (though it comes up a lot).

2. See my response to 1

3. All my testing is actually on NVMe, and comparing the performance to mdraid/ext4 (I also tried LVM) it's actually pretty decent. People have been running databases on ZFS with great success for years and I'm quite possibly the least knowledgeable person who's written about it! You can be productive without O_DIRECT support (again Postgres doesn't use direct IO so). Absolutely worth the benefits that ZFS provides, in my mind.

R0b0t1 · on Dec 22, 2021

What about postgres passthrough to disk. Is it better to do something like that? I normally see postgres on zfs, but I wonder.

hardwaresofton · on Dec 23, 2021

Please see my sibling response -- Postgres doesn't do async nor direct IO, default is fdatasync() now on linux and before it was fsync().

Here's a breakdown of the write path that is mostly correct which I typed out on reddit[0]:

- (Postgres) transaction occurs, generating WAL records (write()s are issued)

- (ZFS) receives filesystem writes makes transaction(s) and puts them in the current transaction group ("txg"), in memory

  - Exception: assuming logbias=latency if size is bigger than zfs_immediate_write_sz then the write goes straight to disk and a pointer is written into the ZIL instead... Bit blurry on this

- (Postgres) wal_write_delay time passes (default 200ms) or enough wal_writer_flush_after (default 1MB) data is added

- (Postgres) initiates WAL flush to disk -- i.e. calling fsync()

- (ZFS) sees the fsync(), and this means it must do a file-specific ZIL flush - it is possible for the txg to be committed first which will also trigger a ZIL flush

- (ZFS) performs a ZIL flush, writing ZIL records from memory to disk

- (ZFS) zfs_txg_timeout seconds passes and a "SPA sync" is performed

- (ZFS) transactions in the current transaction group are processed (+/- compression, encryption, etc) and are written to disk

  - This does NOT concern the ZIL records, these are writes in the transaction group that have not already been committed

  - [CORRECTION] direct sync records written as ZIL blocks are also added to the current TxG in memory. They are processed identically to async writes come TxG commit time and aggregate with them (they are async writes once durability is assured with a ZIL write). (see this comment by /u/taratarabobara)

All that said I haven't read/written enough of either codebase (ZFS/Postgres) to consider myself an expert, so there may be logic bugs in there!

[0]: https://www.reddit.com/r/zfs/comments/rlfhxb/why_not_txg_tim...

hardwaresofton · on Dec 23, 2021

Hey all, super late but OP here... When I posted this, it got two points, (https://news.ycombinator.com/item?id=29637167) -- thought HN wasn't so interested in ZFS/Postgres.

Want to note that there is excellent advice in the related reddit threads, in order of the experimentation I was doing:

- https://www.reddit.com/r/zfs/comments/rjavyc/everything_ive_...

- https://www.reddit.com/r/zfs/comments/rlfhxb/why_not_txg_tim...

- https://www.reddit.com/r/PostgreSQL/comments/rlrvt1/what_hap...

In the post I only begin to scratch the surface of tuning (both ZFS and Postgres are very deep) but glad people liked it.

As a shameless plug, one of the reasons I've been doing so much research and trying to figure out where the state of the art has landed is because I'm planning to launch a cloud provider next year, it's called Nimbus Web Services[0] -- if you run on Hetzner, OVH or LeaseWeb, please check it out.

[0]: https://nimbusws.com

delhanty · on Dec 23, 2021

A super deep and informative post as always Victor.

I'm still digesting it, also cross-referencing it with the ZFS section of your Hetzner drive failure write up of December 13th.

https://vadosware.io/post/handling-your-first-dead-hetzner-h...

An idea for a future post might be

"Our ZFS and Postgres setup for Nimbus WS: Hetzner, OVH and LeaseWeb"

tmikaeld · on Dec 22, 2021

Lot's of what-if-s in this write-up - a bit of a tinkerers ball - would have loved more benchmarks with graphs to show the differences.

Interesting to see a 20% performance increase due to blocksize being 16k instead of 8k.

AdrianB1 · on Dec 22, 2021

Not a Postgres expert, but a MS SQL one: performance of the storage in well configured servers, after the warm up, is the least concern for perf tuning. In query tuning physical reads are a flag for bad configuration, logical reads are the first and foremost indicator and a 20% increase in storage performance brings less than 1% in total server performance. This is because if your query is well written and covered with indexes, reads are performed in memory and large writes are very rare.

Nonetheless, the article is interesting as a way to understand how Postgres and ZFS go together.

hardwaresofton · on Dec 23, 2021

Completely agreed -- the reason that ZFS actually warrants tuning here is because ZFS has the ARC -- basically reads/writes are going to go through the linux page cache AND this additional ARC component, and there's a bit more leverage there than a more traditional LVM-on-xfs/ext4 stack.

Writes always have to touch disk of course to be durable so the underlying does matter. I'm not super familiar with MSSQL but does it have Direct IO? I have to admit I know almost nothing about the internals of MSSQL or the characteristics of filesystems on Windows.

There's a wealth of complexity and tunables in just how ZFS will write to disk along with how it will use and optimize ARC. Basically there's a second WAL at the disk level (the ZIL/SLOG) to contend with so there are lots of ways to do things and properties of the storage system can change.

AdrianB1 · on Dec 23, 2021

MS SQL IO is tightly integrated with Windows, including modes (like instant file initialization) that are not accessible to user mode apps. That makes it 'kind of Direct IO' on Windows, I am not familiar with the Linux port but I don't have any expectations there.

MS SQL configuration and tuning is less and less about storage and more and more about server parameters (most defaults are coming from the SQL version that introduced it, ex. SQL 2000 almost 22 years ago), query and index tuning. The last 2 can make huge differences, many orders of magnitude, so this is where most time is spent optimizing. The server configuration part is usually a list of 10-20 parameters every DBA is changing from start almost never touch later. This makes it really easy on that part.

hardwaresofton · on Dec 23, 2021

You're right! This article actually wasn't meant to be a benchmarking-style post, I didn't actually share my notes and numbers, because I originally started just experimenting and kind of fell down the rabbit hole. This originally started as me trying to pick a local filesystem after moving off of Ceph-for-everything -- it started as an mdraid vs LVM vs ZFS experiment.

That said, the reasons I haven't published all my notes is because:

- I only used one testing software pgbench (which is notorious for producing)

- Good evaluation needs a long term assessment of drive fragmentation which is hard to do (and literally takes time)

- Sometimes I changed multiple variables at a time, and not all the values I have in my notes are averaged across multiple runs, etc.

So in the end this post was just gathering what other much smarter people have said that I took inspiration from so it's all in one place.

Some intuition as well on why 16k can work better than 8k is in that scale15x talk -- for sequential table scans, it's almost like you're pre-faulting. A contact I've made through this research also recently suggested to me that I was only scratching the surface -- bigger record sizes should be better, as long as write amplification can be managed and fragmentation avoided (via a SLOG or more tuning).

testesttest · on Dec 22, 2021

Yea, I have run postgres/citus on zfs with petabytes of data. His analysis on record size is lacking. E.g there is a trade-off between compression vs record size. More compression means more effective iops

hardwaresofton · on Dec 23, 2021

Please see the reddit discussion -- there's some great feedback there on this very point, and some experienced contacts actually think I didn't go far enough, better results at even more than 16k as long as you can control write amplification and fragmentation.

- https://www.reddit.com/r/zfs/comments/rjavyc/everything_ive_...

- https://www.reddit.com/r/zfs/comments/rlfhxb/why_not_txg_tim...

- https://www.reddit.com/r/PostgreSQL/comments/rlrvt1/what_hap...

testesttest · on Dec 23, 2021

Thanks will do! Also, your post would have saved me a bunch of time a few years back. Gj.

hardwaresofton · on Dec 23, 2021

I hope we can lure more of the wizards sitting in their chairs out there to give more talks and write more content in 2022 -- there is so much knowledge on how to run these systems that hasn't been shared yet. I'm spoiled but I want more.

crad · on Dec 22, 2021

Great writeup, lots of detail and links off to reference material. I wish more posts were like this.

hardwaresofton · on Dec 23, 2021

Appreciate ya! Getting more of the sage advice that better scientists/engineers have been generous enough to share is one of my past times.

If you liked this and you're into distributed systems you might like my posts on Paxos:

https://vadosware.io/post/paxosmon-gotta-concensus-them-all/

iRobbery · on Dec 22, 2021

Nice overview, bookmarked. Only thing i could say is missing are some graphs :)

hardwaresofton · on Dec 23, 2021

Thanks! I unfortunately wasn't really scientifically disciplined enough as I ran pgbench running through configurations, so I didn't include my notes or numbers but hopefully someday I can come back to this with a fresh wad of cash to spend on dedicated hardware fit for purpose and put some of this to rest!

One of the things about this kind of benchmarking/testing is that it really requires multiple days worth of evaluation (index bloat, disk fragmentation, etc) and is really hard to get right... I'm afraid I'd add more noise than signal.

Some great resources out there though, like Percona's post on MySQL:

https://www.percona.com/blog/mysql-zfs-performance-update/

Even in there though, the comments are full of reasonable objections, with some really sage advice in terms of fragmentation and really measuring whether ZFS can coalesce your writes. It's a hard topic!

R0b0t1 · on Dec 23, 2021

Run it on really bad hardware, might be easier to see the differences. Could use some of the dm drivers to inject latency on disks too.

hardwaresofton · on Dec 23, 2021

Ah thanks for this, I didn't know there was actually a dm driver to inject latency. TIL it's called dm-delay[0][1].

[0]: https://www.kernel.org/doc/html/latest/admin-guide/device-ma...

[1]: https://enodev.fr/posts/emulate-a-slow-block-device-with-dm-...