ZFS (Zettabyte File System) • PhinIT.DE

ZFS is not just a file system, but a hybrid storage management system that fuses the functions of a file system and a Logical Volume Manager (LVM).

In traditional architectures (e.g., ext4 on LVM), the file system and physical disk management are separate. The file system does not “see” when an underlying block is corrupt. ZFS eliminates this blindness by controlling both layers. The goal is not primarily speed, but absolute data integrity.

Core Mechanics: The “Why” Behind the Technology

Copy-on-Write (CoW)

Classic file systems overwrite data blocks when changes are made (“update-in-place”). If the power fails during this write, the block is inconsistent.

ZFS uses copy-on-write. If you change a file:

ZFS writes the new data to a new, free block.
Then updates the metadata pointers to point to the new block.
Only releases the old block for overwriting after successful writing (if no snapshot accesses it).

Architectural consequence:

The file system is always in a consistent state on the hard drive.
There is no fsck (file system check) after crashes, because broken transactions are simply ignored on the next mount (they were never committed).

Merkle Tree & Checksums

ZFS does not trust the hard drive. Each data block is given a checksum. However, this checksum is not stored in the data block itself, but in the parent block (the pointer that points to the data).

This extends through the entire tree up to the root block (Merkle Tree).

Causality:

If a bit on the disk tilts (“bit red”), the calculated checksum does not match the one stored in the parent block when read.

For Mirror/RAID-Z: ZFS detects the error, retrieves the correct data from parity/mirroring, delivers the correct data to the application and repairs the defective block in the background (“self-healing”).
For single-disk: ZFS reports an I/O error instead of passing corrupt data to the application.

Storage hierarchy (the “Pool”)

ZFS abstracts physical disks into a storage pool (zpool). Datasets (file systems) make use of this pool dynamically instead of using rigid partitions.

VDEVs (Virtual Devices)

The pool consists of one or more VDEVs. A VDEV is the redundancy unit. If an entire VDEV fails, the entire pool is lost.

Area

VDEV Type	Architecture Logic	Application
Mirror	data is mirrored (RAID 1/10). Best IOPS, fastest resilvering.	Databases, VMs, High-Perf.
RAID-Z1	Simple parity (similar to RAID 5). Tolerates 1 disk failure.	Low critical data, backup.
RAID-Z2	Double parity (similar to RAID 6). Tolerates 2 failures. Recommended for large HDDs (>4TB) as rebuild times take a long time and risk further failures.	Standard for file servers / NAS.
RAID-Z3	triple parity. Extreme security.	Cold storage, very large arrays.
dRAID	Distributed RAID. Distributes hotspares and parity across all discs. Dramatically faster rebuild than RAID-Z.	Enterprise arrays (>20 disks).

Important: A RAID-Z VDEV cannot be retrofitted with individual disks (as of today in most stable implementations). You’ll need to add a new VDEV to the pool to gain capacity.
Special VDEV classes

You can tune the performance by offloading loads to SSDs/NVMe:

LOG (SLOG/ZIL): For synchronous writes only (e.g., databases, NFS). ZFS writes the intent log (ZIL) to this fast device before making the slow pool commit. Does not speed up the sequential writing of large files.
CACHE (L2ARC): Extension of RAM cache to SSD. Only useful if the RAM is already maximized (see Section 4).
SPECIAL: Stores metadata (and optionally small blocks) on SSDs. Massively speeds up the search of directories (ls -la, find) on HDD pools. If the SPECIAL-VDEV fails, the pool is defective (so it must be redundant!).

Caching & Performance (ARC)

ZFS uses the ARC (Adaptive Replacement Cache), which is located in the RAM. Unlike the simple LRU (Least Recently Used) of other systems, ARC balances between:

MRU (Most Recently Used): What did you just touch?
MFU (Most Frequently Used): What do you touch often ?

Architecture Decision:

ZFS uses almost all of the free RAM for the ARC by default. This is intentional (“Unused RAM is wasted RAM”). As soon as applications need memory, ZFS releases it immediately.

[SCREENSHOT: Overview of RAM distribution in TrueNAS/Proxmox (ARC, Services, Free)]

The L2ARC Myth:

L2ARC (SSD Cache) consumes RAM to manage its index table.

Rule of thumb: Don’t add L2ARC if you have less than 64GB of RAM. Often, more RAM is more effective than a cache SSD.

Features & Implications

Snapshots & Clones

Thanks to CoW, snapshots initially cost zero storage space. A snapshot is simply a timestamp that prevents blocks that existed at that time from being released. Only when data changes does the snapshot (delta) grow.

Send/Recv: ZFS can send snapshots incrementally to another pool. This is more efficient than rsync, because ZFS knows exactly what has changed at the block level without having to scan the file system.

Compression

Activating LZ4 or ZSTD compression often increases performance.

Causality: CPUs today are extremely fast, hard disks (I/O) slow. It is faster to write compressed data and unpack it in RAM than to write uncompressed data to the slow disk.

Deduplication (Warning!)

Deduplication eliminates identical blocks.

The problem: ZFS requires a “Deduplication Table” (DDT), which must be stored in RAM.
Rule of thumb: ~1-5 GB RAM per 1 TB of data just for the table.
Risk: If the RAM is running full and the DDT has to be moved to the disk, the performance drops dramatically (“thrashing”). In 99% of cases, deduplication for general-purpose workloads is an architectural flaw.

Conclusion & Areas of Application

ZFS is the de facto standard for software-defined storage when data integrity is more important than raw speed on standalone systems.

Deployment checklist:

Use of ECC RAM is strongly recommended (but ZFS also runs without it, only loses a protective layer in the RAM itself).
Do not use a hardware RAID controller! ZFS requires direct access to the disks (HBA mode / IT mode) to read SMART values and address sectors directly.
Ideal for: NAS (TrueNAS), hypervisor storage (Proxmox), backup servers.

This post is also available in: Deutsch English