ArtikelRahmen V5 Engpass

A server that does its job is invisible. But as soon as websites throw timeouts, database queries hang or the SSH login becomes a test of patience, alarm bells ring for you as an admin. The knee-jerk reaction is often: “We need more sheet metal!” But is hardware scaling always the right answer? Often the problems lie deeper – in the code, in the database configuration or in the architecture.

In this article, we’ll analyze how to distinguish between real bottlenecks and mere configuration errors, when scale-up or scale-out is the right strategy, and how to proactively design your capacity planning.

Not all scaling is the same

Before we buy hardware, we need to determine the direction. “Scalability” technically means adapting resources to workloads without having to change the architecture. A distinction is made between two fundamental strategies:

1. Scale-Up

The “Bigger Hammer” principle. You take the existing server and give it more resources (CPU, RAM, NVMe).

  • Ideal for: Monolithic databases (SQL), legacy applications that are not cluster-enabled.
  • Limit: At some point, the largest available server on the market is reached (or the budget is exceeded).

2. Scale-out

The principle of “more workers”. Instead of one large server, you use many small instances that share the load (clusters).

  • Ideal for: Web servers, stateless applications, microservices, containers (Kubernetes).
  • Limit: The complexity is increasing (load balancer, data synchronization, network latency).

Server Scaling

First warning signs: How do you recognize bottlenecks?

Before you release the budget for new hardware, you need to prove that the current hardware is actually the problem. A subjective “The server feels slow” is not enough here. Rather, you need to dive deep into system metrics to distinguish real bottlenecks from mere configuration errors. A look at htop is a good start, but often not detailed enough.

Here are the four main suspects and how to analyze them:

1. CPU: Usage vs. Load Average

Many admins make the mistake of only looking at the CPU usage in percentage. A CPU at 90% isn’t necessarily a problem—it may just be used efficiently (e.g., video encoding).

  • The true indicator: Load Average. The load average (seen via uptime or top) shows how many processes are currently waiting for the CPU or are being processed by it.
    • Rule of thumb: A load of 1.0 means that a CPU core is fully utilized.
    • The warning signal: If the load average consistently exceeds the number of available CPU cores (e.g. Load 6.0 on a 4-core machine), the processes (CPU Wait) will pile up. The system becomes sluggish, latency increases.
  • Special case virtualization: Pay attention to the %steal (st) value. If this is high, your VM may be bored, but the physical host below it is overloaded and does not give you any computing time.

2. RAM: Caching vs. Swapping

Memory is often the most misunderstood resource. In Linux, the principle applies: “Free RAM is wasted RAM”. The operating system aggressively uses unused memory as a page cache for the file system to minimize disk access.

  • The real indicator: swapping & paging. Ignore the “Free Memory” value. Instead, look for “Available”. It only becomes critical when the server starts swapping.
    • The warning signal: If processes have to be actively swapped to the swap area (to the slow hard drive), performance drops massively. You can tell by a high activity of the kernel (process kswapd consumes a lot of CPU) and high values at si (swap in) and so (swap out) in the tool vmstat.
    • Worst case: The OOM Killer (Out of Memory Killer) strikes and indiscriminately terminates important processes (such as the database service) to save the system from crashing.

3. Storage I/O: IOPS vs. throughput

Often, the CPU is blamed, even though the hard drive is the real culprit. Databases in particular often do not generate a high data rate (MB/s), but an extremely large number of small read/write accesses (IOPS).

  • The real indicator: I/O Wait (%wa). You can find this value in top or iostat. A high iowait value means that the CPU is twiddling its thumbs and waiting for the hard drive to deliver data.
    • The warning signal: If your CPU load is low (user/system%), but the “wait” time is high, your storage is too slow.
    • The solution: No CPU upgrade will help here. You need disks with lower latency (switching from HDD/SATA SSD to NVMe) or more IOPS. Check with iotopwhich process is currently “hammering” the plate.

4. Network: Bandwidth vs. Limits

A network bottleneck does not have to mean that the 1 Gbit/s line is full. Often it is logical limits or packet rates that limit.

  • The real indicator: Dropped Packets & Conntrack.
    • Saturation: Use iftop or nloadto check if you reach the physical limit of the interface.
    • Packet Rate (PPS): A DDoS attack or faulty applications can send millions of small packets that overwhelm the network card’s CPU even though the bandwidth is still running out.
    • State Limits: A common, silent death for web servers is a full Conntrack table in the firewall. If the system can no longer track new connections (dmesg check: “table full, dropping packet”), the server is no longer accessible from the outside, although the hardware seems relaxed.


Pre-purchase checklist: Is it really the hardware?

Nothing is more embarrassing (and expensive) than freeing up the budget for a server twice as large, only to find that the application is exactly as slow afterwards as it was before. In practice, an estimated 70% of all performance issues are not due to a lack of hardware, but to inefficient software or configuration.

Before you scale, work through this “health check” list mercilessly:


[ ] 1. The Database Check: Indexes & Slow Queries

The database is the most common bottleneck. A single bad SQL query can bring even a 64-core server to its knees.

  • Check Indices: Are WHERE-, – and ORDER BY-clauses JOINcovered by indices? A full table scan with millions of rows is the death of any I/O performance.
  • Slow Query Log: Enable logging for slow queries (for example, > 1 second). Often it is queries that load an unnecessary amount of data (SELECT *) or N+1 problems when using ORMs (Object-Relational Mappers).
  • Explain Plan: Use EXPLAIN before your SQL statement to see if the database really uses the index.

[ ] 2. The Caching Strategy: Don’t Calculate Anything Twice

The fastest request is the one that does not reach the web server in the first place. Caching is almost always cheaper than CPU power.

  • Application cache: Are expensive database results (e.g., “Top 10 Products”) cached in Redis or Memcached?
  • OpCode Cache: In PHP, is the opcache enabled and large enough so that scripts don’t have to be recompiled every time they are called?
  • HTTP cache: Do you use Varnish or Nginx caching for static assets or entire HTML pages? A reverse proxy can deliver thousands of requests per second, while your backend server capitulates to 50 requests.

Code Profiling & APM: A Look Under the Hood

Sometimes the code itself is the problem.

  • Memory Leaks: Does a process’s RAM consumption increase steadily over days until it crashes? Then you have a memory leak that will also fill 1 TB of RAM at some point.
  • Inefficient loops: Is data processed in nested loops, which puts unnecessary strain on the CPU (O(n²) complexity)?
  • Blocking I/O: Does your code synchronously wait for an external API (e.g. payment gateway) to respond while the user waits? Use asynchronous jobs (queues) for such tasks. Tools such as New Relic or Datadog help with “Application Performance Monitoring” (APM) here.

[ ] 4. OS & Server Config: Releasing the Handbrake

Standard configurations are designed for safety and compatibility, not high load.

  • File Descriptors (Open Files): In Linux, “everything is a file” (including network sockets). The default limit (often 1024) is far too low for web servers. Check ulimit -n and increase it if necessary.
  • Worker limits: Are Apache(MaxRequestWorkers) or PHP-FPM(pm.max_children) configured to take advantage of the available hardware? If the pool is too small, users wait, even though the CPU still has room for improvement. If it is too large, there is a risk of swapping.
  • KeepAlive: Is the TCP keepalive time set correctly to reduce the overhead of establishing a connection?

Conclusion: Only when you can tick off anywhere here and the load is still too high – then (and only then) is hardware the next logical step.

Decision: When which way?

The choice between scale-up and scale-out is rarely a matter of taste, but is usually dictated by the architecture of the application. Here are two classic scenarios that you will encounter in everyday life:

Scenario A: The database server (stateful) gasps

Relational databases (MySQL, PostgreSQL, MSSQL) are “stateful” in nature. They love consistency (ACID) and hate latency between memory and compute. When performance drops here, scale-up (vertical) is almost always the first, most cost-efficient and most stable step.

  • The RAM factor: Database performance stands and falls with caching. The goal is to keep the entire index (and ideally the “hot data”) in RAM (e.g. in the InnoDB buffer pool). If the DB has to access the disk for index lookups, the latency increases by a factor of 1,000.
    • Strategy: Maximize RAM and CPU cores. Switch from SATA SSD to NVMe during storage bottlenecks to dramatically increase IOPS.
  • The scale-out option (read replicas): If a single “monster server” is no longer enough, you can scale horizontally, but it’s complex.
    • Master-Slave: You provide the write server (master) with several read servers (slaves/replicas). However, your application must be intelligent enough to send write operations to the master and read operations to the replicas.
    • Sharding: Dividing the data into completely separate servers is the last expansion stage (“premier class”), but it entails massive administrative overhead.

Scenario B: The webshop / frontend (stateless) is slow

Web servers (Nginx, Apache) or application servers (PHP-FPM, Node.js, Python) ideally process queries “statelessly”. This means that you do not store any status locally on the hard drive. It doesn’t matter whether Request A ends up on Server 1 and Request B on Server 2. Scale-out (horizontal) is mandatory here.

  • The Load Balancer as Conductor: You simply put a second, third or tenth server next to it. A load balancer (e.g. HAProxy, F5 or cloud LBs) distributes the incoming requests.
    • Strategy: Keep the individual servers (“nodes”) small. Many small servers are often more fail-safe than two huge ones.
  • Challenge Sessions: Attention to user sessions (shopping carts, logins). If these are stored locally in the RAM of the web server, the user loses his shopping cart when the load balancer sends him to the neighboring server.
    • Solution: Use an external session store (e.g. Redis or Memcached) or configure “Session Stickiness” on the load balancer.
  • Advantage of High Availability (HA): Scale-Out provides redundancy “for free”. If a server fails (hardware failure, patching), the load balancer takes it out of rotation and traffic flows seamlessly to the remaining nodes.

Modern Scaling: Containers, Cloud & Orchestration

In modern environments, no one unscrews cases to add RAM bars. Today, infrastructure is defined as code (IaC), and scaling happens dynamically via API.

1. Auto-scaling in the cloud (VM-based)

Hyperscalers such as AWS (Auto Scaling Groups) or Azure (Virtual Machine Scale Sets) have democratized the concept of elasticity. You no longer define a fixed number of servers, but a set of rules:

  • The rule: “If the average CPU load of all web servers is above 70% for more than 5 minutes, start two new instances.”
  • Scale-In (Important for Costs): “If the load drops below 30%, finish the excess instances.”
  • Use Benefits: Your infrastructure “breathes” with the course of business. You pay for the peak load on Black Friday, but not for the silence on Sunday morning.

Kubernetes (K8s) Container Orchestration

Kubernetes takes scaling to the extreme by packaging applications into containers (pods) and distributing them on a cluster of servers (nodes). K8s offers two dimensions of scaling:

  • HPA (Horizontal Pod Autoscaler): Kubernetes checks your application’s metrics every second. As requests to your microservice increase, K8s immediately launches new pods – often in a fraction of a second, as containers start much faster than entire VMs.
  • Cluster Autoscaler: What happens if all servers in the cluster are full of containers? The Cluster Autoscaler automatically orders new virtual machines (nodes) from the cloud provider, adds them to the cluster and thus creates physical space for more containers.

Although this degree of automation initially requires more configuration effort (keyword: complexity), it is the gold standard for modern, highly available applications.


grafik 9

Conclusion: Acting instead of reacting

Server scalability is not a one-time event, but an ongoing process. If you only upgrade when the server is already up and running, you have often already lost money and the trust of the users.

Keep these three basic rules in mind:

  • Monitors correctly: Know your baseline. You need to know what “normal” traffic is in order to detect anomalies immediately.
  • Optimize first: Software before hardware. A missing database index or a memory leak cannot be slain with even more RAM in the long run.
  • Plan for buffers: A CPU should not run above 70% permanently in regular operation. This is the only way to have reserves to cushion spikes from marketing campaigns or backups.

Whether you choose the easy way (scale-up) or the flexible way (scale-out) depends on your architecture in the end. The only important thing is to identify your bottleneck precisely before you remove it. This way, you can always stay one step ahead of the rising expectations of your users.

further links

Brendan Gregg: Linux Performance
The “bible” for performance analysis on Linux.
https://www.brendangregg.com/linuxperf.html
Prometheus Documentation
Standard tool for modern monitoring and alerting to make bottlenecks visible.
https://prometheus.io/docs/introduction/overview/
AWS: What is Autoscaling?
Explanation of the horizontal scaling principle using the example of the cloud
https://aws.amazon.com/de/autoscaling/
Google SRE: Monitoring Distributed Systems
How Google monitors and scales – the gold standard for DevOps and admins.
https://sre.google/sre-book/monitoring-distributed-systems/

This post is also available in: Deutsch English