Architecture

Overview

On a high level, the nzyme architecture is simple:

As many nzyme-tap installation as you want are sending recorded data using HTTPs to as many nzyme-node installations as you wish. Any HTTP load balancer technology can be used to distribute data to the nzyme-node instances.
The nzyme-node installations automatically form a cluster if they are connected to the same PostgreSQL database.
The data sent by nzyme-tap is aggregated and summarized, meaning it will only send a fraction of the data it recorded.
All state is automatically handled by PostgreSQL. You can remove and add nzyme-node nodes as you wish.

A single-node architecture

A single-node architecture, with only one nzyme-node instance and one nzyme-tap that is collecting data. This is easy to set up and will work, but offers no redundancy.

flowchart TD
    N[nzyme-node]
    T[nzyme-tap]
    W[User via Web Interface]
    S((Storage))
    SQL[(PostgreSQL)]

    W -. HTTPs .-> N
    N -- local mount --> S
    N --> SQL
    T -- HTTPs --> N

A multi-node production architecture

A single-node architecture can be extended to a multi-node architecture of any size, without rebuilding the setup. You can add additional types of nodes as you wish, at any time.

flowchart TD
    subgraph Nodes [nzyme-node Cluster]
      N1[nzyme-node]
      N2[nzyme-node]
      N3[nzyme-node]

      S1((Storage))
      S2((Storage))
      S3((Storage))
    end

    subgraph PSQLCluster [PostgreSQL Cluster]
      direction LR
      SQL1[(PostgreSQL)]
      SQL2[(PostgreSQL)]
      SQL3[(PostgreSQL)]
    end

    subgraph DataCenter1 [Data Center 1]
      direction LR
      T1[nzyme-tap]
      T2[nzyme-tap]
    end

    subgraph Cloud [Public Cloud]
      direction LR
      T3[nzyme-tap]
    end

    subgraph OfficeHQ [Office / HQ Campus]
      direction LR
      T4[nzyme-tap]
      T5[nzyme-tap]
    end

    W[User via Web Interface]
    LB[Load Balancer]

    W -. HTTPs .-> LB
    LB -. HTTPs .-> Nodes

    N1 --> PSQLCluster
    N2 --> PSQLCluster
    N3 --> PSQLCluster

    N1 -- local mount --> S1
    N2 -- local mount --> S2
    N3 -- local mount --> S3

    T1 -. HTTPs .-> LB
    T2 -. HTTPs .-> LB
    T3 -. HTTPs .-> LB
    T4 -. HTTPs .-> LB
    T5 -. HTTPs .-> LB

Scaling

We are using two concepts here:

Horizontal scaling: Scaling workload by adding more machines
Vertical scaling: Scaling workload by increasing resources (CPU, RAM, IO, ...) of a single machine (making it bigger)

Note

Before attempting any horizontal or vertical scaling, you should always consider performance tuning nzyme first.

Scaling `nzyme-tap` data collection

Horizontally scaling a nzyme-tap is not feasible, because adding a second tap at the same collection point would only lead to data duplication.

That's why you should increase resources of your nzyme-tap if it starts to run into performance issues:

Identify if you are running out of CPU or memory resources. It is very unlikely for nzyme-tap to run out of IO resources, because it does not interact with the disk very much.
If you are running out CPU resources, switch to CPUs with a faster clock speed (GHz) or, even better, with more cores/threads and adjust your thread settings in the nzyme-tap configuration file as described in performance tuning. The system is designed to run extremely efficiently over multiple cores.
If you are running out of memory, increase memory. The nzyme-tap is written in Rust and does not have a required setting for heap space. While extremely memory efficient, it will always attempt to use as much memory as it needs without any configured limit like in Java.

Scaling `nzyme-node` data processing

The nzyme-node component can and usually should be scaled horizontally. Additionally, running more nzyme-node servers can bring you a higher degree of redundancy.

The most important part of scaling nzyme-node is to determine the location of the bottleneck. There are several parts of the architecture that can cause performance problems. Here are some common symptoms and resolution steps:

Timeouts received when tap attempted to submit data

If nzyme-tap instances fail to send their recorded data and status reports to nzyme, they will report this in their own log files. Usually, you would likely see reported HTTP timeouts because the connection is closed before all data is transmitted.

There are some likely causes of this problem:

Insufficient write speed into PostgreSQL with the nzyme-node instance not fully utilized and waiting for PostgreSQL to process the received tap data. Tuning PostgreSQL or adding another PostgreSQL node should solve this issue.
CPU-bound nzyme-node: High CPU utilization of the nzyme-node leading to slow data processing of the received tap data. The taps can send huge amounts of compressed JSON that requires CPU resources to parse.

Closed connections or internal server errors received when tap attempted to submit data

Similar to the timeouts described above, you could also see closed connections or errors returned by the nzyme-node that received the data.

In this case, you should first check the nzyme-node log file for any obvious issues. Also check CPU and memory utilization. A node running out of memory can cause aborted connections and errors.

Also make sure to always adapt the HEAP_SIZE setting of nzyme-node to use the correct amount of memory. (The installation documentations cover this because it is dependent on the operating system you use.)

Slow web interface or API calls

If the nzyme web interface works but feels sluggish and slow, you are likely running into resource constraints on your PostgreSQL database. Check CPU, memory and IO load of your nodes and increase PostgreSQL processing power if required.