Architecture
Overview
On a high level, the nzyme architecture is simple:
- As many
nzyme-tap
installation as you want are sending recorded data using HTTPs to as manynzyme-node
installations as you wish. Any HTTP load balancer technology can be used to distribute data to thenzyme-node
instances. - The
nzyme-node
installations automatically form a cluster if they are connected to the same PostgreSQL database. - The data sent by
nzyme-tap
is aggregated and summarized, meaning it will only send a fraction of the data it recorded. - All state is automatically handled by PostgreSQL. You can remove and add
nzyme-node
nodes as you wish.
A single-node architecture
A single-node architecture, with only one nzyme-node
instance and one nzyme-tap
that is collecting data. This is
easy to set up and will work, but offers no redundancy.
flowchart TD
N[nzyme-node]
T[nzyme-tap]
W[User via Web Interface]
S((Storage))
SQL[(PostgreSQL)]
W -. HTTPs .-> N
N -- local mount --> S
N --> SQL
T -- HTTPs --> N
A multi-node production architecture
A single-node architecture can be extended to a multi-node architecture of any size, without rebuilding the setup. You can add additional types of nodes as you wish, at any time.
flowchart TD
subgraph Nodes [nzyme-node Cluster]
N1[nzyme-node]
N2[nzyme-node]
N3[nzyme-node]
S1((Storage))
S2((Storage))
S3((Storage))
end
subgraph PSQLCluster [PostgreSQL Cluster]
direction LR
SQL1[(PostgreSQL)]
SQL2[(PostgreSQL)]
SQL3[(PostgreSQL)]
end
subgraph DataCenter1 [Data Center 1]
direction LR
T1[nzyme-tap]
T2[nzyme-tap]
end
subgraph Cloud [Public Cloud]
direction LR
T3[nzyme-tap]
end
subgraph OfficeHQ [Office / HQ Campus]
direction LR
T4[nzyme-tap]
T5[nzyme-tap]
end
W[User via Web Interface]
LB[Load Balancer]
W -. HTTPs .-> LB
LB -. HTTPs .-> Nodes
N1 --> PSQLCluster
N2 --> PSQLCluster
N3 --> PSQLCluster
N1 -- local mount --> S1
N2 -- local mount --> S2
N3 -- local mount --> S3
T1 -. HTTPs .-> LB
T2 -. HTTPs .-> LB
T3 -. HTTPs .-> LB
T4 -. HTTPs .-> LB
T5 -. HTTPs .-> LB
Scaling
We are using two concepts here:
- Horizontal scaling: Scaling workload by adding more machines
- Vertical scaling: Scaling workload by increasing resources (CPU, RAM, IO, ...) of a single machine (making it bigger)
Note
Before attempting any horizontal or vertical scaling, you should always consider performance tuning nzyme first.
Scaling nzyme-tap
data collection
Horizontally scaling a nzyme-tap
is not feasible, because adding a second tap at the same collection
point would only lead to data duplication.
That's why you should increase resources of your nzyme-tap
if it starts to run into performance issues:
- Identify if you are running out of CPU or memory resources. It is very unlikely for
nzyme-tap
to run out of IO resources, because it does not interact with the disk very much. - If you are running out CPU resources, switch to CPUs with a faster clock speed (GHz) or, even better, with more
cores/threads and adjust your thread settings in the
nzyme-tap
configuration file as described in performance tuning. The system is designed to run extremely efficiently over multiple cores. - If you are running out of memory, increase memory. The
nzyme-tap
is written in Rust and does not have a required setting for heap space. While extremely memory efficient, it will always attempt to use as much memory as it needs without any configured limit like in Java.
Scaling nzyme-node
data processing
The nzyme-node
component can and usually should be scaled horizontally. Additionally, running more nzyme-node
servers can bring you a higher degree of redundancy.
The most important part of scaling nzyme-node
is to determine the location of the bottleneck. There are several
parts of the architecture that can cause performance problems. Here are some common symptoms and resolution steps:
Timeouts received when tap attempted to submit data
If nzyme-tap
instances fail to send their recorded data and status reports to nzyme, they will report this in their
own log files. Usually, you would likely see reported HTTP timeouts because the connection is closed before all data
is transmitted.
There are some likely causes of this problem:
- Insufficient write speed into PostgreSQL with the
nzyme-node
instance not fully utilized and waiting for PostgreSQL to process the received tap data. Tuning PostgreSQL or adding another PostgreSQL node should solve this issue. - CPU-bound
nzyme-node
: High CPU utilization of thenzyme-node
leading to slow data processing of the received tap data. The taps can send huge amounts of compressed JSON that requires CPU resources to parse.
Closed connections or internal server errors received when tap attempted to submit data
Similar to the timeouts described above, you could also see closed connections or errors returned by the nzyme-node
that received the data.
In this case, you should first check the nzyme-node
log file for any obvious issues. Also check CPU and memory
utilization. A node running out of memory can cause aborted connections and errors.
Also make sure to always adapt the HEAP_SIZE
setting of nzyme-node
to use the correct amount of memory. (The
installation documentations cover this because it is dependent on the operating system you use.)
Slow web interface or API calls
If the nzyme web interface works but feels sluggish and slow, you are likely running into resource constraints on your PostgreSQL database. Check CPU, memory and IO load of your nodes and increase PostgreSQL processing power if required.