Maximizing DevOps Monitoring with Zabbix

Today, I want to share my experience working with Zabbix, its architecture, its pros, and its cons.

Zabbix is a universal monitoring tool that combines data collection, data visualization, and problem notification. It also allows for some advanced features, such as problem prediction.
My first encounter with this monitoring system was in 2014 when I joined a project where Zabbix was already in use for monitoring network devices (routers, switches). Back then, it was version 2.2, and it had a somewhat challenging web interface, even for that time. Over the course of five years, while working on the project, we went through several system upgrades until we finally transitioned to Zabbix 4.0 LTS. The number of monitored network devices grew to several hundred, and we added monitoring for VPN tunnels, physical servers, VMware vCenter, virtual machines, and some services like DNS and NTP. All of this required creating numerous XML templates, but it was worth it as Zabbix, together with the SIEM system, became the main tools for the OPS team.

My next job was at a product-oriented IT company with five offices around the world. We were using PRTG for infrastructure monitoring, and I hope none of you had to work with that system. By that time, I was already a Zabbix enthusiast, so within a few months, I migrated the entire IT department to the new monitoring system. In addition to resource monitoring, which was in place at my previous job, we added website monitoring, SSL certificates, and NGFW. Most of the templates were sourced from the official Zabbix website, as during that period, Zabbix developers started actively working on integrations with various products.

For nearly three years now, I have been working on one of the projects at Innovecs. Here, we monitor a much broader range of resources, but I’ll delve into that further in the article.

Popular monitoring systems

Before we delve into discussing Zabbix in more detail, let’s take a look at the other popular monitoring systems.

Prometheus is an open-source monitoring system providing powerful query language, storage, and visualization features for its users. It collects real-time metrics and records them in a time-series database. Prometheus is a tool that has a wide set of built-in functionalities, so Prometheus users don’t need to install various plugins or daemons to collect metrics. Prometheus is a common choice for Kubernetes monitoring because it was built for a cloud-native environment.

Datadog is a cloud-based SaaS solution for monitoring things like cloud applications, servers, databases, tools, and services. Real-time monitoring and analytics of complex applications and infrastructure. Detecting anomalies and setting alerts based on machine learning models.

Grafana is not a monitoring system, but it was wrong not to mention it. It is a data visualization tool that can integrate with various databases and monitoring systems, including both Zabbix and Prometheus. With its help, you can create beautiful interactive dashboards for data analysis.

Why Zabbix

Zabbix is the optimal monitoring solution for our project because it allows us to collect data and metrics from several dozen separate environments that do not have a network connection between them. In addition, we can create a large number of custom metrics to monitor such things as the number of certain errors in the service logs or records in the database.

What we need to monitor:

Templates provided by Zabbix:

Linux and Windows hosts status and resource utilization(CPU, RAM, disk space e.t.c), ntpd, sshd and other service status
Network status in different environments
Web monitoring of own sites (availability, authentication)
3-rd party app status and health (Nginx, Jenkins, MySQL, MongoDB, ELK cluster, Filebeat, DNS servers)

Customized Zabbix templates:

Own Java application deployed on Linux hosts (via JMX monitoring)
Own .Net application deployed on Windows and Linux hosts
Own .Net and Java application deployed on Kubernetes

Zabbix templates created by our DevOps team:

Errors in Nginx logs (count, type), upstreams status, count of specific requests, response statuses
Errors in own application logs, their type, count etc
Count of specific records in databases, their count, and changes per timeframe
SSL certificates
Database replications
Critical files checksum

Zabbix architecture

The entire monitoring system consists of the following services:

Zabbix server which includes the server itself, web interface, and database. The latest LTS version supports MySQL DB, Percona DB, MariaDB, and PostgreSQL. The web interface and database could be deployed on the same host or separately. Since v6.0 Zabbix server supports HA installation. Could be installed on all popular Unix-like operating systems

Zabbix proxy is a process that may collect monitoring data from one or more monitored devices and send the information to the Zabbix server, essentially working on behalf of the server. All collected data is buffered locally in its own DB and then transferred to the Zabbix server. Using databases on Zabbix proxy prevents losing data in cases of connection interruption. Could be installed on all popular Unix-like operating systems. Zabbix proxy could work in 2 modes, active and passive.

The primary difference between active and passive Zabbix proxies lies in their data collection methods, with active proxies actively initiating connections to monitored devices for real-time data retrieval and immediate response to triggers, potentially imposing a higher load on monitored devices and requiring changes to network configurations and firewall rules, whereas passive proxies rely on incoming connections initiated by monitored devices, resulting in lower resource impact, minimal network configuration changes, and making them more suitable for scenarios where low-latency monitoring is not critical and when modifying network settings on monitored devices is challenging.

Zabbix agent is deployed on a monitoring target to actively monitor local resources and applications (hard drives, memory, processor statistics, etc. it also allows us to create custom metrics on monitored hosts). Support nest platforms – Windows (since XP), Linux,macOS, IBM AIX, FreeBSD, OpenBSD, Solaris

Zabbix Java gateway is a process that provides native support for monitoring JMX applications. Should be installed on server or proxy hosts that are used for Java app monitoring.

IPMI agent – specific type of Zabbix agent item that allowed us to monitor devices that have IPMI support (HP iLO, DELL DRAC, IBM RSA, Sun SSP, etc)

SNMP agent – one more specific type of Zabbix agent item that allowed us to monitor devices such as routers, network switches, printers, etc. via SNMP protocol

In order to Documentation and my own experience, it’s good to have one centralized Zabbix Server and a set of Zabbix Proxy in each environment. In my opinion, DB replication and scheduled server snapshots are sufficient measures to ensure stable operation. All connections from the remote Proxy must use encryption.

Customization (main reason why we use it)

Zabbix offers extensive customization options to tailor monitoring to your organization’s specific needs. Administrators can define host groups, templates, and item types to categorize and organize tracked items. As I mentioned earlier, Zabbix offers a large number of templates for monitoring equipment, services, and other products of many popular vendors. If the presented integrations are not enough, you can always create your own. As the primary data source, you can choose a Python or bash script or any other executable file that can be run from the Zabbix server/proxy or directly on the agent itself (for example, if you want to collect some information from the server logs in real-time).

Speaking of templates, don’t forget about autodiscovery. Using these two features together will allow you to automatically create items, graphs, and triggers. Moreover, the update frequency of values, graph granularity, trigger severity for notifications, and other template parameters can depend on the host name, host group, assigned macro values, database records, or the number of specific errors in a log file within a time interval. Furthermore, the elements created using these features can dynamically change in the event of modifications to the parameters mentioned above or any others that you can define. More info about templates and autodiscovery you could get from Zabbix’s official documentation.

How we can visualize and use collected data

And so we collected a bunch of metrics and other data. What to do with them?

Graphs – provide a comprehensive view of various metrics; graphs are highly customizable and help to identify trends and patterns over time. Zabbix graphs offer both real-time and historical views. Graphs are integrated with the alert system and allow users to quickly respond to critical situations.

Problems – display information about current problems or show it in historical view. On this page, it is possible to display operational data for current problems, their duration, and action triggered by this problem (send message with problem info, run script at remote server, etc.)

Reports – summarize monitoring data, including graphs and charts, for a specified time period. These documents provide key performance metrics, graphs, and information over a period of time. Reports are useful for documenting trends, troubleshooting, and sharing information with stakeholders. Including customizable visualizations and detailed data, Zabbix reports facilitate data-driven decision-making and assist in capacity planning. The reporting feature improves communication between IT teams, allowing them to effectively analyze and monitor system performance.

Dashboards are highly customizable. Dashboards allow us to consolidate various monitoring data, including graphs, maps, and visualizations, into a single centralized interface. This provides a quick overview of the health and performance of all infrastructure, specific environment, or just one server or service. The latest LTS version of Zabbix supports 24 dashboard widget types.

Alerts and forecasts

Zabbix triggers serve as predefined conditions that, when met by collected monitoring data, automatically generate alerts or notifications, allowing system administrators to respond proactively to potential issues or anomalies in their IT infrastructure. Zabbix can send alerts to various destinations and through different channels, including email, SMS, custom alert scripts, and Webhooks. In addition, custom actions allow you to automatically respond to alerts, such as executing scripts both directly on the server/proxy itself and on agents.

Another cool thing that I discovered in Zabbix is “Forecasting”. What do I mean by forecasting in Zabbix? This system stores historical data and trends for all the metrics we collect, so we can create alerts about impending problems a little before they happen.

For instance, you can create a trigger that compares the average number of errors over the last hour to the number of errors at the same time one day, two days, and three days ago. If it differs by 30% or more in 2 out of 3 cases, this trigger can generate an alert. The alert triggered by this condition can provide insight into potential system capacity issues during specific time intervals.

Fly in The Ointment

Despite its numerous advantages, Zabbix also has disadvantages. Firstly, setting up and configuring Zabbix can be complex and require a deep understanding of system administration. Secondly, data visualization and reporting in Zabbix may be less flexible and intuitive compared to some competitors. Thirdly, with a large number of monitored devices and high data collection intensity, Zabbix can consume significant resources, potentially requiring infrastructure scaling. Lastly, for inexperienced users, the initial setup of Zabbix can be challenging and may require time and effort to master.

Tips

Use templates only – Zabbix templates are essential for efficient and standardized monitoring configuration across various hosts and devices. Having created a template once, you no longer have to repeat the same operations many times

Autodiscovery is your friend – it streamlines repetitive tasks of setting up monitoring, speeds up deployment processes for new environments by automating the process of adding new hosts and services to monitoring, reducing manual configuration efforts, ensuring timely detection of changes in your environment, and enhancing the scalability and flexibility of your monitoring infrastructure.

Zabbix proxies help to reduce network traffic, improve scalability, increase security, and add the ability to monitor remote and isolated environments efficiently.

Create informative dashboards, as they provide more efficient and visually appealing monitoring of your infrastructure, enabling timely responses to events. Also, strive to include as much information as possible in alerts, especially when you have separate operational or support teams.

Trigger with forecasts. When configuring the monitoring of any system, it’s essential to think about what metrics to collect and what events to generate notifications for. In my opinion, this is the perfect opportunity to showcase some creativity and develop triggers that alert about system changes before they become critical. Trend expressions, arithmetic functions, majority processing, and predictive trigger functions can be helpful in achieving this goal.

That’s pretty much it! I hope you find this guide useful.

Zabbix: Tips for Effective DevOps Monitoring

Popular monitoring systems

Why Zabbix

Fly in The Ointment

Tips