Zabbix
Zabbix is a great and Awesome Open Source Monitoring and Alerting Software. This is why Brian from @Awesome Open Source and I decided to make two episodes together. In this Episode We will configure Zabbix to Scan our network, dynamically add hosts for monitoring and alerting.
The first part can be found here: Zabbix Video on the Awesome Open Source channel
All commands that are used in the video as well as the Ansible playbook to roll out the Zabbix Agent can be found on my Github Repository
Watch the video on YouTube
Click to view the entire transcript
So – your network has grown. You added devices. Containers, Routers, PCs – whatever. But you don’t want things to go belly up because a disk was full or a link on your switch or router was down. Why did the device not tell me before? Rather than just stopping and leaving a line in the event log saying “Sorry pal, my disk is full” – it could have told me that before right? That’s what monitoring an alerting is for.
(Zabbix is Awesome Open Source)
Good for you – there is an awesome open source solution called Zabbix for this. And because it is Awesome Open Source, Brian from the “Awesome Open Source” YouTube channel and myself decided to do these episodes together. So if you have not seen the teaser video last week or if you don’t know the Awesome Open Source YouTube channel then here’s a link for you where Brian actually shows us in his very unique and patient way how to install Zabbix and how to get it up and running in no time. Please check it out and make sure you leave him a like and a comment on his channel. Many thanks.
(Why use Zabbix and not anything else?)
So why and how do I use Zabbix? Why not Prometheus and Grafana like many other YouTubers suggest? I found those solutions a bit over sized really for my home lab environment. Plus - they are lacking two or three features which I really love with Zabbix. First off, I don’t want to define all the devices and metrics by hand. I would want to scan my network and have the devices added to the graphical user interface automatically. Also – I don’t really fancy defining stuff in text files. I want a user interface for that. And last but not least, I would love to use that tool not only for monitoring and alerting, but also as an inventory. Zabbix gives me all that.
(How I use Zabbix)
Here is how I use Zabbix. When I start Zabbix, then the first thing that I look at, is the monitoring dashboard. Are there any problems? If yes, then they show up here in different colors here in the Zabbix UI depending on the severity. If I click on one of those then I can ad hoc see how things developed here over time. If I have configured alerting then I would also receive a text or mail from Zabbix outlining the problem. I can also go directly to the host list which I can filter by groups like I put all proxmox containers into one group for example. I could pick one of those and see the latest data that has been collected. Let’s take this ebus container here. That one collects data from my heating system. Looking at the disk space you can see that there is something filling up the disk over time. In this case it’s an ever growing log file that never gets truncated. On my routers and Wifi access points I can quickly access data such as the throughput of the Wifi interfaces. On my switches I can see the status and throughput for each single port.
(How to get and install it)
There’s much more. But I am not trying to feature-sell zabbix to you. I think it’s much easier if you tried it out for yourself. If you have not watched the first episode on the Awesome Open Source YouTube channel then you might as well install it following the instructions on the zabbix.com Web site. But honestly – there are a couple of catches, so it’s much easier to watch Brian’s video and follow the steps there.
(Scan the network)
Cool – now that we have Zabbix up and running, we want to start monitoring our servers, containers and network devices. But the only host that we have available in the host list is our Zabbix Server itself. We could now of course create every single host in the network by clicking on that blue “Create Host” button in the Zabbix UI. But that’s a bit of a tedious task. Let’s rather go and have Zabbix scan our network and create the host entries automatically for us.
For this, we go to “Configuration” in the Zabbix UI, then to “Discovery”. You can see that there is already a rule which Zabbix created called Local Network. The network to be scanned is set to the subnet which my Zabbix host is in. Let’s tick the “Enabled” box here at the bottom. Now we go to “Actions” then “Discovery Actions” and click on “Create Action”. Here we can tell Zabbix what it should do with discovered hosts. I want Zabbix to create a host entry in my inventory for every entry it finds, so here under “conditions” I click on “Add”. The type of the condition is “Discovery Rule”, I click on “Select” and chose the rule which I had just enabled. On the “Operations” tab I can now have Zabbix perform an action. I click on “Add”. As an operation I just select “Add Host”. We’ll talk about all the other options later.
We would now expect Zabbix to actually scan the network and add the hosts. But if you go to “Monitoring” and then “Discovery” then you won’t see any hosts being added. Let’s check why. We go back to the “Configuration” then “Discovery” tab and select the “Local Network” rule again. In this Box here that is titled “Checks” you can see that it is actually scanning the network for hosts that have the Zabbix agent installed. So in fact it’s scanning the network for open ports 10050. But at this moment we do not have any Zabbix Agent anywhere. We’ll do that later. So let’s rather do something more generic. I remove the Zabbix Agent test and add three new ones. ICMP Ping, HTTP on port 80 and SSH on port 22. So what I want Zabbix to do is that it pings all hosts and checks if http port 80 or 22 is open. I’ll also change the interval to 5 minutes. You can even change this to 1 minute for starters. That’s it. Guys, for a more detailed explanation of each single field and example rules please see the Zabbix documentation chapter 15, part 1.
Now – I’ve seen it happen that nothing changes here under Monitoring – Discovery. Restarting the server with systemctl restart zabbix-server did the trick. Nevertheless, this will now take a long time. You will see hosts being added gradually to the Zabbix interface. Time to take a coffee or go for a walk with the dog. After a while you should see that list fill up.
(Configure Hosts)
Awesome – the list has filled up and now it’s time to configure the hosts for Zabbix. If we go to Monitoring, then Hosts in the Zabbix UI then we can see the newly added hosts here. You can see that they all have this greyed out “ZBX” icon here – that’s because there is no agent yet. We’ll come to that. Now let’s first click on any of those discovered hosts and select “Configuration”. We can change a couple of obvious values here, such as the Visible name if I don’t want the host to be shown in lists with the dns name. I strongly advice to keep the host name as is – maybe later you would want to use Zabbix as an inventory for Ansible – it will then come in handy if the host name is a proper dns name. We can also assign the host to a group. Zabbix has created the group “Discovered hosts” here, but I can assign it to any host group which I can also customize under Configuration – Host Groups. Now for the interesting part – templates. We don’t want to define all the metrics that we want to collect from that host like cpu, Memory, disk space etc, manually. For this, there are a huge number of templates readily available in Zabbix. If you click on “Select” next to templates and then again on “Select” in that dialogue box, then you get a list of theme-based templates here. Let’s select “Templates”. Now I get a nice list of predefined templates for various vendors like Cisco, Mikrotik or HP and many others. There’s also templates for Operating systems such as Linux or Windows. In my case I select “Linux by Zabbix Agent” as I will be monitoring a Linux device. Note that there is a passive and an active template here. In fact the Zabbix agent on the remote device can be active, that means it’s initiating the connection to the Zabbix server, or it can be passive, that means that it just listens on Port 10050 and waits for the Zabbix Server to connect. I use the passive one for local servers and the active one for Laptops for example. That’s it for the moment.
(Add and configure the agent)
Amazing – the host is configured and all I need in order to collect data from it is to install the Zabbix agent on it. Let me quickly do this on two devices. One is a linux box and the other one is my Sandbox Router that is running OpenWrt. Let me do the router first. On OpenWrt, the zabbix agent is available as a software package. So I log into the OpenWrt user interface, select System, then Software, potentially update the list and search for “zabbix”. Fun fact here – I could even run the Zabbix Server on my OpenWrt router – but here I am only after the zabbix agent. That’s the first package here – zabbix-agentd. Furthermore, I also want to install all these zabbix-extra packages as they will provide me with more metrics. Cool – now I have the zabbix agent running on this router. Just one more thing to tweak. By default, the server IP address from which the agent accepts connections is set to the localhost address, 127.0.0.1 in the agent’s config file. So I need to replace that with the address of the zabbix server. I do this over ssh with this little sed command here. Also, I now need to restart the zabbix agent service. Done.
After a minute or so, the status of the host changes from grey or red to green – Zabbix is now collecting data from it. I can click on the latest data link and view the data points that Zabbix has already collected. I can even graph them already – OK, not much to see after a couple of seconds. But it works. I want to draw your attention to the fact that zabbix is already collecting over 160 data points from this single device. That’s because the templates have a large number of discovery rules. Imagine how long this would take me to build 160 data point collections by hand. Hooray for templates. On a Debian server like this rundeck server here, the Zabbix agent can be installed using apt. apt install zabbix-agent, then I launch the same sed statement. Just this time the config file is in a slightly different location. Restart the service. Done. After a minute or so – again, the host is available in the host list. Beautiful.
At his point I just want to make a remark on manual zabbix agent installation. You have seen that there is a bit of config work to be done on the client. That’s probably not a problem if you install the agent on two or three machines. You only need to do it once forever. Nevertheless, if you want to implement this in a larger environment that is maybe rapidly changing then I really recommend using something like Ansible for the distribution of the agent. I will make a video on ansible very shortly. If you are already using it, then please see the ansible playbook on my github repository.
(Alerting)
Right – now we are monitoring everything with Zabbix. To round this thing up, let’s add some alerting to our Zabbix installation. First I need to decide what should trigger an alert. I certainly don’t want to receive a text message when a machine goes high on CPU for a minute. But I do want to be alerted when the real bad stuff happens. For this, the severity of problems in Zabbix comes in handy. If we look at the dashboard again, then we can see that these problems have different severities. I’ll talk about that high memory utilization warning in a minute. So in a nutshell I could say – I want to receive alerts whenever something happens that has a severity of “High” or “Disaster”. Let’s do that. And guys, we will only touch a fraction of this. For a full documentation of the features here – again – see the Zabbix documentation chapter 10 “Notification upon events”.
The first thing I do is that I configure the ways or “media” as it is called in Zabbix. I click on “Administration”, then “Media Types” here in the Zabbix UI. There’s a vast choice. Text message, Discord, e-Mail, Telegram. I can even add my own media type which can be e-Mail, SMS – so text message on the GSM phone if I have a modem attached to the server – it could be a script or a webhook. So that can do virtually anything really. But let’s stick to the existing ones. Let’s configure e-Mail for starters. I just need to type in all the parameters of my mail server here. Once I have configured that, I then go to Administration-Users and then the user I want to add that media type to. So let’s say the admin here. I click on the Media tab and add the corresponding media type. For instance Email. Last but not least I need to tell Zabbix who to alert depending on which condition. This is done under configuration-Actions-Trigger Actions. We already have a disabled entry here called “Report problems to Zabbix administrators.” Let’s use this one. There is a Conditions box here. Let’s add a condition. I want to be alerted when the Trigger severity is greater than or equals the “High” severity. Next I go over to the “Operations” tab and here I tell Zabbix what to do if the condition matches. The nice thing her is that there are Operations that can be triggered when the problem occurs and also Operations for Recovery, so when the problem has gone away and update operations. On the operation details I can now define User groups or single users and many more parameters like which media to use, how often to send etc. Just don’t forget to tick the “enabled” box and now I will receive notifications whenever something really bad happens in the environment.
(Sum up)
Awesome – Just two things before we close. Some real life scenarios where Zabbix has helped me save time and headache. I’ve already shown you the ever growing log scenario in the beginning. That’s a quite common thing. I have roughly 20 containers running 24/7 in my environment and I really don’t look into the Proxmox UI every day so getting alerted when the disk needs to be resized is really handy.
Another typical one is this vdr container where my Sat TV used to run on. If I monitor the memory usage long term, like over a month or two, then you can see that over time the memory is just eaten up. That’s a quite common problem with software written in C or C++ where the software does not free up memory after allocation properly. So-called memory leaks. You can only identify them if you monitor long term. How long would it take me to debug the software and fix the issue? Probably weeks. Or – I just take the quick and dirty operational approach and just restart the services every 30 days. Not clean but it works and everyone is happy. Another common scenario is identifying bottle necks. Look at my Shinobi cctv container here. It has a quite constant CPU utilization. But if I figured out that there would be bottlenecks long term then I could just add another CPU Core to the container or I could do the opposite of course – if I see that it is over provisioned then I could just remove one.
One last remark with regards to the high memory utilization warning in the dashboard here. The default zabbix linux template does not collect memory and CPU correctly from LXC containers. I had to define another template for that. I’ll put a link into the description or on the github repository.
That’s it guys – There would be so much more – let me know in the comments if you want follow ups on anything - Thanks so much for watching. And – like always – please stay safe, stay healthy, bye for now.