This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Metrics

1 - Prometheus

Overview

Prometheus is a time series database, meaning it’s optimized to store and work with data organized in time order. It includes in it’s single binary:

  • Database engine
  • Collector
  • Simple web-based user interface

This allows you to collect and manage data with fewer tools and less complexity than other solutions.

Data Collection

End-points normally expose metrics to Prometheus by making a web page available that it can poll. This is done by including a instrumentation library (provided by Prometheus) or simply adding a listener on a high-level port that spits out some text when asked.

For systems that don’t support Prometheus natively, there are a few add-on services to translate. These are called ’exporters’ and translate things such as SNMP into a web format Prometheus can ingest.

Alerting

You can also alert on the data collected. This is through the Alert Manager, a second package that works closely with Prometheus.

Visualization

You still need a dashboard tool like Grafana to handle visualizations, but you can get started quite quickly with just Prometheus.

1.1 - Installation

Install from the Debian Testing repo, as stable can be up to a year behind.

# Testing
echo 'deb http://deb.debian.org/debian testing main' | sudo tee -a /etc/apt/sources.list.d/testing.list

# Pin testing down to a low level so the rest of your packages don't get upgraded
sudo tee -a /etc/apt/preferences.d/not-testing << EOF
Package: *
Pin: release a=testing
Pin-Priority: 50
EOF

# Living Dangerously with test
sudo apt update
sudo apt install -t testing prometheus

Configuration

Use this for your starting config.

cat /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

This says every 15 seconds, run down the job list. And there is one job - to check out the system at ’localhost:9090’ which happens to be itself.

For every target listed, the scraper makes a web request for /metrics/ and stores the results. It ingests all the data presented and stores them for 15 days. You can choose to ignore certain elements or retain differently by adding config, but by default it takes everything given.

You can see this yourself by just asking like Prometheus would. Hit it up directly in your browser. For example, Prometheus is making metrics available at /metrics

http://some.server:9090/metrics

Operation

User Interface

You can access the Web UI at:

http://some.server:9090

At the top, select Graph (you should be there already) and in the Console tab click the dropdown labeled “insert metric at cursor”. There you will see all the data being exposed. This is mostly about the GO language it’s written in, and not super interesting. A simple Graph tab is available as well.

Data Composition

Data can be simple, like:

go_gc_duration_seconds_sum 3

Or it can be dimensional which is accomplished by adding labels. In the example below, the value of go_gc_duration_seconds has 5 labeled sub-sets.

go_gc_duration_seconds{quantile="0"} 4.5697e-05
go_gc_duration_seconds{quantile="0.25"} 7.814e-05
go_gc_duration_seconds{quantile="0.5"} 0.000103396
go_gc_duration_seconds{quantile="0.75"} 0.000143687
go_gc_duration_seconds{quantile="1"} 0.001030941

In this example, the value of net_conntrack_dialer_conn_failed_total has several.

net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="unknown"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="unknown"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="unknown"} 0

How is this useful? It allows you to do aggregations - such as looking at all the net_contrack failures, and also look at the failures that were specifically refused. All with the same data.

Removing Data

You may have a target you want to remove. Such as a typo hostname that is now causing a large red bar on a dashboard. You can remove that mistake by enabling the admin API and issuing a delete

sudo sed -i 's/^ARGS.*/ARGS="--web.enable-admin-api"/' /etc/default/prometheus

sudo systemctl reload prometheus

curl -s -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="badhost.some.org:9100"}'

The default retention is 15 days. You may want less than that and you can configure --storage.tsdb.retention.time=1d similar to above. ALL data has the same retention, however. If you want historical data you must have a separate instance or use VictoriaMetrics.

Next Steps

Let’s get something interesting to see by adding some OS metrics

Troubleshooting

If you can’t start the prometheus server, it may be an issue with the init file. Test and Prod repos use different defaults. Add some values explicitly to get new versions running

sudo vi /etc/default/prometheus

ARGS="--config.file="/etc/prometheus/prometheus.yml  --storage.tsdb.path="/var/lib/prometheus/metrics2/"

1.2 - Node Exporter

This is a service you install on your end-points that make CPU/Memory/Etc. metrics available to Prometheus.

Installation

On each device you want to monitor, install the node exporter with this command.

sudo apt install prometheus-node-exporter

Do a quick test to make sure it’s responding to scrapes.

curl localhost:9100/metrics

Configuration

Back on your Prometheus server, add these new nodes as a job in the prometheus.yaml file. Feel free to drop the initial job where Prometheus was scraping itself.

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'servers'
    static_configs:
    - targets:
      - some.server:9100
      - some.other.server:9100
      - and.so.on:9100
sudo systemctl reload prometheus.service

Operation

You can check the status of your new targets at:

http://some.server:9090/classic/targets

A lot of data is collected by default. On some low power systems you may want less. For just the basics, customize the the config to disable the defaults and only enable specific collectors.

In the case below we are reduce collection to just CPU, Memory, and Hardware metrics. When scraping a Pi 3B, this reduces the Scrape Duration from 500 to 50ms.

sudo sed -i 's/^ARGS.*/ARGS="--collector.disable-defaults --collector.hwmon --collector.cpu --collector.meminfo"/' /etc/default/prometheus-node-exporter
sudo systemctl restart prometheus-node-exporter

The available collectors are listed on the page:

https://github.com/prometheus/node_exporter

1.3 - SNMP Exporter

SNMP is one of the most prevalent (and clunky) protocols still widely used on network-attached devices. But it’s a good general-purpose way to get data from lots of different makes of products in a similar way.

But Prometheus doesn’t understand SNMP. The solution is a translation service that acts a a middle-man and ’exports’ data from those devices in a way Prometheus can.

Installation

Assuming you’ve already installed Prometheus, install some SNMP tools and the exporter. If you have an error installing the mibs-downloader, check troubleshooting at the bottom.

sudo apt install snmp snmp-mibs-downloader
sudo apt install -t testing prometheus-snmp-exporter

Change the SNMP tools config file to allow use of installed MIBs. It’s disabled by default.

# The entry 'mibs:' in the file overrides the default path. Comment it out so the defaults kick back in.
sudo sed -i 's/^mibs/# &/' /etc/snmp/snmp.conf

Preparation

We need a target, so assuming you have a switch somewhere and can enable SNMP on it, let’s query the switch for its name, AKA sysName. Here we’re using version “2c” of the protocol with the read-only password “public”. Pretty standard.

Industry Standard Query

There are some well-known SNMP objects you can query, like System Name.

# Get the first value (starting at 0) of the sysName object
snmpget -Oqv -v 2c -c public some.switch.address sysName.0

Some-Switch

# Sometimes you have to use 'getnext' if 0 isn't populated
snmpgetnext -v 2c -c public some.switch.address sysName

Vendor Specific Query

Some vendors will release their own custom MIBs. These provide additional data for things that don’t have well-known objects. Add the MIBs to the system and ‘walk’ the tree to see what’s interesting.

# Unifi, for example
sudo cp UBNT-MIB.txt UBNT-UniFi-MIB.txt  /usr/share/snmp/mibs

# snmpwalk doesn't look for enterprise sections by default, so you have to 
# look at the MIB and add the specific company's OID number.
grep enterprises UBNT-*
...
UBNT-MIB.txt:    ubnt OBJECT IDENTIFIER ::= { enterprises 41112 }
...

snmpwalk -v2c -c public 10.10.202.246 enterprises.41112 

Note: If you get back an error or just the ‘iso’ prefixed value, double check the default MIB path.

Configuration

To add this switch to the Prometheus scraper, add a new job to the prometheus.yaml file. This job will include the targets as normal, but also the path (since it’s different than default) and an optional parameter called module that specific to the SNMP exporter. It also does something confusing - a relabel_config

This is because Prometheus isn’t actually taking to the switch, it’s talking to the local SNMP exporter service. So we put all the targets normally, and then at the bottom ‘oh, by the way, do a switcheroo’. This allows Prometheus to display all the data normally with no one the wiser.

...
...
scrape_configs:
  - job_name: 'snmp'
    static_configs:
      - targets:
        - some.switch.address    
    metrics_path: /snmp
    params:
      module: [if_mib]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9116  # The SNMP exporter's real hostname:port.

Operation

No configuration on the exporter side is needed. Reload the config and check the target list. Then examine data in the graph section. Add additional targets as needed and the exporter will pull in the data.

http://some.server:9090/classic/targets

These metrics are considered well known and so will appear in the database named sysUpTime and upsBasicBatteryStatus and not be prefixed with snmp_ like you might expect.

Next Steps

If you have something non-standard, or you simply don’t want that huge amount of data in your system, look at the link below to customize the SNMP collection with the Generator.

SNMP Exporter Generator Customization

Troubleshooting

The snmp-mibs-downloader is just a handy way to download a bunch of default MIBs so when you use the tools, all the cryptic numbers, like “1.3.6.1.2.1.17.4.3.1” are translated into pleasant names.

If you can’t find the mibs-downloader its probably because it’s in the non-free repo and that’s not enabled by default. Change your apt sources file like so

sudo vi /etc/apt/sources.list

deb http://deb.debian.org/debian/ bullseye main contrib non-free
deb-src http://deb.debian.org/debian/ bullseye main contrib non-free

deb http://security.debian.org/debian-security bullseye-security main contrib non-free
deb-src http://security.debian.org/debian-security bullseye-security main contrib non-free

deb http://deb.debian.org/debian/ bullseye-updates main contrib non-free
deb-src http://deb.debian.org/debian/ bullseye-updates main contrib non-free

It may be that you only need to change one line.

1.4 - SNMP Generator

Installation

There is no need to install the Generator as it comes with the SNMP exporter. But if you have a device that supplies it’s own MIB (and many do), you should add that to the default location.

# Mibs are often named SOMETHING-MIB.txt
sudo cp -n *MIB.txt /usr/share/snmp/mibs/

Preparation

You must identify the values you want to capture. Using snmpwalk is a good way to see what’s available, but it helps to have a little context.

The data is arranged like a folder structure that you drill-down though. The folder names are all numeric, with ‘.’ instead of slashes. So if you wanted to get a device’s sysName you’d click down through 1.3.6.1.2.1.1.5 and look in the file 0.

When you use snmpwalk it starts wherever you tell it and then starts drilling-down, printing out everything it finds.

How do you know that’s where sysName is at? A bunch of folks got together (the ISO folks) and decided everything in advance. Then they made some handy files (MIBs) and passed them out so you didn’t have to remember all the numbers.

They allow vendors to create their own sections as well, for things that might not fit anywhere else.

A good place to start is looking at what the vendor made available. You see this by walking their section and including their MIB so you get descriptive names - only the ISO System MIB is included by default.

# The SysobjectID identifies the vendor section
# Note use of the MIB name without the .txt
$ snmpwalk -m +SOMEVENDOR-MIB -v 2c -c public some.address SysobjectID

SNMPv2-MIB::sysObjectID.0 = OID: SOMEVENDOR-MIB::somevendoramerica

# Then walk the vendor section using the name from above
$ snmpwalk -m +SOMEVENDOR-MIB -v 2c -c some.address somevendoramerica

SOMEVENDOR-MIB::model.0 = STRING: SOME-MODEL
SOMEVENDOR-MIB::power.0 = INTEGER: 0
...
...

# Also check out the general System section
$ snmpwalk -m +SOMEVENDOR-MIB -v 2c -c public some.address system

# You can also walk the whole ISO tree. In some cases,
# there are thousands of entries and it's indecipherable
$ snmpwalk -m +SOMEVENDOR-MIB -v 2c -c public some.system iso

This can be a lot of information and you’ll need to do some homework to see what data you want to collect.

Configuration

The exporter’s default configuration file is snmp.yml and contains about 57 Thousand lines of config. It’s designed to pull data from whatever you point it at. Basically, it doesn’t know what device it’s talking to, so it tries to cover all the bases.

This isn’t a file you should edit by hand. Instead, you create instructions for the generator and it look though the MIBs and create one for you. Here’s an example for a Samlex Invertor.

vim ~/generator.yml
modules:
  samlex:
    walk:
      - sysLocation
      - inverterMode
      - power
      - vin
      - tempDD
      - tempDA
prometheus-snmp-generator generate
sudo cp /etc/prometheus/snmp.yml /etc/prometheus/snmp.yml.orig
sudo cp ~/snmp.yml /etc/prometheus
sudo systemctl reload prometheus-snmp-exporter.service

Configuration in Prometheus remains the same - but since we picked a new module name we need to adjust that.

    ...
    ...
    params:
      module: [samlex]
    ...
    ...
sudo systemctl reload prometheus.service

Adding Data Prefixes

by default, the names are all over the place. The SNMP Exporter Devs leave it this way because there are a lot of pre-built dashboards on downstream systems that expect the existing names.

If you are building your own downstream systems you can prefix (as is best-practice) as you like with a post generation step. This example cases them all to be prefixed with samlex_.

prometheus-snmp-generator generate
sed -i 's/name: /name: samlex_/' snmp.yml

Combining MIBs

You can combine multiple systems in the generator file to create one snmp.yml file, and refer to them by the module name in the Prometheus file.

modules:
  samlex:
    walk:
      - sysLocation
      - inverterMode
      - power
      - vin
      - tempDD
      - tempDA
  ubiquiti:
    walk:
      - something
      - somethingElse  

Operation

As before, you can get a preview directly from the exporter (using a link like below). This data should show up in the Web UI too.

http://some.server:9116/snmp?module=samlex&target=some.device

Sources

https://github.com/prometheus/snmp_exporter/tree/main/generator

1.5 - Sensors DHT

DHT stands for Digital Humidity and Temperature. At less than $5 they are cheap and can be hooked to a Raspberry Pi easily. Add a Prometheus exporter if you want to do at scale.

  • Connect the Senor
  • Provision and Install the Python Libraries
  • Test the Libraries and the Sensor
  • Install the Prometheus Exporter as a Service
  • Create a Service Account
  • Add to Prometheus

Connect The Sensor

These usually come as a breakout-board with three leads you can connect to the Raspberry PI GPIO pins as follows;

  • Positive lead to pin 1 (power)
  • Negative lead to pin 6 (ground)
  • Middle or ‘out’ lead to pin 7 (that’s GPIO 4)

Image of DHT Connection

(From https://github.com/rnieva/Playing-with-Sensors---Raspberry-Pi)

Provision and Install

Use the Raspberry Pi Imager to Provision the Pi with Raspberry PI OS Lite 64 bit. Next, install the “adafruit_blinka” library as adapted from their instructions and test.

# General updates
sudo apt update
sudo apt -y upgrade
sudo apt -y autoremove
sudo reboot

# These python components may already be installed, but making sure
sudo apt -y install python3-pip
sudo apt -y install --upgrade python3-setuptools
sudo apt -y install python3-venv

# Make a virtual environment for the python process
sudo mkdir /usr/local/bin/sensor-dht
sudo python3 -m venv /usr/local/bin/sensor-dht --system-site-packages
cd /usr/local/bin/sensor-dht
sudo chown -R ${USER}:${USER} .
source bin/activate

# Build and install the library
pip3 install --upgrade adafruit-python-shell
wget https://raw.githubusercontent.com/adafruit/Raspberry-Pi-Installer-Scripts/master/raspi-blinka.py
sudo -E env PATH=$PATH python3 raspi-blinka.py

Test the Libraries and the Sensor

After logging back in, test the blinka lib.

cd /usr/local/bin/sensor-dht
source bin/activate
wget https://learn.adafruit.com/elements/2993427/download -O blinkatest.py
python3 blinkatest.py

Then install the DHT library from CircuitPython and create a script to test the sensor.

cd /usr/local/bin/sensor-dht
source bin/activate
pip3 install adafruit-circuitpython-dht

vi sensortest.py
import board
import adafruit_dht

dhtDevice = adafruit_dht.DHT11(board.D4)
temp = dhtDevice.temperature
humidity = dhtDevice.humidity

print(
  "Temp: {:.1f} C    Humidity: {}% ".format(temp, humidity)
)

dhtDevice.exit()

You can get occasional errors like RuntimeError: Checksum did not validate. Try again. that are safe to ignore. These DHTs are not 100% solid.

Install the Prometheus Exporter as a Service

Add the Prometheus pips.

cd /usr/local/bin/sensor-dht
source bin/activate
pip3 install prometheus_client

And create a script like this.

nano sensor.py
import board
import adafruit_dht
import time
from prometheus_client import start_http_server, Gauge

dhtDevice = adafruit_dht.DHT11(board.D4)

temperature_gauge= Gauge('dht_temperature', 'Local temperature')
humidity_gauge = Gauge('dht_humidity', 'Local humidity')

start_http_server(8000)
    
while True:

 try:
  temperature = dhtDevice.temperature
  temperature_gauge.set(temperature)

  humidity = dhtDevice.humidity
  humidity_gauge.set(humidity)
 
 except:
  # Errors happen fairly often as DHT's are hard to read. Just continue on.
  continue

 finally:
  time.sleep(60)

Create a service

sudo nano /lib/systemd/system/sensor.service
[Unit]
Description=Temperature and Humidity Sensing Service
After=network.target

[Service]
Type=idle
Restart=on-failure
User=root
ExecStart=/bin/bash -c 'cd /usr/local/bin/sensor-dht && source bin/activate && python sensor.py'

[Install]
WantedBy=multi-user.target

Enable and start it

sudo systemctl enable --now sensor.service

curl http://localhost:8000/metrics

Create a Service Account

This service is running as root. You should consider creating a sensor account.

sudo useradd --home-dir /usr/local/bin/sensor-dht --system --shell /usr/sbin/nologin --comment "Sensor Service" sensor
sudo usermod -aG gpio sensor

sudo systemctl stop  sensor.service
sudo chown -R sensor:sensor /usr/local/bin/sensor-dht
sudo sed -i 's/User=root/User=sensor/' /lib/systemd/system/sensor.service

sudo systemctl daemon-reload 
sudo systemctl start sensor.service 

Add to Prometheus

Adding it requires logging into your Prometheus server and adding a job like below.

sudo vi /etc/prometheus/prometheus.yml
...
...
  - job_name: 'dht' 
    static_configs: 
      - targets: 
        - 192.168.1.45:8000

You will be able to find the node in your server at http://YOUR-SERVER:9090/targets?search=#pool-dht and data will show up with a leading dht_....

Sources

https://randomnerdtutorials.com/raspberry-pi-dht11-dht22-python/

You may want to raise errors to the log as in the above source.

2 - Smokeping

I’ve been using Smokeping for at least 20 years. Every so often I look at the competitors, but it’s still the best for a self-contained latency monitoring system.

Installation

On Debian Stretch:

# Install smokeping - apache gets installed automatically and the config enabled
sudo apt-get install smokeping

# Install and enable SpeedyCGI if you can find it, otherwise, fastCGI
sudo apt install libapache2-mod-fcgid

Configure

Edit the General Config file

sudo vim /etc/smokeping/config.d/General

owner    = Some Org
contact  = [email protected]
mailhost = localhost

cgiurl   = http://some.server.you.just.deployed.on/cgi-bin/smokeping.cgi

# specify this to get syslog logging
syslogfacility = local0

# each probe is now run in its own process
# disable this to revert to the old behaviour
# concurrentprobes = no

@include /etc/smokeping/config.d/pathnames

Edit the pathnames file. You must put in a value for sendmail (If you don’t have it) so that smoke ping will run.

sudo vim /etc/smokeping/config.d/pathnames


sendmail = /bin/false
imgcache = /var/cache/smokeping/images
imgurl   = ../smokeping/images
...
...

Edit the Alerts

sudo vim /etc/smokeping/config.d/Alerts


to = [email protected]
from = [email protected]

Edit the Targets

# Add your own targets that you will measure by appending them to the bottom of the targets file. 
# These will show up in a menu on the left of the generated web page. You add an entry starting with a + to create a top level entry, and subsequent lines with ++ that will show up as sub entries like so:

#        + My Company

#        ++ My Company's Web Server 1

#        ++ My Company's Web Server 2


#    Actual config requires a few extra lines, as below;


sudo vim /etc/smokeping/config.d/Targets


    + My_Company

    menu =  My Company
    title = My Company


    ++ Web_Server_1
    menu = Web Server 1
    title = Web Server 1
    host = web.server.org


# Restart smokeping and apache

sudo service smokeping restart
sudo service apache2 reload

Access smokeping at:

http://some.server.name/smokeping

Notes

The default resolution - i.e. polling frequency is 20 requests over 5 min - or 1 request every 15 seconds

http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/SA/v12/i07/a5.htm

ArchWiki suggests a workaround for sendmail

https://wiki.archlinux.org/index.php/smokeping