Installation

Install from the Debian Testing repo, as stable can be up to a year behind.

# Testing
echo 'deb http://deb.debian.org/debian testing main' | sudo tee -a /etc/apt/sources.list.d/testing.list

# Pin testing down to a low level so the rest of your packages don't get upgraded
sudo tee -a /etc/apt/preferences.d/not-testing << EOF
Package: *
Pin: release a=testing
Pin-Priority: 50
EOF

# Living Dangerously with test
sudo apt update
sudo apt install -t testing prometheus

Configuration

Use this for your starting config.

cat /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

This says every 15 seconds, run down the job list. And there is one job - to check out the system at ’localhost:9090’ which happens to be itself.

For every target listed, the scraper makes a web request for /metrics/ and stores the results. It ingests all the data presented and stores them for 15 days. You can choose to ignore certain elements or retain differently by adding config, but by default it takes everything given.

You can see this yourself by just asking like Prometheus would. Hit it up directly in your browser. For example, Prometheus is making metrics available at /metrics

http://some.server:9090/metrics

Operation

User Interface

You can access the Web UI at:

http://some.server:9090

At the top, select Graph (you should be there already) and in the Console tab click the dropdown labeled “insert metric at cursor”. There you will see all the data being exposed. This is mostly about the GO language it’s written in, and not super interesting. A simple Graph tab is available as well.

Data Composition

Data can be simple, like:

go_gc_duration_seconds_sum 3

Or it can be dimensional which is accomplished by adding labels. In the example below, the value of go_gc_duration_seconds has 5 labeled sub-sets.

go_gc_duration_seconds{quantile="0"} 4.5697e-05
go_gc_duration_seconds{quantile="0.25"} 7.814e-05
go_gc_duration_seconds{quantile="0.5"} 0.000103396
go_gc_duration_seconds{quantile="0.75"} 0.000143687
go_gc_duration_seconds{quantile="1"} 0.001030941

In this example, the value of net_conntrack_dialer_conn_failed_total has several.

net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="unknown"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="unknown"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="unknown"} 0

How is this useful? It allows you to do aggregations - such as looking at all the net_contrack failures, and also look at the failures that were specifically refused. All with the same data.

Removing Data

You may have a target you want to remove. Such as a typo hostname that is now causing a large red bar on a dashboard. You can remove that mistake by enabling the admin API and issuing a delete

sudo sed -i 's/^ARGS.*/ARGS="--web.enable-admin-api"/' /etc/default/prometheus

sudo systemctl reload prometheus

curl -s -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="badhost.some.org:9100"}'

The default retention is 15 days. You may want less than that and you can configure --storage.tsdb.retention.time=1d similar to above. ALL data has the same retention, however. If you want historical data you must have a separate instance or use VictoriaMetrics.

Next Steps

Let’s get something interesting to see by adding some OS metrics

Troubleshooting

If you can’t start the prometheus server, it may be an issue with the init file. Test and Prod repos use different defaults. Add some values explicitly to get new versions running

sudo vi /etc/default/prometheus

ARGS="--config.file="/etc/prometheus/prometheus.yml  --storage.tsdb.path="/var/lib/prometheus/metrics2/"