This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Security, Monitoring and Alerting

1 - CrowdSec

1.1 - Installation

Overview

CrowdSec has two main parts; detection and interdiction.

Detection is handled by the main CrowdSec binary. You tell it what files to keep an eye on, how to parse those files, and what something ‘bad’ looks like. It then keeps a list of IPs that have done bad things.

Interdiction is handled by any number of plugins called ‘bouncers’, so named because they block access or kick out bad IPs. They run independently and keep an eye on the list, to do things like edit the firewall to block access for a bad IP.

There is also the ‘crowd’ part. The CrowdSec binary downloads IPs of known bad-actors from the cloud for your bouncers to keep out and submits alerts from your systems.

Installation

With Debian, you can simply add the repo via their script and install with a couple lines.

curl -s https://packagecloud.io/install/repositories/crowdsec/crowdsec/script.deb.sh | sudo bash
sudo apt install crowdsec
sudo apt install crowdsec-firewall-bouncer-nftables

This installs both the detection (crowdsec) and the interdiction (crowdsec-firewall-bouncer) parts. Assuming eveything went well, crowdsec will check in with the cloud, download a baseline list of known bad-actors, the firewall-bouncer will set up a basic drop list in the firewall, and crowdsec will start watching your syslog for intrusion attempts.

# Check out the very long drop list
sudo nft list ruleset | less

Note - if there are no rules, you may need to sudo systemctl restart nftables.service or possibly reboot (as I’ve found in testing)

Configuration

CrowdSec comes pre-configured to watch for ssh brute-force attacks. If you have specific services to watch you can add those as described below.

Add a Service

You probably want to watch a specific service, like web server. Take a look at [https://hub.crowdsec.net/] to see all the available components. For example, browse the collections and search for caddy. The more info link will show you how to install the collection;

sudo cscli collections list -a
sudo cscli collections install crowdsecurity/caddy

Tell CrowdSec where Caddy’s log files are.

sudo tee -a /etc/crowdsec/acquis.yaml << EOF

---
filenames:
 - /var/log/caddy/*.log
labels:
  type: caddy
---
EOF

Restart crowdsec for these changes to take effect

sudo systemctl reload crowdsec

Operation

DataFlow

CrowdSec works by pulling in data from the Acquisition files, Parsing the events, comparing to Scenarios, and then Deciding if action should be taken.

Acquisition of data from log files is based on entries in the acquis.yaml file, and the events given a label as defined in that file.

Those events feed the Parsers. There are a handful by default, but only the ones specifically interested in a given label will see it. They look for keywords like ‘FAILED LOGIN’ and then extract the IP.

Successfully parsed lines are feed to the Scenarios to if what happened matters. The scenarios look for things like 10 FAILED LOGINs in 1 min. This separates the accidental bad password entry from a brute force attempt.

Matching a scenario gets the IP added to the Decision List, i.e the list of bad IPs. These have a configurable expiration, so that if you really guess wrong 10 times in a row, you’re not banned forever.

The bouncers use this list to take action, like a firewall block, and will unblock you after the expiration.

Collections

Parsers and Scenarios work best when they work together so they are usually distributed together as a Collection. You can have collections of collections as well. For example, the base installation comes with the linux collection that includes a few parsers and the sshd collection.

To see what Collections, Parsers and Scenarios are running, use the cscli command line interface.

sudo cscli collections list
sudo cscli collections inspect crowdsecurity/linux
sudo cscli collections inspect crowdsecurity/sshd

Inspecting the collection will tell you what parsers and scenarios it contains. As well as some metrics. To learn more a collection and it’s components, you can check out their page:

https://hub.crowdsec.net/author/crowdsecurity/collections/linux

The metrics are a bit confusing until you learn that the ‘Unparsed’ column doesn’t mean unparsed so much as it means a non-event. These are just normal logfile lines that don’t have one of the keywords the parser was looking for, like ‘LOGIN FAIL’.

Status

Is anyone currently attacking you? The decisions list shows you any current bad actors and the alerts list shows you a summary of past decisions. If you are just getting started this is probably none, but if you’re open to the internet this will grow quickly.

sudo cscli decisions list
sudo cscli alerts list

But you are getting events from the cloud and you can check those with the -a option. You’ll notice that every 2 hours the community-blocklist is updated.

sudo cscli alerts list -a

After a while of this collection running, you’ll start to see these kinds of alerts

sudo cscli alerts list
╭────┬───────────────────┬───────────────────────────────────────────┬─────────┬────────────────────────┬───────────┬─────────────────────────────────────────╮
│ ID │       value       │                  reason                   │ country │           as           │ decisions │               created_at                │
├────┼───────────────────┼───────────────────────────────────────────┼─────────┼────────────────────────┼───────────┼─────────────────────────────────────────┤
│ 27 │ Ip:18.220.128.229 │ crowdsecurity/http-bad-user-agent         │ US      │ 16509 AMAZON-02        │ ban:1     │ 2023-03-02 13:12:27.948429492 +0000 UTC │
│ 26 │ Ip:18.220.128.229 │ crowdsecurity/http-path-traversal-probing │ US      │ 16509 AMAZON-02        │ ban:1     │ 2023-03-02 13:12:27.979479713 +0000 UTC │
│ 25 │ Ip:18.220.128.229 │ crowdsecurity/http-probing                │ US      │ 16509 AMAZON-02        │ ban:1     │ 2023-03-02 13:12:27.9460075 +0000 UTC   │
│ 24 │ Ip:18.220.128.229 │ crowdsecurity/http-sensitive-files        │ US      │ 16509 AMAZON-02        │ ban:1     │ 2023-03-02 13:12:27.945759433 +0000 UTC │
│ 16 │ Ip:159.223.78.147 │ crowdsecurity/http-probing                │ SG      │ 14061 DIGITALOCEAN-ASN │ ban:1     │ 2023-03-01 23:03:06.818512212 +0000 UTC │
│ 15 │ Ip:159.223.78.147 │ crowdsecurity/http-sensitive-files        │ SG      │ 14061 DIGITALOCEAN-ASN │ ban:1     │ 2023-03-01 23:03:05.814690037 +0000 UTC │
╰────┴───────────────────┴───────────────────────────────────────────┴─────────┴────────────────────────┴───────────┴─────────────────────────────────────────╯

You may even need to unblock yourself

sudo cscli decisions list
sudo cscli decision delete --id XXXXXXX

Next Steps

You’re now taking advantage of the crowd-part of the crowdsec and added your own service. If you don’t have any alerts though, you may be wondering how well it’s actually working.

Take a look at the detailed activity if you want to look more closely at what’s going on.

1.2 - Detailed Activity

Inspecting Metrics

Data comes in through the parsers. To see what they are doing, let’s take a look at the Acquisition and Parser metrics.

sudo cscli metrics

Most of the ‘Acquisition Metrics’ lines will be read and unparsed. This is because normal events are dropped. It only considers lines parsed if they were passed on to a scenario. The ‘bucket’ column refers to event scenarios and is also blank as there were no parsed lines to hand off.

Acquisition Metrics:
╭────────────────────────┬────────────┬──────────────┬────────────────┬────────────────────────╮
│         Source         │ Lines read │ Lines parsed │ Lines unparsed │ Lines poured to bucket │
├────────────────────────┼────────────┼──────────────┼────────────────┼────────────────────────┤
│ file:/var/log/auth.log │ 216        │ -            │ 216            │ -                      │
│ file:/var/log/syslog   │ 143        │ -            │ 143            │ -                      │
╰────────────────────────┴────────────┴──────────────┴────────────────┴────────────────────────╯

The ‘Parser Metrics’ will show the individual parsers - but not all of them. Only parsers that have at least one ‘hit’ are shown. In this example, only the syslog parser shows up. It’s a low-level parser that doesn’t look for matches, so every line is a hit.

Parser Metrics:
╭─────────────────────────────────┬──────┬────────┬──────────╮
│             Parsers             │ Hits │ Parsed │ Unparsed │
├─────────────────────────────────┼──────┼────────┼──────────┤
│ child-crowdsecurity/syslog-logs │ 359  │ 359    │ -        │
│ crowdsecurity/syslog-logs       │ 359  │ 359    │ -        │
╰─────────────────────────────────┴──────┴────────┴──────────╯

However, try a couple failed SSH login attempts and you’ll see them and how they feed up the the Acquisition Metrics.


Acquisition Metrics:
╭────────────────────────┬────────────┬──────────────┬────────────────┬────────────────────────╮
│         Source         │ Lines read │ Lines parsed │ Lines unparsed │ Lines poured to bucket │
├────────────────────────┼────────────┼──────────────┼────────────────┼────────────────────────┤
│ file:/var/log/auth.log │ 242        │ 3            │ 239            │ -                      │
│ file:/var/log/syslog   │ 195        │ -            │ 195            │ -                      │
╰────────────────────────┴────────────┴──────────────┴────────────────┴────────────────────────╯

Parser Metrics:
╭─────────────────────────────────┬──────┬────────┬──────────╮
│             Parsers             │ Hits │ Parsed │ Unparsed │
├─────────────────────────────────┼──────┼────────┼──────────┤
│ child-crowdsecurity/sshd-logs   │ 61   │ 3      │ 58       │
│ child-crowdsecurity/syslog-logs │ 442  │ 442    │ -        │
│ crowdsecurity/dateparse-enrich  │ 3    │ 3      │ -        │
│ crowdsecurity/geoip-enrich      │ 3    │ 3      │ -        │
│ crowdsecurity/sshd-logs         │ 8    │ 3      │ 5        │
│ crowdsecurity/syslog-logs       │ 442  │ 442    │ -        │
│ crowdsecurity/whitelists        │ 3    │ 3      │ -        │
╰─────────────────────────────────┴──────┴────────┴──────────╯

Lines poured to bucket however, is still empty. That means the the action didn’t match a scenario defining a hack attempt. In fact - you may notice the ‘whitelist` was triggered. Let’s ask crowdsec to explain what’s going on.

Detailed Parsing

To see which parsers got involved and what they did, you can ask.

sudo cscli explain --file /var/log/auth.log --type syslog

Here’s a ssh example of a failed login. The numbers, such as (+9 ~1), mean that the parser added 9 elements it parsed from the raw event, and updated 1. Notice the whitelists parser at the end. It’s catching this event and dropping it, hence the ‘parser failure’. The failure message is a red herring, as this is how it’s supposed to work. It short-circuits as soon as it thinks something should be white-listed.

line: Mar  1 14:08:11 www sshd[199701]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=192.168.1.16  user=allen
        ├ s00-raw
        |       └ 🟢 crowdsecurity/syslog-logs (first_parser)
        ├ s01-parse
        |       └ 🟢 crowdsecurity/sshd-logs (+9 ~1)
        ├ s02-enrich
        |       ├ 🟢 crowdsecurity/dateparse-enrich (+2 ~1)
        |       ├ 🟢 crowdsecurity/geoip-enrich (+9)
        |       └ 🟢 crowdsecurity/whitelists (~2 [whitelisted])
        └-------- parser failure 🔴

But why exactly did it get whitelisted? Let’s ask for a verbose report.

sudo cscli explain -v --file /var/log/auth.log --type syslog
line: Mar  1 14:08:11 www sshd[199701]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=192.168.1.16  user=someGuy
        ├ s00-raw
        |       └ 🟢 crowdsecurity/syslog-logs (first_parser)
        ├ s01-parse
        |       └ 🟢 crowdsecurity/sshd-logs (+9 ~1)
        |               └ update evt.Stage : s01-parse -> s02-enrich
        |               └ create evt.Parsed.sshd_client_ip : 192.168.1.16
        |               └ create evt.Parsed.uid : 0
        |               └ create evt.Parsed.euid : 0
        |               └ create evt.Parsed.pam_type : unix
        |               └ create evt.Parsed.sshd_invalid_user : someGuy
        |               └ create evt.Meta.service : ssh
        |               └ create evt.Meta.source_ip : 192.168.1.16
        |               └ create evt.Meta.target_user : someGuy
        |               └ create evt.Meta.log_type : ssh_failed-auth
        ├ s02-enrich
        |       ├ 🟢 crowdsecurity/dateparse-enrich (+2 ~1)
        |               ├ create evt.Enriched.MarshaledTime : 2023-03-01T14:08:11Z
        |               ├ update evt.MarshaledTime :  -> 2023-03-01T14:08:11Z
        |               ├ create evt.Meta.timestamp : 2023-03-01T14:08:11Z
        |       ├ 🟢 crowdsecurity/geoip-enrich (+9)
        |               ├ create evt.Enriched.Longitude : 0.000000
        |               ├ create evt.Enriched.ASNNumber : 0
        |               ├ create evt.Enriched.ASNOrg : 
        |               ├ create evt.Enriched.ASNumber : 0
        |               ├ create evt.Enriched.IsInEU : false
        |               ├ create evt.Enriched.IsoCode : 
        |               ├ create evt.Enriched.Latitude : 0.000000
        |               ├ create evt.Meta.IsInEU : false
        |               ├ create evt.Meta.ASNNumber : 0
        |       └ 🟢 crowdsecurity/whitelists (~2 [whitelisted])
        |               └ update evt.Whitelisted : %!s(bool=false) -> true
        |               └ update evt.WhitelistReason :  -> private ipv4/ipv6 ip/ranges
        └-------- parser failure 🔴

Turns out that private IP ranges are whitelisted by default so you can’t lock yourself out from inside. The parser crowdsecurity/whitelists has updated the property ’evt.Whitelisted’ to true and gave it a reason. That property appears to be a built-in that flags events to be dropped.

If you want to change the ranges, you can edit the logic by editing the yaml file. A sudo cscli hub list will show you what file that is. Add or remove entries from the list it’s checking the ‘ip’ value and ‘cidr’ value against. Any match cases whitelist to become true.

1.3 - Whitelisting

In the previous examples we’ve looked at the metrics and details of internal facing service like failed SSH logins. Those types aren’t prone to a lot of false positives. But other sources, like web access logs, can be.

False Positives

You’ll recall that when looking at metrics that a high number of ‘Lines unparsed’ is normal. They were simply entries that didn’t match any specific events the parser was looking for. Parsed lines however, are ‘poured’ to a bucket. A bucket being a potential attack type.

sudo cscli metrics

Acquisition Metrics:
╭────────────────────────────────┬────────────┬──────────────┬────────────────┬────────────────────────╮
│             Source             │ Lines read │ Lines parsed │ Lines unparsed │ Lines poured to bucket │
├────────────────────────────────┼────────────┼──────────────┼────────────────┼────────────────────────┤
│ file:/var/log/auth.log         │ 69         │ -            │ 69             │ -                      │
│ file:/var/log/caddy/access.log │ 21         │ 21           │ -              │ 32                     │ <--Notice the high number in the 'poured' column
│ file:/var/log/syslog           │ 2          │ -            │ 2              │ -                      │
╰────────────────────────────────┴────────────┴──────────────┴────────────────┴────────────────────────╯

In the above example, “lines poured” is bigger than the number parsed. This is because some lines can match more than one scenario and end up in multiple buckets, like a malformed user agent asking for a page that doesn’t exist. Sometimes, that’s OK. Action isn’t taken until a given bucket meets a threshold. That’s in scenarios so let’s take a look there.

Scenario Metrics:
╭──────────────────────────────────────┬───────────────┬───────────┬──────────────┬────────┬─────────╮
│                Scenario              │ Current Count │ Overflows │ Instantiated │ Poured │ Expired │
├──────────────────────────────────────┼───────────────┼───────────┼──────────────┼────────┼─────────┤
│ crowdsecurity/http-crawl-non_statics │ -             │ -         │ 2            │ 17     │ 2       │
│ crowdsecurity/http-probing           │ -             │ 1         │ 2            │ 15     │ 1       │
╰──────────────────────────────────────┴───────────────┴───────────┴──────────────┴────────┴─────────╯

It appears the scenario ‘http-crawl-non_statics’ is designed to allow some light web-crawling. Of the 32 events ‘poured’ above, 17 of them went into it’s bucket and it ‘Instantiated’ tracking against 2 IPs, but neither ‘Overflowed’, which would cause an action to be taken.

However, ‘http-probing’ did. Assuming this is related to a web application you’re trying to use, you just got blocked. So let’s see what that scenario is looking for and what we can do about it.

sudo cscli hub list | grep http-probing

  crowdsecurity/http-probing                        ✔️  enabled  0.4      /etc/crowdsec/scenarios/http-probing.yaml

sudo cat  /etc/crowdsec/scenarios/http-probing.yaml
...
...
filter: "evt.Meta.service == 'http' && evt.Meta.http_status in ['404', '403', '400']
capacity: 10
reprocess: true
leakspeed: "10s"
blackhole: 5m
...
...

You’ll notice that it’s simply looking for a few status codes, notably ‘404’. If you get more than 10 in 10 seconds, you get black-holed for 5 min. The next thing is to find out what web requests are triggering it. We could just look for 404s in the web access log, but we can also ask CrowdSec itself to tell is. This will be more important when the triggers are more subtle, so let’s give it a try now.

# Grep some 404 events from the main log to a test file
sudo grep 404 /var/log/caddy/access.log | tail  >  ~/test.log

# cscli explain with -v for more detail
sudo cscli explain -v --file ./test.log --type caddy

  ├ s00-raw
  | ├ 🟢 crowdsecurity/non-syslog (first_parser)
  | └ 🔴 crowdsecurity/syslog-logs
  ├ s01-parse
  | └ 🟢 crowdsecurity/caddy-logs (+19 ~2)
  |   └ update evt.Stage : s01-parse -> s02-enrich
  |   └ create evt.Parsed.request : /0/icon/Smith
  |   ...
  |   └ create evt.Meta.http_status : 404
  |   ...
  ├-------- parser success 🟢
  ├ Scenarios
    ├ 🟢 crowdsecurity/http-crawl-non_statics
    └ 🟢 crowdsecurity/http-probing

In this case, the client is asking for the file /0/icon/Smith and it doesn’t exist. Turns out, the web client is asking just in case and accepting the 404 without complaint in the background. That’s fine for the app, but matches two things under the Scenarios section; that of someone crawling the server, and or someone probing it. To fix this, we’ll need to create a whitelist definition for the app.

You can also work it from the alerts side and inspect what happened (assuming you’ve caused an alert).

sudo cscli alert list

# This is an actual attack, and not something to be whitelisted, but it's a good example of how the inspection works.

╭─────┬──────────────────────────┬────────────────────────────────────────────┬─────────┬──────────────────────────────────────┬───────────┬─────────────────────────────────────────╮
│  ID │           value          │                   reason                   │ country │                  as                  │ decisions │                created_at               │
├─────┼──────────────────────────┼────────────────────────────────────────────┼─────────┼──────────────────────────────────────┼───────────┼─────────────────────────────────────────┤
951 │ Ip:165.22.253.118        │ crowdsecurity/http-probing                 │ SG      │ 14061 DIGITALOCEAN-ASN               │ ban:1     │ 2025-02-26 13:53:08.589118208 +0000 UTC │


sudo cscli alerts inspect 951 -d

################################################################################################

 - ID           : 951
 - Date         : 2025-02-26T13:53:14Z
 - Machine      : 0e4a17d2f5d44270b7d543ac29c1dd4eWv2ozxHsRqoJWmRL
 - Simulation   : false
 - Remediation  : true
 - Reason       : crowdsecurity/http-probing
 - Events Count : 11
 - Scope:Value  : Ip:165.22.253.118
 - Country      : SG
 - AS           : DIGITALOCEAN-ASN
 - Begin        : 2025-02-26 13:53:08.589118208 +0000 UTC
 - End          : 2025-02-26 13:53:13.990699814 +0000 UTC
 - UUID         : eb454114-bc1e-455d-bfcc-f4772803e8bf


 - Context  :
╭────────────┬──────────────────────────────────────────────────────────────╮
│     Key    │                             Value                            │
├────────────┼──────────────────────────────────────────────────────────────┤
│ method     │ GET                                                          │
│ status     │ 403│ target_uri │ /                                                            │
│ target_uri │ /wp-includes/wlwmanifest.xml                                 │
│ target_uri │ /xmlrpc.php?rsd                                              │
│ target_uri │ /blog/wp-includes/wlwmanifest.xml                            │
│ target_uri │ /web/wp-includes/wlwmanifest.xml                             │
│ target_uri │ /wordpress/wp-includes/wlwmanifest.xml                       │
│ target_uri │ /website/wp-includes/wlwmanifest.xml                         │
│ target_uri │ /wp/wp-includes/wlwmanifest.xml                              │
│ target_uri │ /news/wp-includes/wlwmanifest.xml                            │
│ target_uri │ /2018/wp-includes/wlwmanifest.xml                            │
│ target_uri │ /2019/wp-includes/wlwmanifest.xml                            │
│ user_agent │ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 │
│            │ (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36       │
╰────────────┴──────────────────────────────────────────────────────────────╯

Whitelist

To whitelist an app, we create a file with an expression that matches the behavior we see above, such as the apps attempts to load a file that doesn’t exist, and exempts it. You can only add these to the s02 stage folder and the name element but be unique for each.

sudo vi /etc/crowdsec/parsers/s02-enrich/some-app-whitelist.yaml

This example uses the startsWith expression and assumes that all requests start the same

name: you/some-app
description: "Whitelist 404s for icon requests" 
whitelist: 
  reason: "icon request" 
  expression:   
    - evt.Parsed.request startsWith '/0/icon/'

If it’s less predictable, you can use a regular expression instead and combine with other expressions like a site match. In general, the more specific the better.

name: you/some-app-whitelist
description: "Whitelist 404s for icon requests" 
whitelist: 
  reason: "icon request" 
  expression:   
    - evt.Parsed.request matches '^/[0-9]/icon/.*' && evt.Meta.target_fqdn == "some-app.you.org"

Now you can reload crowdsec and test

sudo systemctl restart crowdsec.service

sudo cscli explain -v --file ./test.log --type caddy
 ├ s00-raw
 | ├ 🔴 crowdsecurity/syslog-logs
 | └ 🟢 crowdsecurity/non-syslog (+5 ~8)
 |  └ update evt.ExpectMode : %!s(int=0) -> 1
 |  └ update evt.Stage :  -> s01-parse
...
 ├ s02-enrich
 | ├ 🟢 you/some-app-whitelist (~2 [whitelisted])
 |  ├ update evt.Whitelisted : %!s(bool=false) -> true
 |  ├ update evt.WhitelistReason :  -> some icon request
 | ├ 🟢 crowdsecurity/dateparse-enrich (+2 ~2)
...
...
 | ├ 🟢 crowdsecurity/http-logs (+7)
 | └ 🟢 crowdsecurity/whitelists (unchanged)
 └-------- parser success, ignored by whitelist (audioserve icon request) 🟢

You’ll see in the above example, we successfully parsed the entry, but it was ‘ignored’ and didn’t go on to the Scenario setion.

Regular Checking

You’ll find yourself doing this fairly regularly at first.

# Look for an IP on the ban list
sudo cscli alerts list

# Pull out the last several log entries for that IP
sudo grep SOME.IP.FROM.ALERTS /var/log/caddy/access.log | tail -10 > test.log 

# See what it was asking for
cat test.log | jq '.request'
cat test.log | jq '.request.uri'

# Ask caddy why it had a problem
sudo cscli explain -v --file ./test.log --type caddy

Troubleshooting

New Whitelist Has No Effect

If you have more than one whitelist, check the name you gave it on the first line. If that’s not unique, the whole thing will be silently ignore.

Regular Expression Isn’t Matching

CrowdSec uses the go-centric expr-lang. You may be used to unix regex where you’d escape slashes, for example. A tool like https://www.akto.io/tools/regex-tester is helpful.

1.4 - Custom Parser

When checking out the detailed metrics you may find that log entries aren’t being parsed. Maybe the log format has changed or you’re logging additional data the author didn’t anticipate. The best thing is to add your own parser.

Types of Parsers

There are several type of parsers and they are used in stages. Some are designed to work with the raw log entries while others are designed to take pre-parsed data and add or enrich it. This way you can do branching and not every parser needs to now how to read a syslog message.

Their Local Path will tell you what stage they kick in at. Use sudo cscli parsers list to display the details. s00-raw works with the ‘raw’ files while s01 and s02 work further down the pipeline. Currently, you can only create s00 and s01 level parsers.

Integrating with Scenarios

Useful parsers supply data that Scenarios are interested in. You can create a parser that watches the system logs for ‘FOOBAR’ entries, extracts the ‘FOOBAR-LEVEL`, and passes it on. But if nothing is looking for ‘FOOBARs’ then nothing will happen.

Let’s say you’ve added the Caddy collection. It’s pulled in a bunch of Scenarios you can view with sudo cscli scenarios list. If you look at one of the assicated files you’ll see a filter section where they look for ’evt.Meta.http_path’ and ’evt.Parsed.verb’. They are all different though, so how do you know what data to supply?

Your best bet is to take an existing parser and modify it.

Examples

Note - CrowdSec is pretty awesome and after talking in the discord they’ve already accomodated both these scenarios within a relase cycle or two. So these two examples are solved. I’m sure you’ll find new ones, though ;-)

A Web Example

Let’s say that you’ve installed the Caddy collection, but you’ve noticed basic auth login failures don’t trigger the parser. So let’s add a new file and edit it.

sudo cp /etc/crowdsec/parsers/s01-parse/caddy-logs.yaml /etc/crowdsec/parsers/s01-parse/caddy-logs-custom.yaml

You’ll notice two top level sections where the parsing happens; nodes and statics and some grok pattern matching going on.

Nodes allow you try multiple patterns and if any match, the whole section is considered successful. I.e. if the log could have either the standard HTTPDATE or a CUSTOMDATE, as long as it has one it’s good and the matching can move on. Statics just goes down the list extracting data. If any fail the whole event is considered a fail and dropped as unparseable.

All the pasrsed data gets attached to event as ’evt.Parsed.something’ and some of the statics are moving it to evt values the Senarios will be looking for Caddy logs are JSON formatted and so basically already parsed and this example makes use of the JsonExtract method quite a bit.

# We added the caddy logs in the acquis.yaml file with the label 'caddy' and so we use that as our filter here
filter: "evt.Parsed.program startsWith 'caddy'"
onsuccess: next_stage
# debug: true
name: caddy-logs-custom
description: "Parse custom caddy logs"
pattern_syntax:
 CUSTOMDATE: '%{DAY:day}, %{MONTHDAY:monthday} %{MONTH:month} %{YEAR:year} %{TIME:time} %{WORD:tz}'
nodes:
  - nodes:
    - grok:
        pattern: '%{NOTSPACE} %{NOTSPACE} %{NOTSPACE} \[%{HTTPDATE:timestamp}\]%{DATA}'
        expression: JsonExtract(evt.Line.Raw, "common_log")
        statics:
          - target: evt.StrTime
            expression: evt.Parsed.timestamp
    - grok:
        pattern: "%{CUSTOMDATE:timestamp}"
        expression: JsonExtract(evt.Line.Raw, "resp_headers.Date[0]")
        statics:
          - target: evt.StrTime
            expression: evt.Parsed.day + " " + evt.Parsed.month + " " + evt.Parsed.monthday + " " + evt.Parsed.time + ".000000" + " " + evt.Parsed.year
    - grok:
        pattern: '%{IPORHOST:remote_addr}:%{NUMBER}'
        expression: JsonExtract(evt.Line.Raw, "request.remote_addr")
    - grok:
        pattern: '%{IPORHOST:remote_ip}'
        expression: JsonExtract(evt.Line.Raw, "request.remote_ip")
    - grok:
        pattern: '\["%{NOTDQUOTE:http_user_agent}\"]'
        expression: JsonExtract(evt.Line.Raw, "request.headers.User-Agent")
statics:
  - meta: log_type
    value: http_access-log
  - meta: service
    value: http
  - meta: source_ip
    expression: evt.Parsed.remote_addr
  - meta: source_ip
    expression: evt.Parsed.remote_ip
  - meta: http_status
    expression: JsonExtract(evt.Line.Raw, "status")
  - meta: http_path
    expression: JsonExtract(evt.Line.Raw, "request.uri")
  - target: evt.Parsed.request #Add for http-logs enricher
    expression: JsonExtract(evt.Line.Raw, "request.uri")
  - parsed: verb
    expression: JsonExtract(evt.Line.Raw, "request.method")
  - meta: http_verb
    expression: JsonExtract(evt.Line.Raw, "request.method")
  - meta: http_user_agent
    expression: evt.Parsed.http_user_agent
  - meta: target_fqdn
    expression: JsonExtract(evt.Line.Raw, "request.host")
  - meta: sub_type
    expression: "JsonExtract(evt.Line.Raw, 'status') == '401' && JsonExtract(evt.Line.Raw, 'request.headers.Authorization[0]') startsWith 'Basic ' ? 'auth_fail' : ''"

The very last line is where a status 401 is checked. It looks for a 401 and a request for Basic auth. However, this misses events where someone asks for a resource that is protected and the serer responds telling you Basic is needed. I.e. when a bot is poking at URLs on your server ignoring the prompts to login. You can look at the log entries more easily with this command to follow the log and decode it while you recreate failed attempts.

sudo tail -f /var/log/caddy/access.log | jq

To change this, update the expression to also check the response header with an additional ? (or) condition.

    expression: "JsonExtract(evt.Line.Raw, 'status') == '401' && JsonExtract(evt.Line.Raw, 'request.headers.Authorization[0]') startsWith 'Basic ' ? 'auth_fail' : ''"xtract(evt.Line.Raw, 'status') == '401' && JsonExtract(evt.Line.Raw, 'resp_headers.Www-Authenticate[0]') startsWith 'Basic ' ? 'auth_fail' : ''"

Syslog Example

Let’s say you’re using dropbear and failed logins are not being picked up by the ssh parser

To see what’s going on, you use the crowdsec command line interface. The shell command is cscli and you can ask it about it’s metrics to see how many lines it’s parsed and if any of them are suspicious. Since we just restarted, you may not have any syslog lines yet, so let’s add some and check.

ssh [email protected]
logger "This is an innocuous message"

cscli metrics
INFO[28-06-2022 02:41:33 PM] Acquisition Metrics:
+------------------------+------------+--------------+----------------+------------------------+
|         SOURCE         | LINES READ | LINES PARSED | LINES UNPARSED | LINES POURED TO BUCKET |
+------------------------+------------+--------------+----------------+------------------------+
| file:/var/log/messages | 1          | -            | 1              | -                      |
+------------------------+------------+--------------+----------------+------------------------+

Notice that the line we just read is unparsed and that’s OK. That just means it wasn’t an entry the parser cared about. Let’s see if it responds to an actual failed login.

dbclient some.remote.host

# Enter some bad passwords and then exit with a Ctrl-C. Remember, localhost attempts are whitelisted so you must be remote.
[email protected]'s password:
[email protected]'s password:

cscli metrics
INFO[28-06-2022 02:49:51 PM] Acquisition Metrics:
+------------------------+------------+--------------+----------------+------------------------+
|         SOURCE         | LINES READ | LINES PARSED | LINES UNPARSED | LINES POURED TO BUCKET |
+------------------------+------------+--------------+----------------+------------------------+
| file:/var/log/messages | 7          | -            | 7              | -                      |
+------------------------+------------+--------------+----------------+------------------------+

Well, no luck. We will need to adjust the parser

sudo cp /etc/crowdsec/parsers/s01-parse/sshd-logs.yaml /etc/crowdsec/parsers/s01-parse/sshd-logs-custom.yaml

Take a look at the logfile and copy an example line over to https://grokdebugger.com/. Use a pattern like

Bad PAM password attempt for '%{DATA:user}' from %{IP:source_ip}:%{INT:port}

Assuming you get the pattern worked out, you can then add a section to the bottom of the custom log file you created.

  - grok:
      name: "SSHD_AUTH_FAIL"
      pattern: "Login attempt for nonexistent user from %{IP:source_ip}:%{INT:port}"
      apply_on: message

1.5 - On Alpine

Install

There are some packages available, but (as of 2022) they are a bit behind and don’t include the config and service files. So let’s download the latest binaries from Crowsec and create our own.

Download the current release

Note: Download the static versions. Alpine uses a differnt libc than other distros.

cd /tmp
wget https://github.com/crowdsecurity/crowdsec/releases/latest/download/crowdsec-release-static.tgz
wget https://github.com/crowdsecurity/cs-firewall-bouncer/releases/latest/download/crowdsec-firewall-bouncer.tgz

tar xzf crowdsec-firewall*
tar xzf crowdsec-release*
rm *.tgz

Install Crowdsec and Register with The Central API

You cannot use the wizard as it expects systemd and doesn’t support OpenRC. Follow the Binary Install steps from CrowdSec’s binary instrcutions.

sudo apk add bash newt envsubst
cd /tmp/crowdsec-v*

# Docker mode skips configuring systemd
sudo ./wizard.sh --docker-mode

sudo cscli hub update
sudo cscli machines add -a
sudo cscli capi register

# A collection is just a bunch of parsers and scenarios bundled together for convienence
sudo cscli collections install crowdsecurity/linux 

Install The Firewall Bouncer

We need a netfilter tool so install nftables. If you already have iptables installed you can skip this step and set FW_BACKEND to that below when generating the API keys.

sudo apk add nftables

Now we install the firewall bouncer. There is no static build of the firewall bouncer yet from CrowdSec, but you can get one from Alpine testing (if you don’t want to compile it yourself)

# Change from 'edge' to other versions a needed
echo "http://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
apk update
apk add cs-firewall-bouncer

Now configure the bouncer. We will once again do this manually becase there is not support for non-systemd linuxes with the install script. But cribbing from their install script, we see we can:

cd /tmp/crowdsec-firewall*

BIN_PATH_INSTALLED="/usr/local/bin/crowdsec-firewall-bouncer"
BIN_PATH="./crowdsec-firewall-bouncer"
sudo install -v -m 755 -D "${BIN_PATH}" "${BIN_PATH_INSTALLED}"

CONFIG_DIR="/etc/crowdsec/bouncers/"
sudo mkdir -p "${CONFIG_DIR}"
sudo install -m 0600 "./config/crowdsec-firewall-bouncer.yaml" "${CONFIG_DIR}crowdsec-firewall-bouncer.yaml"

Generate The API Keys

Note: If you used the APK, just do the first two lines to get the API_KEY (echo $API_KEY) and manually edit the file (vim /etc/crowdsec/bouncers/crowdsec-firewall-bouncer.yaml)

cd /tmp/crowdsec-firewall*
CONFIG_DIR="/etc/crowdsec/bouncers/"

SUFFIX=`tr -dc A-Za-z0-9 </dev/urandom | head -c 8`
API_KEY=`sudo cscli bouncers add cs-firewall-bouncer-${SUFFIX} -o raw`
FW_BACKEND="nftables"
API_KEY=${API_KEY} BACKEND=${FW_BACKEND} envsubst < ./config/crowdsec-firewall-bouncer.yaml | sudo install -m 0600 /dev/stdin "${CONFIG_DIR}crowdsec-firewall-bouncer.yaml"

Create RC Service Files

sudo touch /etc/init.d/crowdsec
sudo chmod +x /etc/init.d/crowdsec
sudo rc-update add crowdsec

sudo vim /etc/init.d/crowdsec
#!/sbin/openrc-run

command=/usr/local/bin/crowdsec
command_background=true

pidfile="/run/${RC_SVCNAME}.pid"

depend() {
   need localmount
   need net
}

Note: If you used the package from Alpine testing above it came with a service file. Just rc-update add cs-firewall-bouncer and skip this next step.

sudo touch /etc/init.d/cs-firewall-bouncer
sudo chmod +x /etc/init.d/cs-firewall-bouncer
sudo rc-update add cs-firewall-bouncer

sudo vim /etc/init.d/cs-firewall-bouncer
#!/sbin/openrc-run

command=/usr/local/bin/crowdsec-firewall-bouncer
command_args="-c /etc/crowdsec/bouncers/crowdsec-firewall-bouncer.yaml"
pidfile="/run/${RC_SVCNAME}.pid"
command_background=true

depend() {
  after firewall
}

Start The Services and Observe The Results

Start up the services and view the logs to see that everything started properly

sudo service start crowdsec
sudo service cs-firewall-bouncer status

sudo tail /var/log/crowdsec.log
sudo tail /var/log/crowdsec-firewall-bouncer.log

# The firewall bouncer should tell you about how it's inserting decisions it got from the hub

sudo cat /var/log/crowdsec-firewall-bouncer.log

time="28-06-2022 13:10:05" level=info msg="backend type : nftables"
time="28-06-2022 13:10:05" level=info msg="nftables initiated"
time="28-06-2022 13:10:05" level=info msg="Processing new and deleted decisions . . ."
time="28-06-2022 14:35:35" level=info msg="100 decisions added"
time="28-06-2022 14:35:45" level=info msg="1150 decisions added"
...
...

# If you are curious about what it's blocking
sudo nft list table crowdsec
...

1.6 - Cloudflare Proxy

Cloudflare offers an excellent reverse proxy and they filter most bad actors for you. But not all. Here’s a sample of what makes it through;

allen@www:~/$ sudo cscli alert list    
╭─────┬────────────────────┬───────────────────────────────────┬─────────┬────────────────────────┬───────────┬─────────────────────────────────────────╮
│  ID │        value       │               reason              │ country │           as           │ decisions │                created_at               │
├─────┼────────────────────┼───────────────────────────────────┼─────────┼────────────────────────┼───────────┼─────────────────────────────────────────┤
│ 221 │ Ip:162.158.49.136  │ crowdsecurity/jira_cve-2021-26086 │ IE      │ 13335 CLOUDFLARENET    │ ban:1     │ 2025-01-22 15:14:34.554328601 +0000 UTC │
│ 187 │ Ip:128.199.182.152 │ crowdsecurity/jira_cve-2021-26086 │ SG      │ 14061 DIGITALOCEAN-ASN │ ban:1     │ 2025-01-19 20:50:45.822199509 +0000 UTC │
│ 186 │ Ip:46.101.1.225    │ crowdsecurity/jira_cve-2021-26086 │ GB      │ 14061 DIGITALOCEAN-ASN │ ban:1     │ 2025-01-19 20:50:41.699518104 +0000 UTC │
│ 181 │ Ip:162.158.108.104 │ crowdsecurity/http-bad-user-agent │ SG      │ 13335 CLOUDFLARENET    │ ban:1     │ 2025-01-19 12:39:20.468268327 +0000 UTC │
│ 180 │ Ip:172.70.208.61   │ crowdsecurity/http-bad-user-agent │ SG      │ 13335 CLOUDFLARENET    │ ban:1     │ 2025-01-19 12:38:36.664997131 +0000 UTC │
╰─────┴────────────────────┴───────────────────────────────────┴─────────┴────────────────────────┴───────────┴─────────────────────────────────────────╯

You can see that CrowdSec took action, but it was the wrong one. It’s blocking the Cloudflare exit node and removed everyone’s access.

What we want is:

  • Identify the actual attacker
  • Block that somewhere effective (the firewall-bouncer can’t selectively block proxied traffic)

Identifying The Attacker

We could replace the CrowdSec Caddy log parser and use a different header, but there’s a hint in the CrowdSec parser that suggests using the trusted_proxies module.

##Caddy now sets client_ip to the value of X-Forwarded-For if users sets trusted proxies

Additionally, we can choose the CF-Connecting-IP header like francislavoie suggests, as X-Forwarded-For is easily spoofed.

Add a Trusted Proxy

To set Cloudflare as a trusted proxy we must identify all the Cloudflare exit node IPs to trust them. That would be hard to manage, but happily, there’s a handy caddy-cloudflare-ip module for that. Many thanks to WeidiDeng!

sudo caddy add-package github.com/WeidiDeng/caddy-cloudflare-ip
sudo vi /etc/caddy/Caddyfile
#
# Global Options Block
#
{
        servers {             
                trusted_proxies cloudflare  
                client_ip_headers CF-Connecting-IP  
        }    
}

After restarting Caddy, we can see the header change

sudo head /var/log/caddy/access.log  | jq '.request'
sudo tail /var/log/caddy/access.log  | jq '.request'

Before

  "remote_ip": "172.68.15.223",
  "client_ip": "172.68.15.223",

After

  "remote_ip": "172.71.98.114",
  "client_ip": "109.206.128.45",

And when consulting crowdsec, we can see it’s using the client_ip information.

sudo tail /var/log/caddy/access.log > test.log
sudo cscli explain -v --file ./test.log --type caddy

 ├ s01-parse
 | └ 🟢 crowdsecurity/caddy-logs (+14 ~2)
 |  └ update evt.Stage : s01-parse -> s02-enrich
 |  └ create evt.Parsed.remote_ip : 109.206.128.45 <-- Your Actual IP

And when launching a probe we can see it show up with the correct IP.

# Ask for lots of pages that don't exist to simulate a HTTP probe
for X in {1..100}; do curl -D - https://www.some.org/$X;done


sudo cscli decisions list   
╭─────────┬──────────┬───────────────────┬────────────────────────────┬────────┬─────────┬───────────────┬────────┬────────────┬──────────╮
│    ID   │  Source  │    Scope:Value    │           Reason           │ Action │ Country │       AS      │ Events │ expiration │ Alert ID │
├─────────┼──────────┼───────────────────┼────────────────────────────┼────────┼─────────┼───────────────┼────────┼────────────┼──────────┤
2040067 │ crowdsec │ Ip:109.206.128.45 │ crowdsecurity/http-probing │ ban    │ US      │ 600 BADNET-AS │ 11     │ 3h32m5s    │ 235╰─────────┴──────────┴───────────────────┴────────────────────────────┴────────┴─────────┴───────────────┴────────┴────────────┴──────────╯

This doesn’t do anything on its own (because traffic is proxied) but we can make it work if we change bouncers.

Changing Bouncers

The ideal approach would to tell Cloudflare to stop forwarding traffic from the bad actors. There is a cloudflare-bouncer to do just that. It’s rate limited however, and only suitable for premium clients. There is also the CrowdSec Cloudflare Worker. It’s better, but still suffers from limits for non-premium clients.

Caddy Bouncer

Instead, we’ll use the caddy-crowdsec-bouncer. This is a layer 4 (protocol level) bouncer. It works inside Caddy and will block IPs based on the client_ip from the HTTP request.

Generate an API key for the bouncer with the bouncer add command - this doesn’t actually install anything, just generates a key.

sudo cscli bouncers add caddy-bouncer

Add the module to Caddy (which is the actual install).

sudo caddy add-package github.com/hslatman/caddy-crowdsec-bouncer

Configure Caddy

#
# Global Options Block
#
{
        
        crowdsec {
                api_key ABIGLONGSTRING
        }
        # Make sure to add the order statement
        order crowdsec first
}
www.some.org {

    crowdsec 

    root * /var/www/www.some.org
    file_server
}

And restart.

sudo systemctl restart caddy.service

Testing Remediation

Let’s test that probe again. Initially, you’ll get a 404 (not found) but after while of that, it should switch to 403 (access denied)

for X in {1..100}; do curl -D - --silent https://www.some.org/$X | grep HTTP;done

HTTP/2 404 
HTTP/2 404 
...
...
HTTP/2 403 
HTTP/2 403 

Conclusion

Congrats! after much work you’ve traded 404s for 403s. Was it worth it? Probably. If an adversary’s probe had a chance to find something, it has less of a chance now.

Bonus Section

I mentioned earlier that the X-Forwarded-For header could be spoofed. Let’s take a look at that. Here’s an example.

# Comment out 'client_ip_headers CF-Connecting-IP' from your Caddy config, and restart.

for X in {1..100}; do curl -D - --silent "X-Forwarded-For: 192.168.0.2" https://www.some.org/$X | grep HTTP;done

HTTP/2 404 
HTTP/2 404 
...
...
HTTP/2 404 
HTTP/2 404

No remediation happens. Turns out Cloudflare appends by default, giving you:

sudo tail -f /var/log/caddy/www.some.org.log | jq

    "client_ip": "192.168.0.2",

      "X-Forwarded-For": [
        "192.168.0.2,109.206.128.45"
      ],

Caddy takes the first value, which is rather trusting but canonically correct, puts it as the client_ip and CrowdSec uses that.

Adjusting Cloudflare

You don’t need to, but you can configure Cloudflare to “Remove visitor IP headers”. This is counterintuitive, but the notes say “…Cloudflare will only keep the IP address of the last proxy”. In testing, it keeps the last value in the X-Forwarded-For string, and that’s what we’re after. It works for normal and forged headers.

  1. Log in to the Cloudflare dashboard and select your website
  2. Go to Rules > Overview
  3. Select “Manage Request Header Transform Rules”
  4. Select “Managed Transforms”
  5. Enable Remove visitor IP headers

The Overview page may look different depending on your plan, so you may have to hunt around for this setting.

Now when you test, you’ll get access denied regardless of your header

for X in {1..100}; do curl -D - --silent "X-Forwarded-For: 192.168.0.2" https://www.some.org/$X | grep HTTP;done

HTTP/2 404 
HTTP/2 404 
...
...
HTTP/2 403 
HTTP/2 403

Bonus Ending

You’ve added an extra layer of protection - but it’s not clear if it’s worth it. It may add to the proxy time, so use at your own discretion.

2 - Encryption

2.1 - GPG

GPG is an implementation of the OpenPGP standard (the term ‘PGP’ is trademarked by Symantec).

The best practice, that GPG implements by default, is to create a signing-only primary key with an encryption subkey1. These subkeys expire2 and must be extended or replaced from time to time.

The Basics

The basics of gpg can be broken down into:

  • managing your keys
  • encrypting and decrypting your files
  • integrating gpg keys with mail and other utilities

Let’s skip the details of asymmetric key encryption, public private keys, and just know that there are two keys; your private key, and your public key. You encrypt with the public key, and you decrypt with the private key.

The private key is the one that matters. That’s the one you use to decrypt things. Your public key you can recreate, should you lose it, as long as you have your private key.

The public key is the one you pass out to your friends and even put on your web site when you want someone to sen you something that only you can read. It sounds crazy, but through wonders of mathematics, it can only be used to encrypt a file, never to decrypt one. So it doesn’t matter who you give it to. They can encrypt something, send it to you, and you can decrypt it - all without anyone sending a password.

You can also sign things. This is when you want to send something that anyone can read, but just want to be sure it came from you. More on that later. Let’s focus on secrecy.

Note - In my opinion, we can probably skip all the old command line stuff, not that it’s not good to know, it’s just slower to use as a novice.

http://ubuntuforums.org/showthread.php?t=680292

Key Management

To list keys

# If you don't use this list-option arguement, you won't see all the subkeys
gpg --list-options show-unusable-subkeys --list-keys

gpg --edit-key C621C2A8040C51F5C4AD9D2990A1676C9CB79C5D addkey

Encrypt and Decrypt

This will encrypt the file and apply the default option of appending .gpg on the end of the file

gpg -e -r '[email protected]' /path/to/some/file.txt

This will do the reverse - note you have to specify the output file or you will get to view the decrypted file via stdout, probably not what you wanted

gpg -o /path/to/some/file.txt -d /path/to/some/file.txt.gpg

3 - Event Management

Before it was SIEM

Back in the dawn of time, we called it ‘Central Logging’ and it looked kind of like this:

# The classical way you'd implement this is via a tiered system.

Log Shipper --\                   /--> Log Parser --\
Log Shipper ---+--> Log Broker --+---> Log Parser ---+--> Log Storage --> Log Visualizer 
Log Shipper --/                   \--> Log Parser --/

# The modern way is more distributed. The clients are more powerful so you spread the load out and they can connect to distributed storage directly.

Log Parser Shipper --\ /-- Log Storage <-\
Log Parser Shipper ---+--- Log Storage <--+-  Visualizer 
Log Parser Shipper --/ \-- Log Storage <-/

# ELK (Elasticsearch Logstash and Kibana) is a good example.

Logstash --\ /-- Elasticsearch <-\
Logstash ---+--- Elasticsearch <--+--> Kibana 
Logstash --/ \-- Elasticsearch <-/

More recently, there’s a move toward shippers like NXLog and Elasticsearch’s beats client. A native client saves you from deploying Java and is better suited for thin or micro instances.

# NXLog has an output module for Elasticsearch now. Beats is an Elasticsearch product.
nxlog --\   
nxlog ---+--> Elasticsearch <-- Kibana
beats --/ 

Windows has it’s own log forwarding technology. You can put it to work without installing anything on the clients. This makes Windows admins a lot happier.

# It's built-in and fine for windows events - just doesn't do text files. Beats can read the events and push to elasticsearch.
Windows Event Forwarding --\   
Windows Event Forwarding ---+--> Central Windows Event Manager -> Beats/Elasticsearch --> Kibana
Windows Event Forwarding --/ 

Unix has several ways to do it, but the most modern/least-overhead way is to use the native journald system.

# Built-in to systemd
journald send --> central journald receive --> Beats/Elasticsearch --> Kibana

But Why?

The original answer used to be ‘reporting’. It was easier to get all the data together and do an analysis in one place.

Now the answer is ‘correlation’. If someone is probing your systems, they’ll do it very slowly and from multiple IPs to evade thresholds if they can, trying to break up patterns of attack. These patterns can become clear however, when you have a complete picture in one place.

3.1 - Elastic Stack

This is also referred to ELK, and is an acronym that stands for Elasticsearch, Logstash and Kibana

This is a trio of tools that <www.elasticsearch.org> has packaged up into a simple and flexible way to handle, store and visualize data. Logstash collects the logs, parses them and stores them in Elasticsearch. Kibana is a web application that knows how to to talk to Elasticsearch and visualizes the data.

Quite simple and powerful

To make use of this tio, start by deploying in this order:

  • Elasticseach (first, you have have some place to put things)
  • Kibana (so you can see what’s going on in elasticsearch easily)
  • Logstash (to start collecting data)

More recently, you can use the Elasticsearch Beats client in place of Logstash. These are natively compiled clients that have less capability, but are easier on the infrastructure than Logstash, a Java application.

3.1.1 - Elasticsearch

3.1.1.1 - Installation (Linux)

This is circa 2014 - use with a grain of salt.

This is generally the first step, as you need a place to collect your logs. Elasticsearch itself is a NoSQL database and well suited for pure-web style integrations.

Java is required, and you may wish to deploy Oracle’s java per the Elasticsearch team’s recommendation. You may also want to dedicate a data partition. By default, data is stored in /var/lib/elasticsearch and that can fill up. We will also install the ‘kopf’ plugin that makes it easier to manage your data.

Install Java and Elasticsearch

# (add a java repo)
sudo yum install java

# (add the elasticsearch repo)
sudo yum install elasticsearch

# Change the storage location
sudo mkdir /opt/elasticsearch
sudo chown elasticsearch:elasticsearch /opt/elasticsearch

sudo vim /etc/elasticsearch/elasticsearch.yml

    ...
    path.data: /opt/elasticsearch/data
    ...

# Allow connections on ports 9200, 9300-9400 and set the cluster IP

# By design, Elasticsearch is open so control access with care
sudo iptables --insert INPUT --protocol tcp --source 10.18.0.0/16 --dport 9200 --jump ACCEPT

sudo iptables --insert INPUT --protocol tcp --source 10.18.0.0/16 --dport 9300:9300 --jump ACCEPT

sudo vim /etc/elasticsearch/elasticsearch.yml
    ...
    # Failing to set the 'publish_host can result in the cluster auto-detecting an interface clients or other
    # nodes can't reach. If you only have one interface you can leave commented out. 
    network.publish_host: 10.18.3.1
    ...


# Increase the heap size
sudo vim  /etc/sysconfig/elasticsearch

    # Heap size defaults to 256m min, 1g max
    # Set ES_HEAP_SIZE to 50% of available RAM, but no more than 31g
ES_HEAP_SIZE=2g

# Install the kopf plugin and access it via your browser

sudo /usr/share/elasticsearch/bin/plugin -install lmenezes/elasticsearch-kopf
sudo service elasticsearch restart

In your browser, navigate to

http://10.18.3.1:9200/_plugin/kopf/

If everything is working correctly you should see a web page with KOPF at the top.

3.1.1.2 - Installation (Windows)

You may need to install on windows to ensure the ‘maximum amount of service ability with existing support staff’. I’ve used it on both Windows and Linux and it’s fine either way. Windows just requires a few more steps.

Requirements and Versions

The current version of Elasticsearch at time of writing these notes is 7.6. It requires an OS and Java. The latest of those supported are:

  • Windows Server 2016
  • OpenJDK 13

Installation

The installation instructions are at https://www.elastic.co/guide/en/elastic-stack-get-started/current/get-started-elastic-stack.html

Note: Elasicsearch has both an zip and a MSI. The former comes with a java distro but the MSI includes a service installer.

Java

The OpenJDK 13 GA Releases at https://jdk.java.net/13/ no longer include installers or the JRE. But you can install via a MSI from https://github.com/ojdkbuild/ojdkbuild

Download the latest java-13-openjdk-jre-13.X and execute. Use the advanced settings to include the configuration of the JAVA_HOME and other useful variables.

To test the install, open a command prompt and check the version

C:\Users\allen>java --version
openjdk 13.0.2 2020-01-14
OpenJDK Runtime Environment 19.9 (build 13.0.2+8)
OpenJDK 64-Bit Server VM 19.9 (build 13.0.2+8, mixed mode, sharing)

Elasticsearch

Download the MSI installer from https://www.elastic.co/downloads/elasticsearch. It may be tagged as beta, but it installs the GA product well. Importantly, it also installs a windows service for Elasticsearch.

Verify the installation by checking your services for ‘Elasticsearch’, which should be running.

Troubleshooting

Elasticsearch only listing on localhhost

By default, this is the case. You must edit the config file.

# In an elevated command prompt
notepad C:\ProgramDaata\Elastic\Elasticsearach\config\elasticsearch.yml

# add
discovery.type: single-node
network.host: 0.0.0.0

https://stackoverflow.com/questions/59350069/elasticsearch-start-up-error-the-default-discovery-settings-are-unsuitable-for

failure while checking if template exists: 405 Method Not Allowed

You can’t run newer versions of the filebeat with older versions of elasticsearch. Download the old deb and sudo apt install ./some.deb

https://discuss.elastic.co/t/filebeat-receives-http-405-from-elasticsearch-after-7-x-8-1-upgrade/303821 https://discuss.elastic.co/t/cant-start-filebeat/181050

3.1.1.3 - Common Tasks

This is circa 2014 - use with a grain of salt.

Configuration of elasticsearch itself is seldom needed. You will have to maintain the data in your indexes however. This is done by either using the kopf tool, or at the command line.

After you have some data in elasticsearch, you’ll see that your ‘documents’ are organized into ‘indexes’. This is a simply a container for your data that was specified when logstash originally sent it, and the naming is arbitrarily defined by the client.

Deleting Data

The first thing you’re likely to need is to delete some badly-parsed data from your testing.

Delete all indexes with the name test*

curl -XDELETE http://localhost:9200/test*

Delete from all indexes documents of type ‘WindowsEvent’

curl -XDELETE http://localhost:9200/_all/WindowsEvent

Delete from all indexes documents have the attribute ‘path’ equal to ‘/var/log/httpd/ssl_request.log’

curl -XDELETE 'http://localhost:9200/_all/_query?q=path:/var/log/https/ssl_request.log'

Delete from the index ’logstash-2014.10.29’ documents of type ‘shib-access’

curl -XDELETE http://localhost:9200/logstash-2014.10.29/shib-access

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

Curator

All the maintenance by hand has to stop at some point and Curator is a good tool to automate some of it. This is a script that will do some curls for you, so to speak.

Install

wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
sudo pip install elasticsearch-curator
sudo pip install argparse

Use

curator --help
curator delete --help

And in your crontab

# Note: you must escape % characters with a \ in crontabs
20 0 * * * curator delete indices --time-unit days --older-than 14 --timestring '\%Y.\%m.\%d' --regex '^logstash-bb-.*'
20 0 * * * curator delete indices --time-unit days --older-than 14 --timestring '\%Y.\%m.\%d' --regex '^logstash-adfsv2-.*'
20 0 * * * curator delete indices --time-unit days --older-than 14 --timestring '\%Y.\%m.\%d' --regex '^logstash-20.*'

Sometimes you’ll need to do an inverse match.

0 20 * * * curator delete indices --regex '^((?!logstash).)*$'

A good way to test your regex is by using the show indices method

curator show indices --regex '^((?!logstash).)*$'

Here’s some OLD posts and links, but be aware the syntax had changed and it’s been several versions since these

http://www.ragingcomputer.com/2014/02/removing-old-records-for-logstash-elasticsearch-kibana http://www.elasticsearch.org/blog/curator-tending-your-time-series-indices/ http://stackoverflow.com/questions/406230/regular-expression-to-match-line-that-doesnt-contain-a-word

Replication and Yellow Cluster Status

By default, elasticsearch assumes you want to have two nodes and replicate your data and the default for new indexes is to have 1 replica. You may not want to do that to start with however, so you change the default and change the replica settings on your existing data in-bulk with:

http://stackoverflow.com/questions/24553718/updating-the-default-index-number-of-replicas-setting-for-new-indices

Set all existing replica requirements to just one copy

curl -XPUT 'localhost:9200/_settings' -d '
{ 
  "index" : { "number_of_replicas" : 0 } 
}'

Change the default settings for new indexes to have just one copy

curl -XPUT 'localhost:9200/_template/logstash_template' -d ' 
{ 
  "template" : "*", 
  "settings" : {"number_of_replicas" : 0 }
} '

http://stackoverflow.com/questions/24553718/updating-the-default-index-number-of-replicas-setting-for-new-indices

Unassigned Shards

You will occasionally have a hiccup where you run out of disk space or something similar and be left with indexes that have no data in them or have shards unassigned. Generally, you will have to delete them but you can also manually reassign them.

http://stackoverflow.com/questions/19967472/elasticsearch-unassigned-shards-how-to-fix

Listing Index Info

You can get a decent human readable list of your indexes using the cat api

curl localhost:9200/_cat/indices

If you wanted to list by size, they use the example

curl localhost:9200/_cat/indices?bytes=b | sort -rnk8 

3.1.2 - Kibana

3.1.2.1 - Installation (Windows)

Kibana is a Node.js app using the Express Web framework - meaning to us it looks like a web server running on port 5601. If you’re running elasticsearch on the same box, it will connect with the defaults.

https://www.elastic.co/guide/en/kibana/current/windows.html

Download and Extract

No MSI or installer is available for windows so you must download the .zip from https://www.elastic.co/downloads/kibana. Uncompress (this will take a while), rename it to ‘Kibana’ and move it to Program Files.

So that you may access it later, edit the config file at {location}/config/kibana.yaml with wordpad and set the server.host entry to:

server.host: "0.0.0.0"

Create a Service

Download the service manager NSSM from https://nssm.cc/download and extract. Start an admin powershell, navigate to the extracted location and run the installation command like so:

C:\Users\alleng\Downloads\nssm-2.24\nssm-2.24\win64> .\nssm.exe install Kibana

In the Pop-Up, set the application path to the below. The start up path will auto populate.

C:\Program Files\Kibana\kibana-7.6.2-windows-x86_64\bin\kibana.bat

Click ‘Install service’ and it should indicate success. Go to the service manager to find and start it. After a minute (Check process manager for the CPU to drop) You should be able to access it at:

http://localhost:5601/app/kibana#/home

3.1.2.2 - Troubleshooting

Rounding Errors

Kibana rounds to 16 significant digits

Turns out, if you have a value of type integer, that’s just the limit. While elasticsearch shows you this:

    curl http://localhost:9200/logstash-db-2016/isim-process/8163783564660983218?pretty
    {
      "_index" : "logstash-db-2016",
      "_type" : "isim-process",
      "_id" : "8163783564660983218",
      "_version" : 1,
      "found" : true,
      "_source":{"requester_name":"8163783564660983218","request_num":8163783618037078861,"started":"2016-04-07 15:16:16:139 GMT","completed":"2016-04-07 15:16:16:282 GMT","subject_service":"Service","request_type":"EP","result_summary":"AA","requestee_name":"Mr. Requester","subject":"mrRequest","@version":"1","@timestamp":"2016-04-07T15:16:16.282Z"}
    }

Kibana shows you this

View: Table / JSON / Raw
Field Action Value
request_num    8163783618037079000

Looking at the JSON will give you the clue - it’s being treated as an integer and not a string.

 "_source": {
    "requester_name": "8163783564660983218",
    "request_num": 8163783618037079000,
    "started": "2016-04-07 15:16:16:139 GMT",
    "completed": "2016-04-07 15:16:16:282 GMT",

Mutate it to string in logstash to get your precision back.

https://github.com/elastic/kibana/issues/4356

3.1.3 - Logstash

Logstash is a parser and shipper. It reads from (usually) a file, parses the data into JSON, then connects to something else and send the data. That something else can be Elasticsearch, a systlog server, and others.

Logstash v/s Beats

But for most things these days, Beats is a better choice. Give that a look fist.

3.1.3.1 - Installation

Note: Before you install logstash, take a look at Elasticsearch’s Beats. It’s lighter-weight for most tasks.

Quick Install

This is a summary of the current install page. Visit and adjust versions as needed.

# Install java
apt install default-jre-headless
apt-get install apt-transport-https
apt install gnupg2
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add -

# Check for the current version - 7 is no longer the current version by now
echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-7.x.list
apt update
apt-get install logstash

Logstash has a NetFlow module, but it has been deprecated2. One should instead use the Filebeat Netflow Module.3

The rest of this page is circa 2014 - use with a grain of salt.

Installation - Linux Clients

Install Java

If you don’t already have it, install it. You’ll need at least 1.7 and Oracle is recommended. However, with older systems do yourself a favor and use the OpenJDK as older versions of Sun and IBM do things with cryptography leading to strange bugs in recent releases of logstash.

# On RedHat flavors, install the OpenJDK and select it for use (in case there are others) with the system alternatives utility
sudo yum install java-1.7.0-openjdk

sudo /usr/sbin/alternatives --config java

Install Logstash

This is essentially:

( Look at https://www.elastic.co/downloads/logstash to get the lastest version or add the repo)
wget (some link from the above page)
sudo yum --nogpgcheck localinstall logstash*

# You may want to grab a plugin, like the syslog output, though elasticsearch installs by default
cd /opt/logstash/
sudo bin/plugin install logstash-output-syslog

# If you're ready to configure the service
sudo vim /etc/logstash/conf.d/logstash.conf

sudo service logstash start

https://www.elastic.co/guide/en/logstash/current/index.html

Operating

Input

The most common use of logstash is to tail and parse log files. You do this by specifying a file and filter like so

[gattis@someHost ~]$ vim /etc/logstash/conf.d/logstash.conf


input {
  file {
    path => "/var/log/httpd/request.log"
  }
}
filter {
  grok {
    match => [ "message", "%{COMBINEDAPACHELOG}"]
  }
}
output {
  stdout {
    codec => rubydebug
  }
}

Filter

There are many different types of filters, but the main one you’ll be using is grok. It’s all about parsing the message into fields. Without this, you just have a bunch of un-indexed text in your database. It ships with some handy macros such as %{COMBINEDAPACHELOG} that takes this:

10.138.120.138 - schmoej [01/Apr/2016:09:39:04 -0400] "GET /some/url.do?action=start HTTP/1.1" 200 10680 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36" 

And turns it into

agent        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
auth         schmoej
bytes                   10680
clientip   10.138.120.138
httpversion 1.1
path           /var/pdweb/www-default/log/request.log
referrer   "-"
request   /some/url.do?action=start
response   200
timestamp   01/Apr/2016:09:39:04 -0400
verb        GET 

See the grok’ing for more details

Output

We’re outputting to the console so we can see what’s going on with our config. If you get some output, but it’s not parsed fully because of an error in the parsing, you’ll see something like the below with a “_grokparsefailure” tag. That means you have to dig into a custom pattern as in described in grok’ing.

Note: by default, logstash is ’tailing’ your logs, so you’ll only see new entries. If you’ve got no traffic you’ll have to generate some

{
       "message" => "test message",
      "@version" => "1",
    "@timestamp" => "2014-10-31T17:39:28.925Z",
          "host" => "some.app.private",
          "tags" => [
        [0] "_grokparsefailure"
    ]
}

If it looks good, you’ll want to send it on to your database. Change your output to look like so which will put your data in a default index that kibana (the visualizer) can show by default.

output {

  elasticsearch {
    hosts => ["10.17.153.1:9200"]
  }
}

Troubleshooting

If you don’t get any output at all, check that the logstash user can actually read the file in question. Check your log files and try running logstash as yourself with the output going to the console.

cat /var/log/logstash/*

/opt/logstash/bin/logstash -f /etc/logstash/conf.d/logstash.conf

3.1.3.2 - Operation

Basic Operation

Generally, you create a config with 3 sections;

  • input
  • filter
  • output

This example uses the grok filter to parse the message.

sudo vi /etc/logstash/conf.d/logstash.conf
input {
  file {
        path => "/var/pdweb/www-default/log/request.log"        
      }
}
filter {
  grok {
    match => [ "message", "%{COMBINEDAPACHELOG}"]
  }
}
output {
  stdout { }
}

Then you test it at the command line

# Test the config file itself
/opt/logstash/bin/logstash -f /etc/logstash/conf.d/logstash.conf --configtest

# Test the parsing of data
/opt/logstash/bin/logstash -e -f /etc/logstash/conf.d/logstash.conf

You should get some nicely parsed lines. If that’s the case, you can edit your config to add a sincedb and an actual destination.

input {
  file {
        path => "/var/pdweb/www-default/log/request.log"
        sincedb_path => "/opt/logstash/sincedb"
  }
}
filter {
  grok {
    match => [ "message", "%{COMBINEDAPACHELOG}"]
  }
}
output {
  elasticsearch {
    host => "some.server.private"
    protocol => "http"
  }
}

If instead you see output with a _grokparsefailure like below, you need to change the filter. Take a look at the common gotchas, then the parse failure section below it.

{
       "message" => "test message",
      "@version" => "1",
    "@timestamp" => "2014-10-31T17:39:28.925Z",
          "host" => "some.app.private",
          "tags" => [
        [0] "_grokparsefailure"
    ]
}

Common Gotchas

No New Data

Logstash reads new lines by default. If you don’t have anyone actually hitting your webserver, but you do have some log entries in the file itself, you can tell logstash to process the exiting entries and not save it’s place in the file.

file {
  path => "/var/log/httpd/request.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
}

Multiple Conf files

Logstash uses all the files in the conf.d directory - even if they don’t end in .conf. Make sure to remove any you don’t want as they can conflict.

Default Index

Logstash creates Elasticsearch indexes that look like:

logstash-%{+YYYY.MM.dd}

The logstash folks have some great material on how to get started. Really top notch.

http://logstash.net/docs/1.4.2/configuration#fieldreferences

Parse Failures

The Greedy Method

The best way to start is to change your match to a simple pattern and work out from there. Try the ‘GREEDYDATA’ pattern and assign it to a field named ‘Test’. This takes the form of:

%{GREEDYDATA:Test}

And it looks like:

filter {
  grok {
    match => [ "message" => "%{GREEDYDATA:Test}" ]
  }
}


       "message" => "test message",
      "@version" => "1",
    "@timestamp" => "2014-10-31T17:39:28.925Z",
          "host" => "some.app.private",
          "Test" => "The rest of your message

That should give you some output. You can then start cutting it up with the patterns (also called macros) found here;

You can also use the online grok debugger and the list of default patterns.

Combining Patterns

There may not be a standard pattern for what you want, but it’s easy to pull together several existing ones. Here’s an example that pulls in a custom timestamp.

Example:
Sun Oct 26 22:20:55 2014 File does not exist: /var/www/html/favicon.ico

Pattern:
match => { "message" => "(?<timestamp>%{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{YEAR})"}

Notice the ‘?’ at the beginning of the parenthetical enclosure. That tells the pattern matching engine not to bother capturing that for later use. Like opting out of a ( ) and \1 in sed.

Optional Fields

Some log formats simply skip columns when they don’t have data. This will cause your parse to fail unless you make some fields optional with a ‘?’, like this:

match => [ "message", "%{HOSTNAME:VHost}? %{COMBINEDAPACHELOG} %{IP:XForwardedFor}?"]

Date Formats

http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html

Dropping Events

Oftentimes, you’ll have messages that you don’t care about and you’ll want to drop those. Best practice is to do coarse actions first, so you’ll want to compare and drop with a general conditional like:

filter {
  if [message] =~ /File does not exist/ {
    drop { }
  }
  grok {
    ...
    ...

You can also directly reference fields once you have grok’d the message

filter {
  grok {
    match => { "message" => "%{HOSTNAME:VHost}? %{COMBINEDAPACHELOG} %{IP:XForwardedFor}?"}
  }  
  if [request] == "/status" {
        drop { }
  }
}

http://logstash.net/docs/1.4.2/configuration#conditionals

Dating Messages

By default, logstash date stamps the message when it sees them. However, there can be a delay between when an action happens and when it gets logged to a file. To remedy this - and allow you to suck in old files without the date on every event being the same - you add a date filter.

Note - you actually have to grok out the date into it’s own variable, you can’t just attempt to match on the whole message. The combined apache macro below does this for us.

filter { grok { match => { “message” => “%{HOSTNAME:VHost}? %{COMBINEDAPACHELOG} %{IP:XForwardedFor}?”} } date { match => [ “timestamp” , “dd/MMM/yyyy:HH:mm:ss Z” ] } }

In the above case, ’timestamp’ is a parsed field and you’re using the date language to tell it what the component parts are

http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html

Sending to Multiple Servers

In addition to an elasticsearch server, you may want to send it to a syslog server at the same time.

    input {
      file {
        path => "/var/pdweb/www-default/log/request.log"
        sincedb_path => "/opt/logstash/sincedb"
      }
    }

    filter {
      grok {
        match => [ "message", "%{HOSTNAME:VHost}? %{COMBINEDAPACHELOG} %{IP:XForwardedFor}?"]
      }
      date {
        match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
      }

    }

    output {
      elasticsearch {
        host => "some.server.private"
        protocol => "http"
      }
      syslog {
        host => "some.syslog.server"
        port => "514"
        severity => "notice"
        facility => "daemon"
      }
    }

Deleting Sent Events

Sometimes you’ll accidentally send a bunch of event to the server and need to delete and resend corrected versions.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-delete-mapping.html

curl -XDELETE <http://localhost:9200/_all/SOMEINDEX>
curl -XDELETE <http://localhost:9200/_all/SOMEINDEX?q=path:"/var/log/httpd/ssl_request_log>"

3.1.3.3 - Index Routing

When using logstash as a broker, you will want to route events to different indexes according to their type. You have two basic ways to do this;

  • Using Mutates with a single output
  • Using multiple Outputs

The latter is significantly better for performance. The less you touch the event, the better it seems. When testing these two different configs in the lab, the multiple output method was about 40% faster when under CPU constraint. (i.e. you can always add more CPU if you want to mutate the events.)

Multiple Outputs

    input {
      ...
      ...
    }
    filter {
      ...
      ...
    }
    output {

      if [type] == "RADIUS" {
        elasticsearch {
          hosts => ["localhost:9200"]
          index => "logstash-radius-%{+YYYY.MM.dd}"
        }
      }

      else if [type] == "RADIUSAccounting" {
        elasticsearch {
          hosts => ["localhost:9200"]
          index => "logstash-radius-accounting-%{+YYYY.MM.dd}"
        }
      }

      else {
        elasticsearch {
          hosts => ["localhost:9200"]
          index => "logstash-test-%{+YYYY.MM.dd}"
        }
      }

    }

Mutates

If your source system includes a field to tell you want index to place it in, you might be able to skip mutating altogether, but often you must look at the contents to make that determination. Doing so does reduce performance.

input {
  ...
  ...
}
filter {
  ...
  ... 

  # Add a metadata field with the destination index based on the type of event this was
  if [type] == "RADIUS" {
    mutate { add_field => { "[@metadata][index-name]" => "logstash-radius" } } 
  }
  else  if [type] == "RADIUSAccounting" {
    mutate { add_field => { "[@metadata][index-name]" => "logstash-radius-accounting" } } 
  }
  else {
    mutate { add_field => { "[@metadata][index-name]" => "logstash-test" } } 
  }
}
output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "%{[@metadata][index-name]}-%{+YYYY.MM.dd}"
  }
}

https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html#metadata

3.1.3.4 - Database Connections

You can connect Logstash to a database to poll events almost as easily as tailing a log file.

Installation

The JDBC plug-in ships with logstash so no installation of that is needed. However, you do need the JDBC driver for the DB in question.

Here’s an example for DB2, for which you can get the jar from either the server itself or the DB2 fix-pack associated with the DB Version you’re running. The elasticsearch docs say to just put it in your path. I’ve put it in the logstash folder (based on some old examples) and we’ll see if it survives upgrades.

sudo mkdir /opt/logstash/vendor/jars
sudo cp /home/gattis/db2jcc4.jar /opt/logstash/vendor/jars
sudo chown -R logstash:logstash /opt/logstash/vendor/jars

Configuration

Configuring the input

Edit the config file like so

sudo vim /etc/logstash/conf.d/logstash.conf

    input {
      jdbc {
        jdbc_driver_library => "/opt/logstash/vendor/jars/db2jcc4.jar"
        jdbc_driver_class => "com.ibm.db2.jcc.DB2Driver"
        jdbc_connection_string => "jdbc:db2://db1.tim.private:50000/itimdb"
        jdbc_user => "itimuser"
        jdbc_password => "somePassword"
        statement => "select * from someTable"
      }
    }

Filtering

You don’t need to do any pattern matching, as the input emits the event pre-parsed based on the DB columns. You may however, want to match a timestamp in the database.

    # A sample value in the 'completed' column is 2016-04-07 00:41:03:291 GMT

    filter {
      date {
        match => [ "completed" , "yyyy-MM-dd HH:mm:ss:SSS zzz" ]
      }
    }

Output

One recommended trick is to link the primary keys between the database and kibana. That way, if you run the query again you update the existing elasticsearch records rather than create duplicates ones. Simply tell the output plugin to use the existing primary key from the database for the document_id when it sends it to elasticsearch.

    # Database key is the column 'id'

    output {
      elasticsearch {
        hosts => ["10.17.153.1:9200"]
        index => "logstash-db-%{+YYYY}"

        document_id => "${id}"

        type => "isim-process"

      }
    }

Other Notes

If any of your columns are non-string type, logstash and elasticsearch will happily store them as such. But be warned that kibana will round them to 16 digits due to a limitation of javascript.

https://github.com/elastic/kibana/issues/4356

Sources

https://www.elastic.co/blog/logstash-jdbc-input-plugin https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html

3.1.3.5 - Multiline Matching

Here’s an example that uses the multiline codec (preferred over the multiline filter, as it’s more appropriate when you might have more than one input)

input {
  file {
    path => "/opt/IBM/tivoli/common/CTGIM/logs/access.log"
    type => "itim-access"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "^<Message Id"
      negate => true
      what => previous
    }
  }
}

Getting a match can be difficult, as grok by default does not match against multiple lines. You can mutate to remove all the new lines, or use a seemingly secret preface, the ‘(?m)’ directive as shown below

filter {
  grok {
    match => { "message" => "(?m)(?<timestamp>%{YEAR}.%{MONTHNUM}.%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND}%{ISO8601_TIMEZONE})%{DATA}com.ibm.itim.security.%{WORD:catagory}%{DATA}CDATA\[%{DATA:auth}\]%{DATA}CDATA\[%{DATA:clientip}\]"}
  }

https://logstash.jira.com/browse/LOGSTASH-509

3.1.4 - Beats

Beats are a family of lightweight shippers that you should consider as a first-solution for sending data to Elasticsearch. The two most common ones to use are:

  • Filebeat
  • Winlogbeat

Filebeat is used both for files, and for other general types, like syslog and NetFlow data.

Winlogbeat is used to load Windows events into Elasticsearch and works well with Windows Event Forwarding.

3.1.4.1 - Linux Installation

On Linux

A summary from the general docs. View and adjust versions as needed.

If you haven’t already added the repo:

apt-get install apt-transport-https
apt install gnupg2
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add -
echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-7.x.list
apt update

apt install filebeat
systemctl enable filebeat

Filebeat uses a default config file at /etc/filebeat/filebeat.yml. If you don’t want to edit that, you can use the ‘modules’ to configure it for you. That command will also load dashboard elements into Kibana, so you must have that already installed Kibana to make use of it.

Here’s a simple test

mv /etc/filebeat/filebeat.yml /etc/filebeat/filebeat.yml.orig
vi /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/*.log
output.file:
  path: "/tmp/filebeat"
  filename: filebeat
  #rotate_every_kb: 10000
  #number_of_files: 7
  #permissions: 0600

3.1.4.2 - Windows Installation

Installation

Download the .zip version (the msi doesn’t include the server install script) from the URL below. Extract, rename to Filebeat and move it the to the c:\Program Files directory.

https://www.elastic.co/downloads/beats/filebeat

Start an admin powershell, change to that directory and run the service install command. (Keep the shell up for later when done)

PowerShell.exe -ExecutionPolicy UnRestricted -File .\install-service-filebeat.ps1

Basic Configuration

Edit the filebeat config file.

write.exe filebeat.yml

You need to configure the input and output sections. The output is already set to elasticsearch localhost so you only have to change the input from the unix to the windows style.

  paths:
    #- /var/log/*.log
    - c:\programdata\elasticsearch\logs\*

Test as per normal

  ./filebeat test config -e

Filebeat specific dashboards must be added to Kibana. Do that with the setup argument:

  .\filebeat.exe setup --dashboards

To start Filebeat in the forrgound (to see any interesting messages)

  .\filebeat.exe -e

If you’re happy with the results, you can stop the application then start the service

  Ctrl-C
  Start-Service filebeat

Adapted from the guide at

https://www.elastic.co/guide/en/beats/filebeat/7.6/filebeat-getting-started.html

3.1.4.3 - NetFlow Forwarding

The NetFlow protocol is now implemented in Filebeat1. Assuming you’ve installed Filebeat and configured Elasticsearch and Kibana, you can use this input module to auto configure the inputs, indexes and dashboards.

./filebeat modules enable netflow
filebeat setup -e

If you are just testing and don’t want to add the full stack, you can set up the netflow input2 which the module is a wrapper for.

filebeat.inputs:
- type: netflow
  max_message_size: 10KiB
  host: "0.0.0.0:2055"
  protocols: [ v5, v9, ipfix ]
  expiration_timeout: 30m
  queue_size: 8192
output.file:
  path: "/tmp/filebeat"
  filename: filebeat
filebeat test config -e

Consider dropping all the fields you don’t care about as there are a lot of them. Use the include_fields processor to limit what you take in

  - include_fields:
      fields: ["destination.port", "destination.ip", "source.port", "source.mac", "source.ip"]

3.1.4.4 - Palo Example

# This filebeat config accepts TRAFFIC and SYSTEM syslog messages from a Palo Alto, 
# tags and parses them 

# This is an arbitrary port. The normal port for syslog is UDP 512
filebeat.inputs:
  - type: syslog
    protocol.udp:
      host: ":9000"

processors:
    # The message field will have "TRAFFIC" for  netflow logs and we can 
    # extract the details with a CSV decoder and array extractor
  - if:
      contains:
        message: ",TRAFFIC,"
    then:
      - add_tags:
          tags: "netflow"
      - decode_csv_fields:
          fields:
            message: csv
      - extract_array:
          field: csv
          overwrite_keys: true
          omit_empty: true
          fail_on_error: false
          mappings:
            source.ip: 7
            destination.ip: 8
            source.nat.ip: 9
            network.application: 14
            source.port: 24
            destination.port: 25
            source.nat.port: 26
      - drop_fields:
          fields: ["csv", "message"] 
    else:
        # The message field will have "SYSTEM,dhcp" for dhcp logs and we can 
        # do a similar process to above
      - if:
          contains:
            message: ",SYSTEM,dhcp"
        then:
        - add_tags:
            tags: "dhcp"
        - decode_csv_fields:
            fields:
              message: csv
        - extract_array:
            field: csv
            overwrite_keys: true
            omit_empty: true
            fail_on_error: false
            mappings:
              message: 14
        # The DHCP info can be further pulled apart using space as a delimiter
        - decode_csv_fields:
            fields:
              message: csv2
            separator: " "
        - extract_array:
            field: csv2
            overwrite_keys: true
            omit_empty: true
            fail_on_error: false
            mappings:
              source.ip: 4
              source.mac: 7
              hostname: 10
        - drop_fields:
            fields: ["csv","csv2"] # Can drop message too like above when we have watched a few        
  - drop_fields:
      fields: ["agent.ephemeral_id", "agent.hostname", "agent.id", "agent.type", "agent.version", "ecs.version","host.name","event.severity","input.type","hostname","log.source.address","syslog.facility", "syslog.facility_label", "syslog.priority", "syslog.priority_label","syslog.severity_label"]
      ignore_missing: true
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
setup.template.settings:
  index.number_of_shards: 1
output.elasticsearch:
  hosts: ["localhost:9200"]

3.1.4.5 - RADIUS Forwarding

Here’s an example of sending FreeRADIUS logs to Elasticsearch.

cat /etc/filebeat/filebeat.yml
filebeat.inputs:
  - type: log
    paths:
      - /var/log/freeradius/radius.log
    include_lines: ['\) Login OK','incorrect']
    tags: ["radius"]
processors:
  - drop_event:
      when:
        contains:
          message: "previously"
  - if:
      contains:
        message: "Login OK"
    then: 
      - dissect:
          tokenizer: "%{key1} [%{source.user.id}/%{key3}cli %{source.mac})"
          target_prefix: ""
      - drop_fields:
          fields: ["key1","key3"]
      - script:
          lang: javascript
          source: >
            function process(event) {
                var mac = event.Get("source.mac");
                if(mac != null) {
                        mac = mac.toLowerCase();
                         mac = mac.replace(/-/g,":");
                         event.Put("source.mac", mac);
                }
              }            
    else:
      - dissect:
          tokenizer: "%{key1} [%{source.user.id}/<via %{key3}"
          target_prefix: ""
      - drop_fields: 
          fields: ["key1","key3"]
output.elasticsearch:
  hosts: ["http://logcollector.yourorg.local:9200"]        
  allow_older_versions: true
  setup.ilm.enabled: false

3.1.4.6 - Syslog Forwarding

You may have an older system or appliance that can transmit syslog data. You can use filebeat to accept that data and store it in Elasticsearch.

Add Syslog Input

Install filebeat and test the reception the /tmp.

vi  /etc/filebeat/filebeat.yml

filebeat.inputs:
- type: syslog
  protocol.udp:
    host: ":9000"
output.file:
  path: "/tmp"
  filename: filebeat


sudo systemctl filebeat restart

pfSense Example

The instructions are NetGate’s remote logging example.

Status -> System Logs -> Settings

Enable and configure. Internet rumor has it that it’s UDP only so the config above reflects that. Interpreting the output requires parsing the message section detailed in the filter log format docs.

'5,,,1000000103,bge1.1099,match,block,in,4,0x0,,64,0,0,DF,17,udp,338,10.99.147.15,255.255.255.255,2048,30003,318'

'5,,,1000000103,bge2,match,block,in,4,0x0,,84,1,0,DF,17,udp,77,157.240.18.15,205.133.125.165,443,61343,57'

'222,,,1000029965,bge2,match,pass,out,4,0x0,,128,27169,0,DF,6,tcp,52,205.133.125.142,205.133.125.106,5225,445,0,S,1248570004,,8192,,mss;nop;wscale;nop;nop;sackOK'

'222,,,1000029965,bge2,match,pass,out,4,0x0,,128,11613,0,DF,6,tcp,52,205.133.125.142,211.24.111.75,15305,445,0,S,2205942835,,8192,,mss;nop;wscale;nop;nop;sackOK'

3.2 - Loki

Loki is a system for handling logs (unstructured data) but is lighter-weight than Elasticsearch. It also has fewer add-ons. But if you’re already using Prometheus and Grafana and you want to do it yourself, it can be a better solution.

Installation

Install Loki and Promtail together. These are available in the debian stable repos at current version. No need to go to backports or testing

sudo apt install loki promtail
curl localhost:3100/metrics

Configuration

Default config files are create in /etc/loki and /etc/promtail. Promtail is tailing /var/log/*log file, pushing them to localhost loki on the default port (3100) and loki is saving data in the /tmp directory. This is fine for testing.

Promtail runs as the promtail user (not root) and can’t read anything useful, so add them to the adm group.

sudo usermod -a -G adm promtail
sudo systemctl restart promtail

Grafana Integration

In grafana, add a datasource.

Configuration –> Add new data source –> Loki

Set the URL to http://localhost:3100

Then view the logs

Explore –> Select label (filename) –> Select value (daemon)

Troubleshooting

error notifying frontend about finished query

Edit the timeout setting in your loki datasource. The default may be too short so set it to 30s or some such

Failed to load log volume for this query

If you added a logfmt parser like the gui suggested, you may find not all your entries can be parsed, leading to this error.:w

3.3 - Network Traffic

Recoding traffic on the network is critical for troubleshooting and compliance. For the latter, the most common strategy is to record the “flows”. These are the connections each host makes or accepts, and how much data is involved.

You can collect this information at the LAN on individual switches, but the WAN (at the router) is usually more important. And if the router is performing NAT, it’s the only place to record the mappings of internal to external IPs and ports.

Network Log System

A network flow log system usually has three main parts.

Exporter --> Collector --> Analyzer

The Exporter, which records the data, the Collector, which is where the data is stored, and the Analyzer which makes the data more human-readable.

Example

We’ll use a Palo Alto NG Firewall as our exporter, and an Elasticsearch back-end. The data we are collecting is essentially log data, and Elasticsearch is probably the best at handling unstructured information.

At small scale, you can combine all of the the collection and analysis parts on a single system. We’ll use windows servers in our example as well.

graph LR

A(Palo)
B(Beats)
C(ElasticSearch)
D(Kibana) 

subgraph Exporter
A
end

subgraph Collector and Analyzer
B --> C --> D
end

A --> B

Installation

Start with Elasticsearch and Kibana, then install Beats.

Configuration

Beats and Palo have a couple of protocols in common. NetFlow is the traditional protocol, but when you’re using NAT the best choice is the syslog protocol as the Palo will directly tell you the NAT info all in one record and you don’t have to correlate multiple interface flows to see who did what.

Beats

On the Beats server, start an admin powershell session, change to the Filebeat directory, edit the config file and restart the server.

There is a bunch of example text in the config so tread carefully and keep in mind that indentation matters. Stick this block right under the filebeat.inputs: line and you should be OK.

This config stanza has a processor block that decodes the CVS content sent over in the message field, extracts a few select fields, then discards the rest. There’s quite a bit left over though, so see tuning below if you’d like to reduce the data load even more.

cd "C:\Program Files\Filebeat"
write.exe filebeat.yml

filebeat.inputs:
- type: syslog
  protocol.udp:
    host: ":9000"
  processors:
  - decode_csv_fields:
      fields:
        message: csv
  - extract_array:
      field: csv
      overwrite_keys: true
      omit_empty: true
      fail_on_error: false
      mappings:
        source.ip: 7
        destination.ip: 8
        source.nat.ip: 9
        network.application: 14
        source.port: 24
        destination.port: 25
        source.nat.port: 26
  - drop_fields:
        fields: ["csv", "message"]

A larger is example is under the beats documentation.

Palo Alto Setup

Perform steps 1 and 2 of the Palo setup guide with the notes below.

https://docs.paloaltonetworks.com/pan-os/10-0/pan-os-admin/monitoring/use-syslog-for-monitoring/configure-syslog-monitoring

  • In step 1 - The panw module defaults to 9001
  • In step 2 - Make sure to choose Traffic as the type of log

Tuning

You can reduce the amount of data even more by adding a few more Beats directives.

# At the very top level of the file, you can add this processor to affect global fields
processors:
- drop_fields:
   fields: ["agent.ephemeral_id","agent.id","agent.hostname","agent.type","agent.version","ecs.version","host.name"]

# You can also drop syslog fields that aren't that useful (you may need to put this under the syslog input)
- drop_fields:
    fields: ["event.severity","input.type","hostname","syslog.facility", "syslog.facility_label", "syslog.priority", "syslog.priority_label","syslog.severity_label"]

You may want even more data. See the Full Palo Syslog data to see what’s available. An example

Conclusion

At this point you can navigate to the Kibana web console and explore the logs. There is no dashboard as this is just for log retention and covers the minimum required. If you’re interested in more, check out the SIEM and Netflow dashboards Elasticsearch offers.

Sources

Palo Shipping

https://docs.logz.io/shipping/security-sources/palo-alto-networks.html

3.4 - NXLog

This info on NXLog is circa 2014 - use with caution.

NXLog is best used when Windows Event Forwarding can’t be and filebeats isn’t sufficient.

Background

There are several solutions for capturing logs in Windows, but NXLog has some advantages;

  • Cross-platform and Open Source
  • Captures windows events pre-parsed
  • Native windows installer and service

You could just run logstash everywhere. But in practice, Logstash’s memory requirements are several times NXLog and not everyone likes to install java everywhere.

Deploy on Windows

Download from http://nxlog.org/download. This will take you to the sourceforge site and the MSI you can install from. This installation is clean and the service installs automatically.

Configure on Windows

NXLog uses a config file with blocks in the basic pattern of:

  • Input Block
  • Output Block
  • Routing Block

The latter being what ties together your inputs and outputs. You start out with one variable, called the $raw_event with everything in it. As you call modules, that variable gets parsed out to more useful individual variables.

Event Viewer Example

Here’s an example of invoking the module that pulls in data from the windows event log entries associated.

  • Navigate to C:\Program Files (x86)\nxlog\conf
  • Edit the security settings on the file nxlog.conf. Change the ‘Users’ to have modify rights. This allows you to actually edit the config file.
  • Open that file in notepad and simply change it to look like so
    # Set the ROOT to the folder your nxlog was installed into
    define ROOT C:\Program Files (x86)\nxlog

    ## Default required locations based on the above
    Moduledir %ROOT%\modules
    CacheDir %ROOT%\data
    Pidfile %ROOT%\data\nxlog.pid
    SpoolDir %ROOT%\data
    LogFile %ROOT%\data\nxlog.log

    # Increase to DEBUG if needed for diagnosis
    LogLevel INFO

    # Input the windows event logs
    <Input in>
      Module      im_msvistalog
    </Input>


    # Output the logs to a file for testing
    <Output out>
        Module      om_file
        File        "C:/Program Files (x86)/nxlog/data/log-test-output.txt"
    </Output>

    # Define the route by mapping the input to an output
    <Route 1>
        Path        in => out
    </Route>

With any luck, you’ve now got some lines in your output file.

File Input Example

    # Set the ROOT to the folder your nxlog was installed into
    define ROOT C:\Program Files (x86)\nxlog

    ## Default required locations based on the above
    Moduledir %ROOT%\modules
    CacheDir %ROOT%\data
    Pidfile %ROOT%\data\nxlog.pid
    SpoolDir %ROOT%\data
    LogFile %ROOT%\data\nxlog.log

    # Increase to DEBUG if needed for diagnosis
    LogLevel INFO

    # Input a test file 
    <Input in>
        Module      im_file
        File ""C:/Program Files (x86)/nxlog/data/test-in.txt"
        SavePos     FALSE   
        ReadFromLast FALSE
    </Input>

    # Output the logs to a file for testing
    <Output out>
        Module      om_file
        File        "C:/Program Files (x86)/nxlog/data/log-test-output.txt"
    </Output>

    # Define the route by mapping the input to an output
    <Route 1>
        Path        in => out
    </Route>

Sending Events to a Remote Logstash Receiver

To be useful, you need to send your logs somewhere. Here’s an example of sending them to a Logstash receiver.

    # Set the ROOT to the folder your nxlog was installed into
    define ROOT C:\Program Files (x86)\nxlog

    ## Default required locations based on the above
    Moduledir %ROOT%\modules
    CacheDir %ROOT%\data
    Pidfile %ROOT%\data\nxlog.pid
    SpoolDir %ROOT%\data
    LogFile %ROOT%\data\nxlog.log

    # Increase to DEBUG if needed for diagnosis
    LogLevel INFO

    # Load the JSON module needed by the output module
    <Extension json>
        Module      xm_json
    </Extension>

    # Input the windows event logs
    <Input in>
      Module      im_msvistalog
    </Input>


    # Output the logs out using the TCP module, convert to JSON format (important)
    <Output out>
        Module      om_tcp
        Host        some.server
        Port        6379
        Exec to_json();
    </Output>

    # Define the route by mapping the input to an output
    <Route 1>
        Path        in => out
    </Route>

    Restart the service in the windows services, and you are in business.

Note about JSON

You’re probably shipping logs to a logstash broker (or similar json based tcp receiver). In that case, make sure to specify JSON on the way out, as in the example above or you’ll spend hours trying to figure out why you’re getting a glob of plain txt and loose all the pre-parsed windows event messages which are nearly impossible to parse back from plain text.

Using that to_json() will replace the contents. The variable we mentioned earlier, $raw_event, with all of the already parsed fields. If you hand’t invoked a module to parse that data out, you’d just get a bunch of empty events as the data was replaced with a bunch of nothing.

3.4.1 - Drop Events

Exec

You can use the ‘Exec’ statement in any block and some pattern matching to drop events you don’t care about.

<Input in>
  Module im_file
  File "E:/Imports/get_accessplans/log-test.txt"
  Exec if $raw_event =~ /someThing/ drop();
</Input>

Or the inverse, with the operator !~

Dropping Events with pm_pattern

The alternative is the patternDB approach as it has some parallelization advantages you’ll read about in the docs should you dig into it further. This matters when you have lots of patterns to check against.

# Set the ROOT to the folder your nxlog was installed into
define ROOT C:\Program Files (x86)\nxlog

## Default required locations based on the above
Moduledir %ROOT%\modules
CacheDir %ROOT%\data
Pidfile %ROOT%\data\nxlog.pid
SpoolDir %ROOT%\data
LogFile %ROOT%\data\nxlog.log

# Increase to DEBUG if needed for diagnosis
LogLevel INFO

# Load the JSON module needed by the output module
<Extension json>
    Module      xm_json
</Extension>

# Input the windows event logs
<Input in>
  Module      im_msvistalog
</Input>

# Process log events 
<Processor pattern>
  Module  pm_pattern
  PatternFile %ROOT%/conf/patterndb.xml
</Processor>

# Output the logs out using the TCP module, convert to JSON format (important)
<Output out>
    Module      om_tcp
    Host        some.server
    Port        6379
    Exec to_json();
</Output>

# Define the route by mapping the input to an output
<Route 1>
    Path        in => pattern => out
</Route>

And create an XML file like so:

<?xml version="1.0" encoding="UTF-8"?>
<patterndb>

  <group>
    <name>eventlog</name>
    <id>1</id>

   <pattern>
      <id>2</id>
      <name>500s not needed</name> 
      <matchfield>
        <name>EventID</name>
        <type>exact</type>
        <value>500</value>
      </matchfield>
      <exec>drop();</exec>
    </pattern>

    
  </group>

</patterndb>

3.4.2 - Event Log

Limiting Log Messages

You may not want ALL the event logs. You can add a query to that module however, and limit logs to the security logs, like so

<Input in>
  Module im_msvistalog
  Query <QueryList><Query Id="0" Path="Security"><Select Path="Security">*</Select></Query></QueryList>
</Input>

You can break that into multiple lines for easier reading by escaping the returns. Here’s an example that ships the ADFS Admin logs.

<Input in>
  Module im_msvistalog
  Query <QueryList>\
            <Query Id="0">\
                <Select Path='AD FS 2.0/Admin'>*</Select>\
            </Query>\
        </QueryList>
</Input>

Pulling out Custom Logs

If you’re interested in very specific logs, you can create a custom view in the windows event viewer, and after selecting the criteria in with the graphical tool, click on the XML tab to see what the query is. For example, to ship all the ADFS 2 logs (assuming you’ve turned on auditing) Take the output of the XML tab (shown below) and modify to be compliant with the nxlog format.

<QueryList>
  <Query Id="0" Path="AD FS 2.0 Tracing/Debug">
    <Select Path="AD FS 2.0 Tracing/Debug">*[System[Provider[@Name='AD FS 2.0' or @Name='AD FS 2.0 Auditing' or @Name='AD FS 2.0 Tracing']]]</Select>
    <Select Path="AD FS 2.0/Admin">*[System[Provider[@Name='AD FS 2.0' or @Name='AD FS 2.0 Auditing' or @Name='AD FS 2.0 Tracing']]]</Select>
    <Select Path="Security">*[System[Provider[@Name='AD FS 2.0' or @Name='AD FS 2.0 Auditing' or @Name='AD FS 2.0 Tracing']]]</Select>
  </Query>
</QueryList

Here’s the query from a MS NPS

<QueryList>
  <Query Id="0" Path="System">
    <Select Path="System">*[System[Provider[@Name='NPS']]]</Select>
    <Select Path="System">*[System[Provider[@Name='HRA']]]</Select>
    <Select Path="System">*[System[Provider[@Name='Microsoft-Windows-HCAP']]]</Select>
    <Select Path="System">*[System[Provider[@Name='RemoteAccess']]]</Select>
    <Select Path="Security">*[System[Provider[@Name='Microsoft-Windows-Security-Auditing'] and Task = 12552]]</Select>
  </Query>
</QueryList>

3.4.3 - Input File Rotation

NXLog has decent ability to rotate it’s own output files, but it’s doesn’t come with a lot of methods to rotate input files - i.e. your reading in Accounting logs from a windows RADIUS and it would be nice to archive those with NXLog, because Windows won’t do it. You could bust out some perl (if you’re on unix) and use the xm_perl module, but there’s a simpler way.

On windows, the solution is to use an exec block with a scheduled command. The forfiles executable is already present in windows and does the trick. The only gotcha is that ALL the parameters must be delimited like below.

So the command

forfiles /P "E:\IAS_Logs" /D -1 /C "cmd /c move @file \\server\share"

Becomes

<Extension exec>
  Module xm_exec
  <Schedule>
    When @daily
 Exec exec('C:\Windows\System32\forfiles.exe','/P','"E:\IAS_Logs"','/D','-1','/C','"cmd','/c','move','@file','\\server\share"');
  </Schedule>
</Extension>

A slightly more complex example with added compression and removal of old files (there isn’t a great command line zip utility for windows in advance of powershell 5)

# Add log rotation for the windows input files
<Extension exec>
    Module xm_exec
    <Schedule>
     When @daily
         # Make a compressed copy of .log files older than 1 day
         Exec exec('C:\Windows\System32\forfiles.exe','/P','"E:\IAS_Logs"','/M','*.log','/D','-1','/C','"cmd','/c','makecab','@file"')
         # Delete original files after 2 days, leaving the compressed copies 
         Exec exec('C:\Windows\System32\forfiles.exe','/P','"E:\IAS_Logs"','/M','*.log','/D','-2','/C','"cmd','/c','del','@file"')
         # Move compressed files to the depot after 2 days
         Exec exec('C:\Windows\System32\forfiles.exe','/P','"E:\IAS_Logs"','/M','*.lo_','/D','-2','/C','"cmd','/c','move','@file','\\shared.ohio.edu\appshare\radius\logs\radius1.oit.ohio.edu"');
    </Schedule>
</Extension>

The @daily runs right at 0 0 0 0 0 (midnight every night). Check the manual for more precise cron controls

3.4.4 - Inverse Matching

You can use the ‘Exec’ statement to match inverse like so

<Input in>
  Module im_file
  File "E:/Imports/get_accessplans/log-test.txt"
  Exec if $raw_event !~ /someThing/ drop();
</Input>

However, when you’re using a pattern db this is harder as the REGEXP doesn’t seem to honor inverses like you’d expect. Instead, you must look for matches in your pattern db like normal;

<?xml version="1.0" encoding="UTF-8"?>
<patterndb>

  <group>
    <name>eventlog</name>
    <id>1</id>

    <pattern>
      <id>2</id>
      <name>Identify user login success usernames</name>

      <matchfield>
        <name>EventID</name>
        <type>exact</type>
        <value>501</value>
      </matchfield>

      <matchfield>
        <name>Message</name>
        <type>REGEXP</type>
        <value>windowsaccountname \r\n(\S+)</value>
        <capturedfield>
          <name>ADFSLoginSuccessID</name>
          <type>STRING</type>
        </capturedfield>
      </matchfield

   </pattern>
  </group>
</patterndb>

Then, add a section to your nxlog.conf to take action when the above capture field doesn’t existing (meaning there wasn’t a regexp match).

...

# Process log events 
<Processor pattern>
  Module  pm_pattern
  PatternFile %ROOT%/conf/patterndb.xml
</Processor>

# Using a null processor just to have a place to put the exec statement
<Processor filter>
 Module pm_null
 Exec if (($EventID == 501) and ($ADFSLoginSucccessID == undef)) drop();
</Processor>

# Output the logs out using the TCP module, convert to JSON format (important)
<Output out>
    Module      om_tcp
    Host        some.server
    Port        6379
    Exec to_json();
</Output>

# Define the route by mapping the input to an output
<Route 1>
    Path        in => pattern => filter => out
</Route>

3.4.5 - Logstash Broker

When using logstash as a Broker/Parser to receive events from nxlog, you’ll need to explicitly tell it that the message is in json format with a filter, like so:

input {
  tcp {
    port => 6379
    type => "WindowsEventLog"
  }
}
filter {
  json {
    source => message
  }
}
output {
  stdout { codec => rubydebug }
}

3.4.6 - Manipulating Data

Core Fields

NXLog makes and handful of attributes about the event available to you. Some of these are from the ‘core’ module

$raw_event
$EventReceivedTime
$SourceModuleName
$SourceModuleType

Additional Fields

These are always present and added to by the input module or processing module you use. For example, the mseventlog module adds all the attributes from the windows event logs as attributes to the nxlog event. So your event contains:

$raw_event
$EventReceivedTime
$SourceModuleName
$SourceModuleType
$Message
$EventTime
$Hostname
$SourceName
$EventID
...

You can also create new attributes by using a processing module, such as parsing an input file’s XML. This will translate all the tags (within limites) into attributes.

<Extension xml>
    Module xm_xml
</Extension>
<Input IAS_Accounting_Logs>
    Module im_file
    File "E:\IAS_Logs\IN*.log"
    Exec parse_xml();
</Input>

And you can also add an Exec at any point to create or replace new attribute as desired

<Input IAS_Accounting_Logs>
    Module im_file
    File "E:\IAS_Logs\IN*.log"
    Exec $type = "RADIUSAccounting";
</Input>

Rewriting Data

Rather than manipulate everything in the input and output modules, use the pm_null module to group a block together.

<Processor rewrite>
    Module pm_null
    Exec parse_syslog_bsd();\
 if $Message =~ /error/ \
                {\
                  $SeverityValue = syslog_severity_value("error");\
                  to_syslog_bsd(); \
                }
</Processor>


<Route 1>
    Path in => rewrite => fileout
</Route>

3.4.7 - NPS Example

define ROOT C:\Program Files (x86)\nxlog

Moduledir %ROOT%\modules
CacheDir %ROOT%\data
Pidfile %ROOT%\data\nxlog.pid
SpoolDir %ROOT%\data
LogFile %ROOT%\data\nxlog.log


# Load the modules needed by the outputs
<Extension json>
    Module      xm_json
</Extension>

<Extension xml>
    Module xm_xml
</Extension>


# Inputs. Add the field '$type' so the receiver can easily tell what type they are.
<Input IAS_Event_Logs>
    Module      im_msvistalog
    Query \
 <QueryList>\
 <Query Id="0" Path="System">\
 <Select Path="System">*[System[Provider[@Name='NPS']]]</Select>\
 <Select Path="System">*[System[Provider[@Name='HRA']]]</Select>\
  <Select Path="System">*[System[Provider[@Name='Microsoft-Windows-HCAP']]]</Select>\
 <Select Path="System">*[System[Provider[@Name='RemoteAccess']]]</Select>\
 <Select Path="Security">*[System[Provider[@Name='Microsoft-Windows-Security-Auditing'] and Task = 12552]]</Select>\
 </Query>\
 </QueryList>
    Exec $type = "RADIUS";
</Input>

<Input IAS_Accounting_Logs>
    Module      im_file
    File  "E:\IAS_Logs\IN*.log"
    Exec parse_xml();
    Exec $type = "RADIUSAccounting";
</Input>


# Output the logs out using the TCP module, convert to JSON format (important)
<Output broker>
    Module      om_tcp
    Host        192.168.1.1
    Port        8899
    Exec to_json();
</Output>


# Routes
<Route 1>
    Path        IAS_Event_Logs,IAS_Accounting_Logs => broker
</Route>


# Rotate the input logs while we're at it, so we don't need a separate tool
<Extension exec>
    Module xm_exec
    <Schedule>
     When @daily
      #Note -  the Exec statement is one line but may appear wrapped
                Exec exec('C:\Windows\System32\forfiles.exe','/P','"E:\IAS_Logs"','/D','-1','/C','"cmd','/c','move','@file','\\some.windows.server\share\logs\radius1"');
    </Schedule>
</Extension>

3.4.8 - Parsing

You can also extract and set values with a pattern_db, like this; (Note, nxlog uses perl pattern matching syntax if you need to look things up)

<?xml version="1.0" encoding="UTF-8"?>
<patterndb>

  <group>
    <name>ADFS Logs</name>
    <id>1</id>

   <pattern>
      <id>2</id>
      <name>Identify user login fails</name>
      <matchfield>
        <name>EventID</name>
        <type>exact</type>
        <value>111</value>
      </matchfield>

      <matchfield>
        <name>Message</name>
        <type>REGEXP</type>
        <value>LogonUser failed for the '(\S+)'</value>
        <capturedfield>
          <name>ADFSLoginFailUsername</name>
          <type>STRING</type>
        </capturedfield>
      </matchfield>

      <set>
        <field>
          <name>ADFSLoginFail</name>
          <value>failure</value>
          <type>string</type>
        </field>
      </set>
    </pattern>

And a more complex example, where we’re matching against a sting like:

2015-03-03T19:45:03 get_records 58  DailyAddAcct completed (Success) with: 15 Records Processed 0 adds 0 removes 0 modified 15 unchanged 


<?xml version="1.0" encoding="UTF-8"?>
<patterndb>

  <group>
    <name>Bbts Logs</name>
    <id>1</id>
  
    <pattern>
      <id>2</id>
      <name>Get TS Records</name> 
 
      <matchfield>
        <name>raw_event</name>
        <type>REGEXP</type>
        <value>^(\S+) get_record (\S+)\s+(\S+) completed \((\S+)\) with: (\S+) Records Processed (\S+) adds (\S+) removes (\S+) modified (\S+) unchanged</value>
 
        <capturedfield>
          <name>timestamp</name>
          <type>STRING</type>
        </capturedfield>

        <capturedfield>
          <name>Transaction_ID</name>
          <type>STRING</type>
        </capturedfield>

         <capturedfield>
          <name>Job_Subtype</name>
          <type>STRING</type>
        </capturedfield>

        <capturedfield>
          <name>Job_Status</name>
          <type>STRING</type>
        </capturedfield>

        <capturedfield>
          <name>Record_Total</name>
          <type>STRING</type>
        </capturedfield>

        <capturedfield>
          <name>Record_Add</name>
          <type>STRING</type>
        </capturedfield>

        <capturedfield>
          <name>Record_Remove</name>
          <type>STRING</type>
        </capturedfield>

        <capturedfield>
          <name>Record_Mod</name>
          <type>STRING</type>
        </capturedfield>

        <capturedfield>
          <name>Record_NoChange</name>
          <type>STRING</type>
        </capturedfield>

      </matchfield>

      <set>
        <field>
          <name>Job_Type</name>
          <value>Get_Records</value>
          <type>string</type>
        </field>
      </set>
    </pattern>


  </group>
</patterndb>

3.4.9 - Reprocessing

Sometimes you have a parse error when you’re testing and you need to feed all your source files back in. Problem is you’re usually saving position and reading only new entries by default.

Defeat this by adding a line to the nxlog config so it starts reading files at the beginning and deleting the ConfigCache file (so there’s no last position to start from).

    <Input IAS_Accounting_Logs>
        Module      im_file
        ReadFromLast FALSE
        File  "E:\IAS_Logs\IN*.log"
        Exec parse_xml();
        Exec $type = "RADIUSAccounting";
    </Input>

del C:\Program Files (x86)\nxlog\data\configcache.dat

Restart and it will begin reprocessing all the data. When you’re done, remove the ReadFromLast line and restart.

Note: If you had just deleted the cache file, nxlog would have resumed at the tail of the file. You could have told it not to save position, but you actually do want that for when you’re ready to resume normal operation.

https://www.mail-archive.com/[email protected]/msg00158.html

3.4.10 - Syslog

There are two components; adding the syslog module and adding the export path.

    <Extension syslog>
        Module xm_syslog
    </Extension>

    <Input IAS_Accounting_Logs>
        Module      im_file
        File  "E:\IAS_Logs\IN*.log"
        Exec $type = "RADIUSAccounting";
    </Input>

    <Output siem>
    Module om_udp
    Host 192.168.1.1
    Port 514
    Exec to_syslog_ietf();

    </Output>

    <Route 1>
        Path        IAS_Accounting_Logs => siem
    </Route>

3.4.11 - Troubleshooting

If you see this error message from nxlog:

ERROR Couldn't read next event, corrupted eventlog?; The data is invalid.

Congrats - you’ve hit a bug.

https://nxlog.org/support-tickets/immsvistalog-maximum-event-log-count-support

The work-around is to limit your log event subscriptions on the input side by using a query. Example:

<Input in>
  Module im_msvistalog
  Query <QueryList><Query Id="0" Path="Microsoft-Windows-PrintService/Operational"><Select Path="Microsoft-Windows-PrintService/Operational">*</Select></Query></QueryList>
  Exec if $EventID != 307 drop();
  Exec $type = "IDWorks";
</Input>

Parse failure on windows to logstash

We found that nxlog made for the best windows log-shipper. But it didn’t seem to parse the events in the event log. Output to logstash seemed not to be in json format, and we confirmed this by writing directly to disk. This happens even though the event log input module explicitly emits the log attributes atomically.

Turns out you have to explicitly tell the output module to use json. This isn’t well documented.

3.4.12 - UNC Paths

When using Windows UNC paths, don’t forget that the backslash is also used for escaping characters, so the path

    \\server\radius 

looks like

    \\server;adius 

in your error log message. You’ll want to escape your back slashes like this;

\\\\server\\radius\\file.log

3.4.13 - Unicode Normalization

Files you’re reading may be any character set and this can cause strange things when you modify or pass the data on, as an example at stack exchange shows. This isn’t a problem with windows event logs, but windows applications use several different types of charsets.

Best practice is to convert everything to UTF-8. This is especially true when invoking modules such as json, that don’t handle other codes well.

NXLog has the ability to convert and can even to this automatically. However, there is some room for error. If you can, identity what the encoding is by looking at it in a hex editor and comparing to MS’s identification chart.

Here’s an snippet of a manual conversion of a powershell generated log. Having looked at the first part and identified it as UTF-16LE

...
<Extension charconv>
    Module xm_charconv
    AutodetectCharsets utf-8, utf-16, utf-32, iso8859-2, ucs-2le
</Extension>

<Input in1>
    Module      im_file
    File "E:/Imports/log.txt"
    Exec  $raw_event = convert($raw_event,"UTF-16LE","UTF-8");
</Input>
...

Notice however that the charconv module has an automatic directive. You can use that as long as what you have is included as marked in bold here.

<Extension charconv>
    Module xm_charconv
    AutodetectCharsets utf-8, utf-16, utf-16le, utf-32, iso8859-2
</Extension>


<Input sql-ERlogs>
    Module      im_file
    File 'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL\MSSQL\Log\ER*'
    ReadFromLast TRUE
    Exec        convert_fields("AUTO", "utf-8");
</Input>

If you’re curious what charsets are supported, you can type this command in any unix system to see the names.

iconv -i

3.4.14 - Windows Files

Windows uses UTF-16 by default. Other services may use derivations thereof. In any event, it’s recommended to normalize things to UTF-8. Here’s a good example of what will happen if you don’t;

<http://stackoverflow.com/questions/27596676/nxlog-logs-are-in-unicode-charecters>

The answer to that question is to use the specific code field, as “AUTO” doesn’t seem to detect properly.

<Input in>
    Module      im_file
    File "E:/Imports/get_accessplans/log-test.txt"
    Exec if $raw_event == '' drop(); 
    Exec $Event = convert($raw_event,"UCS-2LE","UTF-8"); to_json();
    SavePos     FALSE   
    ReadFromLast FALSE
</Input>

From the manual on SQL Server

Microsoft SQL Server

Microsoft SQL Server stores its logs in UTF-16 encoding using a line-based format.
It is recommended to normalize the encoding to UTF-8. The following config snipped
will do that.

<Extension _charconv>
    Module xm_charconv
</Extension>

<Input in>
    Module im_file
    File "C:\\MSSQL\\ERRORLOG"
    Exec convert_fields('UCS-2LE','UTF-8'); if $raw_event == '' drop();
</Input>

As of this writing, the LineBased parser, the default InputType for im_file is not able to properly read the double-byte UTF-16 encoded files and will read an additional empty line (because of the double-byte CRLF). The above drop() call is intended to fix this.

convert_fields('UTF-16','UTF-8'); might also work instead of UCS-2LE.

3.5 - Windows Event Forwarding

If you’re in a Windows shop, this is the best way to keep the Windows admins happy. No installation of extra tools. ‘Keeps it in the MS family’ so to speak.

Configure your servers to push1 logs to a cental location and use a client there, to send it on. Beats works well for this.

The key seems to be

  • Create a domain service account or add the machine account
  • add that to the group on the client

check the runtime status on the collector

For printing, in Event Viewer navigate to Microsoft-Windows-PrintService/Operational and enable it as its not on by default.

Make sure to enable for latency or you’ll spend a long time wondering why there is no data.

Sources

https://hackernoon.com/the-windows-event-forwarding-survival-guide-2010db7a68c4 https://www.ibm.com/docs/en/netcoolomnibus/8?topic=acquisition-forwarded-event-log https://www.youtube.com/watch?v=oyPuRE51k3o&t=158s

4 - Monitoring

Infrastructure monitoring is usually about metrics and alerts. You’re concerned about status and performance - is it up and how’s it doing? and when do we need to buy more?

4.1 - Metrics

4.1.1 - Prometheus

Overview

Prometheus is a time series database, meaning it’s optimized to store and work with data organized in time order. It includes in it’s single binary:

  • Database engine
  • Collector
  • Simple web-based user interface

This allows you to collect and manage data with fewer tools and less complexity than other solutions.

Data Collection

End-points normally expose metrics to Prometheus by making a web page available that it can poll. This is done by including a instrumentation library (provided by Prometheus) or simply adding a listener on a high-level port that spits out some text when asked.

For systems that don’t support Prometheus natively, there are a few add-on services to translate. These are called ’exporters’ and translate things such as SNMP into a web format Prometheus can ingest.

Alerting

You can also alert on the data collected. This is through the Alert Manager, a second package that works closely with Prometheus.

Visualization

You still need a dashboard tool like Grafana to handle visualizations, but you can get started quite quickly with just Prometheus.

4.1.1.1 - Installation

Install from the Debian Testing repo, as stable can be up to a year behind.

# Testing
echo 'deb http://deb.debian.org/debian testing main' | sudo tee -a /etc/apt/sources.list.d/testing.list

# Pin testing down to a low level so the rest of your packages don't get upgraded
sudo tee -a /etc/apt/preferences.d/not-testing << EOF
Package: *
Pin: release a=testing
Pin-Priority: 50
EOF

# Living Dangerously with test
sudo apt update
sudo apt install -t testing prometheus

Configuration

Use this for your starting config.

cat /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

This says every 15 seconds, run down the job list. And there is one job - to check out the system at ’localhost:9090’ which happens to be itself.

For every target listed, the scraper makes a web request for /metrics/ and stores the results. It ingests all the data presented and stores them for 15 days. You can choose to ignore certain elements or retain differently by adding config, but by default it takes everything given.

You can see this yourself by just asking like Prometheus would. Hit it up directly in your browser. For example, Prometheus is making metrics available at /metrics

http://some.server:9090/metrics

Operation

User Interface

You can access the Web UI at:

http://some.server:9090

At the top, select Graph (you should be there already) and in the Console tab click the dropdown labeled “insert metric at cursor”. There you will see all the data being exposed. This is mostly about the GO language it’s written in, and not super interesting. A simple Graph tab is available as well.

Data Composition

Data can be simple, like:

go_gc_duration_seconds_sum 3

Or it can be dimensional which is accomplished by adding labels. In the example below, the value of go_gc_duration_seconds has 5 labeled sub-sets.

go_gc_duration_seconds{quantile="0"} 4.5697e-05
go_gc_duration_seconds{quantile="0.25"} 7.814e-05
go_gc_duration_seconds{quantile="0.5"} 0.000103396
go_gc_duration_seconds{quantile="0.75"} 0.000143687
go_gc_duration_seconds{quantile="1"} 0.001030941

In this example, the value of net_conntrack_dialer_conn_failed_total has several.

net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="unknown"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="default",reason="unknown"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="snmp",reason="unknown"} 0

How is this useful? It allows you to do aggregations - such as looking at all the net_contrack failures, and also look at the failures that were specifically refused. All with the same data.

Removing Data

You may have a target you want to remove. Such as a typo hostname that is now causing a large red bar on a dashboard. You can remove that mistake by enabling the admin API and issuing a delete

sudo sed -i 's/^ARGS.*/ARGS="--web.enable-admin-api"/' /etc/default/prometheus

sudo systemctl reload prometheus

curl -s -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="badhost.some.org:9100"}'

The default retention is 15 days. You may want less than that and you can configure --storage.tsdb.retention.time=1d similar to above. ALL data has the same retention, however. If you want historical data you must have a separate instance or use VictoriaMetrics.

Next Steps

Let’s get something interesting to see by adding some OS metrics

Troubleshooting

If you can’t start the prometheus server, it may be an issue with the init file. Test and Prod repos use different defaults. Add some values explicitly to get new versions running

sudo vi /etc/default/prometheus

ARGS="--config.file="/etc/prometheus/prometheus.yml  --storage.tsdb.path="/var/lib/prometheus/metrics2/"

4.1.1.2 - Node Exporter

This is a service you install on your end-points that make CPU/Memory/Etc. metrics available to Prometheus.

Installation

On each device you want to monitor, install the node exporter with this command.

sudo apt install prometheus-node-exporter

Do a quick test to make sure it’s responding to scrapes.

curl localhost:9100/metrics

Configuration

Back on your Prometheus server, add these new nodes as a job in the prometheus.yaml file. Feel free to drop the initial job where Prometheus was scraping itself.

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'servers'
    static_configs:
    - targets:
      - some.server:9100
      - some.other.server:9100
      - and.so.on:9100
sudo systemctl reload prometheus.service

Operation

You can check the status of your new targets at:

http://some.server:9090/classic/targets

A lot of data is collected by default. On some low power systems you may want less. For just the basics, customize the the config to disable the defaults and only enable specific collectors.

In the case below we are reduce collection to just CPU, Memory, and Hardware metrics. When scraping a Pi 3B, this reduces the Scrape Duration from 500 to 50ms.

sudo sed -i 's/^ARGS.*/ARGS="--collector.disable-defaults --collector.hwmon --collector.cpu --collector.meminfo"/' /etc/default/prometheus-node-exporter
sudo systemctl restart prometheus-node-exporter

The available collectors are listed on the page:

https://github.com/prometheus/node_exporter

4.1.1.3 - SNMP Exporter

SNMP is one of the most prevalent (and clunky) protocols still widely used on network-attached devices. But it’s a good general-purpose way to get data from lots of different makes of products in a similar way.

But Prometheus doesn’t understand SNMP. The solution is a translation service that acts a a middle-man and ’exports’ data from those devices in a way Prometheus can.

Installation

Assuming you’ve already installed Prometheus, install some SNMP tools and the exporter. If you have an error installing the mibs-downloader, check troubleshooting at the bottom.

sudo apt install snmp snmp-mibs-downloader
sudo apt install -t testing prometheus-snmp-exporter

Change the SNMP tools config file to allow use of installed MIBs. It’s disabled by default.

# The entry 'mibs:' in the file overrides the default path. Comment it out so the defaults kick back in.
sudo sed -i 's/^mibs/# &/' /etc/snmp/snmp.conf

Preparation

We need a target, so assuming you have a switch somewhere and can enable SNMP on it, let’s query the switch for its name, AKA sysName. Here we’re using version “2c” of the protocol with the read-only password “public”. Pretty standard.

Industry Standard Query

There are some well-known SNMP objects you can query, like System Name.

# Get the first value (starting at 0) of the sysName object
snmpget -Oqv -v 2c -c public some.switch.address sysName.0

Some-Switch

# Sometimes you have to use 'getnext' if 0 isn't populated
snmpgetnext -v 2c -c public some.switch.address sysName

Vendor Specific Query

Some vendors will release their own custom MIBs. These provide additional data for things that don’t have well-known objects. Add the MIBs to the system and ‘walk’ the tree to see what’s interesting.

# Unifi, for example
sudo cp UBNT-MIB.txt UBNT-UniFi-MIB.txt  /usr/share/snmp/mibs

# snmpwalk doesn't look for enterprise sections by default, so you have to 
# look at the MIB and add the specific company's OID number.
grep enterprises UBNT-*
...
UBNT-MIB.txt:    ubnt OBJECT IDENTIFIER ::= { enterprises 41112 }
...

snmpwalk -v2c -c public 10.10.202.246 enterprises.41112 

Note: If you get back an error or just the ‘iso’ prefixed value, double check the default MIB path.

Configuration

To add this switch to the Prometheus scraper, add a new job to the prometheus.yaml file. This job will include the targets as normal, but also the path (since it’s different than default) and an optional parameter called module that specific to the SNMP exporter. It also does something confusing - a relabel_config

This is because Prometheus isn’t actually taking to the switch, it’s talking to the local SNMP exporter service. So we put all the targets normally, and then at the bottom ‘oh, by the way, do a switcheroo’. This allows Prometheus to display all the data normally with no one the wiser.

...
...
scrape_configs:
  - job_name: 'snmp'
    static_configs:
      - targets:
        - some.switch.address    
    metrics_path: /snmp
    params:
      module: [if_mib]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9116  # The SNMP exporter's real hostname:port.

Operation

No configuration on the exporter side is needed. Reload the config and check the target list. Then examine data in the graph section. Add additional targets as needed and the exporter will pull in the data.

http://some.server:9090/classic/targets

These metrics are considered well known and so will appear in the database named sysUpTime and upsBasicBatteryStatus and not be prefixed with snmp_ like you might expect.

Next Steps

If you have something non-standard, or you simply don’t want that huge amount of data in your system, look at the link below to customize the SNMP collection with the Generator.

SNMP Exporter Generator Customization

Troubleshooting

The snmp-mibs-downloader is just a handy way to download a bunch of default MIBs so when you use the tools, all the cryptic numbers, like “1.3.6.1.2.1.17.4.3.1” are translated into pleasant names.

If you can’t find the mibs-downloader its probably because it’s in the non-free repo and that’s not enabled by default. Change your apt sources file like so

sudo vi /etc/apt/sources.list

deb http://deb.debian.org/debian/ bullseye main contrib non-free
deb-src http://deb.debian.org/debian/ bullseye main contrib non-free

deb http://security.debian.org/debian-security bullseye-security main contrib non-free
deb-src http://security.debian.org/debian-security bullseye-security main contrib non-free

deb http://deb.debian.org/debian/ bullseye-updates main contrib non-free
deb-src http://deb.debian.org/debian/ bullseye-updates main contrib non-free

It may be that you only need to change one line.

4.1.1.4 - SNMP Generator

Installation

There is no need to install the Generator as it comes with the SNMP exporter. But if you have a device that supplies it’s own MIB (and many do), you should add that to the default location.

# Mibs are often named SOMETHING-MIB.txt
sudo cp -n *MIB.txt /usr/share/snmp/mibs/

Preparation

You must identify the values you want to capture. Using snmpwalk is a good way to see what’s available, but it helps to have a little context.

The data is arranged like a folder structure that you drill-down though. The folder names are all numeric, with ‘.’ instead of slashes. So if you wanted to get a device’s sysName you’d click down through 1.3.6.1.2.1.1.5 and look in the file 0.

When you use snmpwalk it starts wherever you tell it and then starts drilling-down, printing out everything it finds.

How do you know that’s where sysName is at? A bunch of folks got together (the ISO folks) and decided everything in advance. Then they made some handy files (MIBs) and passed them out so you didn’t have to remember all the numbers.

They allow vendors to create their own sections as well, for things that might not fit anywhere else.

A good place to start is looking at what the vendor made available. You see this by walking their section and including their MIB so you get descriptive names - only the ISO System MIB is included by default.

# The SysobjectID identifies the vendor section
# Note use of the MIB name without the .txt
$ snmpwalk -m +SOMEVENDOR-MIB -v 2c -c public some.address SysobjectID

SNMPv2-MIB::sysObjectID.0 = OID: SOMEVENDOR-MIB::somevendoramerica

# Then walk the vendor section using the name from above
$ snmpwalk -m +SOMEVENDOR-MIB -v 2c -c some.address somevendoramerica

SOMEVENDOR-MIB::model.0 = STRING: SOME-MODEL
SOMEVENDOR-MIB::power.0 = INTEGER: 0
...
...

# Also check out the general System section
$ snmpwalk -m +SOMEVENDOR-MIB -v 2c -c public some.address system

# You can also walk the whole ISO tree. In some cases,
# there are thousands of entries and it's indecipherable
$ snmpwalk -m +SOMEVENDOR-MIB -v 2c -c public some.system iso

This can be a lot of information and you’ll need to do some homework to see what data you want to collect.

Configuration

The exporter’s default configuration file is snmp.yml and contains about 57 Thousand lines of config. It’s designed to pull data from whatever you point it at. Basically, it doesn’t know what device it’s talking to, so it tries to cover all the bases.

This isn’t a file you should edit by hand. Instead, you create instructions for the generator and it look though the MIBs and create one for you. Here’s an example for a Samlex Invertor.

vim ~/generator.yml
modules:
  samlex:
    walk:
      - sysLocation
      - inverterMode
      - power
      - vin
      - tempDD
      - tempDA
prometheus-snmp-generator generate
sudo cp /etc/prometheus/snmp.yml /etc/prometheus/snmp.yml.orig
sudo cp ~/snmp.yml /etc/prometheus
sudo systemctl reload prometheus-snmp-exporter.service

Configuration in Prometheus remains the same - but since we picked a new module name we need to adjust that.

    ...
    ...
    params:
      module: [samlex]
    ...
    ...
sudo systemctl reload prometheus.service

Adding Data Prefixes

by default, the names are all over the place. The SNMP Exporter Devs leave it this way because there are a lot of pre-built dashboards on downstream systems that expect the existing names.

If you are building your own downstream systems you can prefix (as is best-practice) as you like with a post generation step. This example cases them all to be prefixed with samlex_.

prometheus-snmp-generator generate
sed -i 's/name: /name: samlex_/' snmp.yml

Combining MIBs

You can combine multiple systems in the generator file to create one snmp.yml file, and refer to them by the module name in the Prometheus file.

modules:
  samlex:
    walk:
      - sysLocation
      - inverterMode
      - power
      - vin
      - tempDD
      - tempDA
  ubiquiti:
    walk:
      - something
      - somethingElse  

Operation

As before, you can get a preview directly from the exporter (using a link like below). This data should show up in the Web UI too.

http://some.server:9116/snmp?module=samlex&target=some.device

Sources

https://github.com/prometheus/snmp_exporter/tree/main/generator

4.1.1.5 - Sensors DHT

DHT stands for Digital Humidity and Temperature. At less than $5 they are cheap and can be hooked to a Raspberry Pi easily. Add a Prometheus exporter if you want to do at scale.

  • Connect the Senor
  • Provision and Install the Python Libraries
  • Test the Libraries and the Sensor
  • Install the Prometheus Exporter as a Service
  • Create a Service Account
  • Add to Prometheus

Connect The Sensor

These usually come as a breakout-board with three leads you can connect to the Raspberry PI GPIO pins as follows;

  • Positive lead to pin 1 (power)
  • Negative lead to pin 6 (ground)
  • Middle or ‘out’ lead to pin 7 (that’s GPIO 4)

Image of DHT Connection

(From https://github.com/rnieva/Playing-with-Sensors---Raspberry-Pi)

Provision and Install

Use the Raspberry Pi Imager to Provision the Pi with Raspberry PI OS Lite 64 bit. Next, install the “adafruit_blinka” library as adapted from their instructions and test.

# General updates
sudo apt update
sudo apt -y upgrade
sudo apt -y autoremove
sudo reboot

# These python components may already be installed, but making sure
sudo apt -y install python3-pip
sudo apt -y install --upgrade python3-setuptools
sudo apt -y install python3-venv

# Make a virtual environment for the python process
sudo mkdir /usr/local/bin/sensor-dht
sudo python3 -m venv /usr/local/bin/sensor-dht --system-site-packages
cd /usr/local/bin/sensor-dht
sudo chown -R ${USER}:${USER} .
source bin/activate

# Build and install the library
pip3 install --upgrade adafruit-python-shell
wget https://raw.githubusercontent.com/adafruit/Raspberry-Pi-Installer-Scripts/master/raspi-blinka.py
sudo -E env PATH=$PATH python3 raspi-blinka.py

Test the Libraries and the Sensor

After logging back in, test the blinka lib.

cd /usr/local/bin/sensor-dht
source bin/activate
wget https://learn.adafruit.com/elements/2993427/download -O blinkatest.py
python3 blinkatest.py

Then install the DHT library from CircuitPython and create a script to test the sensor.

cd /usr/local/bin/sensor-dht
source bin/activate
pip3 install adafruit-circuitpython-dht

vi sensortest.py
import board
import adafruit_dht

dhtDevice = adafruit_dht.DHT11(board.D4)
temp = dhtDevice.temperature
humidity = dhtDevice.humidity

print(
  "Temp: {:.1f} C    Humidity: {}% ".format(temp, humidity)
)

dhtDevice.exit()

You can get occasional errors like RuntimeError: Checksum did not validate. Try again. that are safe to ignore. These DHTs are not 100% solid.

Install the Prometheus Exporter as a Service

Add the Prometheus pips.

cd /usr/local/bin/sensor-dht
source bin/activate
pip3 install prometheus_client

And create a script like this.

nano sensor.py
import board
import adafruit_dht
import time
from prometheus_client import start_http_server, Gauge

dhtDevice = adafruit_dht.DHT11(board.D4)

temperature_gauge= Gauge('dht_temperature', 'Local temperature')
humidity_gauge = Gauge('dht_humidity', 'Local humidity')

start_http_server(8000)
    
while True:

 try:
  temperature = dhtDevice.temperature
  temperature_gauge.set(temperature)

  humidity = dhtDevice.humidity
  humidity_gauge.set(humidity)
 
 except:
  # Errors happen fairly often as DHT's are hard to read. Just continue on.
  continue

 finally:
  time.sleep(60)

Create a service

sudo nano /lib/systemd/system/sensor.service
[Unit]
Description=Temperature and Humidity Sensing Service
After=network.target

[Service]
Type=idle
Restart=on-failure
User=root
ExecStart=/bin/bash -c 'cd /usr/local/bin/sensor-dht && source bin/activate && python sensor.py'

[Install]
WantedBy=multi-user.target

Enable and start it

sudo systemctl enable --now sensor.service

curl http://localhost:8000/metrics

Create a Service Account

This service is running as root. You should consider creating a sensor account.

sudo useradd --home-dir /usr/local/bin/sensor-dht --system --shell /usr/sbin/nologin --comment "Sensor Service" sensor
sudo usermod -aG gpio sensor

sudo systemctl stop  sensor.service
sudo chown -R sensor:sensor /usr/local/bin/sensor-dht
sudo sed -i 's/User=root/User=sensor/' /lib/systemd/system/sensor.service

sudo systemctl daemon-reload 
sudo systemctl start sensor.service 

Add to Prometheus

Adding it requires logging into your Prometheus server and adding a job like below.

sudo vi /etc/prometheus/prometheus.yml
...
...
  - job_name: 'dht' 
    static_configs: 
      - targets: 
        - 192.168.1.45:8000

You will be able to find the node in your server at http://YOUR-SERVER:9090/targets?search=#pool-dht and data will show up with a leading dht_....

Sources

https://randomnerdtutorials.com/raspberry-pi-dht11-dht22-python/

You may want to raise errors to the log as in the above source.

4.1.2 - Smokeping

I’ve been using Smokeping for at least 20 years. Every so often I look at the competitors, but it’s still the best for a self-contained latency monitoring system.

Installation

On Debian Stretch:

# Install smokeping - apache gets installed automatically and the config enabled
sudo apt-get install smokeping

# Install and enable SpeedyCGI if you can find it, otherwise, fastCGI
sudo apt install libapache2-mod-fcgid

Configure

Edit the General Config file

sudo vim /etc/smokeping/config.d/General

owner    = Some Org
contact  = [email protected]
mailhost = localhost

cgiurl   = http://some.server.you.just.deployed.on/cgi-bin/smokeping.cgi

# specify this to get syslog logging
syslogfacility = local0

# each probe is now run in its own process
# disable this to revert to the old behaviour
# concurrentprobes = no

@include /etc/smokeping/config.d/pathnames

Edit the pathnames file. You must put in a value for sendmail (If you don’t have it) so that smoke ping will run.

sudo vim /etc/smokeping/config.d/pathnames


sendmail = /bin/false
imgcache = /var/cache/smokeping/images
imgurl   = ../smokeping/images
...
...

Edit the Alerts

sudo vim /etc/smokeping/config.d/Alerts


to = [email protected]
from = [email protected]

Edit the Targets

# Add your own targets that you will measure by appending them to the bottom of the targets file. 
# These will show up in a menu on the left of the generated web page. You add an entry starting with a + to create a top level entry, and subsequent lines with ++ that will show up as sub entries like so:

#        + My Company

#        ++ My Company's Web Server 1

#        ++ My Company's Web Server 2


#    Actual config requires a few extra lines, as below;


sudo vim /etc/smokeping/config.d/Targets


    + My_Company

    menu =  My Company
    title = My Company


    ++ Web_Server_1
    menu = Web Server 1
    title = Web Server 1
    host = web.server.org


# Restart smokeping and apache

sudo service smokeping restart
sudo service apache2 reload

Access smokeping at:

http://some.server.name/smokeping

Notes

The default resolution - i.e. polling frequency is 20 requests over 5 min - or 1 request every 15 seconds

http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/SA/v12/i07/a5.htm

ArchWiki suggests a workaround for sendmail

https://wiki.archlinux.org/index.php/smokeping

4.2 - Visualization

4.2.1 - Grafana