Datasets Enumeration and Description¶
In over 20 years Spamhaus ended up producing quite a number of different datasets. Some are very broad in terms of possible usage, some other are extremely specific to one purpose only, and quite often email is by far not the only field they can be succesfully used in.
Understanding where they come from, what they’re intended to solve and what you can expect to be included in them is the basic foundation to understanding what they can do for you.
Here’s a list of the various datasets published by Spamhaus and their description, along with the return code(s) they are associated with, that are represented both as an IPv4 IP (as that’s used by the DNSBL semantics) and an integer number (HTTP API semantics).
Associated with the return code 127.0.0.2 (1002), the SBL is a manually maintained list of abuse-related resources, not necessarily of exclusively SMTP emitters. Resources that can be listed in the SBL are for example webservers or DNS servers (sometimes, even routers) providing service to abusing actors, either as a result of a compromise or because they’re dedicated to that purpose.
In general, outright blocking at the SMTP level a source that is listed by the SBL is supposed to be safe in terms of false positives. Another usage with a fairly low false-positive rate is checking the IPs contained in the Received headers of the messages (so-called Deep-Header Parsing, or, from now on, DHP).
Due to the characteristics above, however, other uses are possible: for example, a sender whose domain is served by an SBL-listed DNS server has a non-trivial probability of being abusive too.
Similarly, if the message contains URLs resolving to SBL-listed addresses, there’s a reasonable chance the message is abusive. However, use of the SBL for these specific purposes is encouraged only within scoring systems, as a contribution to a decision taken upon multiple factors.
Being SBL a manually-managed resource means it’s at the very least suboptimal when it comes to following fast-paced operations that constantly shift from one location to another. Therefore, in order to keep track of these operations (snowshoe and hailstorm sources are the first to come to mind) something else was needed as integration to it. That is how the CSS dataset came to life and associated with its own return code 127.0.0.3 (1003).
It’s a completely automated sublist, listing SMTP emitters associated with a low reputation or confirmed abuse. This can either mean a resource controlled by an abusing actor or a compromised host. Its usage should be limited to the sending IP and can be used to outright reject the delivery.
It’s an additional flag added to SBL listings, indicating that the resource is known to be controlled by a bad actor, meaning that a query returning the code for DROP/EDROP (127.0.0.9 / 1009) will also always return the code for SBL listings. It indicates that the queried IP is part of IP resources assigned to known rogue entities, with bulletproof hosters and similar shady operators being a typical example.
It is strongly suggested to avoid any kind of interaction with entities listed by this dataset. This is by no means limited to SMTP: as a matter of fact Spamhaus has been inviting consumers to apply DROP/EDROP at the firewall or router level, dropping any traffic coming from (or going to) these network resources.
For this reason, DROP/EDROP is distributed in a number of different ways, in order to give users the largest possible flexibility.
The distinction between DROP and EDROP is:
if the network resource has been directly given by an RIR to the bad actors, it belongs to DROP
if the network resource has been delegated to the bad actor by an ISP, it belongs to EDROP
From any consumer perspective, the difference between the two sub-lists is negligible, and they can safely be treated as the same thing. As a result, not all the query methods provide this dataset in a way that allows to discriminate between the two components.
It is widely known how compromised hosts (being them servers sitting in a datacenter, infected computers on someone’s desk, or vulnerable IoT devices in somebody’s cellar) are generally used to emit spam (among other bad deeds). XBL is a list of IPs that have recently been observed hosting compromised hosts, and can be composed of several independent contributions, each one associated with each own return code in the range between 127.0.0.4 (1004) and 127.0.0.7 (1007). However, in this specific moment the only component is the CBL (see https://www.abuseat.org), hence only 127.0.0.4 (1004) is currently being returned.
The first suggested use for this dataset is to outright block SMTP deliveries coming from an IP listed by it.
Hosts can be compromised and be used for abusive purposes even without actually emitting spam, however. For this reason other usages are possible: for example if an URL contained in the message body points to an exploited webserver, there’s a non-trivial chance that the message itself is spam, pointing the recipient to an abusive URL that will be redirecting him to the spammer’s website or -in the worst case- downloading malware of some sort.
Using the XBL to check the IPs URLs point to is therefore possible and suggested, but only as part of a scoring system where this is one of the indicators taken into account.
Similarly, using DHP against the XBL is possible, but the chance of false positive can be quite significant, particularly in cases where the source is on a dynamically assigned address (meaning the sender inherited an IP that hosted a compromised system hours before) or in case of NAT (where one host is compromised but most others are not, but all share the same public IP); therefore, it should only be used in a scoring system.
A side effect of having compromised hosts emitting spam is that you’ll end up seeing SMTP traffic reaching your MX from networks where no SMTP server is expected. Most notably, dynamic IP space used for residential connectivity pools.
This means that, even when an infected system is not yet known to the XBL, you can can possibly identify it as an unwanted source based on where the traffic is originating.
This is, in the end, what led to the creation of PBL. It’s not -strictly speaking- an abuse-related list: it’s a list of dynamic and low-security IP space. In general, it’s address space that should never host an SMTP server, therefore any SMTP connection coming from this IP space is almost certainly abusive.
Since every message has to originate somewhere, DHP against PBL makes no sense and is highly discouraged. On the other hand, scoring based on PBL for URLs is possible, although not particularly performant.
Two return codes are associated with this dataset, telling whether the nature of the listed subnet has been inferred by Spamhaus (127.0.0.11 /1011) or indicated directly by the ISP responsible for the network (127.0.0.10 / 1010).
Some bots are known to perform authentication credential hijacking or bruteforcing. Knowing if the peer in an authentication session is amongst those can be a precious datapoint when your application is trying to decide if a client session is legit or abusive.
AuthBL is basically that: a collection of bots known to use stolen credentials or authentication bruteforce.
For the largest part, AuthBL is therefore a subset of the XBL, and it aims to help in any situation where credentials are in use and can be stolen, from SMTP-AUTH, to IMAP, to HTTP or other protocols that have nothing to do with email in the first place, like ssh or VoIP.
It’s associated with its own return code 127.0.0.20 / 1020.
IPs are not, by far, the only thing in a message that can be associated with a reputation.
DBL is a database of domains with a poor reputation, at least from the end-user perspective.
In truth, what the DBL does is effectively keeping track and computing a reputation “score” for every domain seen on the Internet and produce a list of those that
are above a certain threshold
have been observed active in the last X days
This list is what users would query.
Different return codes are used to tag the type of abuse the domain has been observed involved in whenever that information is available.
One thing that should be noted is that not all the records have the same meaning in term of “badness”: basically two separate sets of return codes are provided:
127.0.1.2-99 / 2002-2099 identify resources that are considered inherently bad or associated with a low reputation. In general, it means that the domain is “safe to block” according to Spamhaus data.
127.0.1.102-199 / 2102-2199 identify domains that -while not inherently “bad”- have been observed involved in abuse. Briefly referred to as “abused-legit” the typical example of this is a domain that due to a security issue is currently serving malicious contents. This second set of return codes is only suggested for use in scoring systems.
If queried for an IP, the DBL will return a positive reply with the return code 127.0.1.255: this should be under any aspect treated as an error code, with the meaning “IP queries not supported”; in HTTP lookups, this error is conveyed by a HTTP code.
Ok, but then: what if somebody starts using a domain that has just been registered, before it even acquires a reputation?!?
With no surprise, that happens all the time and has been for quite a while. The actual numbers may differ, but something all the researches in domain abuse agree upon is that the vast majority of newly-registered domains will only be used for some bad deed for some time, then scrapped and left unused until they expire.
As a result, the vast majority of newly-registered domains can’t be trusted, turning the lack of reputation into a strong reputation indicator by itself, in a way.
ZRD is a database of domains that have been observed for the first time in the last 24 hours and can therefore be treated with extreme prejudice.
The fourth octet of the return code (in the range between 127.0.2.2 / 3002 and 127.0.2.24 / 3024) is used to indicate the time elapsed since its first observation, in hours.