This is my second article about Log Pattern Recognition. In this series, we explore different approaches to automated pattern recognition.

## An introduction to Drain

The Drain paper describes an algorithm to identify log patterns (called log templates in the paper). Drain works in a single pass over streamed data. To reduce the time and the space required to ingest logs, Drain uses a fixed depth tree.

### Short comparison with LogMine

Drain’s approach is very different from LogMine’s :

• Drain assumes tokens in the beginning position of a log are likely to be constant. Drain will still work if that is not the case but pattern identification effectiveness will be degraded.
• Logs are grouped by length : there is no alignment algorithm. This means that processing is faster but delimiters must be well defined.

At the same time, Drain also uses methods similar to LogMine :

• It only keep a single log representative to reduce memory footprint
• On each new log, it updates this log representative to create a new pattern
• The distance function used to calculate similarity between logs is the same

IBM’s version (Drain3) implements features such as masking which match the behavior of LogMine’s user-provided patterns. As this article focuses on the Drain algorithm as described by the paper, I will not dive into the (great) extensions made on Drain3.

### Using Drain

Let’s illustrate how Drain work by using the same logs used for our LogMine example:

10:00:00 - [DEBUG] - User 123 disconnected
10:30:00 - [DEBUG] - User 123 disconnected
11:11:11 - [ERROR] - An error occurred while disconnecting user 456
12:12:12 - [DEBUG] - User 789 disconnected


Because Drain assumes beginning tokens are likely to be constant and actually works better if they are, we will assume that a time pattern was defined. While it exists in Drain3, this is not described in the original paper so we will simply modify the logs given to Drain:

<time> - [DEBUG] - User 123 disconnected
<time> - [DEBUG] - User 123 disconnected
<time> - [ERROR] - An error occurred while disconnecting user 456
<time> - [DEBUG] - User 789 disconnected


Note

In the examples below, we will keep most of the metadata as-is to match the behavior described in the papier. You should be aware that this is not recommended by the implementations because it impacts performance and accuracy.

The input for Drain3 is the unstructured free-text portion log messages. It is recommended to extract structured headers like timestamp, hostname. severity, etc.. from log messages before passing to Drain3, in order to improve mining accuracy. Expected Input (IBM/Drain3)

Executing Drain3, we get the following result (with <*> representing a wildcard):

Clusters:
1 (size 3): <time> - [DEBUG] - User <*> disconnected
2 (size 1): <time> - [ERROR] - An error occurred while disconnecting user 456


Awesome, right? Let’s see what happens behind the scene !

## How does Drain works?

In order to extract patterns from logs, Drain execute the following steps:

1. Clean logs and split them into an array
2. Check for matching groups using the fixed depth tree
3. Check in the found groups if one is “close” enough to the log
1. Form a new pattern if there is a match
2. Create a new group otherwise

Groups contains a log representative, called Log Event and the number of logs that matched the group.

### Cleaning logs

Log cleaning, called preprocessing based on domain knowledge in the paper, is the first step in order to create logs that can be properly grouped together. Logs are cleaned according to rules defined by the user or the implementation. For the logs we used earlier, you could define that a single - is not a token.

Common rules usually include:

• Removing leading and trailing characters
• A list of delimiters to define how to split the log
• Removing punctuation

In our example, the only rule applied was that any space character is a delimiter. This means that the output of this step is the following:

$Token_1$ $Token_2$ $Token_3$           $Token_n$
<time> - [DEBUG] - User 123 disconnected
<time> - [DEBUG] - User 123 disconnected
<time> - [ERROR] - An error occurred while
<time> - [DEBUG] - User 789 disconnected

I stopped at the word while for readability reasons, but there’s no limit to the number of tokens.

Note that while I represents the output of this step as a table, logs are ingested one at a time.

### Fixed Depth Tree

The second step is actually the core of Drain. This is the most interesting part of the algorithm and where most of the work is being done.

A fixed depth tree in a tree where the “height” of the tree is limited. In Drain, this depth can be adjusted by the user. The number of child per node can be limited by a MaxChild parameter.

The tree is made of the following levels:

• The Root node does not encode any information and is only used to access the tree
• The first level encodes in the subtree the number of tokens that composes a log message
• The N next levels are used to encode the N beginning tokens. This is the reason why Drain assumes that beginning tokens are likely to be constant.
• The leaf nodes have a collection of log groups (including tokens from the upper level nodes)

The Root node and the leafs are excluded when counting the depth. Using our example and a Parse Tree with N=2:

To better illustrate what happens, let’s add the following log to our tree:

<time> Critical error - User not disconnected


As there was no matching group and no path matching the first tokens in our log, a new path and a new log group, highlighted in green, were created.

#### Searching for groups that might match the Log

Now that we’ve seen the tree structure, let’s see how it is used. In order to find candidate groups to assign our log to, we need to reach a leaf.

To do this, we follow the tree from the root node to the leaf. On the first level, we select the node matching the number of tokens of our log message.

For every levels after that, we use the tokens in our Log as keys. For instance, with the log <time> - [DEBUG] - User 123 disconnected, the algorithm will search for the following nodes:

• <time> $\Longrightarrow$ - for a Depth=2
• <time> $\Longrightarrow$ - $\Longrightarrow$ [DEBUG] for a Depth=3

The number of tokens in the tree depends on the depth. A deeper tree improves performance but increases the risk that a token in the tree is variable.

From the structure seen earlier, you can easily imagine how a variable token can lead to a tree with millions of nodes. This breaks both performance and grouping, because logs in different subtrees cannot get grouped together.

Note

To mitigate branch explosion, the paper defines that any token containing a digit is variable and group them under a wildcard (<*>) node.

This is actually a very smart mitigation. In the best case scenario, logs are grouped properly and performance is not impacted. In the worst case scenario, we reduced the number of nodes in the tree but increased the number of groups on the leafs.

The original paper also limit the number of Child a node can have and group any log not matching in a wildcard node. This mitigation avoids branch explosion but does not help merging the logs previously grouped into different subtrees.

If the algorithm found a candidate group, we can proceed to the next step. If not, we will create a new group and add it in the tree as we’ve done earlier with <time> Critical error - User not disconnected.

#### Assigning our log to a group

Once a leaf node is reached, there is a list of candidate groups that might match our log. Because the tree depth is not high, a high number of groups are found on a single leaf. In our example, any log with an identical length and starting with <time> - would be on the same leaf.

To group logs together, Drain introduces a notion of distance called Token Similarity. While it is presented a bit differently, the formula is exactly the same as LogMine !

$\text{simSeq} = \frac{\sum_{i=1}^{n}{\text{ equ}(seq_1(i),seq_2(i))}}{n}$

Where $seq_1$ and $seq_2$ are two logs ($seq_i$ being the $i$-th token), $n$ the number of tokens

And $\text{equ}$ defined as:

$\text{equ}(t1, t2) = \begin{cases} 1 &\text{if } x=y\\ 0 &\text{otherwise} \end{cases}$

This can easily be applied on this log made of $n=5$ tokens.

 $Log_{Representative}$ My awesome log is <*> $Log_{Ingested}$ My error log is awesome Sequence Similarity $\frac{1}{5}$ 0 $\frac{1}{5}$ $\frac{1}{5}$ $\frac{1}{5}$ Sequence Similarity (Cumulative) $\frac{1}{5}$ $\frac{1}{5}$ $\frac{2}{5}$ $\frac{3}{5}$ $\frac{4}{5}$

Which results in $simSeq(Log_{Representative},Log_{Ingested}) = \frac{4}{5}$.

When comparing to LogMine’s version, we can directly match $\text{equ}$ to the $\text{Score}$ function.

$\text{Dist} = 1 - \text{simSeq}$ because $\text{Max}(\text{len}(P),\text{len}(Q))$ is always n as $\text{len}(P) = \text{len}(Q)$

If the result of this function is higher than the similarity threshold $st$, the group is identified as most suitable. This match-first behavior is exactly the same as LogMine. To optimize this operation, we can stop early if the sum has already reached the similarity threshold.

If no group has been identified, a new group will be created.

#### Merging a log with an existing group

When a group has been identified as a suitable log group, Drain will merge the log into the group. To do that, it compares the log being ingested to the current representative of the group, token by token, and creates a new representative:

• If the tokens are the same, the token is returned
• Else, a wildcard is returned

For example, the merge of the following two logs:

<time> - [DEBUG] - User 123 disconnected
<time> - [DEBUG] - User 123 disconnected


will return:

<time> - [DEBUG] - User 123 disconnected


And the merge of the following two logs:

<time> - [DEBUG] - User 123 disconnected
<time> - [DEBUG] - User 789 disconnected


will return:

<time> - [DEBUG] - User <*> disconnected


This behavior is also the same as the behavior we’ve seen in LogMine. Because logs in a group have the same length, we don’t need an algorithm to align the two logs.

In the paper, there is a concept of log identifier (log ID). When the log is ingested, it is then added into the group. On implementations, however, this is usually not the case. Drain works on streamed data, determining a unique identifier for logs it does not store can be impossible. As an example, you can imagine ingesting the four logs we’ve defined earlier by feeding them into the Drain’s stdin. Would the ID be the line number? What happens if a log was missed during collection or that order is not deterministic ?

This might be why implementations chose to only count the number of logs assigned to a cluster.

## Wrapping up

Drain has a very interesting approach to Log Parsing. Unlike LogMine, it assumes that beginning tokens are likely to be constant and that logs with different length are not from the same template.

Preprocessing rules can be added to help the algorithm and tuning can be done on several parameters:

• MaxChild which allows to tune the maximum number of child inside the Parse Tree
• Depth which allows to customize the depth of the Parse Tree
• st, the similarity threshold

But it does not solve the issues we’ve seen with LogMine, especially meaning that spans over several tokens ([Sun Dec 04 04:47:44 2005] is a date but will be split into several tokens).

Unlike LogMine, Drain keeps track of how many logs were assigned to a cluster by default.

Finally, just like LogMine, Drain works on logs where tokens order does not change.

## Thanks

I’d like to thank Marc Schmitt and Brice Dutheil for reviewing this article.