Using Static Code Analysis to Improve Log Parsing
Introduction
Palantir has invested extensively in enabling remote debugging workflows powered by our log and telemetry infrastructure. Sometimes we must investigate issues with third-party software that we run, and thus care about capturing logs and metrics for these components. Combining regular expressions and static code analysis, we have been able to extract structured information from these otherwise unstructured logs. This blog post outlines how we do it.
Our logs and telemetry infrastructure
Backend and frontend components of Palantir’s products, Gotham and Foundry, emit log and metric information in a structured JSON format rather than as unstructured text. These structured log lines look like this:
{
"type": "service.1,
"message": "Accessing dataset.",
"level": "DEBUG",
"params": {
"dataset": "my-dataset",
},
"time": "2019-07-28T09:31:04.46165-07:00"
}
Note that these structured logs avoid a common logging practice of inserting parameters into the log message via string formatting, creating a variable message. Structured logs with known schemas facilitate indexing, searching, and analyzing log information. For instance, we could run a query against our log database to count the occurrence of this particular log entry:
SELECT *
FROM service.1-logs
WHERE message = "Accessing dataset."
Structured logs also allow for more complex queries like aggregations on specific fields within the log. Log aggregations offer a drill-down approach to debugging. Perhaps, for example, we’d learn something if we bucketed our query results by the params.dataset
field:
SELECT COUNT(*)
from service.1-logs
WHERE message = "Accessing dataset."
GROUP BY params.dataset
(Aside: we don’t actually use SQL when querying these logs, but use it here as it is widely understood.) This query’s results may reveal anomalous behavior around a certain dataset and tip us off where to look next. This flexibility in log querying has greatly improved the ability of product teams to become operationally responsible for their software, and makes supporting a product everywhere it’s deployed (often hundreds of installations) tractable for even small teams. There’s a little gap in coverage here, though — what about supporting and debugging open-source software that doesn’t produce logs in the format we expect?
Making third party logs consumable
Our Rubix team (Rubix is our name for our Kubernetes infrastructure) runs third party software like etcd, vault, and Kubernetes that do not produce logs in this format, but we still want these logs to flow through our pipeline so that product developers can reliably use the same debugging workflow for both Palantir-written and third-party software. Over a year ago now, we wrote a small tool called planer to solve this problem. Planer uses regular expressions to impose Palantir’s standard log structure on unstructured log lines from third-party software, thus allowing us to index and analyze third party through the same log pipeline.
Planer’s shortcomings
Writing regular expressions is hard — handcrafting these regexes can be very time-consuming, and the resulting regexes are often brittle. Moreover, planer will drop any log line that doesn’t match any of its configured regular expressions. For this reason, most of our early planer configurations had overly general regular expressions — this meant that the log lines we were searching for while debugging issues mostly had the parameters somewhere in the messages, instead of being pulled out into the params field. Note that this meant we were capturing all the logs we wanted, we just weren’t imposing any useful structure on them, and this caused us to eat a good amount of pain and lose valuable time while debugging in production.
What options do we have for improving planer configurations? We could either:
- Search existing logs and see which messages looked like they had parameters in them, and then handcraft regular expressions to match these lines.
- Inspect the source code of the third-party product and back out regular expressions from logging calls the code makes.
Both these options are slow and error-prone. We thus started looking for an approach that would improve the speed and accuracy with which we wrote more useful planer configurations.
Programmatically generating planer configuration
Most of the third-party software the Rubix team deploys and supports is written in Golang. Golang’s standard library includes the go/ast
package, which enables consumers to read and traverse the abstract syntax tree of a go package—any go package. Each node in a go AST has a concrete type. Logging calls in Golang (or, more generally, all method invocations) take the form of one of these concrete types: *ast.CallExpr
(CallExpr is short for call expression). For the etcd repository, these logging call expressions look like:
plog.Printf("finished scheduled compaction at %d (took %v)", compactMainRev, time.Since(totalStart))
plog.Warningf("%srequest %q with result %q took too long (%v) to execute", prefix, reqStringer.String(), result, d)
Taking advantage known method signatures of these logging call expressions (the same as the standard library’s fmt.Printf
method), we can traverse the AST for each package in the repository and, when we come to a call expression we recognize as a logging call, construct a regular expression based on the call expression’s arguments. In pseudo-code, this looks like:
configure AST reader with a list of call expressions to match:
- plog.Printf
- plog.WarningfproposedRegexes = []Regex
for each package p in repository:
for each file f in p:
for each node in f's AST:
if the node is a call expression:
if the call expression's method matches any of our configured methods:
create regex by inspecting the arguments of the call expression.
add regex to proposedRegexes
return proposedRegexes
A simplified, minimal Golang example for constructing a regex from a matched call expression, omitting type assertion and slice length checks for brevity, looks like:
func extractRegex(callExpr *ast.CallExpr) (*regexp.Regexp, []string, bool) {
logArgs := callExpr.Args
message := strings.Trim(logArgs[0].(*ast.BasicLit).Value, "\"")
logParams := make([]string, 0, len(logArgs)-1)
logFmtArgs := logArgs[1:]
for i, logArg := range logFmtArgs {
logParams = append(logParams, parseLogParamName(logArg))
}
return proposeRegex(message, logParams), logParams, true
}var (
fmtPlaceHolderRegex = regexp.MustCompile(`%[+#\-\s]?[a-zA-Z]`)
)func proposeRegex(message string, logParams []string) *regexp.Regexp {
fmtPlaceHolders := fmtPlaceHolderRegex.FindAllStringIndex(message, len(logParams))
var proposedRegex string
currPos, messageIndex := 0, 1
for i, param := range logParams {
matchRange := fmtPlaceHolders[i]
if currPos != matchRange[0] {
fmt.Sprintf("(?P<%s>%s)", fmt.Sprintf("message%d", messageIndex), utils.EscapeMessageSpecialChars(message[currPos:matchRange[0]]))
proposedRegex += formattedMessageGroup
messageIndex++
}
currPos = matchRange[1]
proposedRegex += fmt.Sprintf("(?P<%s>.+)", param)
}
return regexp.MustCompile(proprosedRegex)
}
This code would take our etcd logging call expression from above:
plog.Printf("finished scheduled compaction at %d (took %v)", compactMainRev, time.Since(totalStart))
and return the following regular expression:
(?P<message1>finished scheduled compaction at )(?P<compactmainrev>.+)(?P<message2> \(took )(?P<timesince>.+)(?P<message3>\))$
We now have a list of regular expressions that are guaranteed (modulo bugs in the generator code, of course) to match logged messages!
Deciding which regular expressions to use
We now have all the regular expressions we need — too many, in fact. Running this on the etcd repository generated 441 regular expressions. We observed performance issues when we configured planer with this many regular expressions — the mean and P95 for single line matching were 0.32 milliseconds and 1.53 milliseconds, respectively. These match times would prevent planer from processing lines fast enough at our larger scale deployments where etcd has high log output. To match a log line, planer iterates through its configured regular expressions, returning as soon as it finds a regular expression that matches the log line. Different log lines can thus take longer to match than others, based on where the regular expression for a log line appears in our generated configuration.
We can use a simple trick to reduce the average processing time: let’s order the regular expressions such that those expression that match most log lines are checked first. To achieve this ordering, we need to query existing log data for which messages appear most often. Using our log infrastructure, we were able to:
- Run a query aggregating logs by their
message
fields (SELECT COUNT(*) FROM service.1-logs GROUP BY message
). This query effectively results in a list of two-tuples, where each tuple looks like (message, number of appearances of that message). - Keep a map of regular expressions to the number of log lines that regex matches
- Run each message through every regex — if they match, increment the regex’s number matched by the number of log lines in which the message appears.
- Output the regular expressions in order of the number of log lines each regular expression matches, from greatest to least. If a regular expression matched no log lines, it will not appear in our output.
Using this approach, we simplified our planer configuration and dropped our mean and P95 line matching times to 0.25 and 0.42 milliseconds, respectively.
Shortcomings/limitations
Some issues remain with our approach of generating regular expressions from inspecting source code. Some repositories use several different call expressions for logging, which makes the manual process of configuring our regex generator with these expressions more difficult. Our regex construction also relies on statically defining the pattern for a capture group — using .+
or .*
here will allow the resulting regex to match a wide range of values, but incurs a regex performance cost. We can construct more performant search patterns for a capture group if we inspect the type of variable the group is intended to capture - for example, if we are constructing a capture group for a boolean variable (fmt.Printf("some boolean variable = %t", boolVar)
), we can use the regular expression `(true|false)` instead of .+
.
We also rely on meaningful and consistent variable names for the resulting params in our planed log lines in order to simplify log indexing and analytics. For example, if a repository specifies a log line with a variable named username
and another log line with a variable named userName
, we’d like to deduplicate these. The approach of lowercasing all parameter names helps us a little, but there are more complex cases like deduplicating user
and username
.
Conclusion
While this process imposed structure on many unimportant log lines, some of our newly structured logs provide solid value for debugging; one example is etcd’s warn-level logging for requests that take longer than expected to execute. Thanks to our new planer configuration, we can aggregate these log lines by request type and more easily identify troublesome etcd queries. Parsing all the logging expressions for a given repository has also allowed us to blacklist certain log lines via a separate point in planer config, which means we can drop unhelpful noise and reduce the amount of volume our logs pipeline has to handle. (This is particularly useful for reining in kubelet log volume, which scales with the number of kubernetes nodes we run across all our deployments).
This project currently works for the Rubix team quite well, as our team runs the most Golang OSS components at Palantir. For other teams to benefit from this, we need to write regex generators for other languages. Libraries like Javaparser can enable us to pick up on logging call expressions provided by libraries like Log4j, just introducing the additional challenge of figuring out which log encoders Log4j is set up to use.
Author
James R.