Match and Handle Date/Time Formats in Td-Agent or Fluentd

When handling your log files with either td-agent or fluentd, it's sometimes not enough to rely on the built-in formats provided by them. See the 'format (required)' section here for a complete list. But what to do if you are working with logs which do not fit those common patterns? Do you have to switch to using something like json or just skip structuring/parsing the data with the none format?

Behind most built-in supported log formats, such as apache, apache2, apache_error, nginx or syslog, are a combination of a regular expression and a clarification on how to parse the time stamp part of each log line. The corresponding configuration lines of a source entry are format and time_format. When you need a little more flexibility, for example when parsing default Golang logs or an output of some fancier logging library, you can help fluentd or td-agent to handle those as usually. Here is what a source block using those two fields looks like:

<source>
  type tail
  format /^(?<time>[^ ]* [^ ]*) (?<message>.*)$/
  time_format %Y/%m/%d %H:%M:%S
  path /var/log/upstart/my-service.log
  pos_file /var/log/td-agent/my-service.log.pos
  tag my-service
</source>

This source block, when put into a td-agent.conf will handle default Golang-style logs emitted by the my-service upstart-run job to be parsed properly. The time format is able to handle log lines similar to the following:

2016/01/09 14:21:24 Hello!

Here, "Hello!" will be the message, while the time stamp is obtained by parsing the part of the line matched by the time group of the regex, using the time_format. We expect the time to be in two chunks of non-whitespace character groups, separated by a whitespace. But how to find out more about the percent-prefixed letters? Those are a common time formatting notation, and among others used in the Python functions time.strptime and time.strftime to both parse and create string representations of date-time combinations. The corresponding documentation section describes their behavior perfectly, but to get a better overview with great examples check out strftime.org.

I'd like to conclude with a few examples which might save you some time when handling the time_format field. For once, here is a way to parse the more fancy go-json-rest library log format in the most simple case:

format /^(?<remoteaddress>[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*) - (?<remoteuser>.*) (?<time>[^ ]* [^ ]* [^ ]*) (?<message>.*)$/
time_format %d/%b/%Y:%H:%M:%S %z

The message could be further subdivided into relevant fields if needed.

There are three historically allowed time formats for the representation of date/time stamps in the context of HTTP applications, according to the W3C:

Sun, 06 Nov 1994 08:49:37 GMT  ; RFC 822, updated by RFC 1123
Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036
Sun Nov  6 08:49:37 1994       ; ANSI C's asctime() format

To match those, the following time_format strings can be used:

time_format %a, %d %b %Y %H:%M:%S %Z
time_format %A, %d-%b-%y %H:%M:%S %Z
time_format %c

Have a good time getting the most out of your logs!