Merikanto

一簫一劍平生意,負盡狂名十五年

Intro to InfluxDB

The growth of time based data creates the need for time series databases, for which you can of course use a traditional relational database management system (RDBMS) like Oracle, SQL Server or MySQL, but somewhere down the line you will begin to see some limitations regarding time series data.

Possibilities for native time series databases are limited, but InfluxDB from InfluxData does just that. It is a big data store, designed to handle the high write and read requests for time series projects. InfluxDB is a fairly new product and is a part of InfluxData’s TICK stack.

It is written in the Go programming language developed at Google and as mentioned is specifically designed for time series data mainly thanks to the high availability and I/O speeds. Furthermore the installation of the system is quick and practically painless in contrast to other big data storage solutions, yet there is possibility for complex solutions thanks to the built in scalability for distributed systems.

Reference:

Official Blog

Time-series database stack

Time-Series Data



Time Series Data

As InfluxDB is a time series database it’s important to see what differentiates time series data from other data types. In other words; “What is time series data?”.

In essence, time series data is a collection of datapoints that meet the following requirements:

  • all datapoints are composed out of variable measurements within a certain time interval
  • the said time interval needs to be continuous, so measurements should be made at all times within this time interval
  • the time between one measurement and the following should be equal to the previous and the next ones
  • only one value can be measured for each datapoint for one specific time unit

Some examples of time series data can be the closing amounts of the stock markets, the periodically measured temperatures and/or CPU loads of your computer or a measurement of your heart rate during a sport session.

The most used visualisations of time series are line diagrams, as you can see in the example image, which shows heart rate data.

Time series are often used in statistics or pattern recognition for example in applied sciences or engineering.

One last important feature of time series data is the fact that there is a natural order in the data based on time, which separates this type of data with cross-sectional analysis where no natural ordening exists.



Time Series DB

Now we have insights on what time series data is, we need to see where the time series databases fit in. Of course, the quickest answer on what a time series database exactly is, is; “A database in which you can store and manage time series data”, but what exactly are the benefits of these native time series databases over the relational databases that we came accustomed to?

First of all time series databases are exceptionally good in handling expiring data, what means that obsolete or irrelevant data — from previous time periodes — can easily be filtered and removed.

If removing the data is no option, there is always the possibility of downsampling the data. In this case you can join multiple datapoints for a certain time interval into one datapoint, effectively lowering the granularity of the data set.

The main point of the time series databases is of course the ability to do efficient time related queries on the data and this in a rapid fashion.



InfluxData TICK Stack

As mentioned before, InfluxDB is the database store in the platform that InfluxData calls the TICK stack. As with InfluxDB, all other programs are developed in Go programming language and are open source. The name is simply composed out of the initial letters of each program:


Telegraf

Telegraf is the data collector of the TICK stack and can be used to collect metrics and values on the host system or on external systems via HTTPS API. Telegraf will then write the data to InfluxDB in the correct format.

In extent of the example given earlier you can easily have Telegraf collect data on CPU usage for all processes on a host computer.


Chronograf

Chronograf is a simple to install and use application for ad-hoc visualisation of your time series data stored on InfluxDB. It boasts possibilities to create templates for easy and quick use and possess pre-configured dashboard for use with most used datasets. Here’s an example of time series data visualisation:


Kapacitor

Kapacitor is InfluxData’s native data processing engine for InfluxDB, it gives the user the possibility to process the collected time series data, either through batch processing or stream processing. The processing engine enables the user to add own logic to create alerts with dynamic thresholds, compare metrics with patterns or compute statistical anomalies. Kapacitor’s domain specific language called TICKscript:

1
2
3
4
5
6
7
stream
// Select just the cpu measurement from our example database
.from().measurement('cpu')
.alert()
.crit(lambda: "usage_idle" < 70)
// Whenever we get an alert, write it to a file
.log('/tmp/alerts.log')


Look Deeper into InfluxDB

As InfluxDB is designed as a native time series database it is mainly focused on quickly storing large amounts of incoming data and providing rapid query results on these datasets.

Focusing on the Create and Read portions of the CRUD acronym and in lesser extent the Update and Delete makes that InfluxDB is not really a CRUD system, more CR(ud) in essence.

The system, however, is ideal for monitoring metrics, IoT (Internet of Things) sensor data and real-time analysis, which data is abundantly available through in day by day tasks, as highlighted in the introduction.

Features of InfluxDB as described by InfluxData are;

  • High performance of the system which allows for high write and read speeds along with native data compression

  • Thanks to the programming in Go, the code has no external dependancies and is compiled in one file

  • Clustering is built in making use of third party software for distributed computing unnecessary to provide a high level of reliability and availability of the data

In order to achieve the high speeds in both input as retrieval of data, InfluxData recommends the use of SSD’s over conventional mechanical hard disks for the machines that run InfluxDB.



InfluxQL

For people like me, who worked with relational databases accompanied with SQL, starting to use InfluxDB will be very familiar since the query language of InfluxDB is based on the SQL syntax. They call it Influx Query Language or InfluxQL in short and will allow SQL users to put their knowledge to work on InfluxDB. Most of the well known syntax like GROUP BY, MERGE, JOIN are present, as are frequently used mathematical functions — as MIN, MAX, MODE, MEDIAN and PERCENTILE — which make it possible to run routine analytical computations on the collected datasets.

As opposed SQL Databases you will notice that there is no pre-defined schema in InfluxDB which makes adding data with deviating formats easy. Notable is that since we are working with time series data, the primary key in InfluxDB will always be time, which is set by the system.



Concepts

In the following part I will highlight some of the basic concepts of influxDB. Often I will compare the basic concepts to the traditional relational database management systems because this helped me in understanding these concepts through my experience with RDBMS’s. I hope this will help others as well.


Timestamp

It should be clear by now that InfluxDB is designed for time series data. Therefore it may not come as a surprise that time is literally of the essence. Time is stored in the form of timestamps in, what can be compared to a column, conveniently called time.

The time is stored in the RFC3339 UTC format, which is yyyy-mm-ddThh:mm:ssZ.


Fields

Field Keys, Field Values and Field Sets

Next there is the a group called the fields, a first type, called field keys, can be compared with the column names in RDBMS. These are of the type string and contain metadata, the information of what is measured.

Field values in turn, hold the actual measured values and can be any number of type — string, float, integer or boolean — and since we are covering a time series database, will always associated with time.

It is worth noting that without fields you cannot enter a new line in InfluxDB and that fields are not indexed, which will greatly influence the query speed.

The combination of a field key with a field value is called a field set. For the example data the eighty field sets are:


Tags

Tag Keys,Tag Values and Tag Sets

Tags are set up similarly to fields, with the difference that both tag keys and tag values are of the type string and hold metadata. Therefore tags are used to add extra information regarding the measurements.

Though they are optional, it can be very useful to add these to the dataset as — contrary to fields — these tags are indexed which makes them an excellent choice to filter the data on.

As with the fields, the combination of a tag key-value is defined as a tag set. The tag sets for the example data are;


Measurement

A measurement is used as container to hold the timestamps, fields and tags. It gives the user a way to describe the data in the set. Conceptually, this can be compared to the table name in a relational database system and since the measurement contains a description of sorts, the type for this is string.


Retention policy

With retention policy the user defines the period that datapoints are being stored in InfluxDB, which is called DURATION but also the number of versions that should be kept on the cluster, as REPLICATION.


Series

Now we know about tag sets, measurements and retention policies we can talk about series. These are collections of datapoints that have the same;

  • retention policy
  • measurement
  • tag set

Each series will get an arbitrary series name, here are the four series of the example data;


Point

The basis of our data sets are the points, this is one or more field sets and/or tag sets in the same series with the same timestamp. In RDBMS terms, this can be compared to a single row of data. For example;


Database

The database is the main container that holds all information regarding to users, queries and of course the time series data itself. As mentioned earlier, the database in InfluxDB is schemaless, which makes addition of new measurements, with different fields and tags easy.