by Iavor Jelev
In the digital age, not only people but also devices generate more and more data. Whether it's a web server that logs accesses or different sensors that store measured values at regular intervals - the amount of data can very quickly become huge and confusing. However, this also increases the potential for gaining useful and important insights in order to optimize processes, plan for the future or identify failures.
Often, the speed at which these are gained plays a very important role. Machine learning (ML) helps us to cope with the flood of data. The format and type of data to be analyzed is critical in choosing which algorithms to use. In this blog post, we will focus on one method: time series analysis. We will look at approaches to detect anomalies or make predictions about the future, for example. Finally, we will list libraries and programming languages that can be used to implement such solutions.
What is a time series and how do you analyze it?
A time series is a collection of values, each of which refers to a specific time stamp. For example, a sensor that stores a measurement every 5 seconds along with the timestamp generates a time series. An ML algorithm normally treats each entry in its test set the same. In contrast, time series specify an explicit order through the time component. This is not strictly necessary for the algorithm, but one loses accuracy in the results if one ignores it.
If one wants to analyze a time series, there are some possibilities for it. For example, you can decompose it into different components to understand it better:
- Trend: In which direction does the average develop in the long term?
- Season: Is there a cyclical movement to be observed?
- Irregular component: These are outliers in the data set that may be explained by historical data or may simply represent "noise".
Splitting a time series is not mandatory if you are interested in making predictions about future trends. However, it can help to better understand the data as there are many algorithms to choose from. In the blog post introducing Machine Learning, it was already pointed out that Machine Learning is not a magic solution for everything. Certain models and algorithms are better suited for a certain type of data than others.
Some of the components into which you can decompose the time series (if you can) are very useful in making this choice. The following is a list of algorithms along with the type of time series they are well suited for:
- Trend is discernible, but not a season 
- Simple Moving Average Smoothing
- Simple Exponential Smoothing
- Holts Exponential Smoothing
- Trend and season are recognizable
Which tools can be used for the analyses?
Without going into too much detail, let's now look at a few examples that illustrate some of the concepts mentioned. For this, we will use a CSV file that represents statistics about collected news - about feeds that we visualize in our press review use case. The file has the following format:
day, docs 2016-08-29,144 2016-08-30,134 2016-08-31,134 2016-09-01,152 2016-09-02,170 … 2017-09-30,48 2017-10-01,50 2017-10-02,94
As you can see, each line consists of a date and a number representing the crawled messages for that day. For the visualizations and analysis we will use the R language. This language is open source and very popular for tasks around statistical analysis. It has a very large collection of algorithms already implemented and an active community. This allows us to demonstrate some of the methods mentioned with only a few lines of code. In these examples, we are concerned with the algorithms and their application to the number series, so we will focus on the number column docs (the temporal order is preserved by sorting by date) and settle for working with indexes (instead of dates) on the x-axis.
First, let's take a look at our data. We do that with the following code:
pslog <- read.csv('pslog.csv', header = T) plot.ts(pslog$docs)