Introduction to Machine Learning

by Iavor Jelev

Publication date

Artificial intelligence is a hot topic at the moment. Not a week goes by without the media reporting new breakthroughs or areas of application: Algorithms beating humans in games, self-driving cars, natural language communication with virtual assistants. But how does a system become intelligent? Or to put it another way: How does machine learning (ML) work?

Why machine learning?

Programming a system that makes automated decisions is not a trivial task. Let's take a problem we are all familiar with: detecting spam e-mails. How would one go about this task intuitively? For example, we can look at spam emails and identify the words or phrases that typically appear only in unsolicited ones. If these are found in an incoming email, it should be marked as spam, i.e. the classic filtering approach.

This is a quick way to get results, but you have to constantly program new rules as new topics appear in spam emails. Or maybe the scammers used new spellings for already blocked terms, and our filter doesn't work anymore. So in the long run, this solution is not efficient and requires a lot of manual effort.

So we have found a solution, but fail in the long run because of details, since everything the system has learned has to come from us. To have a truly viable solution, it has to happen automatically. The system should get better itself - based on the available data. It should "learn" itself.

What is machine learning?

A frequently cited formal definition for machine learning was given by Tom M. Mitchell in 1997:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P improves with experience E.

Let's take a closer look at this definition in relation to our spam email example. We have data that is divided into two categories: Spam and Ham (unwanted and wanted). The task T would be for us to identify from the given data the words or phrases that will help us distinguish between them. However, this should be done automatically and not manually. We can use statistical methods at that point, which we will not discuss here. The result from this task T is a "model" - a set of rules that allows us to decide, given a new email, which category it falls into.

If we choose a subset of the emails from both categories on which to test the result of T, we can measure P . Since we know what the result should be, we can automatically check the model's suggestions for correctness, and then know if one model is "better" than another. All we need now is experience. To gain this, we can randomize the data for T each time, for example. This allows us to generate different models that we can compare to optimize P (always choosing the model that made the fewest errors on the test data).

We have now transformed our manual solution into one that works with machine learning. Our effort as a user has been reduced to reporting to the algorithm every new spam email we find in our inbox, so that it can retrain itself.

Supervised vs. unsupervised

In our solution to the spam problem, we made use of a particular category of machine learning: supervised ML. This means that we helped the algorithm - in this case, by preparing the data in two predefined result categories as well as making sure that the distribution of the documents among the categories is correct.

Another category of machine learning also does not require any preliminary work. An example approach from non-supervised machine learning is clustering: recognizing groups and structure in a given set of data.

To illustrate this with an example, let's imagine the customer database of an online store. We have information about the users and their purchases. With clustering, we can identify similarities among users and form groups (customer segments). On the one hand, this can help us better understand our customer base, but on the other hand, it can also give us tools to optimize the recommendations of new products for our customers. As mentioned earlier, no upfront work is required with this approach. However, we need to manually check the results, as we don't have an automatic mechanism for this for now.

Solve everything with machine learning?

Machine learning, as great as it is, is not a magic solution to any problem. One important reason for this is: there is no such thing as error-free machine learning. The process of learning is about minimizing errors. Trying to eliminate them completely in practice is utopian.

When then does such an approach make sense? An example would be: When the task is too complex. Be it because the solution is too difficult (computer vision), or because the special cases of a simple solution are too many to map manually (as in our spam example).

In practice, machine learning is often used in Big Data. This has another reason besides the ones mentioned above: The results are usually better (for the same algorithm) with more input data. So for smaller datasets or problems that can be easily hand-coded without errors, machine learning would not necessarily be the best approach.

We have roughly outlined the concept of machine learning and two important categories from it. If you have the expectation that ML works completely autonomously and error-free with arbitrary data sets, you will quickly be disappointed. But if one takes the time to preprocess the data and optimize or moderate the results, and if one accepts low error rates, complex problems that are otherwise impossible (or difficult) to solve programmatically can be handled with manageable effort.

Otherwise, to take up the example from above, two to three spam mails a day can be identified manually with considerably less effort.