This blog is mainly based on the book and lecture notes by Professor Yaser S. Abu-Mostafa from Caltech on Learning from data , you could benefit a lot from the lecture and videos.
“In God we trust, and others bring data”.
If you show a picture to a three-year-old and ask if there is a tree in it, you will likely get the correct answer. But if you ask a thirty-year-old what the definition of a tree is, you will likely get an inconclusive answer.
We didn’t learn what a tree is by studying the mathematical definition of a tree. We learned it by looking at trees. In other words, we learn from ‘Data’.
Learning from data is used in a situation where we don’t have an analytic solution, but we do have data that we can use to construct an empirical solution. This premise covers a lot of territories, and indeed learning from data is one of the most widely used techniques in science, engineering, and economics.
Let draw a figure to show the basic setup of the learning problem:
OK, if we have the data and fully understand the problem, which tool should we use?
Here are some tools which I strongly suggest:
- Matlab (M)
If you are in a university as a student or academic, probably you could use in the Lab freely and install the academic version on your own computer. Now its version is 2016a, updated with more exciting features, including most of the machine learning methods.
Even if you are not a very good programmer, but you could learn Matlab with its detailed demos in a few hours.
If you do not have the access to the free Matlab and do not tend to spend a few dollars on it. Python is the best choice for you. The Scikit-Learn are simple and efficient tools for data mining and data analysis, but you have to spend some time on learning Python.
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. However, you could find a lot of useful packages for you particular needs, the price is you need to spend more time on learning R programming. For me, I prefer Python to R.
I will talk about the details about how to use these tools on my blog in the future. Now, let us image we have already mastered these tools and we have the data. Should we begin ? No, let spend some time to talk about the principles of learning.
- Occam’s Razor
Although it is not an exact quote of Einstein’s, it is often attributed to him that ” An explanation of the data should be made as simple as possible, but no simpler”. A similar principle, Occam’s razor, dates from the 14th century and is attributed to William of Occam, where the ‘razor’ is meant to trim down the explanation to the bare minimum that is consistent with the data.
The simplest model that fits the data is also the most plausible.
- Sampling Bias
It is not that uncommon for someone to throw away training examples they do not like in industry or academic!
If the data is sampled in a biased way, learning will produce a similarly biased outcome.
- Data Snooping
Data snooping is the most common trap for practitioners in learning from the data. The principle involved is simple enough.
If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.
As the saying goes, if you torture the data long enough, it will confess.