General Approach to Machine Learning Problems
Define the Problem
- What are you predicting?
- What is the format of your output?
- What data do you have?
- What is the format of your input?
- Is there background knowledge that you need to understand to do this well?
Gather Training Data
Often times, the data you have will contain more information than you need. Once you know what your data is, you may need to pre-process it to get it in the correct format.
Decide on a Machine Learning Algorithm
Research existing algorithms to decide which one to use. There are many things to consider:
- How simple is the algorithm to understand?
- How difficult is it to implement?
- Can it handle the amount of data you have?
- Does it do the correct thing (classification, regression, etc.)?
Splitting into Training and Testing Data
Once you have decided on an algorithm, you will need to develop your model. This means giving your computer data so that it can understand how to deal with a new piece of information. Just like you need background information to make an educated guess, so does a computer! This background information is our dataset.
After you develop your model, you want to be able to evaluate how good it is. How do you do that? One option is to test the model using our dataset. The problem is that models can sometimes fit their training dataset perfectly. So, running the same data through the model will give you a near 100% success rate. This is called overfitting! This is an issue because datasets are not always representative, and if your model overfits it will have a hard time classifying a new data point.
To counter this, you can train and test on different data. So I build a model using the first 80% of my dataset, and I "hide" the rest of it until after my model is built. I then use the test data to decide how good my model is.
This concept is similar to taking an exam. Your teacher doesn't give you practice questions and then test you on those exact same questions. You could just memorize them. Instead, they give you practice questions, and then test you on slightly (or very) different questions to see what you know.
Train
Use your training data to develop a model (run the machine learning algorithm). One example of a model is a decision tree (pretty much a flow chart). In the training step, you are creating the decision tree. You'd decide how to create the tree based on your training data.
Test
Now that you have your model (or in the example above, your decision tree), see how it performs on the test data.