I recently completed Udacity’s free UD 120 Introduction to Machine Learning. This post is a summary of what the course has to offer.
Here is the link to my GitHub repo which includes all code files for the mini projects done in the course.
The course started with an introduction to machine learning, various types of machine learning algorithms – supervised and unsupervised, where they might be applicable. Various types of supervised and unsupervised algorithms and their basic differences and applications were discussed.
The first algorithm was basic Naive Bayes classification algorithm, with a mini project on classifying an email according to it’s author.
In the third lesson, SVM algorithm was introduced. The mini project done in the first lesson was implemented using the SVM algorithm. The two algorithms were compared based on accuracy, time required to predict, time required to train, etc.
In the fourth lesson, Decision Tree algorithm for classification was introduced. Similar comparative analysis was done based on accuracy, time required to predict and time required to train for the email author classification mini project.
In this part of the course, importance of datasets in machine learning was elaborated. The Enron Corpus dataset was introduced, along with a discussion on what questions one should ask the dataset to get an idea about the dataset.
Linear Regression algorithm was introduced, along with a mini project on predicting Enron employee’s bonuses based on their salaries. Various metrics to evaluate a regression like SSE, R Squared were discussed.
Outliers and their significance was introduced, along with methods to clean outliers. The mini project included identifying outliers in Enron’s employee salary and bonuses data.
Unsupervised learning was introduced along with algorithms like k means clustering. In the mini project for this lesson, clustering was applied on Enron’s employee salary data.
Identifying the most important features of your data using human intuition as well as algorithms.
How to preprocess data with feature scaling to improve your algorithms. min max scaler in sklearn was introduced.
Using text data in machine learning. Concepts like Bag of Words, stemming, TfIfd Vectorizer, etc were introduced. In the mini project, text learning algorithms were applied on the Enron Corpus’s employee emails data.
PCA was introduced. Use of PCA in feature selection and in unsupervised learning was discussed. In the mini project, PCA was used on images of past 10 presidents of USA to classify a new image.
Validating a machine learning algorithm, splitting your dataset into training testing parts using sklearn, cross-validation technique, Grid Search Cross Validation for parameter tuning were introduced.
Metrics to evaluate a machine learning algorithm like Accuracy, Precision, Recall, F1 Score were introduced.
Udacity UD120 is a really good course which gives a high level introduction to many concepts in machine learning with a lot of hands on practice with mini projects in each lesson that use scikit learn in Python. As a beginner in Machine Learning, it really helped me a lot. The only problem is that a lot of sklearn functions used in the course have been deprecated in sklearn’s latest version and you will have to research on the alternatives for the deprecated functions.