Introduction to the Machine Learning Process

Chapter 2 of Aurelien Geron’s Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition provides a walk-through of a complete machine learning project… from sourcing the data, understanding it, preparing it for the ML algorithms, generating the model, to simple deployment.  You can access this book through the SMU Library website as described at the beginning of the semester.  I believe > this < link should work to get you to the O’Reilly Learning Platform for Higher Education. 

As a homework assignment, I want you to read Chapter 2 of the text and create a Jupyter Notebook following along with the project Geron describes.  Throughout the notebook, include markdown cells notating important concepts, ideas, and code explanations.  Note: The section titled Get the Data details how to create a workspace with virtualenv.  Don’t do this part.  We are using conda, not virtualenv.  You need to submit this Jupyter Notebook to Canvas by class time on Feb 28. 

Important remark: Now, you could be crafty and download a notebook from Geron’s Github repo for the book, but that would be pretty obvious.  You could change the variable names, but that would be pretty obvious too.  However, I’m operating under the assumption that you all are of a higher moral character than that.  Please don’t let me down. 

Make sure you could answer the following questions if asked:

  1. What is a performance measure?  What is the difference between RMSE and MAE? When is one preferred over the other?
  2. From the housing dataset used in the project, which features/variables are numeric and which are categorical?
  3. What’s the difference between random sampling and stratified sampling?
  4. What is imputation? What is the SimpleImputer from sklearn?
  5. What is the purpose of one-hot encoding? 
  6. Why scale features?
  7. What is cross-validation and when/why would you use it?
  8. Can you combine the entire process into one Pipeline?