Long days, no blog post

Categories: Journal, Machine Learning
Tags:
Comments: Comments Off
Published on: August 23, 2018

HAL8999 – [3,4]/100

  • Chapter 2 of Hands on ML continues
  • Creation of test sets
  • Stratified sampling
  • sklearn’s StratifiedShuffleSplit
  • Visualizing data with matplotlib
  • Coorelation coefficients

Getting a good train-test split

Since you can’t train a model and just expect it to work well right out of the box it’s standard practice to split off about 20% of the data set to test the model against. The naieve way to do this is to just grab 20% of the data at random but that runs into a number of issues:

  • depending on how you do it, you may grab different train/test sets every time the model runs
  • grabbing data points at random can result in sampling bias to creep in if you happen to get an unrepresentative sample

Solution?

Stratified sampling

Rather than just grabbing data points at random we can ensure that we can get a more representative distribution of sampled data points for some attributes (sex, income, ethnic background, etc) to ensure that random selection hasn’t introduced bias into the training and test sets.

In this example we can be pretty certain that median income correlates strongly with median housing price and we want to be certain we get a representative distribution of districts with respect to median income. The way to do this is to add a column to the data set that groups median income into categories and we can then sample based on the category. This will improve our chance of getting a more representative sampling of the underlying median income attribute.

Once the test set has been selected out we work entirely with the training set so as to not introduce bias based on knowledge of the test data.

Comments are closed.

Welcome , today is Tuesday, September 25, 2018