HAL8999 – [3,4]/100
- Chapter 2 of Hands on ML continues
- Creation of test sets
- Stratified sampling
- sklearn’s StratifiedShuffleSplit
- Visualizing data with matplotlib
- Coorelation coefficients
Getting a good train-test split
Since you can’t train a model and just expect it to work well right out of the box it’s standard practice to split off about 20% of the data set to test the model against. The naieve way to do this is to just grab 20% of the data at random but that runs into a number of issues:
- depending on how you do it, you may grab different train/test sets every time the model runs
- grabbing data points at random can result in sampling bias to creep in if you happen to get an unrepresentative sample
Solution?
Stratified sampling
Rather than just grabbing data points at random we can ensure that we can get a more representative distribution of sampled data points for some attributes (sex, income, ethnic background, etc) to ensure that random selection hasn’t introduced bias into the training and test sets.
In this example we can be pretty certain that median income correlates strongly with median housing price and we want to be certain we get a representative distribution of districts with respect to median income. The way to do this is to add a column to the data set that groups median income into categories and we can then sample based on the category. This will improve our chance of getting a more representative sampling of the underlying median income attribute.
Once the test set has been selected out we work entirely with the training set so as to not introduce bias based on knowledge of the test data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
from sklearn.model_selection import StratifiedShuffleSplit # add a column for the median_income stratafied sample housing["income_cat"] = np.ceil(housing["median_income"] / 1.5) housing["income_cat"].where(housing["income_cat"] < 5, 5, inplace=True) housing["income_cat"].hist() plt.title("Full Data Set") plt.show() split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1337) for train_index, test_index in split.split(housing, housing["income_cat"]): stratified_train = housing.loc[train_index] stratified_test = housing.loc[test_index] plt.title("Training and Test Sets") plt.hist([stratified_test["income_cat"], stratified_train["income_cat"]], stacked=True) plt.show() stratified_train.drop("income_cat", axis=1, inplace=True) stratified_test.drop("income_cat", axis=1, inplace=True) housing = stratified_train.copy() |

