HAL8999 – 3/100
- Chapter 2 of Hands on ML
- Cost functions
- virtualenv setup
- code to get the dataset
The chapter follows a rudimentary machine learning project from business case to final product. California census data is analyzed to build a model which will predict media housing price in a district based on other factors using a linear regression model with a Root Mean Square Error (RMSE) function to measure performance i.e. as a cost function.
\(\displaystyle RMSE(X, h) = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) – y^{(i)})^{2}}\)The function h is the “hypothesis” function which operates on the feature vector \(x^{(i)}\). RMSE isn’t the only cost function by any stretch of the imagination but it seems to get a lot of use.
From this point the author goes through the dev environment setup process I went through a few days ago and it’s pretty clear from the instructions that the work is being done on a Mac.
The code to download the housing tarball is a little sloppy and would have downloaded it every time I ran the cell so I added a simple test to only do the download if the file didn’t already exist.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
[crayon-60114b27587b3941471399 ]import os import tarfile import pandas as pd from six.moves import urllib DL_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/" HOUSING_PATH = os.path.join("datasets", "housing") HOUSING_URL = DL_ROOT + "datasets/housing/housing.tgz" def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH): """If the housing data isn't on disk, fetch it""" if not os.path.isdir(housing_path): os.makedirs(housing_path) tgz_path = os.path.join(housing_path, "housing.tgz") if not os.path.isfile(tgz_path): print("Downloading housing data…") urllib.request.urlretrieve(housing_url, tgz_path) housing_tgz = tarfile.open(tgz_path) housing_tgz.extractall(path=housing_path) housing_tgz.close() def load_housing_data(housing_path=HOUSING_PATH): """Load the housing data into a dataframe""" csv_path = os.path.join(housing_path, "housing.csv") return pd.read_csv(csv_path) fetch_housing_data() housing = load_housing_data() housing.head() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): longitude 20640 non-null float64 latitude 20640 non-null float64 housing_median_age 20640 non-null float64 total_rooms 20640 non-null float64 total_bedrooms 20433 non-null float64 population 20640 non-null float64 households 20640 non-null float64 median_income 20640 non-null float64 median_house_value 20640 non-null float64 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB |