Call me Sisyphus

Categories: Journal
Tags: No Tags
Comments: No Comments
Published on: September 22, 2018

I don’t know why I continue to use Windows. Every single god damned thing feels like pushing a giant rock up a hill. I’ll admit that the fact that I’ve been doing most of my work on Linux for the past twenty years inclines me to use the linux idiom for getting things done and I’m likely just not thinking like a Windows user but my god does MSFT make doing things other than what they want to you to do in the way they want you to do it a massive pain in the ass.

I started using computers with nothing but a command prompt and while I appreciate some of the amenities of modern computing, I’d still like a marginally functional command prompt with a reasonable set of tools. I swear it takes me ten times as long to get basic work done in Windows than it does in Linux.

I think the only thing that is keeping me from just installing Ubuntu on my desktop is my recollection of how clunky desktop linux was a decade ago and my continuing vice of playing MMOs.

Also, I’m back from training people in what can only be described as a suburban Hellscape. How do you politely tell someone that you’d rather put a gun in your mouth than live in the same community they’ve chosen to make their home?

Cheating Irises

Categories: Journal, Machine Learning
Comments: Comments Off
Published on: August 28, 2018

HAL8999 7/100

I was sick yesterday but did spend some time looking over some “cheat sheets” that people had put together for various machine learning topics. Some were good, some were just stupid (I’m looking at you Machine Learning in Emoji). Also went through a very simply classifier based on the iris data set.

Model Selection

Microsoft Azure

Neural networks

Neural Network Graphs

Python 4 Big Data

Python 4 Data Science


Transformers, more than meets the eye

Categories: Journal, Machine Learning
Comments: Comments Off
Published on: August 26, 2018

HAL8999 6/100

  • Watched the “Learn how to Learn” Google talk on youtube
  • Updated the jupyter notebooks for handson-ml from github and read through the Ch2 notebook to address the CategoricalEncoder issue from yesterday
  • Looked at a basic transformer

Part of building a data pipeline is likely to include the creation of custom transformer classes to perform operations specific to the project or data source. For example, one of the products I work on stores xml data in a database with the newlines encoded as ‘\n’. When the data is pulled from the database those ‘\n’ sequences are converted to newline characters before the xml is passed to the parser. It’s a very simple operation but without it the data would fail xml validation.

The scikit-learn package provides a structure for building transformers for a data pipeline that is based on duck typing i.e. “looks like a duck, walks like a duck, etc” rather then through object inheritance. Essentially, if your class has fit(X) and transform(X) methods, it counts as a transformer.

from handson-ml import BrokeAsFuck

Categories: Journal, Machine Learning
Comments: Comments Off
Published on: August 25, 2018

HAL8999 – 5/100

Today while going back through the Hands On Machine Learning book Ch2 I learned that the CategoricalEncoder referenced in the section on handling categorical attributes still isn’t in scikit-learn. I checked the reqirements.txt which shows scikit-learn=0.19.1. Checking my virtualenv, I should be good.

Turns out that the CategoricalEncoder isn’t going to be in scikit-learn until 0.20 so to get it you have to grab 0.20 from Github rather than just use pip.

Fucking hell… 

So, if you’re going to write a book, it’s probably a good idea to use the stable branch of your libraries rather than the bleeding edge dev branch.

It will be a good exercise to convert the book’s example code to work with the standard OneHotEncoder but I’ve always been a fan of “just works” as a design principle.

Long days, no blog post

Categories: Journal, Machine Learning
Comments: Comments Off
Published on: August 23, 2018

HAL8999 – [3,4]/100

  • Chapter 2 of Hands on ML continues
  • Creation of test sets
  • Stratified sampling
  • sklearn’s StratifiedShuffleSplit
  • Visualizing data with matplotlib
  • Coorelation coefficients

Getting a good train-test split

Since you can’t train a model and just expect it to work well right out of the box it’s standard practice to split off about 20% of the data set to test the model against. The naieve way to do this is to just grab 20% of the data at random but that runs into a number of issues:

  • depending on how you do it, you may grab different train/test sets every time the model runs
  • grabbing data points at random can result in sampling bias to creep in if you happen to get an unrepresentative sample


Stratified sampling

Rather than just grabbing data points at random we can ensure that we can get a more representative distribution of sampled data points for some attributes (sex, income, ethnic background, etc) to ensure that random selection hasn’t introduced bias into the training and test sets.

In this example we can be pretty certain that median income correlates strongly with median housing price and we want to be certain we get a representative distribution of districts with respect to median income. The way to do this is to add a column to the data set that groups median income into categories and we can then sample based on the category. This will improve our chance of getting a more representative sampling of the underlying median income attribute.

Once the test set has been selected out we work entirely with the training set so as to not introduce bias based on knowledge of the test data.

One of these things is not like the other…

Categories: Journal, Machine Learning
Comments: Comments Off
Published on: August 20, 2018

HAL8999 – 3/100

  • Chapter 2 of Hands on ML
    • Cost functions
    • virtualenv setup
    • code to get the dataset

The chapter follows a rudimentary machine learning project from business case to final product. California census data is analyzed to build a model which will predict media housing price in a district based on other factors using a linear regression model with a Root Mean Square Error (RMSE) function to measure performance i.e. as a cost function.

\(\displaystyle RMSE(X, h) = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) – y^{(i)})^{2}}\)

The function h is the “hypothesis” function which operates on the feature vector \(x^{(i)}\). RMSE isn’t the only cost function by any stretch of the imagination but it seems to get a lot of use.

From this point the author goes through the dev environment setup process I went through a few days ago and it’s pretty clear from the instructions that the work is being done on a Mac. 

The code to download the housing tarball is a little sloppy and would have downloaded it every time I ran the cell so I added a simple test to only do the download if the file didn’t already exist.


Steak and ML

Categories: Journal, Machine Learning
Comments: Comments Off
Published on: August 19, 2018

Achievement: HAL 8999 – 2/100

  • Completed chapter 1 of Hands On ML and worked through the exercises
  • Modified yesterdays example to also do both k-nearest-neighbors with both three and four neighbors. Four neighbors was further from the linear regression than three demonstrating that more is not always better.

Short list but Ch1 is something of an overview so a lot of concepts get thrown in with not a lot of context or depth of discussion so I found I got to the end and had a hard time connecting what I’d read with the specific questions asked at the end of the chapter. I ended up paging back through the chapter to locate the answers to questions which were oddly specific as opposed to focusing on the broad underlying concepts.

The fact that I did a chunk of the reading while grilling tri-tip and elote and then later when in the post steak and mexican corn food coma might also be part of why and ended up paging back through the chapter so much.

Sort yourself out

Categories: Journal, Machine Learning
Comments: Comments Off
Published on: August 18, 2018

Achievement: HAL 8999 – 1/100

  • Set up virtualenv for HAL8999
  • Installed sklearn, pandas, numpy, matplotlib
  • Unable to install tensorflow since I’m on python 3.7 and the pip installs only work for 3.6 and earlier. I can sort that out later.
  • Read up through Example 1-1 in Hands-On Machine Learning
    • Author is a little fast and loose with the example code and imports

I ended up burning close to an hour figuring out why my plot and model didn’t match the author’s even though we were using the same data. The issue turns out that I’d left out the following line when massaging / mangling the data:

It’s not immediately clear to me why presorting the values would make any difference but the unsorted dataframe included values well outside the range used in the author’s jupyter notebook. My guess is that by using the unsorted data I was applying the wrong GDP values to the wrong countries and so some outlier data made it into the model. Clearly my pandas-fu is weak. Po would be sad.

Also, Visual Studio Code is oddly pickier about the import of sklearn.linear_model and refused to initialize the model unless I specified the whole sklearn.linear_model.LinearRegression() where Jupyter was fine with linear_model.LinearRegression().

HAL 8999

Categories: Journal, Machine Learning
Tags: No Tags
Comments: Comments Off
Published on: August 18, 2018

Some druids apparently wandered by and cast Wall of Thorns in my yard so I spent a good sized chunk of the morning clearing blackberry brambles.  While my hands were busy being impaled by the thorns which unerringly find the weak points in my gloves I got to thinking about some of the things on my internal to do list as well as my achievement list which I’ve been falling down on.

Since I’m already learning machine learning on an informal basis anyway I may as well be like one of the cool kids and do that “100 days of machine learning challenges”. Since “100 days of machine learning” is a relatively uninspiring name and makes things sound like a slog, the HAL 8999 achievement was born. One hour of ML or ML related math per day, every day, for 100 days with an associated blog post to keep the record clear.

Displaying Markdown and Latex in Jupyter output cells

Categories: Journal
Tags: ,
Comments: Comments Off
Published on: August 14, 2018

Just out of curiosity I started looking into how to get jupyter to display text and math formatted as markdown and latex in output cells. This got into my head when I was looking at symbolic integration and differentiation of functions and how to format the output in a more civilized manner. I don’t have the full solution to the question but formatting output as markdown turned out to be ridiculously easy since the heavy lifting had already been done.

Displaying markdown and Latex in jupyter output cells


  • And of course the new WordPress editor is helpfully mangling both the code highlighting and output. Trust me, it works. I’ll have to sort out how to get “Gutenberg” to do the right thing later.
  • Thrashing around a bit now get’s the code highlighted but now the crayon tag is visible. Still no love for the actual Latex.
  • Rolling back to the old editor and fixing these posts. I’ll deal with the gutenberg editor issues when I’m dragged kicking and screaming to it
«page 1 of 13
Welcome , today is Tuesday, September 25, 2018