Building a BQML Linear Model with Looker Using Public Data
Author: Christelle Xu
In my last article I introduced you to BQML and its implementation and integration with Looker. My goal this time is to walk you through a specific example to help solidify those concepts with a tangible use case.
I’ll be basing my walkthrough on Google’s tutorial “Using BigQuery ML to Predict Birth Weight” which you can find here. Much of the BQML syntax will stay the same given that integrating this model in Looker simply involves building a Looker wrapper around BQML code. I’ll be building dimensions from the results to illustrate the visualisation capabilities available once BQML is integrated in Looker. Much of the technical justifications of the integration I covered in my previous blog so please do use it as a reference whilst implementing the model below.
By the end of this tutorial you’ll be able to build a dashboard like this to help facilitate the evaluation of your model:
Key evaluation metrics all in one place
We’ll be using the well known natality data set to create a model that predicts a baby’s birth weight. The data is publicly available through BigQuery so please follow along.
If you haven’t read part one of this series, I urge you to at least have it open. You can find it here. I introduce you to how BQML integrates with Looker and walk you through the logic behind the syntax.
I also assume you have basic Looker LookML experience and BigQuery knowledge.
Accessing The Data
When running a BQML model, you need to make sure that both your data and your model are sitting in the same location. In this case, however, Looker will be writing the model in its scratch schema, the process by which you’ll define when creating the necessary persistent derived table (PDT).
Given that the natality data set is sitting in the US, you’ll need to make sure that the connection you create in Looker also in the US. That may mean creating a new connection where your host connection is US based.
By default, you’ll have access to public data sets via your service account in BigQuery. You should also, therefore, have access to these data sets in Looker. But you’ll want to make sure that that connection hasn’t been manually turned off or removed.
Once the connection is in place, you can start pulling through the fields that will be used in your model. Best practice is to create an autogenerated file by creating a view from table, selecting ‘natality’ under publicdata:samples. Nest this autogenerated view under the ‘ag’ folder, and rename the view as ag_natalityfor clarity.
This will pull through all fields in your underlying table as dimensions where Looker will infer the underlying data type. Don’t make changes to this file.
Autogenerated files go under the ‘ag’ folder.
An autogenerated view.
Any changes we’ll make in the extended file, which includes the autogenerated file, will overwrite it with changes specified in the extended file. This is important if your underlying schema changes often — you can always regenerate the underlying autogenerated view while keeping your business logic in the extended view.
This brings us to our next step — extending the autogenerated view into an extended view to make the business logic changes. Namely you’ll notice that in the BQML Google tutorial that they cast mother_race() as a string. This is because mother_race() is a categorical variable but is saved as a numeric type. All this requires in Looker is a simple cast.