Machine Learning on Google Cloud using Big Query ML

Train and evaluate a classification model on the MNIST dataset using Big Query ML

Hrishi Shirodkar
10 min readMay 21, 2021
Photo by Robynne Hu on Unsplash

The top cloud computing platforms such as Google, Amazon and Microsoft have made significant investments and are betting big on democratizing Artificial Intelligence and Machine Learning. On top of that, more and more companies are adopting machine learning to increase profitability and better serve their customers through improved business processes.

Companies find it easier to experiment and measure their machine learning capabilities using the ML options provided by the cloud platform as these tools were primarily developed to make AI and ML technologies accessible to everyone. Therefore, companies do not need to wait until a mature team of data scientists with advanced AI skills is setup. They can instead leverage their existing resources to begin experimentation and ML modeling.

What is covered in this series?

In this series of three stories, I will introduce you to three different ways to perform machine learning on GCP. We will train and evaluate a classification model on the MNIST dataset using the following ML options offered by GCP. We will then use our trained model to correctly predict the handwritten digits on the MNIST test dataset.

ML options explored in this series:

Free Trial Credit offered by Google:

Google is giving a free trial credit of $300 to its new customers to fully explore and conduct an assessment of its Google Cloud Platform. This is a great way to get some hands-on experience with the AI products Google has to offer.

Pre-requisities:

  • Basic concepts of Machine Learning:
    This tutorial is more focused towards understanding how to utilize GCP to perform Machine Learning over learning ML concepts.
  • Google Cloud Account:
    Sign up using a new account on Google Cloud if you do not have an account already.
    Once your account is setup, use the navigation menu of the GCP console and go to the Billing page. The billing overview page should display a $300 trial credit applied to your account (I have additional credits as seen in the screenshot, but your account should show $300 when you sign up for the first time). Please confirm you have necessary credits before you proceed with the rest of the tutorial, you will otherwise incur charges.

The overview page under Billing should show the credits that have been applied to your account.

What is the MNIST Dataset and its purpose in this series?

MNIST is a very popular dataset among machine learning enthusiasts when it comes to learning new modeling techniques on real world data with no data preprocessing requirements. The data is well structured to begin with and easy to understand. I personally preferred using the MNIST dataset so that we stay focused on the task at hand to learn the ML options offered by GCP than rather be distracted by the complexity of the dataset itself.

The MNIST database (Modified National Institute of Standards and Technology database[1]) is a large database of handwritten digits that is commonly used for training various image processing systems.

The data files mnist_train.csv and mnist_test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The training data set, (mnist_train.csv), has 785 columns. The first column, called “label”, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

The test data set, (mnist_test.csv), is the same as the training set, except that it does not contain the “label” column.

Big Query ML:

What is BigQuery? It is an enterprise data warehouse that enables super-fast SQL queries on massive datasets using the processing power of Google’s infrastructure.

What is BigQuery ML? It lets you create and execute machine learning models in BigQuery using standard SQL queries. The primary objective is to democratize machine learning by letting SQL practitioners build models using existing SQL tools and skills. Current models supported by Big Query ML can be found here.

Step 1: Download MNIST train and test datasets.

We will be using only a subset of the original MNIST dataset to allow for faster training times. Go ahead and download the MNIST training (mnist_train.csv) and test (mnist_test.csv) datasets from the ML folder inside the git repository on your local machine. Here is the GitHub link.

Step 2: Create a cloud storage bucket and upload the MNIST datasets.

From the Project Dashboard, copy the Project ID. We will use this Project ID to setup a globally unique cloud storage bucket to upload the MNIST datasets.

From the navigation menu, select Cloud Storage and click on Browser.

Click on Create Bucket.

Use the <ProjectID>-ml as the name of your bucket and the region closest to you. Use default values for storage class, access control and advanced settings.

Click on CREATE.

Click on CREATE.

Once the CS bucket is created, click on Upload Files to choose the mnist_train.csv and mnist_test.csv from your local machine.

Once the files are uploaded, the CS bucket should look as follows:

Step 3: Upload MNIST datasets into Big Query from Cloud Storage.

From the navigation menu, scroll down to select BigQuery. You may choose to pin the product for faster look up in the future.

Select your Project from the left menu and then click on Create Dataset. Dataset is analogous to a database on Big Query.

Give mnist as the dataset ID and location same as that of your cloud storage bucket. Keep other options as default and click on Create Dataset.

After sometime, on the left menu panel you should see the dataset mnist created under your project.

Now, that we have our dataset, its time to create the tables to hold these files. We will be creating two separate tables, one for the training dataset and the other for the test dataset.

Click on Create Table to upload the training dataset from CS.

On the next screen, please choose the below options and then click on create table.

  • Source: Select the source as “Google Cloud Storage” and provide the location of the GCS bucket for the training dataset (use browse to choose the file). File format is csv.
  • Destination: Provide the table name as train.
  • Schema: Check the Auto detect box.

Follow the same procedure to create the test table.

Once the datasets are loaded, the train and test tables should show up under the mnist dataset.

Select the train table and click on Preview. Spend sometime here to familiarize yourself with the data. Every row on this table is an instance from the training dataset with its appropriate label.

You may run the below query in the query editor which provides the number of samples in the training dataset for each label. Simply replace the <project-id> in the from clause with your project # from the dashboard.

The training dataset has a total of 6000 samples. These are more or less evenly distributed based on the label.

We are now ready to run our classification model!

Step 4: Train the model using Logistic Regression.

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable using a cost function such as Sigmoid.

Generally, logistic regression is utilized in binary classification problems where only two outcomes are possible (eg: Email is spam or not spam).

However, in this story, we will use it to train a multi-class prediction model. There are 10 classes, one class for each digit from 0 to 9.

Again, paste the above query in the editor and click Run to train the model. It will take approx. 5 minutes for BigQuery ML to train your model.

Once the model is trained, it will be added to your mnist dataset as “classification_model”. Click on it to review the model details. The aggregate metrics are available under Evaluation.

Based on the ROC AUC score, the results are quite promising. Your results may not be exactly same as mine but should be close enough.

Scroll down to view the confusion matrix.

Using the confusion matrix, we can observe that the model performs really well on certain labels vs others. It is performing really well on numbers 1 and 6. Predicting the number 5 definitely needs improvement and we can do so by adding more data or experimenting with other model parameters, but it will be a topic of discussion for another day.

ROC curve stands for Receiver Operating Characteristic curve and is commonly used with binary classifiers. ROC plots True Positive Rate against False Positive Rate.

You can compare classifiers by measuring the AUC (area under the curve) of the ROC curve for the two classifiers. A perfect classifier will ROC AUC equal to 1, whereas a purely random classifier will have ROC AUC equal to 0.5.

Confusion Matrix is used to evaluate the performance of a classifier. A perfect classifier will have only TPs and TNs (diagonal values)

Confusion Matrix

Step 5: Time to make some predictions.

We will use our newly trained model to make predictions on the test dataset. Note that our model was never exposed to the test instances during training.

Run the below query in the editor.

Once the query completes, click on Results to review the predictions.

In the screenshot below, the first instance in the test dataset has been predicted correctly as 0. The model has very high confidence in this prediction as suggested by the probability score of 0.998. If you observe closely, you will see that the model also gives us the probability scores for predicting other labels which is fairly low in this particular example.

Step 6: Delete Big Query Dataset and Cloud Storage Bucket.

After your assessment is complete, please go ahead and delete the Big Query dataset and cloud storage bucket to avoid any costs.

On the below screen, select the mnist dataset and then click on Delete Dataset. Confirm that the dataset has been deleted.

Similarly, go into cloud storage console and check the box for the <projectid-ml> bucket and click Delete. Confirm that the bucket does not exist any longer.

Conclusion:

We have only scratched the surface, but you can always experiment with other public datasets Big Query has to offer. Products such as Big Query ML definitely take us one step closer in democratizing Machine Learning for all, I cannot agree more!

Here are links to the other two stories in this series:

Word of Caution:

Please be careful while using the services on Google Cloud to stay within the $300 trial budget. Do not forget to turn off APIs, deleting any un-used instances or clearing off the cloud storage bucket after use so as to avoid any additional costs billed to your account.

--

--

Hrishi Shirodkar

Passionate about building products using ML & Big Data Technologies to solve real world problems! Have Masters in CS [ML specialization] from Georgia Tech Univ.