Warehousing with Google’s Big Query

Data, in the modern world, is decentralized and is being generated and collected at a record pace. To ensure that this data is collected and processed in a manner that enables businesses and organizations to achieve their business goals, specialized and optimized tools are required. ‘The right solution will enable businesses and organizations to store data with resiliency, and swiftly analyze large amounts of data such that it can be used to achieve business outcomes, with decisions powered by data and analytics.

Google’s BigQuery is one such tool. Google’s proprietary BigQuery is a serverless multi-cloud data warehouse that is highly scalable, cost-effective, and specifically designed for offering superior business agility. It democratizes your data-driven insights with built-in machine learning, powered by a flexible and end-to-end multi-cloud analytics solution. In addition to state-of-the-art machine learning, BigQuery also enables lower TCO at scale by almost 26-34% as compared to alternatives. Furthermore, BigQuery adapts to your data with zero operational overhead.

Photo by Taylor Vick on Unsplash

BigQuery’s architecture is developed for Big data. It works optimally when it is being fed with several petabytes of data to be cleaned, processed, and analyzed. BigQuery removed the requirements for humans to make interactive ad-hoc queries to massive data sets in read-only mode.

BigQuery has the following hierarchical structure:

Photo by ETA+ on Unsplash

Projects

From the context of BigQuery, all BigQuery resources are contained within a project. Since there is decoupling between Storage and Compute with BigQuery, the projects that store data and those that query data can be separate.

Datasets

You can utilize datasets to organise BigQuery views A dataset is bound to a location that may be regional (a specific geographical place) or multi-regional (a region that contains two or more geographical places). The location for a dataset can only be defined at the time of its creation.

Tables

BigQuery tables hold your data. Each table is defined by a schema (data types, column/row names, and other information). There are different types of tables, namely, native tables are supported by BigQuery storage, external tables exist on storage external to BigQuery, and views which are virtual tables defined by SQL queries.

Jobs

The actions that BigQuery runs, such as loading, exporting querying, or copying data are referred to as jobs. The location of the job is linked to the location of the dataset for executing the job.

Key Features of BigQuery:

Photo by Adam Śmigielski on Unsplash

Predictive Modelling with BigQuery ML

BigQuery enabled data analysts or data scientists to build machine learning models structured or semi-structured data sets several petabytes big. All this is achieved through simple SQL in minimal time.

BigQuery Omni for Multicloud data analytics

BigQuery Omni allows you to analyze data across the multi-cloud such as Azure and AWS, as a fully managed and end-to-end data analytics solution for the multi-cloud with a focus on saving costs and securing data.

BigQuery BI Engine for Interactive Data Analytics

With its highly optimized, in-memory analysis service, BigQuery BI Engine enables data analysts to obtain actionable insights from massive and complex datasets with a sub-second query response time and high scalability through high concurrency.

BigQuery GIS with Geospatial Analysis

As a unique feature, combine geospatial analysis with BigQuery’s serverless architecture in order to improve and augment your analytics workflows with location-based intelligence. Simplify your analyses and visualize your special data to unlock new potential for your business

Warehousing in BigQuery

If you are interested in a step by step guide, check out this youtube video

https://youtu.be/_Wm_GYO-r_Q

Warehousing with Google's Big Query from Anuj Syal

Loading dataset in BigQuery

Screenshot from Google Cloud

Big query provides multiple options for you to load the data:

Use a pre-existing connector, eg youtube analytics, google analytics
Google Cloud Storage
Big Query Console
Big Query CLI
Using python client libraries

Public Datasets

Google Cloud Public Datasets offer a powerful data repository of more than 200 high-demand public datasets from different industries. It also provides free storage cost and 1TB query cost per month if you intend to use it.

Exploring Ecommerce Public Dataset on Big Query

If you are familiar with simple SQL, Big Query allows you to explore it's biggest datasets for free. So as an example let's check out this Ecommerce dataset publicly available:

About the Dataset The dataset provides Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. We will explore the all_sessions table

Query: Total unique visitors

SELECT
COUNT(*) AS product_views,
COUNT(DISTINCT fullVisitorId) AS unique_visitors
FROM `data-to-insights.ecommerce.all_sessions`;

Out:

product_views	unique_visitors
21493109	389934

Query: Total unique visitors by Channel grouping

SELECT
  COUNT(DISTINCT fullVisitorId) AS unique_visitors,
  channelGrouping
FROM `data-to-insights.ecommerce.all_sessions`
GROUP BY channelGrouping
ORDER BY channelGrouping DESC;

Out:

unique_visitors	channelGrouping
38101	Social
57308	Referral
11865	Paid Search
211993	Organic Search
3067	Display
75688	Direct
5966	Affiliates
62	(Other)

Query: Top Five products with the most views

SELECT
  COUNT(*) AS product_views,
  (v2ProductName) AS ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY v2ProductName
ORDER BY product_views DESC
LIMIT 5;

Out :

product_views	ProductName
316482	Google Men's 100% Cotton Short Sleeve Hero Tee White
221558	22 oz YouTube Bottle Infuser
210700	YouTube Men's Short Sleeve Hero Tee Black
202205	Google Men's 100% Cotton Short Sleeve Hero Tee Black
200789	YouTube Custom Decals

Query: Top Five products with the most unique views

#> You can use the SQL `WITH`
#> clause to help break apart a complex query into multiple steps.
WITH unique_product_views_by_person AS (
-- find each unique product viewed by each visitor
SELECT
 fullVisitorId,
 (v2ProductName) AS ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY fullVisitorId, v2ProductName )
-- aggregate the top viewed products and sort them
SELECT
  COUNT(*) AS unique_view_count,
  ProductName
FROM unique_product_views_by_person
GROUP BY ProductName
ORDER BY unique_view_count DESC
LIMIT 5

Out:

unique_view_count	ProductName
152358	Google Men's 100% Cotton Short Sleeve Hero Tee White
143770	22 oz YouTube Bottle Infuser
127904	YouTube Men's Short Sleeve Hero Tee Black
122051	YouTube Twill Cap
121288	YouTube Custom Decals

Final Query: Total number of distinct products ordered and the total number of total units ordered

SELECT
  COUNT(*) AS product_views,
  COUNT(productQuantity) AS orders,
  SUM(productQuantity) AS quantity_product_ordered,
  v2ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY v2ProductName
ORDER BY product_views DESC
LIMIT 5;

Out:

product_views	orders	quantity_product_ordered	v2ProductName
316482	3158	6352	Google Men's 100% Cotton Short Sleeve Hero Tee White
221558	508	4769	22 oz YouTube Bottle Infuser
210700	949	1114	YouTube Men's Short Sleeve Hero Tee Black
202205	2713	8072	Google Men's 100% Cotton Short Sleeve Hero Tee Black
200789	1703	11336	YouTube Custom Decals

This process of doing multiple queries helps us derive insights from data using SQL, and technically this data can be in petabytes and the process will remain somewhat same

Benefits of BigQuery:

Superior Insights with Predictive Analytics

Get updated analytics and information on all your business processes by querying streaming data in real-time. Utilize these insights to make data-driven decisions for your business and effectively predict business outcomes without moving data across.

Share and access analytics and insights securely from within your organization, enabling stakeholders to develop insightful reports and dashboards using BI-tool right out of the box.

Enhanced Security for your Data

Experience enhanced data resiliency, robust security, and reliability control offering 99.99% uptime SLA. Ensuring that your data is protected, secured and unreachable to unauthorized and unauthenticated access.

Provisioning and System Sizing

Unlike many relational database management systems (RDBMS), Google BigQuery dynamically allocates query resources as you consume them and deallocates resources as data is deleted or tables are dropped. Furthermore, allocated resources match the query type and complexity.

Storage Management

BigQuery utilizes a proprietary format called Capacitor. It is columnar in nature and holds many benefits including the fact that it can evolve with the query engine. Access patterns are used to determine the most optimal number of shards of data and how they are to be encoded for storage. BigQuery queries can either be stored on Google's Colossus platform or outside of BigQuery storage in the cloud, on Google Drive.

Maintenance

BigQuery receives constant updates from it’s engineering team. These upgrades cause little to no downtime on BigQuery’s operations. Ensuring optimal performance and minimal downtime as you collect essential insights for your business goals.

Backup and Recovery

Database administrators have always found backup and recovery to be extremely tedious and complex tasks. Costs rise as there is almost always a need for additional licenses and hardware. With BigQuery, backup and recovery is handled at the service level. BigQuery maintains a complete seven-day history of changes against your tables and lets you write specific queries to point-in-time snapshots of your data. If a table is deleted, its history is removed after a period of seven days.

Monitoring and Auditing

Using the BigQuery metric, you can monitor how BigQuery is behaving in the form of various charts and alerts. In order to have a proactive approach towards system health, you can create alerts that will be triggered based on thresholds defined by you. BigQuery also creates various logs, including audit logs of actions made by users.

Conclusion

As is apparent, BigQuery provides powerful enablement for your business and decision-making through optimized data processing, smart data insights, and resiliency in how this data is stored. It is powerful too used to allow your organization to utilize data to its advantage.

Anuj Syal's Blog

Anuj Syal's Blog

Warehousing with Google’s Big Query

BigQuery has the following hierarchical structure:

Projects

Datasets

Tables

Jobs

Key Features of BigQuery:

Predictive Modelling with BigQuery ML

BigQuery Omni for Multicloud data analytics

BigQuery BI Engine for Interactive Data Analytics

BigQuery GIS with Geospatial Analysis

Warehousing in BigQuery

Loading dataset in BigQuery

Public Datasets

Exploring Ecommerce Public Dataset on Big Query

Benefits of BigQuery:

Superior Insights with Predictive Analytics

Enhanced Security for your Data

Provisioning and System Sizing

Storage Management

Maintenance

Backup and Recovery

Monitoring and Auditing

Conclusion

Warehousing with Google’s Big Query

BigQuery has the following hierarchical structure:

Projects

Datasets

Tables

Jobs

Key Features of BigQuery:

Predictive Modelling with BigQuery ML

BigQuery Omni for Multicloud data analytics

BigQuery BI Engine for Interactive Data Analytics

BigQuery GIS with Geospatial Analysis

Warehousing in BigQuery

Loading dataset in BigQuery

Public Datasets

Exploring Ecommerce Public Dataset on Big Query

Benefits of BigQuery:

Superior Insights with Predictive Analytics

Share Insights Seamlessly

Enhanced Security for your Data

Provisioning and System Sizing

Storage Management

Maintenance

Backup and Recovery

Monitoring and Auditing

Conclusion