Warehousing with Google’s Big Query

Photo by CHUTTERSNAP on Unsplash

Warehousing with Google’s Big Query

·

8 min read

Data, in the modern world, is decentralized and is being generated and collected at a record pace. To ensure that this data is collected and processed in a manner that enables businesses and organizations to achieve their business goals, specialized and optimized tools are required. ‘The right solution will enable businesses and organizations to store data with resiliency, and swiftly analyze large amounts of data such that it can be used to achieve business outcomes, with decisions powered by data and analytics.

Google’s BigQuery is one such tool. Google’s proprietary BigQuery is a serverless multi-cloud data warehouse that is highly scalable, cost-effective, and specifically designed for offering superior business agility. It democratizes your data-driven insights with built-in machine learning, powered by a flexible and end-to-end multi-cloud analytics solution. In addition to state-of-the-art machine learning, BigQuery also enables lower TCO at scale by almost 26-34% as compared to alternatives. Furthermore, BigQuery adapts to your data with zero operational overhead.

taylor-vick-M5tzZtFCOfs-unsplash.jpg

Photo by Taylor Vick on Unsplash

BigQuery’s architecture is developed for Big data. It works optimally when it is being fed with several petabytes of data to be cleaned, processed, and analyzed. BigQuery removed the requirements for humans to make interactive ad-hoc queries to massive data sets in read-only mode.

BigQuery has the following hierarchical structure:

eta-qRmq4tXM9sI-unsplash.jpg

Photo by ETA+ on Unsplash

Projects

From the context of BigQuery, all BigQuery resources are contained within a project. Since there is decoupling between Storage and Compute with BigQuery, the projects that store data and those that query data can be separate.

Datasets

You can utilize datasets to organise BigQuery views A dataset is bound to a location that may be regional (a specific geographical place) or multi-regional (a region that contains two or more geographical places). The location for a dataset can only be defined at the time of its creation.

Tables

BigQuery tables hold your data. Each table is defined by a schema (data types, column/row names, and other information). There are different types of tables, namely, native tables are supported by BigQuery storage, external tables exist on storage external to BigQuery, and views which are virtual tables defined by SQL queries.

Jobs

The actions that BigQuery runs, such as loading, exporting querying, or copying data are referred to as jobs. The location of the job is linked to the location of the dataset for executing the job.

Key Features of BigQuery:

adam-smigielski-ZSct3GqtTL0-unsplash.jpg

Photo by Adam Śmigielski on Unsplash

Predictive Modelling with BigQuery ML

BigQuery enabled data analysts or data scientists to build machine learning models structured or semi-structured data sets several petabytes big. All this is achieved through simple SQL in minimal time.

BigQuery Omni for Multicloud data analytics

BigQuery Omni allows you to analyze data across the multi-cloud such as Azure and AWS, as a fully managed and end-to-end data analytics solution for the multi-cloud with a focus on saving costs and securing data.

BigQuery BI Engine for Interactive Data Analytics

With its highly optimized, in-memory analysis service, BigQuery BI Engine enables data analysts to obtain actionable insights from massive and complex datasets with a sub-second query response time and high scalability through high concurrency.

BigQuery GIS with Geospatial Analysis

As a unique feature, combine geospatial analysis with BigQuery’s serverless architecture in order to improve and augment your analytics workflows with location-based intelligence. Simplify your analyses and visualize your special data to unlock new potential for your business

Warehousing in BigQuery

If you are interested in a step by step guide, check out this youtube video

Warehousing with Google's Big Query from Anuj Syal

Loading dataset in BigQuery

image.png

Screenshot from Google Cloud

Big query provides multiple options for you to load the data:

  • Use a pre-existing connector, eg youtube analytics, google analytics
  • Google Cloud Storage
  • Big Query Console
  • Big Query CLI
  • Using python client libraries

Public Datasets

Google Cloud Public Datasets offer a powerful data repository of more than 200 high-demand public datasets from different industries. It also provides free storage cost and 1TB query cost per month if you intend to use it.

Exploring Ecommerce Public Dataset on Big Query

If you are familiar with simple SQL, Big Query allows you to explore it's biggest datasets for free. So as an example let's check out this Ecommerce dataset publicly available:

  1. Query: Total unique visitors
    SELECT
    COUNT(*) AS product_views,
    COUNT(DISTINCT fullVisitorId) AS unique_visitors
    FROM `data-to-insights.ecommerce.all_sessions`;
    

Out:

product_viewsunique_visitors
21493109389934
  • Query: Total unique visitors by Channel grouping
SELECT
  COUNT(DISTINCT fullVisitorId) AS unique_visitors,
  channelGrouping
FROM `data-to-insights.ecommerce.all_sessions`
GROUP BY channelGrouping
ORDER BY channelGrouping DESC;

Out:

unique_visitorschannelGrouping
38101Social
57308Referral
11865Paid Search
211993Organic Search
3067Display
75688Direct
5966Affiliates
62(Other)
  • Query: Top Five products with the most views
SELECT
  COUNT(*) AS product_views,
  (v2ProductName) AS ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY v2ProductName
ORDER BY product_views DESC
LIMIT 5;

Out :

product_viewsProductName
316482Google Men's 100% Cotton Short Sleeve Hero Tee White
22155822 oz YouTube Bottle Infuser
210700YouTube Men's Short Sleeve Hero Tee Black
202205Google Men's 100% Cotton Short Sleeve Hero Tee Black
200789YouTube Custom Decals
  • Query: Top Five products with the most unique views
#> You can use the SQL `WITH`
#> clause to help break apart a complex query into multiple steps.
WITH unique_product_views_by_person AS (
-- find each unique product viewed by each visitor
SELECT
 fullVisitorId,
 (v2ProductName) AS ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY fullVisitorId, v2ProductName )
-- aggregate the top viewed products and sort them
SELECT
  COUNT(*) AS unique_view_count,
  ProductName
FROM unique_product_views_by_person
GROUP BY ProductName
ORDER BY unique_view_count DESC
LIMIT 5

Out:

unique_view_countProductName
152358Google Men's 100% Cotton Short Sleeve Hero Tee White
14377022 oz YouTube Bottle Infuser
127904YouTube Men's Short Sleeve Hero Tee Black
122051YouTube Twill Cap
121288YouTube Custom Decals
  • Final Query: Total number of distinct products ordered and the total number of total units ordered
SELECT
  COUNT(*) AS product_views,
  COUNT(productQuantity) AS orders,
  SUM(productQuantity) AS quantity_product_ordered,
  v2ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY v2ProductName
ORDER BY product_views DESC
LIMIT 5;

Out:

product_viewsordersquantity_product_orderedv2ProductName
31648231586352Google Men's 100% Cotton Short Sleeve Hero Tee White
221558508476922 oz YouTube Bottle Infuser
2107009491114YouTube Men's Short Sleeve Hero Tee Black
20220527138072Google Men's 100% Cotton Short Sleeve Hero Tee Black
200789170311336YouTube Custom Decals

This process of doing multiple queries helps us derive insights from data using SQL, and technically this data can be in petabytes and the process will remain somewhat same

Benefits of BigQuery:

Superior Insights with Predictive Analytics

Get updated analytics and information on all your business processes by querying streaming data in real-time. Utilize these insights to make data-driven decisions for your business and effectively predict business outcomes without moving data across.

Share Insights Seamlessly

Share and access analytics and insights securely from within your organization, enabling stakeholders to develop insightful reports and dashboards using BI-tool right out of the box.

Enhanced Security for your Data

Experience enhanced data resiliency, robust security, and reliability control offering 99.99% uptime SLA. Ensuring that your data is protected, secured and unreachable to unauthorized and unauthenticated access.

Provisioning and System Sizing

Unlike many relational database management systems (RDBMS), Google BigQuery dynamically allocates query resources as you consume them and deallocates resources as data is deleted or tables are dropped. Furthermore, allocated resources match the query type and complexity.

Storage Management

BigQuery utilizes a proprietary format called Capacitor. It is columnar in nature and holds many benefits including the fact that it can evolve with the query engine. Access patterns are used to determine the most optimal number of shards of data and how they are to be encoded for storage. BigQuery queries can either be stored on Google's Colossus platform or outside of BigQuery storage in the cloud, on Google Drive.

Maintenance

BigQuery receives constant updates from it’s engineering team. These upgrades cause little to no downtime on BigQuery’s operations. Ensuring optimal performance and minimal downtime as you collect essential insights for your business goals.

Backup and Recovery

Database administrators have always found backup and recovery to be extremely tedious and complex tasks. Costs rise as there is almost always a need for additional licenses and hardware. With BigQuery, backup and recovery is handled at the service level. BigQuery maintains a complete seven-day history of changes against your tables and lets you write specific queries to point-in-time snapshots of your data. If a table is deleted, its history is removed after a period of seven days.

Monitoring and Auditing

Using the BigQuery metric, you can monitor how BigQuery is behaving in the form of various charts and alerts. In order to have a proactive approach towards system health, you can create alerts that will be triggered based on thresholds defined by you. BigQuery also creates various logs, including audit logs of actions made by users.

Conclusion

As is apparent, BigQuery provides powerful enablement for your business and decision-making through optimized data processing, smart data insights, and resiliency in how this data is stored. It is powerful too used to allow your organization to utilize data to its advantage.