Photo by CHUTTERSNAP on Unsplash
Warehousing with Google’s Big Query
8 min read
Data, in the modern world, is decentralized and is being generated and collected at a record pace. To ensure that this data is collected and processed in a manner that enables businesses and organizations to achieve their business goals, specialized and optimized tools are required. ‘The right solution will enable businesses and organizations to store data with resiliency, and swiftly analyze large amounts of data such that it can be used to achieve business outcomes, with decisions powered by data and analytics.
Google’s BigQuery is one such tool. Google’s proprietary BigQuery is a serverless multi-cloud data warehouse that is highly scalable, cost-effective, and specifically designed for offering superior business agility. It democratizes your data-driven insights with built-in machine learning, powered by a flexible and end-to-end multi-cloud analytics solution. In addition to state-of-the-art machine learning, BigQuery also enables lower TCO at scale by almost 26-34% as compared to alternatives. Furthermore, BigQuery adapts to your data with zero operational overhead.
Photo by Taylor Vick on Unsplash
BigQuery’s architecture is developed for Big data. It works optimally when it is being fed with several petabytes of data to be cleaned, processed, and analyzed. BigQuery removed the requirements for humans to make interactive ad-hoc queries to massive data sets in read-only mode.
BigQuery has the following hierarchical structure:
From the context of BigQuery, all BigQuery resources are contained within a project. Since there is decoupling between Storage and Compute with BigQuery, the projects that store data and those that query data can be separate.
You can utilize datasets to organise BigQuery views A dataset is bound to a location that may be regional (a specific geographical place) or multi-regional (a region that contains two or more geographical places). The location for a dataset can only be defined at the time of its creation.
BigQuery tables hold your data. Each table is defined by a schema (data types, column/row names, and other information). There are different types of tables, namely, native tables are supported by BigQuery storage, external tables exist on storage external to BigQuery, and views which are virtual tables defined by SQL queries.
The actions that BigQuery runs, such as loading, exporting querying, or copying data are referred to as jobs. The location of the job is linked to the location of the dataset for executing the job.
Key Features of BigQuery:
Photo by Adam Śmigielski on Unsplash
Predictive Modelling with BigQuery ML
BigQuery enabled data analysts or data scientists to build machine learning models structured or semi-structured data sets several petabytes big. All this is achieved through simple SQL in minimal time.
BigQuery Omni for Multicloud data analytics
BigQuery Omni allows you to analyze data across the multi-cloud such as Azure and AWS, as a fully managed and end-to-end data analytics solution for the multi-cloud with a focus on saving costs and securing data.
BigQuery BI Engine for Interactive Data Analytics
With its highly optimized, in-memory analysis service, BigQuery BI Engine enables data analysts to obtain actionable insights from massive and complex datasets with a sub-second query response time and high scalability through high concurrency.
BigQuery GIS with Geospatial Analysis
As a unique feature, combine geospatial analysis with BigQuery’s serverless architecture in order to improve and augment your analytics workflows with location-based intelligence. Simplify your analyses and visualize your special data to unlock new potential for your business
Warehousing in BigQuery
If you are interested in a step by step guide, check out this youtube video
Warehousing with Google's Big Query from Anuj Syal
Loading dataset in BigQuery
Screenshot from Google Cloud
Big query provides multiple options for you to load the data:
- Use a pre-existing connector, eg youtube analytics, google analytics
- Google Cloud Storage
- Big Query Console
- Big Query CLI
- Using python client libraries
Google Cloud Public Datasets offer a powerful data repository of more than 200 high-demand public datasets from different industries. It also provides free storage cost and 1TB query cost per month if you intend to use it.
Exploring Ecommerce Public Dataset on Big Query
If you are familiar with simple SQL, Big Query allows you to explore it's biggest datasets for free. So as an example let's check out this Ecommerce dataset publicly available:
- About the Dataset The dataset provides Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. We will explore the all_sessions table
- Query: Total unique visitors
SELECT COUNT(*) AS product_views, COUNT(DISTINCT fullVisitorId) AS unique_visitors FROM `data-to-insights.ecommerce.all_sessions`;
- Query: Total unique visitors by Channel grouping
SELECT COUNT(DISTINCT fullVisitorId) AS unique_visitors, channelGrouping FROM `data-to-insights.ecommerce.all_sessions` GROUP BY channelGrouping ORDER BY channelGrouping DESC;
- Query: Top Five products with the most views
SELECT COUNT(*) AS product_views, (v2ProductName) AS ProductName FROM `data-to-insights.ecommerce.all_sessions` WHERE type = 'PAGE' GROUP BY v2ProductName ORDER BY product_views DESC LIMIT 5;
|316482||Google Men's 100% Cotton Short Sleeve Hero Tee White|
|221558||22 oz YouTube Bottle Infuser|
|210700||YouTube Men's Short Sleeve Hero Tee Black|
|202205||Google Men's 100% Cotton Short Sleeve Hero Tee Black|
|200789||YouTube Custom Decals|
- Query: Top Five products with the most unique views
#> You can use the SQL `WITH` #> clause to help break apart a complex query into multiple steps. WITH unique_product_views_by_person AS ( -- find each unique product viewed by each visitor SELECT fullVisitorId, (v2ProductName) AS ProductName FROM `data-to-insights.ecommerce.all_sessions` WHERE type = 'PAGE' GROUP BY fullVisitorId, v2ProductName ) -- aggregate the top viewed products and sort them SELECT COUNT(*) AS unique_view_count, ProductName FROM unique_product_views_by_person GROUP BY ProductName ORDER BY unique_view_count DESC LIMIT 5
|152358||Google Men's 100% Cotton Short Sleeve Hero Tee White|
|143770||22 oz YouTube Bottle Infuser|
|127904||YouTube Men's Short Sleeve Hero Tee Black|
|122051||YouTube Twill Cap|
|121288||YouTube Custom Decals|
- Final Query: Total number of distinct products ordered and the total number of total units ordered
SELECT COUNT(*) AS product_views, COUNT(productQuantity) AS orders, SUM(productQuantity) AS quantity_product_ordered, v2ProductName FROM `data-to-insights.ecommerce.all_sessions` WHERE type = 'PAGE' GROUP BY v2ProductName ORDER BY product_views DESC LIMIT 5;
|316482||3158||6352||Google Men's 100% Cotton Short Sleeve Hero Tee White|
|221558||508||4769||22 oz YouTube Bottle Infuser|
|210700||949||1114||YouTube Men's Short Sleeve Hero Tee Black|
|202205||2713||8072||Google Men's 100% Cotton Short Sleeve Hero Tee Black|
|200789||1703||11336||YouTube Custom Decals|
This process of doing multiple queries helps us derive insights from data using SQL, and technically this data can be in petabytes and the process will remain somewhat same
Benefits of BigQuery:
Superior Insights with Predictive Analytics
Get updated analytics and information on all your business processes by querying streaming data in real-time. Utilize these insights to make data-driven decisions for your business and effectively predict business outcomes without moving data across.
Share Insights Seamlessly
Share and access analytics and insights securely from within your organization, enabling stakeholders to develop insightful reports and dashboards using BI-tool right out of the box.
Enhanced Security for your Data
Experience enhanced data resiliency, robust security, and reliability control offering 99.99% uptime SLA. Ensuring that your data is protected, secured and unreachable to unauthorized and unauthenticated access.
Provisioning and System Sizing
Unlike many relational database management systems (RDBMS), Google BigQuery dynamically allocates query resources as you consume them and deallocates resources as data is deleted or tables are dropped. Furthermore, allocated resources match the query type and complexity.
BigQuery utilizes a proprietary format called Capacitor. It is columnar in nature and holds many benefits including the fact that it can evolve with the query engine. Access patterns are used to determine the most optimal number of shards of data and how they are to be encoded for storage. BigQuery queries can either be stored on Google's Colossus platform or outside of BigQuery storage in the cloud, on Google Drive.
BigQuery receives constant updates from it’s engineering team. These upgrades cause little to no downtime on BigQuery’s operations. Ensuring optimal performance and minimal downtime as you collect essential insights for your business goals.
Backup and Recovery
Database administrators have always found backup and recovery to be extremely tedious and complex tasks. Costs rise as there is almost always a need for additional licenses and hardware. With BigQuery, backup and recovery is handled at the service level. BigQuery maintains a complete seven-day history of changes against your tables and lets you write specific queries to point-in-time snapshots of your data. If a table is deleted, its history is removed after a period of seven days.
Monitoring and Auditing
Using the BigQuery metric, you can monitor how BigQuery is behaving in the form of various charts and alerts. In order to have a proactive approach towards system health, you can create alerts that will be triggered based on thresholds defined by you. BigQuery also creates various logs, including audit logs of actions made by users.
As is apparent, BigQuery provides powerful enablement for your business and decision-making through optimized data processing, smart data insights, and resiliency in how this data is stored. It is powerful too used to allow your organization to utilize data to its advantage.