Anuj Syal's Blog

How I Passed Databricks Data Engineer Professional Exam

Anuj Syal — Mon, 26 Feb 2024 05:35:04 GMT

Introduction

You've aced the Databricks Data Engineer Associate Exam congrats! Or maybe, youre just curious as you heard someone becoming a Certified Data Engineering Professional. Guess its time to set your data engineering career in motion.

But where do we start? Let's dive in and explore why this certification is essential in the first place.

The Importance of Databricks Certification in Data Engineering

The Databricks Certified Data Engineer Professional Certification Exam evaluates your proficiency in utilizing Databricks for advanced data engineering tasks. Among the top data certifications, lets understand why it is important to earn this professional certificate for your data engineering career:

Key Cloud Platform for Data Tasks: Databricks stands out as a crucial player in cloud platforms, particularly in data engineering, data science, machine learning, and AI.
Leadership in Data Storage Technology: Its robust storage capabilities enable organizations to securely store and manage large volumes of data, facilitating effective data processing and analysis.
Innovative Solution with Delta Lake: Databricks introduces an innovative solution powered by Delta Lake, a technology that combines the features of a data warehouse and a data lake.

Figure 1: A chart showing elements of Delta Lake

Foundation in Spark: Databricks, built by the creators of Apache Spark, relies on Spark as its core. Integrated with MLflow, Delta Lake, and PyTorch, Databricks offers a comprehensive solution for efficient data processing and advanced machine-learning tasks. This makes Databricks a preferred choice for organizations dealing with large-scale data processing tasks.

Preference Among Large Companies: Its user-friendly interface, advanced features, and seamless integration with existing workflows makes it a preferred choice for organizations seeking to harness the power of their data effectively.

Henceforth, it not only validates proficiency in handling complex data challenges but also positions individuals as sought-after candidates in the competitive job market. That being said, passing the Databricks Associate Exam for Data Engineering is not a prerequisite for this Professional Exam, but it is a wise step to clear the base before setting yourself up for an advanced level.

Quick Insights into the Exam Structure

The Certified Data Engineering Professional exam is more challenging than the associate certification as it delves deeper into topics rather than covering a broad range. It demands you to have at least more than a year of hands-on experience in performing data engineering tasks (as per the exam guide).

Here is a briefly detailed topic-wise structure of what this exam covers:

Databricks Tooling (20%): Master Databricks platform and tools like web app, DBSQL queries, and APIs (DBUtils, MLflow). Learn Apache Spark, Delta Lake, and Databricks CLI and REST API for data processing.
Data Processing (30%): Build batch and incremental ETL pipelines, optimize workloads, deduplicate data, and use Change Data Capture (CDC) techniques efficiently.
Data Modeling (20%): Understand Lakehouse architecture, optimize data layouts, and grasp key data modeling concepts like keys, constraints, and slowly changing dimensions.
Security & Governance (10%): Secure data pipelines, manage permissions, handle PII securely, and ensure compliance with regulations like GDPR & CCPA.
Monitoring & Logging (10%): Set up alerting mechanisms, use SparkListener for monitoring job execution, record metrics, and navigate Spark UI for debugging.
Testing & Deployment (10%): Manage dependencies, implement unit and integration testing, schedule jobs, version code and notebooks, and orchestrate job workflows for production.

A Winning Databricks Certification Strategy: 3 Steps to Success

The heading suggests that I'll discuss a winning strategy for acing this exam. What I haven't mentioned yet is that it's also the simplest path to achieving a professional level in Data Engineering. So, let's kick things off with the first step Preparation.

Figure 4: Databricks guide to exam preparation

Step-1: Preparation

Understanding the exam's ins and outs is key, as Databricks rightly suggests. Remember Epictetus' wise words: "It is impossible for a man to learn what he thinks he already knows." So, take the time to train according to your learning style. It's the best way to grasp the exam's structure, objectives, and expectations. It might not sound thrilling, but it's crucial for success.

Figure 3: Databricks Training Portal

And just like you wouldn't dive into a data project without clarifying its goals, tackling an exam without understanding its parameters is unwise. But which platform to rely upon? Mainly there are two resources that are best suited for your preparation. To access customer learning or partner learning portals of Databricks, you must be in a partner or customer organization. Apart from this training course, it will be great for your learning if youd take on this Udemy preparation course by Derar Alhussein for Databricks Certified Data Engineer Professional.

Figure 4: Preparation Course on Udemy

Here's a quick overview of what's covered in this prep course that made passing the exam a breeze:

Modelling Data Management Solutions: This section focuses on understanding various data modeling techniques and strategies for effective data management. You'll learn about different data models, such as relational, document-oriented, and graph databases, and how to apply them in real-world scenarios.
Data Processing: Here, you'll delve into the intricacies of data processing, including data ingestion, transformation, and integration. You'll explore tools and frameworks like Apache Spark for processing large-scale data efficiently and learn best practices for handling complex data processing tasks.
Improving Performance: This module covers techniques for optimizing data processing performance. You'll learn how to fine-tune your data pipelines, optimize queries, and leverage caching mechanisms to improve overall performance and efficiency.
Databricks Tooling: In this section, you'll gain a comprehensive understanding of the Databricks platform and its various developer tools. You'll learn how to use tools like Apache Spark, Delta Lake, MLflow, and the Databricks CLI and REST API to build and manage data pipelines effectively.
Security & Governance: This module focuses on security best practices and governance policies for managing data securely within the Databricks environment. You'll learn about data encryption, access controls, auditing, and compliance standards to ensure data integrity and confidentiality.
Testing & Deployment: Here, you'll explore strategies for testing and deploying data engineering solutions. You'll learn how to write unit tests, perform integration testing, and automate deployment processes to streamline the development lifecycle and ensure reliability.
Monitoring & Logging: The final module covers monitoring and logging techniques for tracking and troubleshooting data engineering workflows. You'll learn how to set up monitoring tools, analyze logs, and identify performance bottlenecks to maintain system health and reliability.

Overall, the Databricks Certified Data Engineer Professional Preparation course provides comprehensive coverage of essential topics in data engineering, equipping you with the knowledge and skills needed to excel in the certification exam and in real-world data engineering scenarios.

Step-2: Documentation Portal/Notes

Having built a documentation portal or notes where you can always go back to and revise the concepts before the exam is the most basic and yet, essential advice I can give you. Down below is the example of how I make my easily accessible personal notes using Notion and I would recommend the same for you.

Figure 5: An image of personal notes on Notion

Here's a step-by-step guide to creating easily accessible personal notes using Notion (or any note-taking tool of your choice):

Capture Key Information: While going through your Udemy course, take screenshots of important slides or concepts that you want to remember.
Note Down Concepts: After capturing screenshots, jot down the key concepts, explanations, or additional notes that accompany them. Keep your notes concise but informative.
Organize with Toggles: In your note-taking tool, create a separate section for each topic or section of the course. Within each section, use toggles to hide or show detailed notes. This allows you to keep your notes organized and easily navigable.
Include Screenshots: Embed the screenshots you captured earlier into your notes. This visual aid can help reinforce your understanding of the concepts.

Figure 6: An image showing visual interpretation of a data engineering concept

Review and Revise: Regularly review your notes to reinforce your understanding of the material. Use them as a quick reference guide when preparing for the exam.
Adapt to Your Learning Style: Customize your notes to suit your learning preferences. You can add color-coded labels, highlight important points, or incorporate other elements that enhance your learning experience.

By following this method, you'll have a comprehensive documentation portal that consolidates all the essential information from your Udemy course. It's a convenient way to review and revise the concepts, ultimately improving your chances of success on the exam.

Step-3: Mock Exam

As you wrap up your preparation for the certification exam, consider incorporating mock exams into your study routine. With countless resources available online, including those provided by Databricks, exploring mock tests and courses can greatly enhance your chances of success. As already mentioned earlier, Alhussein's comprehensive course offers practice questions closely aligned with the exam format. Enrolling in this course can provide valuable insights into potential exam questions, helping you solidify your understanding of key concepts.

Additionally, for personalized notes and practice exam questions, you can visit my notion page. Incorporating mock exams into your preparation strategy is a vital step towards ensuring readiness and confidence on exam day.

Conclusion

At last, I can guarantee that mastering the Databricks Data Engineer Associate Exam is a challenging yet rewarding journey towards becoming a Certified Data Engineering Professional. By staying curious and keeping up with evolving trends in data engineering, professionals can excel in their roles and establish themselves as leaders in the field. Seize the opportunity, expand your skills, and embark on the path to certification your future self will thank you. Check out my YouTube video for more insights.

So, seize the opportunity, expand your skill set, and embark on the path to becoming a Certified Data Engineering Professional your future self will thank you for it. To hear more on this from me, you can check my YouTube video below.

https://youtu.be/hxASUE4eIjQ

Data Engineering vs Data Science

Anuj Syal — Thu, 21 Sep 2023 15:13:22 GMT

INTRODUCTION

In the tech-driven landscape of today, the dynamic duo of data engineering and data science stands at the forefront of innovation. While both fields play crucial roles in unlocking the potential of data, the limelight often gravitates towards the captivating glamour of data science. However, it is time to look past the hype and recognize the unsung hero that is data engineering.

In this blog, we explore the often-underestimated field of data engineering and how it plays a critical role in helping data scientists gain valuable insights. Let's take a step back and realize that while data science might be the driving force behind data innovation, it's data engineering that's really pushing it forward.

WHAT IS DATA ENGINEERING?

Alain Pham on Unsplash " class="image--center mx-auto" />

Alain Pham Unsplash

In simple terms, Data engineering is the process of making data organized and ready for further analysis. It is critical to the data lifecycle because it ensures that raw data from multiple sources is converted into a usable format that is available for analysis and decision-making. In the world of data-driven decision-making, Data Engineering emerges as the hero, laying the foundation for the vast architecture of insights and analysis.

From designing scalable data pipelines to creating efficient data storage, a data engineer takes on different roles and responsibilities to make behind-the-scenes magic happen, transforming raw information into useful insights. However, for this, they require a diversified skill set that blends technical expertise with problem-solving ability.

Lets look over the core skills required for data engineers in more detail.

SKILLS REQUIRED

Programming Languages: Proficiency in programming languages is essential. Python and Java are popular scripting languages for creating data pipelines. Scala is quite widely used, particularly in large data frameworks such as Apache Spark.
Database Management: An in-depth knowledge of several database systems is required. SQL is essential, as is experience with both relational databases (e.g., MySQL, PostgreSQL) and NoSQL databases (e.g., MongoDB, Cassandra).
ETL (Extract, Transform, Load) Tools: Data engineers must be adept in the use of ETL technologies and frameworks that enable data transportation and transformation. Commonly used tools include Apache NiFi, Apache Kafka, and Apache Airflow.
Big Data Technologies: Understanding big data technologies like Hadoop and Apache Spark is crucial for efficiently processing and analysing massive datasets. Understanding concepts such as MapReduce is handy.
Cloud Platforms: As more businesses shift their data infrastructure to the cloud, expertise of cloud platforms such as AWS, Azure, and Google Cloud and Databricks is growing more valuable. Proficiency in setting up and managing cloud-based data solutions, including Data Warehouses, Data Lakes, and the emerging Lakehouse architecture, is a key benefit.

WHAT IS DATA SCIENCE?

Logan Moreno Gutierrez on Unsplash " class="image--center mx-auto" />

Logan Moreno Gutierrez Unsplash

Data science is the study of data to extract valuable insights and knowledge, leading to informed decision-making and innovation. It combines elements from statistics, computer science, domain expertise, and machine learning to analyse complicated data sets. If data engineering lays the groundwork, data science breathes life into the information that illuminates the path forward.

Data Scientists, equipped with a combination of analytical skills and domain understanding, are responsible for performing several tasks. These majorly include model crafting, exploration for patterns, and visual storytelling. Data Scientists are also known to possess exceptional statistical knowledge since it supports the core of their field - analysis.

Let us explore the key skills needed to be a successful data scientist.

SKILLS REQUIRED

Programming Languages: It is necessary to be adept in programming languages such as Python or R along with libraries like NumPy, Pandas for handling datasets. These languages are frequently used for data manipulation, analysis, and building of machine learning models.
Statistical Analysis: An extensive knowledge of statistics is required for planning experiments, evaluating hypotheses, and drawing meaningful conclusions from data. A strong grasp of statistical methods empowers data scientists to make informed decisions and extract valuable information from complex datasets.
Machine Learning: Building predictive and prescriptive models requires an in-depth understanding of various machine learning algorithms and techniques. Familiarity with libraries like scikit-learn, TensorFlow, or PyTorch is valuable for implementing machine learning algorithms.
Data Cleaning and Preprocessing: The ability to clean and preprocess data to remove irregularities and prepare it for analysis is an essential skill.
Data Visualization: Data visualisation expertise in tools such as Matplotlib, Seaborn, or Tableau are essential for developing informative visualisations that convey complicated discoveries to non-technical stakeholders.

DIFFERENCE

Gregoire Jeanneau on Unsplash " class="image--center mx-auto" />

Gregoire Jeanneau Unsplash

As the digital environment evolves, the roles of Data Engineering and Data Science emerge as essential pillars, each with unique abilities. Let's dive into their differences and unravel their unique features.

FIELD FOCUS

Data Engineering has a broad scope that centres around the development and maintenance of data-related systems, pipelines, and overall infrastructure. Its primary goal is to efficiently collect, store, process, and effectively transform raw data into a thoroughly structured state suitable for analysis.

Data Science is largely concerned with extracting valuable insights from data through the use of a variety of sophisticated approaches. Statistical modelling is used to find patterns, machine learning techniques are used to forecast outcomes, and predictive analytics is used to obtain insight into future trends.

CODING PRACTICES

Data Engineering places a strong emphasis on good coding practices, maintainable code, and production-ready solutions. Engineers are proficient in writing clean, modular, and efficient code using programming languages like Python, Java, or Scala. They specialise in creating robust pipelines and systems that can manage enormous amounts of data and are optimized for performance.

While Data Science professionals do write code, the emphasis is mostly on experimentation and analysis instead of production-level code. Coding is usually done in Jupyter notebooks, which are good for exploration and visualisation but could fail to adhere to the same level of software engineering standards as DE.

CLOUD KNOWLEDGE

DE professionals often possess in-depth knowledge of cloud platforms and their various services. They design and execute scalable data processing architectures that make use of the advantages of cloud infrastructure to manage and process massive datasets efficiently. Here, scalability is an important aspect, and knowledge of cloud services and distributed computing is crucial.

DS practitioners may not necessarily have a thorough understanding of cloud services, especially since their primary focus is data analysis. They may interact with cloud resources via more user-friendly interfaces or APIs, but they are not as skilled as data engineering specialists at optimising for scalability and cost-effectiveness.

PRODUCTIONALIZATION

DE specialists are responsible for taking data pipelines from development to production. They are well-versed in deploying complex data processing systems into production environments while taking security, scalability, and maintainability into account. They have a better understanding of the issues that might surface during productionization and are trained to deal with them.

DS professionals often prioritize data exploration, model development, and research. While working on models and analysis, they may not necessarily be as involved with the operational side of production systems. Ensuring that models work reliably and efficiently in real-world scenarios involves additional considerations, such as error handling and monitoring.

DOMAIN KNOWLEDGE

DE specialists mostly deal with the design and maintenance of data pipelines, infrastructure, and data management systems. They might not always possess deep domain-specific knowledge or advanced statistical expertise since their primary objective is to ensure efficient data processing.

Data Scientists specialization in data analysis enables them to thoroughly grasp the underlying data and make informed conclusions. Their statistical grasping enables them to identify insights that Data Engineering specialists may not be focusing on.

TOOLS AND TECHNOLOGIES

DE professionals frequently use standard software development tools and practices. They employ tools such as Apache Spark, Apache Flink, Hadoop, and different ETL (Extract, Transform, Load) frameworks. They develop code in IDEs and use version control systems like Git to create modular and reusable components.

Data scientists use tools like Python's data science libraries (NumPy, pandas, scikit-learn), R, and specialized tools like TensorFlow and PyTorch for machine learning and deep learning tasks. Jupyter Notebooks are frequently used for exploratory data analysis and model experimentation which are great for quick iterations and research, but might skip on some of the best practices that DE tools and workflows provide.

COLLABORATION

Chris Liverani on Unsplash " class="image--center mx-auto" />

Chris Liverani Unsplash

Data engineers work together with software engineers, data analysts, and data scientists to navigate the complex environment of creating and maintaining robust data infrastructure. Their joint efforts provide smooth data flow, storage, and dependable processing, which supports the entire data ecosystem.

Data scientists often work closely with business stakeholders, domain experts, and decision-makers to translate data insights into actionable strategies. They play a crucial role in bridging the gap between technical insights and business outcomes.

COLLABORATIVE SIGNIFICANCE

Collaboration between data engineers and data scientists is of the utmost importance for the smooth completion of analytics projects. Data engineers lay the framework for data scientists by building dependable infrastructure capable of storing and retrieving enormous amounts of data in multiple formats. This collaboration is critical because it allows for transparent communication across teams throughout the project journey - from understanding business requirements to applying machine learning models in real-world contexts. This collaborative cooperation ensures that data is effectively packaged and distributed for analysis, allowing data scientists to gain useful insights and make educated decisions.

BENEFITS OF COLLABORATION

Collaboration provides numerous benefits that go beyond the sum of individual efforts. Here are some significant benefits:

Cross-Disciplinary Insights: Data engineers contribute technical knowledge, and data scientists bring analytical skills. Collaboration leads to the integration of technical expertise and analytical thinking, resulting in more comprehensive insights and new solutions.
Efficient Data Preparation: Data scientists rely on well-prepared, cleansed, and processed data for their research. Collaboration with data engineers ensures that data scientists obtain data in a format optimised for their analysis, reducing time spent on data preparation and reducing errors.
Seamless Model Deployment: Collaboration ensures that data engineers create tools that allow data science models to be deployed smoothly into production environments. This makes it easier to use predictive models in real-world circumstances, resulting in demonstrable economic benefit.
Real-time Insights: Data engineers can collaborate to construct real-time data pipelines that provide data scientists with up-to-date information. This is especially useful for time-critical analysis and decision-making.

In essence, collaboration between data engineers and data scientists is a bridge that unites technical foundations with analytical research, resulting in a coordinated, productive, and impactful data-driven environment.

CONCLUSION

In conclusion, the dynamic landscape of data-related roles is undergoing a significant shift, with Data Engineering emerging as the frontrunner while Data Science experiences a gradual decline. This change can be explained by Data Engineerings growing importance in managing and improving data infrastructure to get useful insights. As organizations face the challenges of managing large volumes of data, the need for qualified Data Engineers continues to rise. They are beginning to understand that without a robust Data Engineering framework, the full impact of Data Science will not be achieved. In this digital age, the businesses that focus on Data Engineering are the ones that can unlock the real potential of their data and drive innovation leading to sustainable growth.

Top Data Certifications for a Successful 2024

Anuj Syal — Thu, 27 Jul 2023 15:36:29 GMT

In the fast-paced realm of data engineering, staying ahead of the curve with cutting-edge certifications is your passport to unlocking a world of exhilarating career prospects. As we brace for the challenges and opportunities of 2024, the demand for skilled data engineers continues to soar, presenting an ideal moment to seize the best data engineering certifications available. Welcome to our blog, where we'll be your guiding light, illuminating the path for aspiring data engineers like you, showcasing the must-have certifications for the upcoming year.

Embark with me on this exhilarating journey, as we unravel the advantages of each certification, unveil the extraordinary job opportunities they open, and analyze the red-hot demand they command in the ever-evolving market. If you're a data engineer seeking to propel your career to unprecedented heights, brace yourself for what lies ahead in 2024 a world of boundless opportunities and endless success!

AWS Certified Big Data Speciality

Amazon Web Services (AWS) offers the AWS Certified Big Data Speciality certification, which relates to the data engineers' expertise in creating and implementing big data solutions while leveraging AWS's services. The approach of the practical exam for the certification focuses an immense value on hands-on expertise, ensuring that certified professionals have the required practical abilities. AWS certification additionally provides opportunities for networking with a community with cloud-agnostic abilities that may be utilized on different cloud platforms, broadening career choices beyond AWS-specific projects.

Let's delve into the essential features of this certification:

AWS Big Data Services: This certification covers a wide range of AWS big data services, including Amazon S3, Amazon EMR, Amazon Redshift, Amazon Athena, AWS Glue, AWS Lambda, Amazon Kinesis, and more. It provides in-depth knowledge of these services and how they can be used for various big data scenarios.
Data Streaming and Real-time Analytics: The certification covers Amazon Kinesis, a service for ingesting, processing, and analyzing real-time streaming data. You'll learn how to capture and process data from various sources, perform real-time analytics, and gain insights from streaming data using Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics.
Data Warehousing and Business Intelligence: The certification delves into Amazon Redshift, AWS's data warehousing service. You'll gain a deep understanding of how to design and optimize Redshift clusters for data warehousing and learn techniques for building effective business intelligence solutions on top of Redshift.
DEMAND & OPPORTUNITIES
With AWS being one of the leading cloud service providers, organizations across industries seek professionals skilled in AWS Big Data services like Amazon EMR, Redshift, and Athena. AWS Certified Big Data opens up various job opportunities for data engineers, such as Big Data Engineer, Data Architect, Data Analyst, or Cloud Data Engineer. Having this certification sets you apart from the competition, providing you an advantage in the job market and raising your profile with future employers.
AWS certifications, including AWS Certified Big Data, have a global presence and are highly regarded in many countries. Globally and across industries, businesses are moving their data infrastructure to the cloud, requiring the demand for skilled data engineers familiar with AWS services. AWS certifications are highly sought after in the United States, particularly in technology hubs like Silicon Valley and major cities with a strong tech industry presence. Along with that, AWS certifications are also valued in countries like India, Australia, and Singapore, where there is substantial cloud adoption and a growing tech ecosystem.

Microsoft Certified: Azure Data Engineer Associate

The Microsoft Certified: Azure Data Engineer Associate certification provides comprehensive coverage of various Azure data services and tools, enabling data professionals to leverage the full potential of Azure's data capabilities. With this certification, data engineers gain deep knowledge of these services and learn how to design, implement, and manage data solutions on Azure. To earn this certification, candidates need to pass two exams: DP-200 (Implementing an Azure Data Solution) and DP-201 (Designing an Azure Data Solution). These exams cover a range of topics, including data storage, data processing, data integration, data security, and monitoring and optimization of Azure data solutions.

Here are a few key features that make this certification stand out:

Azure Data Services Knowledge: This certification focuses on Azure data services and tools, equipping you with comprehensive knowledge of Azure's data offerings. It covers various services such as Azure Data Factory, Azure Databricks, Azure SQL Database, Azure Synapse Analytics, and Azure Cosmos DB.
Data Engineering Concepts: The certification delves into key data engineering concepts, including data ingestion, data transformation, data storage, data integration, data orchestration, data security, and data governance.
Collaboration and DevOps: The certification places a strong emphasis on collaboration and DevOps techniques in projects involving data engineering. You'll discover how to work well with cross-functional teams, apply DevOps principles to data pipelines, automate the procedures involved in data engineering, and put continuous integration and deployment into practice.

DEMAND & OPPORTUNITIES

There is a rising need for experts with experience in Azure data engineering as firms use Azure increasingly for their data needs. Being a Microsoft certification, it carries substantial industry recognition and credibility. This enhances certified data engineers' exposure to potential employers and improves their chances of landing data engineering positions involving projects and implementations related to Azure. Holding the Microsoft Certified: Azure Data Engineer Associate certification can lead to job roles like Azure Data Engineer, Data Architect, Data Integration Engineer, or Data Platform Engineer.

Azure certifications, including the Azure Data Engineer Associate, are in demand throughout the United States, particularly in industries like finance, healthcare, and technology. Azure certifications also have a strong demand in European countries, including the United Kingdom, Germany, and the Netherlands, where Azure is widely used. As Azure continues to expand its footprint worldwide and gain market share, the demand for professionals skilled in Azure data engineering is expected to grow in other countries as well like India, China and Japan. Due to the availability of Azure on a global scale and Microsoft's large market presence, the demand for people with the Microsoft Certified: Azure Data Engineer Associate certification can be seen in many nations across the world.

Google Cloud Certified - Professional Data Engineer

The Google Cloud Certified - Professional Data Engineer certification is designed to validate the skills and knowledge of data engineers in designing and building data processing systems and solutions on the Google Cloud Platform (GCP). The Professional Data Engineer certification offers comprehensive coverage of Google Cloud Platform's data services, including Google BigQuery, Google Cloud Storage, Google Cloud Dataflow, Google Cloud Pub/Sub, and more. By obtaining this certification, data engineers gain a deep understanding of these services and learn how to architect scalable, reliable, and secure data solutions on GCP.

Discover the noteworthy features that define this certification:

Comprehensive GCP Data Engineering Knowledge: This certification covers a wide range of topics related to data engineering on the Google Cloud Platform. It encompasses data ingestion techniques, data transformation methods, data storage and processing solutions, data analysis and visualization tools, and machine learning integration for data engineering projects.
Advanced Data Engineering Concepts: In-depth advanced data engineering principles are covered in the certification, including designing data pipelines, building scalable data structures, optimizing data storage and retrieval, establishing data security and compliance standards in place, and incorporating data governance and quality procedures.
Hands-on Experience: The certification emphasizes practical experience with GCP data engineering tools and services. It assesses your ability to architect, build, and optimize data processing systems using GCP services like BigQuery, Dataflow, Dataproc, Pub/Sub, Cloud Storage, and Data Studio.

DEMAND & OPPORTUNITIES

As more organizations embrace cloud-based data solutions, the demand for professionals skilled in GCP data engineering is rapidly growing. Earning this certification demonstrates your proficiency in designing, building, and optimizing data solutions on GCP. This certification can create job opportunities as Data Engineer, Data Architect, Cloud Data Engineer, or Data Analyst.

Google Cloud certifications have been gaining traction globally as more organizations adopt Google Cloud Platform (GCP) for their data engineering needs. The demand for Google Cloud certifications, including the Professional Data Engineer, is prominent in the United States, especially in technology-driven regions like California.GCP certifications, including the Professional Data Engineer, have a growing demand in countries like India, Singapore, and Australia, as Google Cloud expands its presence in the region.

Databricks Certified Associate Developer for Apache Spark

Apache Spark is a distributed computing framework designed for big data processing and analytics. The Databricks Certified Associate Developer certification focuses on Spark and covers various aspects of its architecture, core components, and programming concepts. By obtaining this certification, data professionals gain a comprehensive understanding of Spark's capabilities and learn how to utilize its full potential to solve complex data problems.

Discover the noteworthy features that define this certification:

Apache Spark Fundamentals: This certification covers the fundamental concepts of Apache Spark, including RDDs (Resilient Distributed Datasets), transformations, actions, Spark SQL, Spark Streaming, and MLlib. It provides a solid foundation in understanding Spark's core components and functionalities.
Hands-on Spark Development: The certification focuses on hands-on experience with Spark development. It includes exercises and projects that require you to write Spark applications using Scala, Python, or SQL. You'll learn how to work with Spark clusters, write efficient Spark code, and optimize Spark jobs.
Machine Learning with MLlib: The certification covers Spark's MLlib library, which provides a rich set of machine learning algorithms and tools. You'll gain expertise in using MLlib to train and evaluate machine learning models, perform feature engineering, and make predictions or recommendations using Spark.

DEMAND & OPPORTUNITIES

Apache Spark is widely adopted in industries that deal with large-scale data processing and analytics, creating a strong demand for professionals with Spark expertise. Earning this certification demonstrates your proficiency in Spark development and validates your ability to work with Spark clusters, design efficient data processing workflows, and apply Spark for machine learning tasks. Professionals holding this certification can pursue roles like Spark Developer, Data Engineer, Big Data Engineer, or Data Analyst.

As more organizations across different countries adopt Spark for their big data processing and analytics needs, the demand for professionals skilled in Spark development is expected to grow. Databricks certifications, including the Associate Developer for Apache Spark, have gained recognition among data engineering and data science professionals.Spark has a strong presence in countries such as the United States, United Kingdom, Canada, Australia, Germany, India, and many others.

Databricks Certified Machine Learning Associate

Databricks is a unified data analytics platform that brings together data engineering, data science, and business analytics in one collaborative environment. This platform provides a seamless experience for data scientists to develop, test, and deploy machine learning models at scale. The Databricks Certified Machine Learning Associate certification equips data scientists with the expertise to leverage the platform's capabilities and harness the power of machine learning. The Databricks Certified Machine Learning Associate certification is a valuable credential for individuals who want to showcase their expertise in applying machine learning techniques using the Databricks Unified Analytics Platform.

Take a closer look at the significant features inherent to this certification:

Machine Learning Concepts: Essential machine learning ideas including supervised learning, unsupervised learning, and deep learning are covered in the certification. A strong foundation in machine learning principles is provided by its exploration of algorithms, model evaluation, and feature engineering techniques.
Databricks Platform: Candidates gain expertise in using Databricks notebooks, Databricks Runtime, and Databricks MLflow for building, training, and deploying machine learning models.
Integration with Big Data Technologies: The certification covers the integration of machine learning with big data technologies. Candidates learn how to work with large datasets stored in distributed file systems like Hadoop Distributed File System (HDFS) or cloud-based storage systems.

DEMAND & OPPORTUNITIES

Earning this certification demonstrates your proficiency in machine learning using the Databricks platform. It certifies your proficiency with Databricks tools for machine learning model development, training, and deployment. The Databricks Certified Machine Learning Associate certification opens up job opportunities as Machine Learning Engineer, Data Scientist, AI Engineer, or ML Platform Engineer. The certification offers industry recognition, specialized Databricks skills, expanded career opportunities, and a competitive advantage in the job market.

The demand for Databricks certifications, including the Machine Learning Associate, is driven by the adoption of Databricks as a unified analytics platform. Databricks certifications have gained popularity in the United States, as organizations leverage Databricks for machine learning initiatives. It is also sought after in European countries where Databricks is used for data engineering and machine learning tasks, including the United Kingdom, Germany, and the Nordics.

CONCLUSION

As we brace for the boundless opportunities of 2024, the field of data engineering promises immense potential for career growth. To thrive in this dynamic industry, aspiring data engineers must set their sights on certifications that elevate their skills and knowledge.

Among the top certifications to consider, the AWS Certified Big Data shines, equipping you with expertise in creating and implementing AWS-driven big data solutions. The Microsoft Certified: Azure Data Engineer Associate certification validates your proficiency in data engineering on the Azure platform. Meanwhile, the Google Cloud Certified - Professional Data Engineer addresses the surging demand for GCP data engineering prowess.

For those seeking to conquer Spark development and big data processing, the Databricks Certified Associate Developer for Apache Spark emerges as a compelling choice. And don't overlook the Databricks Certified Associate for Machine Learning, showcasing your mastery in applying machine learning techniques through the Databricks platform.

As data engineering continues to evolve and drive data-driven insights, these certifications hold the power to accelerate your career, broaden your skill set, and position you as a highly sought-after data engineering professional. Embrace the opportunity that awaits, embark on your certification journey, and unlock a world of endless possibilities in the exhilarating realm of data engineering throughout 2024!

Top 5 New Data Engineering Technologies to Learn in 2023

Anuj Syal — Wed, 17 May 2023 05:44:42 GMT

In today's fast-paced digital world, keeping up with the latest advancements in data engineering is crucial to stay ahead of the competition. With the amount of data collected every day increasing, data engineering plays an important role in guaranteeing data accuracy, consistency, and reliability for enterprises.

In this blog, we will be discussing the top 5 new data engineering technologies that you should learn in 2023 to stay ahead of the curve. Each of the technologies we will be looking at brings a unique set of capabilities and benefits to the table that can help businesses improve their data engineering processes and make better data-driven decisions. So, lets dive in and learn!

APACHE SUPERSET

Apache Superset is a modern, open-source data visualisation and exploration platform that allows businesses to analyse and visualise data from multiple sources in real-time. Apache Superset was initially launched in 2016 by Airbnb as an internal tool but was later open-sourced in 2017 and has since become a popular choice for businesses and organisations. Apache Superset is designed to be extremely scalable and capable of managing massive amounts of data without sacrificing performance.

The most notable feature about Apache Superset is its ability to connect to a wide range of data sources, including SQL-based databases, Druid, Hadoop, and cloud-based data warehouses such as Amazon Redshift and Google BigQuery. As a result, it is a very adaptable tool that can simply be integrated into existing data infrastructures.

Lets explore some of the features of Apache Superset:

Data Visualisation: Provides various visualisation options, such as line charts, scatter plots, pivot tables, heat maps, and more. Users can customise these visualisations to suit their branding and style.
Advanced Analytics: In addition to data visualisation, Apache Superset also offers advanced analytics features, including predictive analytics and machine learning capabilities. This enables firms to acquire insights into their data and make well-informed decisions based on real-time data analysis.
Dashboard Sharing: Makes it easy for users to share their dashboards with others. Users can share dashboards via a URL or embed them in other applications using an iframe.
Query Building: Query builder interface enables users to create complex queries using a drag-and-drop interface. Users can also write SQL queries directly if they prefer.

Overall, Superset is anticipated to gain more popularity in 2023 as companies seek open-source substitutes for proprietary data visualisation software. If you're keen on data visualisation and reporting, Superset is an excellent tool to acquire knowledge.

APACHE ICEBERG

Apache Iceberg is an open-source data storage and query processing platform that was developed to provide a modern, scalable, and efficient way of managing large datasets. It is made to accommodate a variety of workloads, such as batch and interactive processing, machine learning, and ad-hoc queries. Apache Iceberg was created by the team at Netflix and was open-sourced in 2018.

One of the most significant features of Apache Iceberg that makes it special is its ability to support schema evolution. As datasets grow and change over time, It's crucial to be able to add or remove columns from a database without interfering with already-running applications or queries. Apache Iceberg allows users to add or remove columns to a table without having to rewrite the entire dataset. This makes it easy to evolve and maintain data models as business needs change.

Lets look over the benefits provided by Apache Iceberg:

Efficient Query Processing: Uses a columnar format that reduces the amount of data that needs to be read from disk, which improves query performance. It also supports predicate pushdown and other optimizations that further improve query performance.
Data Consistency: Combination of versioning and snapshot isolation ensures that readers and writers never interfere with each other. Data is always in a consistent state, even during updates or when multiple users are accessing the same data simultaneously.
Easy Integration: Designed to be easy to integrate with existing data processing frameworks, such as Apache Spark, Apache Hive, and Presto. It provides connectors for these frameworks, which makes it easy to start using Iceberg with minimal changes to existing code.
Scalability: It supports partitioning and clustering, which allows users to organise their data into smaller, more manageable chunks. This makes it easier to distribute and process large datasets across multiple nodes in a cluster.
Data Management: Provides a modern, efficient, and scalable way of managing large datasets. It makes it easy to store, organise, and query data, which can improve data quality and increase business agility.

Hence, Apache Iceberg should be learnt for its ability to handle large datasets efficiently and its support for schema evolution, which are critical for modern data management scenarios. It is also a popular technology used by many organisations, making it a valuable skill to have.

GREAT EXPECTATIONS

Great Expectations is an open-source Python library that provides a set of tools for testing and validating data pipelines. First launched in October 2019 as an open-source project on GitHub, it enables users to specify "expectations" for their data - assertions or limitations on how their pipelines should behave. These expectations can be simple rules, like checking for missing values or checking that a column contains only certain values, or more complex constraints, like ensuring that the correlation between two columns falls within a certain range. Additionally, the library offers a number of tools for visualising and documenting data pipelines, making it simple to comprehend and troubleshoot complex data workflows.

Several key features make Great Expectations a valuable tool for data engineers:

Expectation Library: Provides a comprehensive library of pre-defined expectations for common data quality checks. Users can also define their own custom expectations to meet specific requirements.
Data Documentation: Makes it easier to document and understand the data used in pipelines, providing data dictionaries that capture metadata, such as column descriptions, data sources, and data owners. This allows teams to collaborate and understand the data being used in their pipelines.
Data Validation: Offers a range of validation tools, such as data profiling, schema validation, and batch validation, which help users catch issues and errors in their pipelines before they cause downstream problems.
Extensibility: Easy integration with a wide range of data processing and analysis tools, such as Apache Spark, Pandas, and SQL databases. This allows users to use Great Expectations with their existing data stack and workflows.
Automation: Provides a suite of tools for automating the testing and validation of data pipelines, including integration with workflow management tools, such as Apache Airflow and Prefect. This enables users to automate the monitoring and validation of their pipelines to ensure data quality and reliability over time.

Data engineers should learn Great Expectations in 2023 because it offers a comprehensive suite of data validation, documentation, and automation tools. As data quality becomes increasingly important, Great Expectations provides a reliable solution for ensuring data integrity. Furthermore, its integration with popular data processing tools makes it a valuable addition to any data engineer's toolkit.

DELTA LAKE

Delta Lake is an open-source storage layer that is designed to improve the reliability, scalability, and performance of data lakes. It was initially released in 2019 by Databricks and has since gained popularity among data teams and has become an important tool for managing and maintaining data lakes. Data dependability is provided by Delta Lake, which is built on top of Apache Spark, using a transactional layer to make sure that all data updates are atomic and consistent.

Delta Lake has several features to offer that make it a valuable tool for data teams:

ACID Transactions: Delta Lake uses atomic, consistent, isolated, and durable (ACID) transactions to ensure data reliability. This means that data changes are atomic and consistent, and can be rolled back in the event of a failure.
Schema Enforcement: Supports schema enforcement, which ensures that all data stored in the data lake conforms to a predefined schema. This helps to improve data quality and reduces the risk of errors and inconsistencies in the data.
Data Versioning: Supports data versioning, allowing users to track changes to their data over time. This helps to ensure data lineage and enables teams to audit and understand changes to their data over time.
Performance: Delta Lake is designed for performance and can support petabyte-scale data lakes. It also includes optimizations such as indexing and caching to improve query performance.
Open Source: Delta Lake is an open-source project, meaning that it can be used and contributed to by the wider community. This helps to drive innovation and ensures that Delta Lake remains a flexible and evolving solution.

Since its debut, Delta Lake has grown significantly in popularity, and by 2023, data engineers are expected to get familiar with this tool. With more businesses switching to cloud-based solutions for their data infrastructure, Delta Lake is becoming an increasingly important tool for data teams owing to its support for cloud storage services and its capacity to handle difficult data management problems. Furthermore, as more businesses seek to leverage the power of big data and advanced analytics to drive informed decision-making, the need for reliable and scalable data management solutions like Delta Lake will only continue to grow.

ChatGPT

ChatGPT is a large language model developed by OpenAI and released in June 2020, It is based on the GPT-3.5 architecture and designed to generate human-like responses to natural language queries and conversations. The model is capable of understanding and generating responses in multiple languages, and it can be fine-tuned on specific domains or tasks to improve its performance. ChatGPT's ability to perform multiple tasks such as text classification, sentiment analysis, and language translation can help data engineers to gain insights from unstructured data.

One of ChatGPT's key strengths is its capacity to generate open-ended responses to inquiries and conversations, enabling users to have impromptu talks with the model.ChatGPT is trained on a massive corpus of text data, which allows it to generate responses that are contextually relevant and grammatically correct.

Some valuable features of ChatGPT that make it an all-rounder are:

Contextual understanding: ChatGPT can understand the context of a conversation and generate responses that are relevant to the topic being discussed.
Machine learning: Based on deep learning algorithms that enable it to learn and improve over time based on the data it processes.
Customization: ChatGPT can be fine-tuned on specific domains or tasks to improve its accuracy and effectiveness.
Content Creation: Used to generate content for websites, blogs, and social media posts. This can save content creators time and effort while ensuring that the content generated is high-quality and engaging.
Language translation: The ability to understand and generate responses in multiple languages makes it a valuable tool for language translation services.

ChatGPT is an AI-powered chatbot that can help data engineers and other professionals automate repetitive tasks, streamline workflows, and improve productivity. As AI and natural language processing continue to advance, ChatGPT is poised to become an increasingly valuable tool for data engineering teams in 2023 and beyond. Learning how to use ChatGPT can help data engineers stay ahead of the curve and enhance their data engineering capabilities.

CONCLUSION

In conclusion, data engineering is an ever-evolving field, and staying up-to-date with the latest technologies and tools is crucial to gain a competitive edge in the industry. From Apache Superset which can provide powerful data visualisation capabilities to Apache Iceberg which offers easy and efficient table evolution; these technologies can help data engineers work more efficiently and effectively. Great Expectations can ensure data quality and maintain data integrity, while Delta Lake provides a reliable and efficient way to manage big data. On the other hand, ChatGPT offers an innovative and interactive way to create conversational AI models. By learning these technologies, data engineers can stay ahead of the curve and be better equipped to handle the complex challenges of data management and analysis. So, don't wait - start exploring these exciting tools and stay on top of the latest trends in data engineering in 2023 and beyond.

If you are further interested in this topic, check out my youtube video:

https://youtu.be/mDoNk2y1GUg

How I Passed Databricks Data Engineer Associate Exam

Anuj Syal — Sun, 30 Apr 2023 05:43:57 GMT

No matter if you are an experienced engineer looking to increase the visibility of your accomplishments in the field or a newbie looking to spice up your resume, this is another important certification you should obtain.

In this article, we will learn how to prepare effectively but quickly to become a Databricks Certified Data Engineer Associate. Using this simple strategy, I was able to earn a score of 90 percent in just 10 days of preparation. So without further ado, let's start!

Databrick Lakehouse Platform Assessment Criteria

To begin with, this examination is purely meant to assess your ability to use the Databrick Lakehouse Platform and its associated tools. The minimally qualified candidate must be well-off with building ETL pipelines using Apache Spark SQL and Python. It is also expected that you can incrementally process the data, and build production pipelines, while also having a good understanding of data security and its governance.

An Insight into the Examination Format

It consists of a total of 45 multiple-choice questions (MCQs) and there is no negative marking. An individual is supposed to cover these questions in 90 minutes. A thing to be noted, there is not much source material available for this exam as it is quite new. These questions will be distributed covering the advanced-level topics of the subject in the following ratios:

Databricks Lakehouse Platform (11/45) - You must be familiar with the utility and benefits of using the platform and its associated tools.
ELT with Spark SQL and Python (13/45) - Its good if you have hands-on experience in building ETL pipelines using Apache Spark SQL and Python.
Incremental Data Processing (10/45) - Another useful skill that will help you in this journey is to learn to process the data effectively.
Production Pipelines (7/45) - It is required to be able to build production pipelines for data engineering applications and Databricks SQL queries and dashboards.
Data Governance (4/45) - Being well aware of the best security practices in the world of data engineering is a must.

Once you understand this division based on these topics, it will be easier to put proper attention to each one of them. You can set your priorities accordingly and begin to learn and practice. It is also important to know that this examination test has a fee of $200. This certificate stays valid for 2 years since the tester passes the exam.

Importance of Obtaining the Certificate

As Lakehouse Technology is at the forefront of Data Engineering and thus, it brings a significant number of opportunities for engineers who like to keep learning and growing.

The Lakehouse platform is a breakthrough tech that merges both Data Lake and Warehouse into a single paradigm. Over time, things will turn even more inventive as a lot of medium to large-scale businesses are engaging with the Databricks platform to unify their DE, BA, and ML needs.

Further, there are various benefits that this gets you in your data engineering journey. So, having an authenticated certificate from Databricks clearly shows that you have the relevant skills, whereas, and you can quickly build your portfolio from there.

3-Step Approach to Ace the Exam

https://www.youtube.com/watch?v=1Z5AA8T-W3I&t=181s

How I passed Databricks Data Engineer Associate Exam By the Author

To begin with the preparation, you must delve deeper into the variety of courses referred to by Databricks Academy on their platform.

Step 1: Preparation

There is an instructor-led course named Data Engineering with Databricks and a self-paced course available in Databricks Academy. For a better understanding of the certification course, there is a detailed overview of over 40 minutes which you can surely use. As you reach the end of the module, you will get to practice some good questions from a practice exam on their website.

Fundamentals of the Databricks Lakehouse Platform (V2): Once you prepare for this, this portion also provides a free accreditation in place. This course assesses your ability to acquire the basic concepts of Data Lakehouse, Databricks Lakehouse Platform, Data Reliability & its Performance, Unified Governance & Security, Instant Compute & Serverless, and a variety of other key topics such as data warehousing, data engineering, data streaming, and Data Science & ML.
Data Engineering with Databricks V2: The main course will help in getting familiar with all the concepts of the Databricks Lakehouse platform. It covers the following topics like Delta Lake, Relational Entities, Python & ETL with Spark SQL, Databricks Lakehouse Platform, Databricks Workspace and Services, Delta Live Tables (DLT), and Multi-Hop Architecture, etc. For a better understanding of the topics, you can visit this link on GitHub.

Step2: Notes/ Documentation Portal

This is a crucial step as it makes you progress even faster and doubles the chances of success. While you are studying with discipline and dedication, it is just another requisite to make notes or document things that you learn in the process. Consequently, you must revise frequently before sitting for this examination. For more information on how to make your own notes, you can check out my note-making strategy.

Step 3: Mock Exam

In this step, it is highly suggested that you give mock tests to assess your progress while preparing for this exam. You can easily practice some of these for free on Databricks Academy or using various websites such as Udemy. The mock tests and practice exams available on both of these platforms are much similar as the question papers have a flow that is alike. Another thing, some of the questions are given variations coming from the same concept.

A Vital Reminder: Record the Incorrectly Answered Questions and Their Explanations for Effective Revision. This Practice can Help Boost Your Confidence, and Increase Your Chances of Scoring 90 or Above on the Certification Exam.

Final Thoughts on the Certification Course and Preparation Tips

For anyone who is looking for an upgrade in their career or to elevate their knowledge base, this certification course comes as another milestone for you. However, it is easier said than done. Strong preparation is needed to get a good score as they reflect your conceptual clarity and in-depth understanding.

For any further queries or guidance, you can watch my video where I elaborated on the same topic, or directly reach out to me through comments. You can also email your questions and receive a prompt response!

https://www.youtube.com/watch?v=1Z5AA8T-W3I

Create Your First ETL Pipeline with Python

Anuj Syal — Mon, 23 Jan 2023 07:54:38 GMT

When it comes to pursuing a career in the field of Data and specifically Data Engineering and many other tech-related fields, Python comes off as a powerful tool. As you will be forging ahead in your profession, this programming language will be convenient in many ways.

Without further ado, lets dive into the fundamentals of Python that are needed to create your first ETL Pipeline!

A Demonstration of the ETL Process using Python

https://www.youtube.com/watch?v=uqRRjcsUGgk

It may be helpful to use an actual bare-bones example to illustrate how to build an ETL pipeline to gain a better understanding of the subject. With this, we will better understand how easy Python is to use as a whole.

Create a file called etl.py in the text editor of your choice. And add the following docstring.

"""Python Extract Transform Load Example"""

We will begin with a basic ETL Pipeline consisting of essential elements needed to extract the data, then transform it, and finally, load it into the right places. At this step, things are not as complex as they might seem, even if you are a complete beginner at it.

So as we go down the path, you can witness how easy it is to use python for building any such ETL Pipelines.

Importing the Right Packages

import requestsimport pandas as pdfrom sqlalchemy import create_engine

Coming to this step, you will realize how Python is resourceful as a tool. Because it carries an ecosystem of libraries around data and general programming which makes it even more painless, as well as, effective to use. Importing the right libraries is the first step to creating anything using python.

Here, we will require the use of the first three libraries known as request libraries. They help to pull data from an API, which is used for the extraction of data. Apart from that, Pandas is another library to perform transformation and manipulation of data. This is similar to Excel on steroids with the only difference being that it is based on using codes. Hence, pandas can be used to read data in excel formats, and CSV formats (basically, anything in a tabular format), and we can easily transform it using pandas. The last on this list is SQLAlchemy, which is meant to support creating a connection to a database (essentially, its an SQLite database).

Step 1: Extract

def extract()-> dict:    """ This API extracts data from    http://universities.hipolabs.com    """    API_URL = "http://universities.hipolabs.com/search?country=United+States"    data = requests.get(API_URL).json()    return data

The first step to creating the pipeline will begin with the extract. As shown in the sample, we are extracting from an API source that is freely available for use. The sample used is to derive the information on universities available in the United States as a whole. When we run this API, it will provide the data back to us in a JSON format as the sample is shown below.

[{"web_pages": ["http://www.marywood.edu"], "state-province": null, "alpha_two_code": "US", "name": "Marywood University", "country": "United States", "domains": ["marywood.edu"]}, {"web_pages": ["http://www.lindenwood.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Lindenwood University", "country": "United States", "domains": ["lindenwood.edu"]}, {"web_pages": ["https://sullivan.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Sullivan University", "country": "United States", "domains": ["sullivan.edu"]}, {"web_pages": ["https://www.fscj.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Florida State College at Jacksonville", "country": "United States", "domains": ["fscj.edu"]}, {"web_pages": ["https://www.xavier.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Xavier University", "country": "United States", "domains": ["xavier.edu"]}, {"web_pages": ["https://home.tusculum.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Tusculum College", "country": "United States", "domains": ["tusculum.edu"]}, {"web_pages": ["https://cst.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Claremont School of Theology", "country": "United States", "domains": ["cst.edu"]}]

The data we get here is in the bare boost structure being used in HTTPS calls or HTTP calls. We get this data as an output from the URL:

http://universities.hipolabs.com/search?country=United+States

We will use the request library to achieve the extraction and obtain a response as a JSON, which is a dictionary within Python.

Step 2: Transform

def transform(data:dict) -> pd.DataFrame:    """ Transforms the dataset into desired structure and filters"""    df = pd.DataFrame(data)    print(f"Total Number of universities from API {len(data)}")    df = df[df["name"].str.contains("California")]    print(f"Number of universities in california {len(df)}")    df['domains'] = [','.join(map(str, l)) for l in df['domains']]    df['web_pages'] = [','.join(map(str, l)) for l in df['web_pages']]    df = df.reset_index(drop=True)    return df[["domains","country","web_pages","name"]]

This step is mainly about transforming the data to be in the right format and sequence. Mostly, the transformation of any data is done around particular business conditions and their requirements. For this specific sample, we have assumed a hypothetical condition where we are searching for universities in California.

Firstly, we will find data located in a dictionary. This data then will be read into a pandas data frame. But what do pandas do? To elaborate, the Pandas data frame is like a data structure, which means its a library that will enable us to convert this dictionary into a data frame. Further, we can think of a data frame as a CSV which has rows and columns and various added functionalities. Its a comprehensive tool when it comes to transforming data

Next, we will filter out [Line 5 in the snippet] all the universities whose name contains California. As mentioned before, Pandas is like Excel on Steroids where we can use a simple syntax that helps us with any such actions required to filter out or transform the data.

Step 3: Load

def load(df:pd.DataFrame)-> None:    """ Loads data into a sqllite database"""    disk_engine = create_engine('sqlite:///my_lite_store.db')    df.to_sql('cal_uni', disk_engine, if_exists='replace')

The last step of creating this pipeline is about reading the SQLite database which pre-exists on the disk. This type of data can be found on a host or a server. We will then save this data frame into a table by using df.to_sql which is a functionality of the pandas object. Here, we will further have to provide the SQLite engine with a condition that if such data exists, it shall replace it (as highlighted in the image below).

This was the last step of loading after which we can have the transformed data as we perform the load into a database. These are the functional aspects of a programming language where we will be able to reuse this set again. This is a complete ETL pipeline consisting of all the elements to perform any such actions on the data.

Running your first ETL pipeline

data = extract()df = transform(data)load(df)

Finally, we will need to execute this function that we need this pipeline to perform. As a result, we will get a data frame with all the columns and at this point, we will have to load it into an SQLite database. Now, whats left to be done is to run it using the shift & enter keys.

While it will take some time to execute the code and run the API, we will get to extract filtered-out data about the total number of universities in the United States. Along with this, we will achieve the result for the total number of universities in California. We can further store and save this transformed data into an SQLite database in the file explorer. For any future requirements, we can use the same ETL pipeline and search for the data that we need.

This is simply how the ETL process works using Python to achieve whatever we want to extract out of data.

Conclusion

A programming language as versatile as Python is marked under the essentials by many data engineers, data scientists, and developers, including software engineers. Therefore, as a beginner in the field of data engineering, it is a must-have skill to have a core knowledge of python. However, it is not necessary to know everything under the sun when it comes to python. Id like to emphasize more that we will always be learning as we go, so no need to panic and try to gulp down everything as its not required at all!

Check out the video link below that talks about the same for a better explanation!

https://www.youtube.com/watch?v=uqRRjcsUGgk&feature=youtu.be

Source Code

https://github.com/syalanuj/youtube/blob/main/de_fundamentals_python

A Step-by-Step Roadmap To Data Engineering

Anuj Syal — Fri, 13 Jan 2023 11:37:22 GMT

The specialized field of data engineering is ever-expanding and its elements of it are scattered all around. But how do we come to grips with not being confused & consumed in the process of learning?

Initially, you must follow a roughly drawn map and leave the rest to the skills and opportunities you opt for. A fun way to understand this roadmap is to imagine a smooth transition from a rookie to a professional data engineer with each sprint you take!

https://youtu.be/TjxmoAXkaAU

Sprint 1: Strengthening the Base Level

Fundamentals I: Focus on fundamentals such as Python, DS & Algorithms

Firstly, you must focus on fundamental skills such as Python, Data Structures & Algorithms. These programming languages will be used to interact when working with various types of databases. That is why they are also known as interactional languages. At this step, it will be a fruitful decision to learn about data structures and algorithms as it holds most of the things that you will come across regularly later.

Although object-oriented languages such as Python have in-built data structures and assorted open source packages for the application of algorithms. However, it is still preferred to have a better understanding of data structures and algorithms as they help in writing optimized code.

Fundamentals II: Linux Commands, Shell Scripting, Git, Networking Concepts

Next in the fundamentals comes a mixed combination of skills like Linux commands, Shell Scripting, Git, and Networking Concepts. These are important for times when you will be dealing with virtual/cloud servers and several other platforms to transform, manage, and store the data.

Sprint 2: Database

A full sprint dedicated to databases and SQL, as you will have maximum interaction with them in this journey. You need to focus on Database Fundamentals, SQL, Database Modelling, ACID Transaction, Relational, and Non-Relational Databases. Here, you are free to play and experiment with data as you go along and build a good understanding of these concepts.

Sprint 3: Data Lakes & Warehouses

Understand fundamentals of data lakes and warehouses, OLAP Vs. OLTP, S3, GCS, Big Query, Redshift, ClickHouse, Normalization Vs. Denormalization, and processing Big Data using MySQL. These concepts demand dedicated attention from the learners as they play a major role when storing data and managing it all in different places for different purposes.

Sprint 4: Distributed Systems

Distributed Systems are formed when multiple machines work together in groups to manage massive data sets which cannot be done by a single machine alone. These modern frameworks achieve big data processing using Distributed Systems. Hence, it is required to understand the fundamentals of Hadoop, Map Reduce, HDFS, and Cluster Tech such as EMR, Dataproc, and Databricks.

Sprint 5: Data Processing

Data Processing is a step where your coding skills will be challenged. Why? Because it will be required to transform the raw data to bring the most of its utility. A programming language such as Python is a must-have as a coding language that is mostly used. It is suggested that you get accustomed to a variety of tools such as Pandas, SQL, Spark, Beam, Hadoop, etc.

Sprint 6: Orchestration

With the sixth sprint you will take, you need to learn to orchestrate the pipelines using tools, where you define the flow and schedule of your tasks. But thats not just it! It is needed that you gain a detailed understanding of how to use Airflow and create DAGS (Direct Acrylic Graphs). Also, get a glimpse of other orchestration tools such as Luigi and Jenkins.

Sprint 7: Backend Frameworks

This part of the concept in data engineering overlaps with that of software engineering. Sometimes you need to serve your models and their functionality. Therefore, it is crucial to be well aware and learn how to create APIs with frameworks such as Flask, FastAPI, and Django.

Sprint 8: Automation & Deployments

This kind of technology is important to understand as it lets you automate and deploy the codes using a variety of tools and platforms. For you, a few of the necessary learning technologies would be Containerization with Docker, CI/CD with GitHub Actions & Infrastructure as code using Terraform and Ansible.

Sprint 9: Frontend & Dashboarding

Frontend and exploration technologies are essential tools when it comes to showing the outcomes and actions taken on large data sets. In other words, they help visualize the ongoing changes and results using charts, graphs, and diagrams. Some of the popular tools to get used to are Jupyter Notebooks, Dashboarding tools such as PowerBI & Tableau, and Python frameworks such as Dash & Plotly, etc.

Sprint 10: Machine Learning

At this point, youre already competent enough in the field of data engineering. However, to work as a professional alongside a team of other engineers, data scientists, and analysts, It is necessary to grasp the concepts of Machine Learning. ML models and algorithms are used by data scientists to study and then make calculative predictions that can benefit business organizations to make big decisions.

Conclusion

An in-depth understanding of the core concepts is the first step when learning any subject as it promises you great success. Likewise, it is imperative to go with a clear and succinct approach in the advancing field of Data Engineering. This roadmap has simple guiding steps to ensure you can build a promising career with undying enthusiasm. The aspirants of data and its engineering must indulge in learning and practicing from a wide array of skills and technologies as time passes.

For more information and elaborative understanding, you can check out the video!

12 Must-Have Skills to become a Data Engineer

Anuj Syal — Tue, 03 Jan 2023 06:13:43 GMT

Are you passionate about using data to create innovative products and solutions? If so, a career as a data engineer may be the perfect fit for you. But what does it take to be successful in this field? In this blog, we will explore the skills and requirements necessary to become a data engineer and succeed in this exciting profession.

To begin with the fundamentals, or say, to build an in-depth understanding, we must start from scratch.

Full visual diagram: Visit this mind map on Whimsical

Fundamental Skills

SQL: Structured Query Language, also called See-Quell, is always at the top of the list for beginners in the domain. The language was developed in 1970 and is the standard language to interact with data in databases.
Almost all the databases and warehouses used a version of SQL as an interactional language.
The popular standard relational databases are MySQL and PostgreSQL. Moreover, other tools and warehouses have adopted SQL as an abstraction, which allows you to build ML models using SQL in big queries.
Programming Language: Next comes the Programming Language. This is the language that engineers begin using to code with and is the central aspect of data engineering. For most of us, Python is the language of choice as it's easier to get started with.
This language comes packed with data science packages and frameworks and is a perfect choice for production code. Alternatively, there are plenty of other languages (such as R, Scala, and Java) that can be used but Python is recommended.
Git: Git is an important tool for version control, which is a practice of tracking and managing changes to software code. As for every single change that you perform, that change becomes a part of your code base in some remote server/cloud.
But how does Git help you? Git lets you save all the changes and actions that you take while coding and this works wonderfully while collaborating with your team, without losing your code. You just need to simply create a new branch and send a pull request to merge code and Voila! Youre ready to collaborate and work on your code. Check out my video tutorial on git if you want to get started with this technology

https://youtu.be/7_pr528CYQw

Complete Git and GitHub Tutorial for Beginners (Data Professionals)

Linux Commands & Shell Scripting: Being a practitioner in the world of data engineering, you would mostly be dealing with a Linux VM or Server. No matter if in a public cloud or a private server, these machines inherently use some version of Linux such as Ubuntu, Fedora, etc.
Therefore, to work with such machines, you are required to have some knowledge of commands to navigate with Linux servers. Some of the basic commands such as cd, pwd, cp, and mv are a good start, and much more to learn further. However, Shell Scripting is a great tool to automate these Linux commands, without needing to manually use these commands.
Data Structure & Algorithm: Next in the line is Data Structure and Algorithm. Even though you will not be required to create data structures on your own, it is still required for an aspiring data engineer to have an adequate understanding and problem-solving skills of DS & Algo (similar to software engineering). For this purpose, Easy- Intermediate-level LeetCode problems will be enough for the initial practice.

Concept of Networking

As a data engineer, you would be responsible for quite a lot of deployments to VMs and servers. Therefore, It is important for someone dealing with VMs (Virtual Machines), Servers, and APIs (Application Programming Interfaces) to have a basic understanding of basic networking concepts such as IP (Internet Protocol), DNS (Domain Name Server), VPN, TCP, HTTP, Firewalls, etc.

Databases

Fundamentals: A database is a space where data is stored. You will be interacting with many of these databases as a data engineer. For this reason, you need to understand the fundamental concepts of databases, such as tables, rows, columns, keys, joins, merges, and schema.
SQL: This was supposed to be covered once again when talking of the databases, as it comes in handy as an interactional language when working with these databases.
ACID: This abbreviation stands for Atomicity, Consistency, Isolation, and Durability. This is a set of properties of database transactions intended to guarantee data validity, despite errors, power failures, and any other such mishaps.
Database Modelling: Data Modelling or Schema Design helps extensively when building any database, be it applications or warehouses. That is why it is essential to have some knowledge of design patterns for creating schemas for databases. This includes star schema, flat design, snowflake model, etc.
Database Scaling: Vertical Scaling refers to the increase of configuration of a single machine where the database is deployed, which can also be scaled up later on. Alternatively, for Horizontal scaling also known as Sharding, you perform the same process to store the data but, into multiple machines.
OLTP Vs. OLAP: OLTP (Online Analytical Processing) & OLTP (Online Transaction Processing) are two different types of data processing systems. Complex queries are used by online analytical processing (OLAP) to examine past data that has been collected from OLTP systems.
Relational Databases: These are traditional-style databases that power most of the applications. a single database can contain multiple tables with rows and columns. The most commonly used databases of this kind are PostgreSQL and MySQL.
Non-Relational Databases: Non-Relational Databases store the data as nodes and relations separately using a storage model, instead of a tabular schema. This helps in storing the data systematically and then promptly extracting and fetching the data records. Non-relational data further comes in three different types that can be understood well as and when needed.
- Key-Value Databases: examples are Redis, DynamoDB, and FireBase
- Graph Database: examples are Neo4j and ArangoDB
- Wide Column Databases: examples are Apache Cassandra, and Google BigTable

Data Warehousing

The inability of databases to store a huge amount of data leads us to a warehouse. These data warehouses can store large volumes of current and historical data for query and analysis. Data Warehousing is simply databases designed with analytical workloads in mind. These are powerful enough to perform complex aggregate queries and transformations to yield insights. Some of the key concepts to understand within warehousing are-

SQL: With the advent of powerful data warehouses that abstract away complexity, proficiency in SQL is all that is required to unlock their full potential.
Normalization Vs. Denormalization: Normalization involves removing redundancy or any inconsistencies in the data. While Denormalization is the technique of merging the data into a single table to improve data change speeds.
OLAP Vs. OLTP: The primary distinction between the two is that one uses data to gather important insights while the other supports transaction-oriented applications.

Some of the popular data warehouses are:

Googles Big Query
AWS Redshift
Azure Synapse
Snowflake
ClickHouse
Hive

Data Lakes/Object Storage

These work as file storage sources where you can store your files or blob files. They are huge cloud networks that are used globally and are readily available to you.

Distributed Systems

When multiple machines work together as a cluster, they form Distributed Systems. These systems are used when the data is huge and cannot be managed by a single machine. They have separate sets of technologies due to their own complexities. Some of the concepts you must know in depth-

Big Data
Hadoop
HDFS
Map Reduce

Some of the technologies that are built for this purpose include Cluster technologies like Kubernetes, Databricks, Custom Hadoop Cluster, etc. Open-source technologies are also available in distributed systems.

Data Processing

This is where your coding skills will come to use for transforming the data as the raw data is never usable. Being a data engineer, your job responsibility will mainly revolve around transforming the data to be served in the right format. This further includes cleaning the data and its validation. Panda can be your first-hand tool to perform this process as its an easy-to-use python package that uses data frames. SQL can also be used to transform big data as most of the data warehouses support this language. Spark is the most popular framework used for big data transformation. Similarly for stream processing Spark Streaming is the preferred choice.

Orchestration

Orchestration is used to schedule and orchestrate jobs and create pipelines and workflows. The best tool for orchestration is Airflow, as it uses python-based Direct Acrylic Graphs to write down the workflow of jobs. From the simplest of tasks to the most complex ones, Airflow can create everything. Some other orchestration tools are Luigi, Nifi, and Jenkins.

Backend Frameworks

It can be assumed by the name itself that Backend frameworks somehow overlap with software engineering. Backend Frameworks come to use when you require to serve some data set, model, or functionality to be used by some application. For this task, you will be needed to create the backend APIs/frameworks such as Flask, Django, and FastAPI. Some of the dedicated technologies based on python are Flask, Django, and FastAPI. Some of the cloud-based technologies are GCP Vertex AI API for model deployments and Automl APIs.

Frontend & Dashboarding

Frontend and exploration technologies are about displaying the results and actions performed through charts, images, and diagrams. There are plenty of tools available which might not come to use for a data engineer but are good to know about. The popular ones are Jupyter Notebooks, Dashboarding (PowerBI, Tableau), Python Frameworks (Dash, Gradio), etc.

Automation & Deployments

Automation and Deployments are about automating and deploying the codes using a variety of tools and technologies. A few of the important technologies include the following:

Infrastructure as code: Using Terraform, Ansible, Shell Scripts
CI/CD: Using GitHub Actions, Jenkins
Containerization: Docker, Docker Compose

Machine Learning

Machine Learning algorithms (or models) are just another great concept to gain knowledge about. Machine learning is majorly used by data scientists to make predictions by analyzing current and historical data. However, data engineers must have a strong understanding of the basics of machine learning as it can naturally enable them to deploy models, as well as build pipelines having more accuracy. This directly benefits the data scientists to make precise decisions. Hence, it is good to understand the fundamentals and frameworks of ML. Some of the platforms used for ML Operations are Google AI Platforms, Kubeflow, and Sagemaker.

Integrated Platforms

Integrated platforms allow data scientists and data engineers to have integrated workflows together in one place. AWS Sagemaker, Databricks, and Hugging Face, are some examples of Integrated Platforms,

Conclusion

In the field of data engineering, there are a myriad of skills that one needs to learn, and that requires gaining hands-on experience too. As an aspiring data engineer, you get to choose from a wide variety of skills and tools to work with, and thats the thrill of it all!

For more information and elaborative understanding, you can check out the video below!

https://www.youtube.com/watch?v=0qmrsjW_rVI

DATA ENGINEERING SKILLSET

Data Engineering Explained

Anuj Syal — Sat, 24 Dec 2022 08:34:50 GMT

When we scroll through these sites in hopes to find something we need to buy (say, a shirt), we add it to the cart, or we just let it be saved for later. Within a few moments, you begin to see advertisements of the same or similar-looking shirts while surfing other platforms.

For these creepy advertisements to be in the right spots, apart from data tracking using cookies, there is also a good amount of data engineering working behind the scenes.

In this blog post let's try to understand how data engineering works.

Data Engineering in the Past

First, let's try to understand data engineering in the past. A while back, when things were simple, data was scattered across multiple sources such as transaction databases (e.g. MySQL, Postgres), analytics tools (e.g. Google Analytics, Facebook Pixels), and CRM databases. This data was often accessed and analyzed by an Excel professional who would gather the data from various teams, manipulate it in Excel using pivot tables and other functions, and create a final report.

While this process worked for small applications, it was prone to error due to the manual intervention involved and became increasingly cumbersome as the amount of data grew.

Understanding Extract Transform Load (ETL)

Extract, Transform, Load (ETL) is a process in data engineering that involves extracting data from various sources, transforming it into a format suitable for analysis or other purposes, and loading it into a target system or database. The purpose of ETL is to make it easier to work with data from different sources by bringing it into a centralized location or format.

Any simple ETL Pipeline would first extract data from all the sources such as databases, APIs, files, and any other type of data connectors. Then comes the next step where it transforms the data. But what exactly does this step involve? For transforming this collected data, the system removes any ambiguities, missing fields, and columns with nulls. It further includes putting the misplaced columns into the right places, in the right format, joins or merges where it's needed, and sometimes it also brings out a pivot summary.

Finally, the last step is performed where this customized, transformed data is loaded into a sink. This sink could simply be a database. Since this process is not just a one-time thing, it is likely that it will be repeated with a continuing frequency. Therefore, the data engineers make scripts that carry on the whole process of ETL to run on a weekly, monthly, or yearly basis. Airflow Orchestration is one good example of an ETL tool.

Insights for Stakeholders (BI)

Business Intelligence, abbreviated as BI, is software used to accumulate, process, analyze, and visually represent larger sets of unstructured data. They are used to inform decision-making and drive business outcomes. As fascinating as it sounds, BI tools are a great invention as they make it possible for everyone to observe and make sense of their own data.

Photo Credit: Tima Miroshnichenko/Pexels

BI tools are made for end-users like stakeholders and analysts who get access to the recorded insights. Using BI tools, they are able to track KPIs (Key Performance Indicators) and trends and make decisions on the basis of well-curated data. Some popular BI tools are PowerBI, Tableau, and Google Data Studio. These tools make it easier to create charts, graphs, and maps, and can easily be transported into excel sheets for future use.

Data Warehousing

We studied how ETL helps in consistently pushing the data forward each day. But there is a limit to how much data can be stored in a database like MySQL, and for how long. Thats when Data Warehouse comes into the picture.

As its name suggests, this system works as a warehouse for data in larger amounts and often, to store historical data. Data stored in data warehouses is structured to cater to analytical workloads in mind. The schema is usually denormalized to fetch insights without doing a lot of joins. A data warehouse is also termed as OLAP DB system.

It mainly focuses on the analytical part as it supports in performing queries and analysis. In short, this is just another tool that helps big organizations to take mindful and strategic decisions using historical records and generating queries.

ELT/ Data Lake used by Data Scientists

The data warehouse is built to support the business requirements at the moment, it contains structures & well-designed schema. Business users need these to track KPIs and metrics.

However, Data Scientists are required to build ML models in order to make futuristic predictions based on existing data. Their part of the task is to hunt every square inch of data they can find. So they are also interested in looking at unstructured data such as logs, events data, which is not part of the warehouse.

But to make all this possible, a Data Engineer has to do some work. Instead of ETL, they have to turn it into ELT. This means that extraction comes first and then loading of the data into a data lake happens. Data lake stores all the raw data without processing it as DS needs to see all the columns. This data is usually stored in blob storage like S3, HDFS, or GCS. At last, data scientists transform the data in Jupyter notebooks to churn out its usability.

Big Data & Computing Spark

Big Data was one of the most intriguing, overused buzzwords back then when it was first introduced in the world of data technology. But what do we mean when we use the term?

To be concise, the data which cannot be processed/used in a single server is called Big Data. But there is more to it!

For data to be classified as big data, there are 4 Vs that are required:

Volume
Variety
Veracity (accuracy)
Velocity (speed)

Some of the key areas where big data comes to use are:

Ecommerce website doing thousands of sales and logistic transactions
Payment Systems
Financial Institutions
Blockchain Exchanges
Streaming Services like Youtube

The petabytes of data cannot be stored on a single server. Hence, this quantity of data is required to be distributed over other computing and cloud alternatives. For such a purpose, there are open source frameworks like Apache Hadoop, which efficiently stores and processes data that is very huge in volume. Such servers are also known as clusters as you can use as much storage as you need and compute.

Some other great variants of these cloud storage support are GCS and S3, as they are more resilient. Such distributed storage provides scalability and redundancy, for the data can be retrieved if some server crashes in the future. In addition to this, some specific technologies come handy to work on the distributed computing and streaming of this stored data. Spark and Kafka are some good examples for it!

Conclusion

The world of data engineering is huge and includes major components of data science as well. This means that data engineering and data science are not contrasting but complementary to each other. Data engineers design and build the pipelines to transform and then transport the data into the desired format. At the same time, data scientists utilize that data to bring out most of its utility for the business organization and its stakeholders.

However, in order to complement each others efforts, both data engineers and data scientists are supposed to learn data literacy skills and must be well aware of their respective contributions to the system. This is how any business organization flourishes and performs well in the market, understands the likes & preferences of its consumer base, and makes important decisions that are crucial for growth.

Looking for more information

https://www.youtube.com/watch?v=cAJCcpiVpOY

Warehousing with Google’s Big Query

Anuj Syal — Thu, 24 Mar 2022 06:01:30 GMT

Data, in the modern world, is decentralized and is being generated and collected at a record pace. To ensure that this data is collected and processed in a manner that enables businesses and organizations to achieve their business goals, specialized and optimized tools are required. The right solution will enable businesses and organizations to store data with resiliency, and swiftly analyze large amounts of data such that it can be used to achieve business outcomes, with decisions powered by data and analytics.

Googles BigQuery is one such tool. Googles proprietary BigQuery is a serverless multi-cloud data warehouse that is highly scalable, cost-effective, and specifically designed for offering superior business agility. It democratizes your data-driven insights with built-in machine learning, powered by a flexible and end-to-end multi-cloud analytics solution. In addition to state-of-the-art machine learning, BigQuery also enables lower TCO at scale by almost 26-34% as compared to alternatives. Furthermore, BigQuery adapts to your data with zero operational overhead.

Photo by Taylor Vick on Unsplash

BigQuerys architecture is developed for Big data. It works optimally when it is being fed with several petabytes of data to be cleaned, processed, and analyzed. BigQuery removed the requirements for humans to make interactive ad-hoc queries to massive data sets in read-only mode.

BigQuery has the following hierarchical structure:

Photo by ETA+ on Unsplash

Projects

From the context of BigQuery, all BigQuery resources are contained within a project. Since there is decoupling between Storage and Compute with BigQuery, the projects that store data and those that query data can be separate.

Datasets

You can utilize datasets to organise BigQuery views A dataset is bound to a location that may be regional (a specific geographical place) or multi-regional (a region that contains two or more geographical places). The location for a dataset can only be defined at the time of its creation.

Tables

BigQuery tables hold your data. Each table is defined by a schema (data types, column/row names, and other information). There are different types of tables, namely, native tables are supported by BigQuery storage, external tables exist on storage external to BigQuery, and views which are virtual tables defined by SQL queries.

Jobs

The actions that BigQuery runs, such as loading, exporting querying, or copying data are referred to as jobs. The location of the job is linked to the location of the dataset for executing the job.

Key Features of BigQuery:

Photo by Adam migielski on Unsplash

Predictive Modelling with BigQuery ML

BigQuery enabled data analysts or data scientists to build machine learning models structured or semi-structured data sets several petabytes big. All this is achieved through simple SQL in minimal time.

BigQuery Omni for Multicloud data analytics

BigQuery Omni allows you to analyze data across the multi-cloud such as Azure and AWS, as a fully managed and end-to-end data analytics solution for the multi-cloud with a focus on saving costs and securing data.

BigQuery BI Engine for Interactive Data Analytics

With its highly optimized, in-memory analysis service, BigQuery BI Engine enables data analysts to obtain actionable insights from massive and complex datasets with a sub-second query response time and high scalability through high concurrency.

BigQuery GIS with Geospatial Analysis

As a unique feature, combine geospatial analysis with BigQuerys serverless architecture in order to improve and augment your analytics workflows with location-based intelligence. Simplify your analyses and visualize your special data to unlock new potential for your business

Warehousing in BigQuery

If you are interested in a step by step guide, check out this youtube video

https://youtu.be/_Wm_GYO-r_Q

Warehousing with Google's Big Query from Anuj Syal

Loading dataset in BigQuery

Screenshot from Google Cloud

Big query provides multiple options for you to load the data:

Use a pre-existing connector, eg youtube analytics, google analytics
Google Cloud Storage
Big Query Console
Big Query CLI
Using python client libraries

Public Datasets

Google Cloud Public Datasets offer a powerful data repository of more than 200 high-demand public datasets from different industries. It also provides free storage cost and 1TB query cost per month if you intend to use it.

Exploring Ecommerce Public Dataset on Big Query

If you are familiar with simple SQL, Big Query allows you to explore it's biggest datasets for free. So as an example let's check out this Ecommerce dataset publicly available:

About the DatasetThe dataset provides Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. We will explore the all_sessions table

Query: Total unique visitors

SELECTCOUNT(*) AS product_views,COUNT(DISTINCT fullVisitorId) AS unique_visitorsFROM `data-to-insights.ecommerce.all_sessions`;

Out:

product_views	unique_visitors
21493109	389934

Query: Total unique visitors by Channel grouping

SELECT  COUNT(DISTINCT fullVisitorId) AS unique_visitors,  channelGroupingFROM `data-to-insights.ecommerce.all_sessions`GROUP BY channelGroupingORDER BY channelGrouping DESC;

Out:

unique_visitors	channelGrouping
38101	Social
57308	Referral
11865	Paid Search
211993	Organic Search
3067	Display
75688	Direct
5966	Affiliates
62	(Other)

Query: Top Five products with the most views

SELECT  COUNT(*) AS product_views,  (v2ProductName) AS ProductNameFROM `data-to-insights.ecommerce.all_sessions`WHERE type = 'PAGE'GROUP BY v2ProductNameORDER BY product_views DESCLIMIT 5;

Out :

product_views	ProductName
316482	Google Men's 100% Cotton Short Sleeve Hero Tee White
221558	22 oz YouTube Bottle Infuser
210700	YouTube Men's Short Sleeve Hero Tee Black
202205	Google Men's 100% Cotton Short Sleeve Hero Tee Black
200789	YouTube Custom Decals

Query: Top Five products with the most unique views

#> You can use the SQL `WITH`#> clause to help break apart a complex query into multiple steps.WITH unique_product_views_by_person AS (-- find each unique product viewed by each visitorSELECT fullVisitorId, (v2ProductName) AS ProductNameFROM `data-to-insights.ecommerce.all_sessions`WHERE type = 'PAGE'GROUP BY fullVisitorId, v2ProductName )-- aggregate the top viewed products and sort themSELECT  COUNT(*) AS unique_view_count,  ProductNameFROM unique_product_views_by_personGROUP BY ProductNameORDER BY unique_view_count DESCLIMIT 5

Out:

unique_view_count	ProductName
152358	Google Men's 100% Cotton Short Sleeve Hero Tee White
143770	22 oz YouTube Bottle Infuser
127904	YouTube Men's Short Sleeve Hero Tee Black
122051	YouTube Twill Cap
121288	YouTube Custom Decals

Final Query: Total number of distinct products ordered and the total number of total units ordered

SELECT  COUNT(*) AS product_views,  COUNT(productQuantity) AS orders,  SUM(productQuantity) AS quantity_product_ordered,  v2ProductNameFROM `data-to-insights.ecommerce.all_sessions`WHERE type = 'PAGE'GROUP BY v2ProductNameORDER BY product_views DESCLIMIT 5;

Out:

product_views	orders	quantity_product_ordered	v2ProductName
316482	3158	6352	Google Men's 100% Cotton Short Sleeve Hero Tee White
221558	508	4769	22 oz YouTube Bottle Infuser
210700	949	1114	YouTube Men's Short Sleeve Hero Tee Black
202205	2713	8072	Google Men's 100% Cotton Short Sleeve Hero Tee Black
200789	1703	11336	YouTube Custom Decals

This process of doing multiple queries helps us derive insights from data using SQL, and technically this data can be in petabytes and the process will remain somewhat same

Benefits of BigQuery:

Superior Insights with Predictive Analytics

Get updated analytics and information on all your business processes by querying streaming data in real-time. Utilize these insights to make data-driven decisions for your business and effectively predict business outcomes without moving data across.

Share and access analytics and insights securely from within your organization, enabling stakeholders to develop insightful reports and dashboards using BI-tool right out of the box.

Enhanced Security for your Data

Experience enhanced data resiliency, robust security, and reliability control offering 99.99% uptime SLA. Ensuring that your data is protected, secured and unreachable to unauthorized and unauthenticated access.

Provisioning and System Sizing

Unlike many relational database management systems (RDBMS), Google BigQuery dynamically allocates query resources as you consume them and deallocates resources as data is deleted or tables are dropped. Furthermore, allocated resources match the query type and complexity.

Storage Management

BigQuery utilizes a proprietary format called Capacitor. It is columnar in nature and holds many benefits including the fact that it can evolve with the query engine. Access patterns are used to determine the most optimal number of shards of data and how they are to be encoded for storage. BigQuery queries can either be stored on Google's Colossus platform or outside of BigQuery storage in the cloud, on Google Drive.

Maintenance

BigQuery receives constant updates from its engineering team. These upgrades cause little to no downtime on BigQuerys operations. Ensuring optimal performance and minimal downtime as you collect essential insights for your business goals.

Backup and Recovery

Database administrators have always found backup and recovery to be extremely tedious and complex tasks. Costs rise as there is almost always a need for additional licenses and hardware. With BigQuery, backup and recovery is handled at the service level. BigQuery maintains a complete seven-day history of changes against your tables and lets you write specific queries to point-in-time snapshots of your data. If a table is deleted, its history is removed after a period of seven days.

Monitoring and Auditing

Using the BigQuery metric, you can monitor how BigQuery is behaving in the form of various charts and alerts. In order to have a proactive approach towards system health, you can create alerts that will be triggered based on thresholds defined by you. BigQuery also creates various logs, including audit logs of actions made by users.

Conclusion

As is apparent, BigQuery provides powerful enablement for your business and decision-making through optimized data processing, smart data insights, and resiliency in how this data is stored. It is powerful too used to allow your organization to utilize data to its advantage.

Data Lake VS Data Warehouse

Anuj Syal — Mon, 17 Jan 2022 03:21:06 GMT

Data Lakes and Data Warehouses are used widely to store large amounts of data. However, they are not interchangeable terms. You will be surprised to know that both of these approaches are complementary to one another.Lets know about these two terms deeply in the segments mentioned below.

Introduction to Data Lake

Photo by Aaron Burden on Unsplash

Data Lake is known to be a depository that is centralized. It enables you to accumulate all your given informed and unformed data. One of the best things is that it does that at any scale. It allows you to store your data unstructured and ideate various types of analytics. From visualizations and dashboards to the big procession, machine learning guides you towards better decisions.

Data Lake is less structured, more like a lake where you dump everything first then find out usage later

Why does an enterprise need a data lake?

Organizations and firms that successfully business worth from their data, will overdo positively their peers. In various surveys, it is noted that plenty of organizations that executed a Data Lake outperformed familiar companies by 10% (approx.) in authentic revenue increase. These firms managed to do unique sorts of analytics like data from clickstreams, social media, log files, and internet-corresponding devices stowed in the Data Lake.

Ultimately, it helped them to recognize and act on the opportunities for the faster growth of the business by retaining and attracting customers, increasing productivity, and making instructed decisions.

What value does a data lake hold in an enterprise?

The capability to store plenty of data, from tons of sources, in minimum time and empower users in order to ask them to unite and examine data in various ways often directs to better and quicker decision making. The following are some instances that will make it clear for you:

Enhanced Customer Interplay:

A data lake has the ability to combine all the client data through a CRM platform. It happens to do so with social media analytics. Then it creates a marketing platform that consists of purchasing history and happening tickets in order to delegate the business to acknowledge the most valuable and promising client cohort, the reason behind client churn, and the rewards and other promotional activities that will improve the loyalty in them.

Enhanced R&D Innovation Choices:

Data Lake enables your R&D squads to examine their thesis, refine inferences and analyze results accordingly. It can include selecting the suitable materials in your product creation that results in quicker performance, genomic research eventually leading to improved medication.

Improved Functional Efficiencies:The IoT presents various ways to collect data on processes like production, with live data incoming from internet-connected devices. Ideally, data lake happens to make it much easier to store and run the analytics on IoT data (machine-generated) resulting in reduced operational cost and improved quality.

Positioning Data Lakes in the Cloud

Essentially, Data Lakes are an exemplary workload that happens to be cloud-deployed, as the cloud introduces implementation, dependability, scalability, availability, and a unique and categorized set of analytics engines.

Moreover, the major reasons clients perceived the clouds as an edge for Data Lakes. It is so due to better security, quicker time to availability, deployment, often feature updates, geographical coverage, elasticity, and costs connected to existing utilization.

A good example for a Data Lake is Google Cloud Storage or Amazon S3

Introduction to Data Warehouse

Photo by Joshua Tsu on Unsplash

Data Warehouse is a central repository of information that is enabled to be analyzed in order to make informed decisions. Typically, the data flows into a data warehouse from transactional systems and other sources.

Data Warehouse is more structured, more like a water tank where you define usage first then put in the data

How does a data warehouse work?

You may find multiple databases in a data warehouse. Each database has its own data which is organized into tables and columns. And each column, a description of the data is enabled to define accordingly. On the other hand, the tables can be organized inside schemas, which can be known as folders. Finally, when the data is ingested, it is simply stored in different tables.

Why is Data Warehouse important?

Data Warehouse holds a great value when it comes to informed decision-making like Data Lake. not only this but it also manages to consolidate data from plenty of sources. In addition to this, historical data analysis, data quality, accuracy, and consistency are some of the elements data warehouse comes with. Furthermore, the separation of analytics processing from transnational databases ultimately enhances the performance of both the given systems

A good example for Data Warehouse is Google's Big Query or Amazon Redshift

Data Lakes and Data Warehouses have two different approaches- Heres how

Photo by Oliver Roos on Unsplash

Depending on the concerned needs, an organization will need to have a data warehouse and a data lake because they offer diverse needs and use cases.

A data warehouse is quite different from a data lake. A data warehouse is a database optimized in order to analyse relational data arriving from transactional systems and lines of enterprise applications.

On the other hand, a data lake serves different purposes as it stores relational data from a line of enterprise applications. The difference is that it stores data from mobile applications, social media, and IoT devices as well. Meaning, it stores all of your given data without any careful design.

Moreover, data warehouses are primarily used for batch reporting, visualizations, and BI analytics on structured data. Whereas data lake can be potentially be used for solving problems of machine learning, data discovery, predictive analytics, and profiling with large amount of data

Organizations with data warehouses happen to see the perks of data lakes. To maximize their benefits, they are evolving their warehouse to include data lakes as well. Not only does it ensure diverse query abilities but advanced abilities for discovering new info models as well

How do data lake and data warehouse work together?

Photo by Pawan Kawan on Unsplash

These approaches are complementary to one another. Data warehouse manages to structure and package the quality of the data, consistency, and performance with significantly increased concurrency. On the other hand, data lake ensures to focus on original raw data commitment and permanent storage. It does that at a reasonable cost while offering a new state of analytical dexterity.

These two different yet complementary solutions are recommended to be a part of any

Why Get a Cloud Certificate in Data Engineering?

Anuj Syal — Thu, 02 Dec 2021 05:30:34 GMT

Learn the core concepts of data engineering to seek a job with professional data engineering skills, which will be in demand in 2022. Find which course suits you the best!

What can you earn with a professional certificate?

Photo by Sasun Bughdaryan on Unsplash

A professional certificate allows you to prepare for your upcoming job and gain skills that will be adequate to secure a job and make a mark in the industry. These courses are based on developing skills rather than reading relevant theories. Once the program is completed, you will acquire a certification that will help you to secure a job after completion of projects. Also, you will gain relevant experience in the field, which allows you to kick start your career with a bang!

Professional Data Engineering Certification

A professional data engineering certification allows you to identify the core concepts of Big Data and Machine Learning. It enables you to understand the employment of BigQuery for interactive insights of data.

With this professional data engineer course, you will migrate existing workloads of SQL and Hadoop to a Cloud. It will also help to use various data processing techniques to engineer data.

Different options for certification

First, you need to select the type of certification that you require! Different cloud providers have learning and certification tracks of their own. Sharing certification tracks for top 3 public cloud providers that provide certification around data engineering skillset

These are different service providers, efficiently, and you can select the certification according to your preference.

I personally have started preparing for Google Cloud Professional Data Engineer Certification
One of the best parts of choosing Google is the exposure to ML and AI capabilities which are way more advance than other cloud providers. Also, the learning track involves hands-on labs which helps you get accustomed to solving real-life problems.

Responsibilities of Data Engineering

A data engineer works to process/engineer data for operations and analytical purposes. The data engineer then cleanses the data and structure the data so that it can be used analytically. One of the significant responsibilities of a data engineer is to make the data easier to handle and optimize the Big Data Ecosystem within the organization. Here are some organizations that use Big Data insights for their organization, and you can become a part of them:

1) Uber 2) Netflix 3) Starbucks 4) BDO 5) T-mobile 6) Facebook 7) Spotify 8) Amazon

Most popular companies and organizations are using data analytics to provide their customers with relevant results. Taking a Cloud certification can be the first step towards your dream job.

Why to study for a certification?

A certification helps you to stand out from other candidates who are applying for a job. It also enables you to secure an appointment if you are a fresher or even allows you to get promotions! Yes, if you want job promotions, then you are on the right track.

Best way to acquire Data Engineering Skillset

A professional data engineering certification is one of the best ways to land in the data engineer space. Most importantly, you will earn hands-on data engineering skills rather than a degree with theoretical knowledge.

Here is how you can prepare for a professional data engineer certification.

Preparing for a Google Data Engineering Certification

If you are interested in a video version for preparation check out this youtube video

https://youtu.be/blCMQRhZgso

It is necessary to prepare for the professional data engineering exam to help you achieve a certification that will lead to a data engineering career. This Google Cloud Certification will examine the following abilities:

Design a data processing system
Solution Quality
Can you operationalize the machine learning models?
Can you build a data processing system?

If you are interested in what a job scope for a data engineer will look like, check out this job post from google . This should also give a sneak peek of A day in life of data engineer

Complete Exam Guide for Google Cloud Certification

Follow this complete exam guide for Google Cloud Certifications.

Photo by Daniil Silantev on Unsplash

You need to prepare to design the data processing system, which includes the mapping and storage systems and the business requirements, creating the data pipelines, and the schema designs. Also, it will consist of publishing data and data visualization, which is BigQuery.
You will also design the data processing solutions by considering some factors like infrastructure, system availability, fault tolerance, use of distributed systems.
You should also migrate the data, which is the phase of data warehousing and data processing.
You will need to build and operationalize the data processing systems for the storage and management of data.
You will need to operationalize the machine learning models to deploy (ML pipeline), ingest data, and continuously evaluate it.
Lastly, you will ensure the security and reliability of the solution.

For more details check out the official exam guide from Google

Certification Learning Path

Google Cloud Certification Learning Path

Google itself provides an entirely well-designed learning path that includes various videos, and you can get a hand on experience and apply all the concepts of the course simultaneously.

The Google Cloud Certification course costs around $39, which will allow you to build a career and help you secure a job once you are certified. Cloud Services you need to cover!

You will also need to cover the cloud services, including the core functional concepts of storage, ingestion, analytics, machine learning, and serving data solutions.

With this guide, you will get the Google Cloud certification in no time, and you will enjoy the perks of this certification real soon, take a step and change your data game today!

Website Screenshot from Google Cloud Training

Google Certified Professional Data Engineer from AcloudGuru from Tim Berry

Acloudguru previously known as Linux Academy. According to the website, the primary focus of this course is to prepare you for the GCP Professional Data Engineer certification exam.

Along the way youll solidify your foundations in data engineering and machine learning, ensuring that by the end of the course you will be able to design and build data processing solutions, operationalize machine learning models and gain a working knowledge of relevant GCP data processing tools and technologies.

Website Screenshot from AcloudGuru

Conclusion

This brings us to the end of the article. Now, you can prepare for the professional data engineer certification that will change your job-seeking experience, and you can avail the best opportunities by gaining this certification. You can choose from multiple varieties of options and select the best certification for your career path to becoming a data engineer. It is always a good option to choose a certification that will add skills to your skillset and help different organizations to hire you instead of other candidates. The best part is that you will outweigh others in the job search.

Creating a Machine Learning Model with SQL

Anuj Syal — Fri, 12 Nov 2021 05:34:56 GMT

Even though Machine Learning is already really advanced, however, it has some weaknesses which can make it hard for you to use it.

Current machine learning workflows and its problems

If you've worked with ml models you may realize that structure and preparing them can be extremely time-escalated.
For a typical information researcher should initially trade limited quantities of information from information store into I-Python note pad and into information taking care of structures like pandas for python.
In case you're assembling a custom model you first need to change and pre-process all that information and play out all that component designing before you can even takecare of the model information in.
Then, at that point, at long last after you've fabricated your model and say TensorFlowanother comparable library then you train it locally on your PC or on a VM doing thatwith a little model then, at that point, expects you to return and make all the more newinformation includes and further develop execution and you rehash and rehash andrehash furthermore, it's hard so you stop after a couple of cycles

But hey there, dont worry! you can work on building models, even if you are not thedata scientist of your team.

Introducing to Google BigQuery

Photo by Patrick Lindenberg on Unsplash

BigQuery is a fully-managed petabyte-scale enterprise data warehouse. It is basically made up of two key things

Fast SQL query engine
Fully managed data storage
Big Query supports querying petabytes of data with standard SQL that everyone is used to.

Example:

#standardSQLSELECTCOUNT(*) AS total_tripsFROM`bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`

Other Big Query Key features

Serverless
Flexible pricing model
Standard GCP Data encryption and security
Perfect for BI and AI use cases
ML and predictive modeling with BigQuery ML
Really cheap storage - same as Google cloud storage buckets
Interactive data analysis with BigQuery BI Engine - connect to tableau, data studio, looker, etc

Big query and its ml features

As a part of one of the main features, big query allows building predictive machine learning models with just simple SQL syntax. With the petabyte processing power of google cloud, you can easily create models right there in the warehouse.A sample syntax to create models looks like this

CREATE OR REPLACE MODEL `dataset.classification_model`OPTIONS(model_type='logistic_reg',labels = ['y'])AS

A typical workflow using Big Query ML [5 Step]

Flowchart diagram from the author

Tutorial: Building a classification model with big query ML (Simple SQL Syntax)

You can open up a big query console and start replicating the steps below:

You can watch a Video Tutorial I created

https://www.youtube.com/watch?v=7fJs1gEpPjo

Dataset

I am using a public big query dataset google_analytics_sample . The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store. The data includes The data is typical of what an ecommerce website would see and includes the following information:

Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc.Transactional data: information about the transactions on the Google Merchandise Store website.

Dataset license:A public dataset is any dataset that is stored in BigQuery and made available to the general public through the Google Cloud Public Dataset Program. The public datasets are datasets that BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these datasets and provides public access to the data via a project. You pay only for the queries that you perform on the data. The first 1 TB per month is free, subject to query pricing details. Under Creative Commons Attribution 4.0 License

Machine Learning problem we will try to solve

We will try to predict if the user will buy products on return visit, hence we name our label will_buy_on_return_visit

Step1: Exploring the dataset

Checking conversion rate

WITH visitors AS(SELECTCOUNT(DISTINCT fullVisitorId) AS total_visitorsFROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`),purchasers AS(SELECTCOUNT(DISTINCT fullVisitorId) AS total_purchasersFROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`WHERE totals.transactions IS NOT NULL)SELECT  total_visitors,  total_purchasers,  total_purchasers / total_visitors AS conversion_rateFROM visitors, purchasers

What are the top 5 selling products?

SELECT  p.v2ProductName,  p.v2ProductCategory,  SUM(p.productQuantity) AS units_sold,  ROUND(SUM(p.localProductRevenue/1000000),2) AS revenueFROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,UNNEST(hits) AS h,UNNEST(h.product) AS pGROUP BY 1, 2ORDER BY revenue DESCLIMIT 5;

Question: How many visitors bought on subsequent visits to the website?

WITH all_visitor_stats AS (SELECT  fullvisitorid, # 741,721 unique visitors  IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit  FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`  GROUP BY fullvisitorid)SELECT  COUNT(DISTINCT fullvisitorid) AS total_visitors,  will_buy_on_return_visitFROM all_visitor_statsGROUP BY will_buy_on_return_visit

Step 2. Select features and create your training dataset

Now that we know a bit more about the data, let's finalize and create a final dataset we want to use for training

SELECT  * EXCEPT(fullVisitorId)FROM  # features  (SELECT    fullVisitorId,    IFNULL(totals.bounces, 0) AS bounces,    IFNULL(totals.timeOnSite, 0) AS time_on_site  FROM    `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`  WHERE    totals.newVisits = 1)  JOIN  (SELECT    fullvisitorid,    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit  FROM      `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`  GROUP BY fullvisitorid)  USING (fullVisitorId)ORDER BY time_on_site DESCLIMIT 10;

Step 3: Create a Model

This step uses a create model statement over the dataset created in previous step

CREATE OR REPLACE MODEL `ecommerce.classification_model`OPTIONS(model_type='logistic_reg',labels = ['will_buy_on_return_visit'])AS#standardSQLSELECT  * EXCEPT(fullVisitorId)FROM  # features  (SELECT    fullVisitorId,    IFNULL(totals.bounces, 0) AS bounces,    IFNULL(totals.timeOnSite, 0) AS time_on_site  FROM    `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`  WHERE    totals.newVisits = 1)  JOIN  (SELECT    fullvisitorid,    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit  FROM      `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`  GROUP BY fullvisitorid)  USING (fullVisitorId);

Step 4: Evaluate classification model performance

Evaluate the performance of the model you just created using SQL

SELECT  roc_auc,  CASE    WHEN roc_auc > .9 THEN 'good'    WHEN roc_auc > .8 THEN 'fair'    WHEN roc_auc > .7 THEN 'not great'  ELSE 'poor' END AS model_qualityFROM  ML.EVALUATE(MODEL ecommerce.classification_model,  (SELECT  * EXCEPT(fullVisitorId)FROM  # features  (SELECT    fullVisitorId,    IFNULL(totals.bounces, 0) AS bounces,    IFNULL(totals.timeOnSite, 0) AS time_on_site  FROM    `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`  WHERE    totals.newVisits = 1    AND date BETWEEN '20170501' AND '20170630') # eval on 2 months  JOIN  (SELECT    fullvisitorid,    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit  FROM      `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`  GROUP BY fullvisitorid)  USING (fullVisitorId)));

Future steps and feature engineering

If you are further interested in improving the performance of the model, you can always opt-in for adding in more features from the dataset

Conclusion

Products like Google's Big Query ML make building machine learning models accessible to more people.With simple SQL syntax, and googles processing power it is really easy to churn out real-life big data models.

Spark Streaming with Python

Anuj Syal — Wed, 03 Nov 2021 05:15:57 GMT

Photo by JJ Ying on Unsplash

Apache Spark Streaming is quite popular. Due to its integrated technology, Spark Streaming outperforms previous systems in terms of data stream quality and comprehensive approach.

Python and Spark Streaming do wonders for industry giants when used together. Netflix is an excellent Python/Spark Streaming representation: the people behind the popular streaming platform have produced multiple articles about how they use the technique to help us enjoy Netflix even more. Let's get started with the basics.

What is spark streaming, and how does it work?

The Spark platform contains various modules, including Spark Streaming. Spark Streaming is a method for analyzing "unbounded" information, sometimes known as "streaming" information. This is accomplished by dividing it down into micro-batches and allowing windowing for execution over many batches.

The Spark Streaming Interface is a Spark API application module. Python, Scala, and Java are all supported. It allows you to handle real data streams in a fault-tolerant and flexible manner. The Spark Engine takes the data batches and produces the end results stream in batches.

What is a streaming data pipeline?

It is a technology that allows data to move smoothly and automatically from one location to another. This technology eliminates many of the typical issues that the company had, such as information leakage, bottlenecks, multiple data clash, and repeated entry creation.

Streaming data pipelines are data pipeline architectures that process thousands of inputs in actual time at scalability. As an outcome, you'll be able to gather, analyses, and retain a lot of data. This functionality enables real-time applications, monitoring, and reporting.

Photo by Agence Olloweb on Unsplash

Streaming Architecture of Spark.

Spark streaming architecture diagram from spark.apache.org

Spark Streaming's primary structure is batch-by-batch discrete-time streaming. The micro-batches are constantly allocated and analyzed, rather than traveling through the stream processing pipelines one item at a time. As a result, data is distributed to employees depending on accessible resources and location.

When data is received, it is divided into RDD divisions by the receiver. Because RDDs are indeed a key abstraction of Spark datasets, converting to RDDs enables group analysis with Spark scripts and tools.

Real-life spark streaming example (Twitter Pyspark Streaming )

In this solution I will build a streaming pipeline that gets tweets from the internet for specific keywords (Ether), and perform transformations on these realtime tweets to get other top keywords associated with it.

Real-life spark streaming example architecture by the author

Video Tutorial

https://www.youtube.com/watch?v=jMtKh05xR-8

Step1: Streaming tweets using tweepy

import tweepyfrom tweepy import OAuthHandlerfrom tweepy import Streamfrom tweepy.streaming import StreamListenerimport socketimport json# Set up your credentialsconsumer_key=''consumer_secret=''access_token =''access_secret=''class TweetsListener(StreamListener):  def __init__(self, csocket):      self.client_socket = csocket  def on_data(self, data):      try:          msg = json.loads( data )          print( msg['text'].encode('utf-8') )          self.client_socket.send( msg['text'].encode('utf-8') )          return True      except BaseException as e:          print("Error on_data: %s" % str(e))      return True  def on_error(self, status):      print(status)      return Truedef sendData(c_socket):  auth = OAuthHandler(consumer_key, consumer_secret)  auth.set_access_token(access_token, access_secret)  twitter_stream = Stream(auth, TweetsListener(c_socket))  twitter_stream.filter(track=['ether'])if __name__ == "__main__":  s = socket.socket()         # Create a socket object  host = "127.0.0.1"     # Get local machine name  port = 5554                 # Reserve a port for your service.  s.bind((host, port))        # Bind to the port  print("Listening on port: %s" % str(port))  s.listen(5)                 # Now wait for client connection.  c, addr = s.accept()        # Establish connection with client.  print( "Received request from: " + str( addr ) )  sendData( c )

Step2: Coding PySpark Streaming Pipeline

# May cause deprecation warnings, safe to ignore, they aren't errorsfrom pyspark import SparkContextfrom pyspark.streaming import StreamingContextfrom pyspark.sql import SQLContextfrom pyspark.sql.functions import desc# Can only run this once. restart your kernel for any errors.sc = SparkContext()ssc = StreamingContext(sc, 10 )sqlContext = SQLContext(sc)socket_stream = ssc.socketTextStream("127.0.0.1", 5554)lines = socket_stream.window( 20 )from collections import namedtuplefields = ("tag", "count" )Tweet = namedtuple( 'Tweet', fields )# Use Parenthesis for multiple lines or use \.( lines.flatMap( lambda text: text.split( " " ) ) #Splits to a list  .filter( lambda word: word.lower().startswith("#") ) # Checks for hashtag calls  .map( lambda word: ( word.lower(), 1 ) ) # Lower cases the word  .reduceByKey( lambda a, b: a + b ) # Reduces  .map( lambda rec: Tweet( rec[0], rec[1] ) ) # Stores in a Tweet Object  .foreachRDD( lambda rdd: rdd.toDF().sort( desc("count") ) # Sorts Them in a DF  .limit(10).registerTempTable("tweets") ) ) # Registers to a table.

Step3: Running the Spark Streaming pipeline

Open Terminal and run TweetsListener to start streaming tweets

python TweetsListener.py

In the jupyter notebook start spark streaming context, this will let the incoming stream of tweets to the spark streaming pipeline and perform transformation stated in step 2

ssc.start()

Step4: Seeing real-time outputs

Plot real-time information on a chart/dashboard from the registered temporary table in spark tweets. This table will update in every 3 seconds with fresh tweet analysis

import timefrom IPython import displayimport matplotlib.pyplot as pltimport seaborn as sns# Only works for Jupyter Notebooks!%matplotlib inline count = 0while count < 10:    time.sleep( 3 )    top_10_tweets = sqlContext.sql( 'Select tag, count from tweets' )    top_10_df = top_10_tweets.toPandas()    display.clear_output(wait=True)    plt.figure( figsize = ( 10, 8 ) )#     sns.barplot(x='count',y='land_cover_specific', data=df, palette='Spectral')    sns.barplot( x="count", y="tag", data=top_10_df)    plt.show()    count = count + 1

Out:

Some of the pros and cons of Spark Streaming

Now that we have gone through building a real-life solution of spark streaming pipeline, let's list down some pros and cons of using this approach.

Pros

For difficult jobs, it offers exceptional speed.
Sensitivity to faults.
On cloud platforms, it's simple to execute.
Support for multiple languages.
Integration with major frameworks.
The capability to connect databases of various types.

Cons

Massive volumes of storage are required.
It's difficult to use, debug, and master.
There is a lack of documentation and instructional resources.
Visualization of data is unsatisfactory.
Unresponsive when dealing with little amounts of data
There have only been a few machine learning techniques.

Conclusion

Spark Streaming is indeed a technology for collecting and analyzing large amounts of data. Streaming data is likely to become more popular in the near future, so you should start learning about it now. Remember that data science is more than just constructing models; it also entails managing a full pipeline.

The basics of Spark Streaming were discussed in this post, as well as how to use it on a real-world dataset. We suggest you work with another sample or take real-time data to put everything we've learned into practice.

Building Computer Vision Datasets in Coco Format

Anuj Syal — Thu, 02 Sep 2021 03:25:23 GMT

Computer vision is among the biggest disciplines of machine learning with its vast range of uses and enormous potential. Its purpose is to duplicate the brain's incredible visual abilities. Algorithms for computer vision aren't magical. They require information to perform, and they'll only be as powerful as the information you provide. Based on the project, there are various sources to obtain the appropriate data.

The most famous object detection dataset is the Common Objects in Context dataset (COCO). This is commonly applied to evaluate the efficiency of computer vision algorithms. The COCO dataset is labeled, delivering information for training supervised computer vision systems that can recognize the dataset's typical elements. Of course, these systems are beyond flawless, thus the COCO dataset serves as a baseline for assessing the systems' progress over time as a result of computer vision studies.

In this article, we have discussed Coco File Format a standard for building computer vision datasets, object detection, and image detection methods.

Why do neural nets work really well for computer vision?

Artificial neural networks are considered a major subcategory of ML which constitutes the core of deep learning techniques. Their origin and architecture are same as the human mind, and they work like real neurons.

Since pictures do not always have labels, sub-labels for sections and elements must be removed or cleverly reduced, neural networks perform effectively for computer vision. Training information is used by neural networks to train and increase their efficiency with experience. But, once these learning techniques have been fine-tuned for precision, they become formidable resources in computer technology and AI, enabling us to quickly categorize and organize data.

When compared to traditional classification by experienced scientists, activities in voice recognition or image recognition may take only a few minutes rather than hours. Googles technology is among the most famous neural networks.

Why still there is a need to create a custom dataset

Transfer learning has been a specific technique of machine learning in which a model created for one job is applied as the basis for a model on a different task. Considering the enormous compute as well as time resource base needed to establish neural network systems on such concerns, as well as the large leaps in expertise that they deliver on similar issues, it is a common strategy in machine learning in which pre-trained systems are used as the preliminary step on natural language data processing.

We can deal with these instances using transfer learning, which uses previously labeled data from a comparable task or topic.

Photo by NeONBRAND on Unsplash

Coco File Format is a standard for building computer vision datasets

Photo by Irene Kredenets on Unsplash

Analyzing visual environments is a major objective of computer vision; it includes detecting what items are there, localizing them in 2D and 3D, identifying their properties, and describing their relationships. As a result, the dataset could be used to train item recognition and classification methods. COCO is frequently used to test the efficiency of real-time object recognition techniques. Modern neural networking modules can understand the COCO dataset's structure.

Contemporary AI-driven alternatives are not quite skillful of creating complete precision in findings that lead to a fact that the COCO dataset is a substantial reference point for CV to train, test, polish, and refine models for faster scaling of the annotation pipeline.

The COCO standard specifies how your annotations and picture metadata are saved on disc at a substantial stage. Furthermore, the COCO dataset is an addition to transfer learning, in which the material utilized for one model is utilized to start another.

Tutorial to build Computer vision dataset using Datatorch

Video Tutorial

https://youtu.be/E-_o2Q0YTrs

Datatorch is one of the cloud-based free-to-use annotation tools out there. It is a web-based platform where you can just hop on to and quickly start annotating dataset

Step0: Discovering Data

Solving any machine learning problem first starts with data. The first question is what problem you want to solve. Then the next question is where can I get this data.

In my case (hypothetical), I want to build an ML model that detects different dog breeds from photos. I am sourcing this relatively simple Stanford Dogs Dataset from Kaggle

Step1: Create New Project

After you log in you will see the dashboard main screen showing your projects and organization. This will be good when you are trying to work on multiple projects across different teams.Now on the top right of the title bar, click on + and create a new project

Step2: Onboard Data

Then go to Dataset tab from the left navigation bar, click on + to create a new dataset named dogtypes. After that you can easily drop the images

Or there is another option to directly connect to a cloud provider storage (AWS, Google, Azure)

Step3: Start Annotating

If you click on any of the image in the dataset, it will directly lead you to the annotating tool

Annotation Tools On the left there are the annotation tools you can use on the visualizer window in the center
Dataset: List of all the images, click to annotate them
Change/Create Labels: Click to change the label associated with annotation
Annotation Details: After you have done some annotation in the image you will see the details here
Tool Details/ Configuration: When you select an annotation tool the details/configuration appear on this. For example, if you select a brush tool you can change its size here

To start annotating you can just select an annotation tool from the options, it also depends on the type of model you are trying to build. For an object detection model, something like a bounding box or circle tool is good to use, otherwise for a segmentation model you can use the brush tool or an AI-based superpixel tool to highlight relevant pixels. For example for I just used a simple brush tool (increased the size) to highlight over the dog.

Also, it would be best to discover annotation by trying or you can watch the tutorial on my youtube channel.

Step4: Export to Annotated Data to Coco FormatAfter you are done annotating, you can go to exports and export this annotated dataset in COCO format.

Conclusion

If you're inexperienced to object detection and need to create a completely new dataset, the COCO format is an excellent option because of its simple structure and broad use. The COCO dataset structure has been investigated for the most common tasks: object identification and segmentation. COCO datasets are large-scale datasets that are suited for starter projects, production environments, and cutting-edge research.

Introduction to graph-based analytics using Cylynx Motif

Anuj Syal — Thu, 12 Aug 2021 04:26:35 GMT

Data visualization and analysis may help you find summary data, complexity, unseen connections, patterns, differences, irregularities, and insights in your dataset. Although there are several technologies available to help in the presentation of tabular information, this can be true with graph data.

The motif is one of the no-code graph visualization tools. This software will assist researchers, data analysts, and executives in establishing the links between their datasets and making graph analysis more simple and interesting!

In this article, we will discuss graph-based analytics using Cylynx Motif, challenges of graph data exploration, and motif graph intelligence platform. Keep scrolling to read more.

What is Graph-Based Analytics?

It is a fast-growing field of study in which graph-theoretic, analytical, and database approaches are used to describe, save, extract, and execute performance evaluations on graph-structured information.

Analysts can use these methods to figure out how a network's framework changes under various circumstances, discover pathways between combinations of entities that fulfill distinct limitations. These are also used to recognize clumps or carefully interacting subsets within a graph and discover subgraphs that seem to be comparable to a data set.

It is necessary to visualize one's information as a graph of nodes or vertices that indicate items and connections that reflect connections between things for this and several other activities. Several technology fields, like sensing devices, need huge graphs with billions of nodes and edges.

These could indicate a thousand different sorts of things and connections in activities like situation monitoring. The interconnections in telecommunication systems could change over time, and certain organizations might be quite closely linked to one another.

Financial services

GA for financial sectors is a compelling technique for visualizing networks, connections, and activities between individuals, companies, and things. The most well-known applications of GA so far have been in the analysis of social media information.

However, the system can potentially change the financial solutions sector, particularly once used to reinforce Artificial Intelligence (AI)-based metrics, like trying to stabilize different processes or reducing time-consuming actions like information processing, verification, and error adjustment.

Financial Institutions (FIs) may obtain priceless and quick information into their networks (through cybersecurity threat control), counterparties (through counterparty credit threat), users (mostly through KYC/ AML), as well as the broader community by implementing GA successfully (via supply chain analysis).

GA does have a big future. It's feasible to build an AI tool that can either automate choices or enhance and boost rational decisions by integrating it with other scientific approaches. Machine Learning (ML) approaches have recently received a lot of attention in cutting-edge analyses, and GA may have a similar impact on finance and safety innovation.

Supply chain management

We can save cash on shipments by using the supply chain management graph. That form of the graph may help you utilize economies of scaling, particularly if you have a broader perspective of consumption, how everything relates to your stock, as well as how much you'll have to purchase in a given timeframe.

Perhaps you're now placing smaller, more frequent orders since youre unable to implement this long-term strategy throughout the entire supply chain. If you've got a broader perspective (that the graph may assist you to see), you could purchase in large quantities, make more informed purchasing choices, and save cash.

You may also use the graph to improve your inventories. You could then shift your stock around so it's in the appropriate location at the proper moment after you realize how many usable components you need, wherever those are, how much time will need need to bring them to business, and also what transportation they'll have to travel in (in regards of your prediction).

Instead of completing new purchases and keeping your stock untouched, you may use this to stay profitable. This might help you fine-tune your stock management so you don't have as much on the store. And besides, you know how your prediction will come out in perspective of how many components you'll require. Then you'll be able to efficiently arrange your inventories and avoid buying or storing more than you'll need.

The graph can assist you in comparing and contrasting providers and related items. You may analyze customer complaints and evaluate the process variation from different vendors once you have that picture of the providers versus all the parts and materials you're utilizing.

Customer 360

By looking at the graph, you can see how this user is related to certain other businesses; for instance, consumers could operate in any of your other business companies, or maybe they're friends and family. You got contract details, and you can access not only details regarding their existing vehicles - its actual BOM, how it's maintained, choices available, product definition, and many more but also any other cars they've manufactured.

You can also examine the cars they've bought in the past, as well as the choices and requests they've made. Sensor information, telemetry statistics, billing information, and any customer encounters may all be included in this graph.

About the Motif Graph Intelligence Platform

What is Motif?

The motif is a graph analytics application that converts linked information into business information without the use of programming. It allows users to accelerate information mining, research, and interaction by allowing companies to make the connections between disparate data sets. Researchers, business analysts, and executives may use Motif to do a graphical search on graph information.

What was the motivation for developing Motif?

Motif wants to reduce the access restrictions for anybody interested in graph issues. One of the major issues encountered was that integrating graph data into corporate decision-making in sectors such as financing and known security vulnerabilities.

The majority of options are custom-built, private, time-consuming, and costly. Cylynx is trying to fix that with Motif, which makes 80 percent of the most widespread networking visualization use scenarios as simple as possible while also adding interactive functionalities to turn graph data into business insight.

A Simple Tutorial to get started with Motif (using movie dataset)

If you interested in a comprehensive tutorial, check out this video on my youtube channel

https://youtu.be/jPs5UxNNKQ4

Steps

To get started go to the following Demo Motif Link
As a first step you would need to import the data, click on the import data button. There are various ways you can start with the data.

Information about the dataset: In this tutorial, I will use a neo4j graph database server sample Movie Dataset. And connect directly using the server URL and credentials. To create a similar dataset, go to ne4j sandbox . After you log in, you can create a new database server with Movie Dataset preloaded, please note it will expire in 3 days.
After the server is connected, you would a query to extract this data. I am using a query that extracts information on movies and the corresponding actors. Click on execute query and import the nodes and edges.

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie) RETURN p,r,m

After the data has been imported you will see the graph with all the nodes and edges populated in the visual interface

On the right, there are quick action buttons like zoom-in, zoom-out, undo, redo, etc. On the left there is a control panel with the following options

The idea of using Motif-like tool for graph-based analytics is to explore the data and relationships visually. That is where the Styles tab helps you with in depth exploration. You should be able to select different layout options, radius, node spacing, focus nodes, and also node styles, edges styles.

All the options discussed are best understood by the actual implementation. In my case with around 5 minutes of playing around with data helped me better get better insights about the dataset. I selected a Radial Layout, Added Node size relationships to the degree of connections, added node color with legends. And the final output looked something like this

I was able to answer quick questions by just visually looking at the graph, like who is the most popular actor/movie. If you are interested in knowing the detailed steps you can check out my youtube video Or you can just try it yourself by going to the demo link

A Few Challenges of Graph Data Exploration

Finding the connection through the tabular form

The tabular representation of our data is quite difficult to understand in some cases. The companies and organizations are working on relational database structures, they prefer to stay within the boundaries of the data.

These boundaries are not assisting the companies to evolve.The implementation of graphical data exploration in the entities will help them understand the relationship in the tabular forms. It will also assist these companies to adopt flexibility and make new research and discoveries in the relevant fields.

Challenges in data exploration

Finding the connection between different things through graph data may take some time in several situations. This could be challenging for some companies as they dont have enough time to work on data exploration.

Data scientists and researchers in these situations play a huge role. They can implement new technologies and solutions in the companies to make things easier. Networkx and igrapgh are some of these tools that are used by data scientists.

Issues with high dimensional data

A graph designed in a real-world scenario may have some edges, nodes, and some different properties that may be challenging for the new users. The users will need some extra time to deal with these graphs and understand the properties.

Exchanging the results with others is challenging

There are sometimes when you want to share the results with your friends and colleagues. In these situations, the graphical data in tabular form is hard to share with others. There are no special tools for sharing the insights of your graphs like several other technologies.

Conclusion

Cylynx is continuously trying to introduce and implement new tools and software to make things easier. Also providing products and techniques to assist financial fraud authorities in connecting the links among various data pieces.
Graphs are the greatest way to display and handle such information. Motif has been built to make graph finding clear and open to researchers, data analysts, and executives by integrating the demands of our different application releases.

Building Deep Neural Networks the Easy Way | Perceptilabs

Anuj Syal — Mon, 02 Aug 2021 06:59:07 GMT

PerceptiLabs' visual simulation model offers a graphical user interface for creating, learning, and evaluating designs as well as allowing for further programming modifications. You can get quick repetitions and improved solutions that are easier to describe.

The framework of Perceptible allows users to create modified model configurations without requiring scientific knowledge as well as end-to-end simulation techniques that enable users to perceive and analyze the model in an entirely clear way improving awareness and allowing for error detection.

What is Perceptilabs?

This is actually a user interface for TensorFlow with an advanced machine learning platform with a graphical modeling process that combines the freedom of programming with the convenience of a drag-and-drop interface that would be a visual design on top of TensorFlow. This can make modeling creation simpler, quicker, and more available to a broader range of people.

It also includes pre-built algorithms for a variety of disciplines that can be transferred into the workplace by individuals, enabling them to modify and learn these on their own datasets. Tax fraud identification, object classification for pattern recognition, and other applications are among the system's use applications.

PerceptiLabs and machine learning

PerceptiLabs was founded with the goal of making machine learning modeling easier for businesses of all sizes. Machine learning could have a vital aspect in our development, and PerceptiLabs is now on a journey to enable businesses of all sizes to get started in this specific industry.

It analyses the ever-increasing amounts of data accessible today, assists businesses in identifying trends in the information, and provides estimates depending on those trends. Every business has a range of applications, such as employing object identification to predict which grocery stores are getting low on stock or utilizing picture identification to recognize a person in a congested field.

Users can easily create machine learning algorithms for any type of business with PerceptiLab's visual modeling solution. It enables users to click and drag items and join elements, then configures variables before the software writes their programming instantly. Users may quickly train and fine-tune their machine learning model, as well as observe its performance.

Modeling workflow of Perceptilabs

Photo by Tobias Carlsson on Unsplash

Pre-made elements encapsulate TensorFlow data and simplify it into visible components, while still enabling customized code updates. This graphical interface enables you to move these elements into a structure that depicts the design of your system. This user interface makes it simple to implement additional features such as one-hot encoding and thick layering.

As you alter the design in PerceptiLabs, every element also offers graphical information on how it has converted the dataset. This immediate overview reduces the requirement to execute the entire simulation before viewing results, allowing you to change more quickly.

Whenever you compare PerceptiLabs to any other platform, you'll see how much easier it is to visualize pictures and categorize information. You could also observe how every element alters the information, as well as how the alterations contributed to the final categorization.

During modeling, PerceptiLabs retrieves and utilizes the initial piece of an accessible dataset, and it re-runs the system as you implement adjustments, and you'll see how your modifications affect your outcome right away. This useful tool allows you to examine results without having to execute the algorithm on the whole sample.

Building your first Deep Learning model on Perceptilabs

If you are interested in a more comprehensive video tutorial, check out my youtube video below

https://www.youtube.com/watch?v=Ez4la9Lwh04

Step 1: Install and run Perceptilabs on localOpen terminal, use pip to install and run the tool locally (Make sure to have python version < 3.9

pip install perceptilabsperceptilabs

After the setup the tool is up and running on localhost:8080 Step 2: Understanding the dataset I am using the default sample dataset of X-Ray scans of patients provided in Perceptilabs.The dataset has Xray scans with 3 labels - Normal,Viral Pneumonia, CovidTo import this dataset in Perceptilabs you would need it in the right format which is indata.csv file contains the path of the files with corresponding labels

Step 3: Go to the model hub (first tab on the left), click on create model and import the dataset data.csv

Select the URL as input feature and labels as Target Keep the data partition as default [70% Train,20% Validation,10% Test]

Select the training settings

Provide the details such as Model Name, Epochs, Batch Size, Loss Function, Learning Rateand click on Customize and go to the Modelling window

Within the modelling window you will see all the layers of the neural network being laid out as per the inference of the modeling tool, it should look something like the screen shot below

It contains by default 1 convolution layer connected to input images and two dense layers with softmax to convert it to a final label output

Step 4: Play around with the tool and multiple layersYou can add in more deep learning components as a part of modeling tool, or you can easily code a custom Keras function for another layer

Building any deep learning model requires a lot of iterations, therefore the visual approach comes in handy where we can plug play and see the iterated results

Step 5: Start Training and see live stats (Statistics View)Click on Run with current settings on the top bar to start training the model, pass in the model settings as discussed previously, for a classification use case a Cross-Entropy Loss function would make more sense.

After you start modelling, you will be redirected to the statistics view, to see the live statistics of the model while it is getting trained, you should be able to see weights output, loss, accuracy of each row of the dataset being trained. And all of this analysis can be done layer by layer

You should also be able to see the accuracy increase as the epoch pass on a global level

Step 6: Run Validation on Tests DatasetGo to the test view and run the test to get model metrics and confusion matrix of labels

And after it is complete you should be able to see the quality of the model you have built using these test metrics such as Model Accuracy, Precision, Recall

Why something like Perceptilabs makes sense?

Data analysts may use this technology to perform more effectively with machine learning techniques and get a good understanding of them.

Helps you get the Real-time information

Real-time metrics and detailed summaries of every modeling element's data are available. You can simply follow and analyze the behavior of the variables, troubleshoot in real-time, and identify where your system may be improved.

Helps you share them on GitHub

PerceptiLabs allows you to maintain many simulations, evaluate them, and communicate the findings with your group quickly and efficiently. Export your data as a TensorFlow framework.

Helps you overcome Compatibility problems

When a corporation's researchers create models and put them into operation, they must all be using the same model. Otherwise, problems would arise. According to some experts, this problem may be avoided if everyone in a firm utilizes PerceptiLabs' technology. Helps you export your model

Perceptilabs allows you to examine and explain how your program runs and executes, as well as why particular outcomes are being produced. You may also export your data as a training TensorFlow version after you're okay with it.

Advantages of using Perceptilabs

This tool offers a wide range of benefits. Some of them are;

Quick modeling - Includes a simple drag-and-drop user interface that helps make system design simple to create and analyze.
Visibility - It can be used to start understanding how your strategy performs so that it can be explained.
Versatility - Built as a graphical API on top of TensorFlow, this allows programmers to use TensorFlow's low-level Interface while also allowing them to use other Python libraries.

Conclusion

The procedure of developing algorithms must be simplified if businesses are to embrace machine learning. PerceptiLabs offers graphical machine learning modeling solutions to assist businesses in implementing computer learning. It not only allows you to develop computer learning networks quickly, but it also gives you a graphical representation of how the model was performing and allows you to exchange that information with one another.

Introduction to Pyspark ML Lib: Build your first linear regression model

Anuj Syal — Mon, 26 Jul 2021 04:48:30 GMT

Photo by Genessa Panainte on Unsplash

Machine learning that is applied to build personalizations, suggestions, and future analyses are becoming increasingly important as companies generate increasingly diversified and user-focused digital goods and solutions. Rather than dealing with the complications of different datasets, the Apache Spark machine learning library (MLlib) enables data engineers to concentrate on specific data challenges and algorithms.

A linear technique to modeling the connection across a dependent factor and one or maybe more random factors is known as linear regression. It is one of the most fundamental and widely used kinds of predictive modeling.

What is Spark MLlib?

Photo by Marius Masalar on Unsplash

Spark MLlib is among the most appealing features of Spark as it has the capacity to enormously scale processing, which is precisely what machine learning models require. However, there are some machine learning models that cannot be properly implemented, which is a drawback.

MLlib is a comprehensive machine learning package that includes categorization, regression, clustering, cooperative filtration, and fundamental optimal primitives, as well as other popular learning methods and tools.

##What is linear regression model?

By matching a line to the given information, regression methods illustrate the connection among variables. A straight line is used in this system, whereas a curved path is used in nonlinear systems.

You can also start using regression to predict the characteristics of a dependent variable that depends on the characteristics of an independent variable. The link among two qualitative parameters is estimated using simple linear regression.

Why to use spark Mllib for ml

Spark is a strong, centralized platform for data scientists due to its fast speed. It is also a simple to use tool that helps them get desired results quickly. This enables data scientists to fix deep learning complications along with pattern calculation, broadcasting, and interactive request handling at a much larger scale.

R, Python and Java, are just a few of the languages available in Spark. The 2015 Spark Study, which interviewed the Spark community, revealed that Python and R have seen primarily fast growth. In particular, 58 percent of participants said they used Python and 18 percent said they were currently utilising the R API.

Interested in more comprehensive tutorial:

Check out this youtube video:

https://youtu.be/9Sy1x1fa1no

Create your first linear regression model with Spark Mllib

Step 1: Pyspark environment setupFor pyspark environment on local machine, my preferred option is to use docker to run jupyter/pyspark-notebook image. However, if you are interested in an extensive installation guide check out my blog post or youtube video
Step 2: Create a spark session

from pyspark.sql import SparkSessionspark = SparkSession.builder.master("local").appName("linear_regression_model").getOrCreate()

Step3: Load dataset

For the dataset, I am using a simple Real Estate dataset from Kaggle , which contains a simple data for real estate with continuous features like distance from mrt station, coordinates, size, etc

After you download read the dataset into a spark dataframe

real_estate = spark.read.option("inferSchema", "true").csv("real_estate.csv",header=True)

Step4: Explore data and its attribute

We can explore different attributes/columns of the data using a few inbuilt functions in spark.

PrintSchema() to see the columns with data types

real_estate.printSchema()Out:root |-- No: integer (nullable = true) |-- X1 transaction date: double (nullable = true) |-- X2 house age: double (nullable = true) |-- X3 distance to the nearest MRT station: double (nullable = true) |-- X4 number of convenience stores: integer (nullable = true) |-- X5 latitude: double (nullable = true) |-- X6 longitude: double (nullable = true) |-- Y house price of unit area: double (nullable = true)

Show() Check out a few rows and understand the data

real_estate.show(2)Out:+---+-------------------+------------+--------------------------------------+-------------------------------+-----------+------------+--------------------------+| No|X1 transaction date|X2 house age|X3 distance to the nearest MRT station|X4 number of convenience stores|X5 latitude|X6 longitude|Y house price of unit area|+---+-------------------+------------+--------------------------------------+-------------------------------+-----------+------------+--------------------------+|  1|           2012.917|        32.0|                              84.87882|                             10|   24.98298|   121.54024|                      37.9||  2|           2012.917|        19.5|                              306.5947|                              9|   24.98034|   121.53951|                      42.2|+---+-------------------+------------+--------------------------------------+-------------------------------+-----------+------------+--------------------------+only showing top 2 rows

describe() to see statistics of columns

real_estate.describe().show()Out:+-------+-----------------+-------------------+------------------+--------------------------------------+-------------------------------+--------------------+--------------------+--------------------------+|summary|               No|X1 transaction date|      X2 house age|X3 distance to the nearest MRT station|X4 number of convenience stores|         X5 latitude|        X6 longitude|Y house price of unit area|+-------+-----------------+-------------------+------------------+--------------------------------------+-------------------------------+--------------------+--------------------+--------------------------+|  count|              414|                414|               414|                                   414|                            414|                 414|                 414|                       414||   mean|            207.5| 2013.1489710144933| 17.71256038647343|                    1083.8856889130436|              4.094202898550725|  24.969030072463745|  121.53336108695667|         37.98019323671498|| stddev|119.6557562342907| 0.2819672402629999|11.392484533242524|                     1262.109595407851|             2.9455618056636177|0.012410196590450208|0.015347183004592374|        13.606487697735316||    min|                1|           2012.667|               0.0|                              23.38284|                              0|            24.93207|           121.47353|                       7.6||    max|              414|           2013.583|              43.8|                              6488.021|                             10|            25.01459|           121.56627|                     117.5|+-------+-----------------+-------------------+------------------+--------------------------------------+-------------------------------+--------------------+--------------------+--------------------------+

Kaggle also provides the details on these attributes such as count, mean, standard deviation. This will allow you to decide on which parameters to use as features for the model

Step4: VectorAssembler to transform data into feature columns

After you have decided which columns to use VectorAssembler to format the dataframe

from pyspark.ml.feature import VectorAssemblerassembler = VectorAssembler(inputCols=[  'X1 transaction date', 'X2 house age', 'X3 distance to the nearest MRT station', 'X4 number of convenience stores', 'X5 latitude', 'X6 longitude'], outputCol='features')data_set = assembler.transform(real_estate)data_set.select(['features','Y house price of unit area']).show(2)Out:+--------------------+--------------------------+|            features|Y house price of unit area|+--------------------+--------------------------+|[2012.917,32.0,84...|                      37.9||[2012.917,19.5,30...|                      42.2|+--------------------+--------------------------+only showing top 2 rows

Step5: Split into Train and Test set

train_data,test_data = final_data.randomSplit([0.7,0.3])

Step 6: Train your model (Fit your model with train data)

from pyspark.ml.regression import LinearRegressionlr = LinearRegression(labelCol='Y house price of unit area')lrModel = lr.fit(train_data)

Step6: Perform descriptive analysis with correlation

Check out coefficients after validating with the test set:

test_stats = lrModel.evaluate(test_data)print(f"RMSE: {test_stats.rootMeanSquaredError}")print(f"R2: {test_stats.r2}")print(f"R2: {test_stats.meanSquaredError}")Out:RMSE: 7.553238336636628R2: 0.6493363975473592R2: 57.051409370037256

Root Mean Squared Error (RMSE) on test data = 7.553238336636628

Conclusion

Spark isn't just a better approach to comprehend our information; it's also a lot faster. Spark transforms data analytics and research by enabling us to handle a wide variety of data challenges in a preferred language. Spark MLlib makes it simple for new data scientists to engage with their models right out of the package and specialists can fine-tune as needed.

Distributed networks could be the subject of data engineers, while machine learning methods and algorithms might be the subject of data scientists. Spark has significantly improved and revolutionized the machine learning by allowing data scientists to concentrate on the data challenges that matter to them while openly utilizing Spark's single system's performance, convenience, and integration.

Pyspark Installation Guide

Anuj Syal — Mon, 07 Jun 2021 08:11:06 GMT

Stick around if you're for a complete guide to set up a pyspark environment for data science applications; pyspark functionality as well as the best platforms to be explored.

What is Pyspark?

Pyspark, a robust language that must be considered to learn if you're into the idea of creating more scalable pipelines and analyses. According to Chris Min, a data engineer, Pyspark basically enables writing Spark apps in Python and makes data processing efficient in a distributed fashion. Python is not just a great language, but an all-in-one ecosystem to perform exploratory data analysis, create ETLs for data platforms, and build ML pipelines. You might also say that PySpark is no less than a whole library that can be used for a great deal of large data processing on a single/cluster of machines, Moreover, it has you covered up with handling all those parallel processing without even threading or multiprocessing modules in Python.

Spark is the Real Deal For Data Engineering

According to the International Journal of Data Science and Analytics, the emergence of Spark as a general-purpose cluster computing framework having language-integrated API in Python, Scala, and Java is a real thing right now. Its impressively advanced in-memory programming model and libraries for structured data processing, scalable ML, and Graph analysis increase its functionality in the data science industry. And as matter of fact, it is undeniable that at a certain point of data processing, scaling with Pandas is hard. Being a data engineer involves a lot of large data processing which isn't a big deal if you get well-versed with Spark.

Why Should Data Scientists Learn spark?

https://unsplash.com/photos/AxAPuIRWHGk

Being a data scientist, learning Spark can be a game-changer. For large data processing, Spark is way better than Pandas while not so different in use, so switching to it is not a big deal, and that too when you get real deal benefits while your operations in data engineering. Spark has solutions to various issues and it's a complete collection of libraries to execute logic quite efficiently. Spark ensures you a very clean and efficient experience of operations, even better than Pandas somehow, especially while dealing with large data sets. Spark has you covered up by its efficiently high-performance analysis and user-friendly structure.

Exploring All The Options for Pyspark Setup

I also have a video version of this article, if you are interested feel free to watch this video on my youtube channel

https://youtu.be/Ql_jfk3UnHE

Following is a set of various options you can consider to set up the PySpark ecosystem. The list mentioned below addresses all the best platform that you can consider:

Setting Up Locally Spark and Python On Ubuntu

Install Java

sudo apt install openjdk-8-jdk

Download spark from https://spark.apache.org/downloads.html linux version
Set environment variables sudo nano /etc/environment

JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"#Save and exit

To test echo $JAVA_HOME and see path to confirm installation
Open bashrc sudo nano ~/.bashrc and at the end of the file add source /etc/environment
This should setup your Java environment on ubuntu
Install spark, after you downloaded spark in step 2 install with the following commands

cd Downloadssudo tar -zxvf spark-3.1.2-bin-hadoop3.2.tgz

Configure environment for spark sudo nano ~/.bashrc and add the following

export SPARK_HOME=~/Downloads/spark-3.1.2-bin-hadoop3.2export PATH=$PATH:$SPARK_HOME/binexport PATH=$PATH:~/anaconda3/binexport PYTHONPATH=$SPARK_HOME/python:$PYTHONPATHexport PYSPARK_DRIVER_PYTHON="jupyter"export PYSPARK_DRIVER_PYTHON_OPTS="notebook"export PYSPARK_PYTHON=python3export PATH=$PATH:$JAVA_HOME/jre/bin

Save and exit
To test pyspark

Don't have ubuntu?Use VirtualBox

Setup ubuntu on your local using virtualbox. VirtualBox basically enables you to build a virtual computer, and that too, on your own physical computer. You can explore VirtualBox to set up Spark and Python: (20-30 mins approx)

Start with downloading the Virtualbox.

Screenshot from Virtualbox download

Download ubuntu ISO Image
In virtual box click on new and setup ubuntu 64 bit environment
Pass in desired cpu cores,memory and storage
Point to the downloaded ubuntu image

Setting Up Locally Spark and Python On Mac

Make sure Homebrew is installed and updated, if not go to this link or type in terminal

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Open terminal and Install Java

brew install java#to check if java installed?brew info java

Install scala

brew install scala

Install Spark

brew install apache-spark

Install python

brew install python3

Setup environment bashrcOpen file sudo nano .bashrc
Add following env variables

#javaexport JAVA_HOME=/Library/java/JavaVirtualMachines/adoptopenjdk-8.jdk/contents/Home/export JRE_HOME=/Library/java/JavaVirtualMachines/openjdk-13.jdk/contents/Home/jre/#sparkexport SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.4/libexecexport PATH=/usr/local/Cellar/apache-spark/2.4.4/bin:$PATH#pysparkexport PYSPARK_PYTHON=/usr/local/bin/python3 # or your path to pythonexport PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS='notebook'

This should configure the pyspark setup, to test type pyspark in terminal

Setting up locally with docker and jupyter notebook (My preferred Method)

What is docker?

Docker is an open platform for developing, shipping, and running applications. Want to learn more about docker, check out this link

Setting up Spark with docker and jupyter notebook is quite a simple task involving a few steps that help build up an optimal environment for PySpark to be run on Jupyter Notebook in no time. Follow the steps mentioned below:

Install Docker
Use a pre-existing docker image jupyter/pyspark-notebook by jupyter
Pull Image

docker pull jupyter/pyspark-notebook

Docker Run

docker run -d -p 8888:8888 jupyter/pyspark-notebook:latest

Go to localhost:8888 and create a new notebook, and run cell with import pyspark

Databricks Setup

Databricks, a unified analytics platform basically has Spark clusters in the cloud that are quite well managed. It is an easy-to-use environment that encourages the users to learn, collaborate and work in a fully integrated workspace. Any spark code can be easily scheduled without any hassle as databricks support pyspark natively

To start create a databricks account (This is usually done by databricks admins). and link it to your preferred cloud provider. For more information to get started check out this video

https://www.youtube.com/watch?v=3fqfWYBXj2A

You have to start with creating a Databricks cluster.
Create a databricks notebook and test by import pyspark

Spark and Python on AWS EC2

Amazon EC2, are virtual machines provided by AWS, these come with pre-installed os software AMIs but the rest of the dependencies would need to be installed separately.

Go to AWS Console and EC2
Select Ubuntu AMI
Follow the steps from Option 1

Avoid doing this and use other options

Pyspark on AWS Sagemaker Notebooks

Launched in 2017, Amazon SageMaker is a cloud-based machine-learning platform that is fully managed and decouples your environments across developing, training, and deploying, letting you scale them separately whilst helping you optimize your spending and time. It is really easy to spin Sagemaker notebooks with a click of a few buttons. Amazon SageMaker notebook instance is a machine learning (ML) compute instance running the Jupyter Notebook Environment. It comes with pre-configured Conda environments like python2, python3, PyTorch, TensorFlow etc

Log in to your aws console and go to Sagemaker
Click on Notebook, Notebook Instances on the left side
Click on Create Notebook Instances, give it a name and select desired configurations
Select instance type, maybe start small ml.t2.medium, and maybe you can spin up a powerful instance later
Click create and wait for a few minutes and then click on open jupyterlab to go to the notebook
Create a new notebook and write the following code snippet to run pyspark

import sagemaker_pysparkfrom pyspark.sql import SparkSession, DataFrameclasspath = ":".join(sagemaker_pyspark.classpath_jars())spark = SparkSession.builder.config(        "spark.driver.extraClassPath", classpath    ).getOrCreate()

If you are interested to know more about Sagemaker, do check out my previous video, Sagemaker in 11 Minutes

https://youtu.be/95332cm5ROo

AWS EMR Cluster Setup

Amazon EMR, probably one of the best places to run Spark, can help you create Spark clusters very easily as it is equipped with various features such as Amazon S3 connectivity which makes it all lightning-fast and super-convenient. Moreover, integrated operations with EC2 spot market and EMR Managed scaling.

To be precise, EMR is a well-managed big data service enabling data scientists to get assistance in their work with data science applications written in Python, Scala, and Pyspark. It ensures a convenient cluster setup for Spark for the data scientist to have a platform to develop and visualize.

Go to AWS console and search for EMR
Click on create a cluster
In general configuration give it a name, in Software configuration select Spark Application

In Hardware configuration select instance type, maybe start small from m1.medium and select number of instances in cluster
In Security Select EC2 Key pairs, usually created by administrator, if not you can follow the steps on the right to create programmatic access keys for the cluster to use

Keep the rest options to default and create the cluster
After that create a EMR notebook and select the newly created cluster to execute your jobs for scale

Conclusion:

Spark, a complete analytic engine, helps data scientists in their operations of lengthy data processings that are rather difficult when handled with Pandas. Thus, learning PySpark, the robust library, can help data engineers a lot in their course of work. Now that you know various platforms that enable you setup Spark clusters with well managed clouds, you can explore them yourself.

How To NFT?

Anuj Syal — Tue, 11 May 2021 06:27:13 GMT

NFTs are digital directories that are powered by a blockchain system which is the same infrastructure that characterizes common cryptocurrency. However, an NFT is a unique kind of cryptocurrency, and the blockchain database on which it is stored authenticates whoever the legitimate holder of that cryptocurrency.

The NFTs are considered an element of the Ethereum blockchain that is like any other cryptocurrency. Before we understand what NFT is, let's look at the underlying technology behind NFTs which is Ethereum.

What is Ethereum blockchain?

After Bitcoin, Ethereum is the second-largest blockchain by trading volume. However, it wasn't even designed to serve only as an electronic currency. Alternatively, Ethereum's creators decided to create a different type of international, distributed computing platform. This will introduce blockchain's protection and transparency to a wide variety of applications.

Different finance resources, apps, and complicated systems are already running on Ethereum and only the creators' ideas are restricting its possible potential. Ethereum will be used to formalize, decentralize, preserve, and exchange almost anything,

NFT Values digital art in millions

The value of NFT artwork is on the rise. Several artists are selling their masterpieces in millions. Recently Mike Winkelmann sold an NFT of his art piece for $69 million and the auction house has rated him among the top valuable living artists.

When NFTs, initially came to the notice of everyone, a few people knew what they would be or what they would be used for. They are now a flourishing industry with items sold for millions, of dollars each.

NFT token revenues exceeded $500 million in February, more than double the total for the whole year of 2020. More than 191,000 electronic art pieces have now been purchased for a record of $533 million, per the Blockchain Art report.

How to buy your first NFT?

NFT payment systems allow users to upload their creative piece as well as purchase other peoples art which would be a wonderful experience Furthermore, having to look at what other people are offering will offer you a better idea of trending and popular activities.

OpenSea Marketplace

This is unquestionably the best place for generating your NFT. You can generate the token for yourself, quickly and easily, thanks to a very user-friendly development environment. Even then, you can expect to be required to submit a price in ETH to get your NFT delivered.

The method of token generation is free, but the method of selling is not. Even, OpenSea is a great option because it is well-known and draws a large number of customers. The framework has some very innovative tools that are worth exploring. For example, OpenSea allows you to offer your NFT in offerings with those of other vendors. This is a one-of-a-kind feature that can be very useful because it expands the perceptual range.

How to create your NFT on OpenSea?

Visit the official website and click on Create. You will have to use your wallet for accessing the website. After accepting these terms of service, add the details of your collection including name, description, and logo. After creating a collection, you can select the products that you have to tokenize. You can add images, sound, 3D models in any format. Different characters can also be added to the token. After creating the token, you can start selling it online by clicking on Sell.

Current problems with NFTs

ERC-20 ventures are unable to execute any micropayment transfers on Ethereum due to rising Ethereum fuel prices. This makes it impossible to employ the Ethereum system for one of its main applications. Ethereum includes gas charges. These are just the fees that miners would pay to complete transfers. This price isn't fixed; it varies according to a system designed. If a transaction will not exceed the miners' requirement, it will be postponed or refused entirely.

Non-fungible tokens (NFTs), which are independent components of cryptocurrency material, have been somewhat liable for the thousands of tonnes of planet-warming atmospheric carbon pollution produced by the tokens used to purchase and trade them. Many creators, particularly those who have already profited from the trend, believe it is a simple challenge to overcome. Others believe the existing strategies are ineffective.

A critical feature of NFTs is that they can be hacked much like any account. There's also the issue of NFTs being robbed in the first place. , some Social media users announced that their Nifty Portal identities were already compromised and that NFTs valued millions of dollars had been robbed

Several creators have reported about the unique ecological consequences of crypto art. The exact amount of resources used to mint paintings on the blockchain varies, but somehow it ranges from days to weeks to months of an ordinary Country o citizen's energy intake.

Ethereum community is trying to curb the high gas price

Eth2, is a collection of enhancements planned to increase the platform's performance, usability, stability, durability, and versatility. It would solve the gas problem by keeping Dai transactions and other DeFi facilities less costly.

By transitioning to proof-of-stake (PoS), Ethereum's creators hope to start reducing the existing proof-of-work agreement program's heavy resource demand and dependency on advanced resources. The PoS framework that will be implemented on the Beacon Network can enable the distributed Ethereum blockchain to achieve equilibrium and keep the platform secured while reducing energy consumption by providing an economic contribution.

Sharding, another expected enhancement, would allow the system to handle far more payments than it currently allows, lowering payment costs by reducing rivalry for storage in the next block. To spread the massive burden, Eth2 can scatter transactions through a vast range of shards.

Conclusion

Many traders are making risky investments on the NFT industry and NFT artwork with expectations of seeing their worth boom. Others buy NFTs primarily for the purpose of recognition, personal satisfaction, or simply to enter a new culture.

To summarize, an NFT is a virtual piece of artwork with a token that makes it exclusive and registers it on the web's database. Its significance is determined by who created it and just how much individuals believe it is valuable.

Follow Me on Linkedin & Twitter

If you are interested in similar content hit up the follow button on Medium or follow me on Twitter and Linkedin

The SageMaker Saga

Anuj Syal — Wed, 28 Apr 2021 06:06:34 GMT

Many data scientists develop, train, and deploy ML models within a hosted environment. Regrettably for them, they do not have the convenience and facility for scaling up or scaling down resources as and when required based on their models.

This is where AWS SageMaker comes into picture! It solves the issue by facilitating developers to build and train models in order to get faster production with bare minimum efforts at an economical cost.

But firstwhat is AWS you ask?

Photo by Hello I'm Nik on Unsplash

Amazon Web Services (AWS) is a widely adopted, worlds most comprehensive on-demand cloud platform by, you guessed it...Amazon, offering over 200 fully featured services from data centres from around the world. AWS services can be used to build, monitor, and deploy any application type in the cloud enabling millions of people and businesses, including the fastest-growing start-ups, leading government agencies and largest enterprises to lower costs, innovate faster and become more agile.Providing a massive global cloud infrastructure, AWS allows you to quickly innovate, iterate and experiment. With proven operational expertise, flexibility to choose the services you need and way more functionality and features than any other cloud provider, AWS lets you focus on innovation not just infrastructure.As a language and OS agnostic platform providing an unmatched experience, AWS provides a highly secure, scalable and reliable low-cost infrastructure platform in the cloud that powers hundreds of thousands of businesses and millions of customers in over 190 countries around the world.Today AWS has the most dynamic and largest community of customers and partners virtually from every industry and every size.

Welcome, AWS SageMaker

Launched in 2017, Amazon SageMaker is a cloud-based machine-learning platform that is fully-managed and decouples your environments across developing, training and deploying, letting you scale them separately whilst helping you optimise your spend and time. AWS SageMaker includes modules that can be used together or independently to build, train, and deploy ML models at any scale by the data scientists and developers.AWS SageMaker empowers everyday developers and scientists to use machine learning without any previous experience. A whole lot of developers across the world are adopting SageMaker in various ways, some for end-to-end flow while others to scale up training jobs.

Why AWS SageMaker: The Advantages

The AWS SageMaker comes with a pool of advantages, some of which I am listing below:

It improves and enhances the productivity of a machine learning project
It aids in creating and managing compute instance within the least amount of time
It reduces the cost of building machine learning models by up to 70%
It automatically creates, deploys, and trains model with complete visibility by -
inspecting raw data
It reduces the time required for data labelling tasks
It helps in storing all Machine Learning components in one place
It trains model faster and is highly scalable
It maintains uptime Process keeps on running without any stoppage
It maintains high data security

A big umbrella of all the ML services, Sagemaker tries to provide one single place for all your Machine Learning and Data science workflows. It tries to cover all steps involved right from Provisioning Cloud Resources and Importing Data to Cleaning the data, Labelling the data (including manual labelling) and Training models to Automation and Deploying models in production.

AWS Sagemaker Demo in 10 minutes

Looking for a quick start on Sagemaker Console, check out this video on youtube

Sagemaker in 11 minutes by Anuj Syal

Exploring the Full Potential: SageMakers Features and Capabilities

Source: https://aws.amazon.com/sagemaker/

Prepare

Even if you don't have a labelled dataset, AWS Sagemaker allows you to take the help of mechanical Turks to label your dataset correctly. One of it is Amazon SageMaker Ground Truth which is a fully managed data labeling service which helps to build the right training dataset. You can get started with labeling your data in minutes through the SageMaker Ground Truth console using custom or built-in data labeling workflows.

Build

AWS SageMaker makes it easy to build ML models and get them ready for training by providing everything you need to swiftly connect to your training data and help you select and optimize the best algorithm and the apt framework for your application. Amazon SageMaker includes hosted Jupyter notebooks that make it easy to explore and visualise your training data stored on Amazon S3. You can either connect directly to data in S3, or use AWS Glue to move data from Amazon DynamoDB, Amazon RDS, and Amazon Redshift into S3 for analysis in your notebook.For ease of selection of algorithms, AWS SageMaker includes the 10 most frequently used ML algorithms which come pre-installed and optimised thereby delivering up to 10 times the performance you would find running these algorithms anywhere else. SageMaker also comes pre-configured to run Apache MXNet and TensorFlow - two of the most widely used open-source frameworks. Besides, you even have the option of using your own framework.

Train

The next essential feature in AWS SageMaker machine learning is training a model. In this stage you need to focus on the evaluation of the model. The training of a model primarily involves an algorithm, and the selection of the algorithm involves various other factors. For effective and faster use, AWS SageMaker provides in-built algorithms as well.Another key requirement for training the Machine Learning model refers to compute resources. The size of the training dataset and the desired speed of results help in determining the requirement of resources. The next important characteristic also accounts as a formidable aspect in Amazon ML vs. SageMaker which deals with evaluation.After completion of the AWS online training of the model, you have to evaluate the model for testing the accuracy of the inferences. The AWS SDK for Python (Boto) or high-level Python library in SageMaker helps in sending requests for inferences to model. Jupyter notebook assists in training and evaluation of the model.

Deploy

Once your model is trained and tuned, AWS SageMaker makes it easy to deploy in production so you can start running generating predictions on new data (a process called inference). To deliver high performance and high availability both, SageMaker deploys your model on an auto-scaling cluster of Amazon EC2 instances spread across multiple availability zones. AWS SageMaker also comes with built-in A/B testing capabilities to help test your model and experiment with different versions to achieve the best results.AWS SageMaker takes away the heavy lifting of ML, so one can build, train, and deploy machine learning models easily and efficiently.

Validating a Model with SageMaker

You have the option of evaluating your model using offline or historical data:

Offline Testing: For this, historical data is used to send requests to the model through Jupyter notebook in Amazon SageMaker for evaluation.

Online Testing with Live Data: Multiple models are deployed into the endpoint of Amazon SageMaker, and it directs live traffic to the model for validation.

Validating Using a "Holdout Set": For this, a part of the data called a holdout set is set aside. The model is later trained with remaining input data and generalizes the data based on what it learnt initially.

K-fold Validation: For this, the input data is split into two parts - K, which is the validation data for testing the model, and the other part called k1 which is used as training data. Now, based on the input data, the ML models evaluate the final output.

Sneak Peek: AWS SageMaker Studio and Architectural View

It is a fully integrated development environment for machine learning where build, training, and deployment of models can all be done under one single roof.

Amazon SageMaker Notebooks: Used for easily creating and sharing Jupyter notebooks.
Amazon SageMaker Experiments: Used for tracking, organizing, comparing, and evaluating different ML experiments.
Amazon SageMaker Debugger: As the name suggests, it is used for debugging and analyzing training issues of complex types and receiving alert notifications for the errors.
Amazon SageMaker Model Monitor: This is used to detect quality deviations for deployed ML models.Amazon SageMaker Autopilot: It is used to build ML models automatically with full visibility and control.

Final Words: Conclusion

Machine learning is the future of application development and AWS SageMaker is all set to revolutionize the world of computing. The sheer productivity of applications in machine learning will create new prospects for adoption of ML services such as the SageMaker.

Amazon Redshift vs Google BigQuery: Battle of the Biggest OLAP Data Warehouses

Anuj Syal — Tue, 13 Apr 2021 05:08:17 GMT

For enterprises in the big data domain, it is imperative to have data warehouses that are agile, scalable, and at the same time cost-effective. Given how modern businesses are increasingly looking at big data as a solution to enhance in all areas; from customer support to production pace, analytical data warehouses have become critical to most business needs.

While the world of data analytics is still blooming, the large fishes have successfully established their hold in the market with their own data warehouses. Industry giants Amazon and Google - companies at the core of the big data boom, offer their highly sought-after data warehouses Redshift and BigQuery. I will be analyzing these two of the biggest online analytical processing (OLAP) data warehouses in todays blog to help you pick the better possible solution for your data needs. But before we delve deep into which of the two data warehouses is better, I will take you through a brief overview of what a data warehouse is. Read on.

Its a big data storeroom!

A data warehouse can be considered simply as a storage container or place where large amounts of data are fed for processing. Now, the process of acquiring certain information from this warehouse, or the way a data query is handled, differs from one warehouse to the other in terms of its architecture. Additionally, similar to the size of a storage container and the amount of content that can be stored or processed in it, data warehouses also come with their set of limitations and capabilities, based on which enterprises can find their perfect fit. Now, let us look at the two power-packed OLAP data warehouses in detail.

AWS Redshift

Amazon Redshift is a data warehouse, part of the e-commerce giants Amazon Web Services (AWS) cloud-computing platform. Released first 8 years ago, Redshift allows enterprise users to start with a small amount of data, a few hundred gigabytes of it, and move up to a scale of petabytes of data, all enabled with the power of cloud.

How does it work?

A quick and cost-effective BI tool for all data using standard SQL, Redshift stores data in a cluster or block form, allowing for faster and seamless queries as compared to traditional or on-premise data warehouses. The cluster format minimizes the instances of information input or output to focus on only relevant information extraction.

Unlike traditional data warehouses, Redshift allows you to make complex analytic queries against terabytes to petabytes of structured and semi-structured data. It uses sophisticated query optimization, columnar storage, and parallel query execution across multiple physical resources.

By using familiar SQL-based clients and BI tools using standard Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC), Redshift promises impressively fast querying capabilities, most coming back in seconds.

When it comes to pricing, Redshift allows you to start small with a few gigabytes of data for just $0.25 per hour and scale up to terabytes and petabytes for $1,000 per terabyte per year claimed less than a tenth the cost of traditional on-premises solutions. It also includes Amazon Redshift Spectrum to help you run SQL queries directly against exabytes of unstructured data in Amazon S3 data lakes.

Even if the data size is voluminous, Redshift Spectrum automatically scales up the compute capacity based on the data retrieved for faster queries against Amazon Simple Storage Service or S3. Whats more, you dont need loading or transformation. You can use open data formats, including Avro, CSV, Grok, Amazon Ion, JSON, ORC, Parquet, RCFile, RegexSerDe, Sequence, Text, Hudi, Delta and TSV.

Redshift at a glance:

Quick and cost-effective
Stores data in columnar format
Less than tenth of the cost of traditional on-premises solutions
Parallel query execution
Allows SQL queries directly
Secure with AWS IAM integration, column-level access control, VPC, encryption, and more.
Uses replication and continuous backups to improve performance

Best use cases

Redshift is well-suited for enterprises handling time-sensitive workloads , requiring real-time analysis. For example, daily stock market index reporting, ad bidding, or live dashboards. The predictive models on top of Redshift allow for near real-time ad-bidding. Live data streaming use cases, requiring continuous querying by refreshing is also handled impressively.

Google BigQuery

Google BigQuery is another sought-after data warehouse under the search giants own cloud infrastructure, Google Cloud. The serverless, multi-cloud data warehouse is designed to handle voluminous data while promising agility and scalability.

How does it work?

At the heart of BigQuery is Dremel - Googles query service that allows you to run SQL queries against mammoth data sets and get accurate results in seconds! BigQuery is, therefore, an externalisation of Dremel, providing its core set of features to third-party developers using Rest API, a command line interface, a web UI, and access control. To give an estimate of its potential, Dremel can scan 35 billion rows without an index in tens of seconds. By sharing Googles own cloud infrastructure, it can parallelize each query and run it on tens of thousands of servers simultaneously.

Coming back to BigQuery, it shares the same architecture and performance characteristics as Dremel. By leveraging Googles computational and cloud infrastructure, benefits include multiple replication across regions and high data center scalability, without needing any management by the developer.

BigQuery also stores data in columnar format called Capacitor. Each column in the table is stored in a separate file block and all columns are stored in a single capacitor file which is then compressed and encrypted.

It allows data analysis across clouds using the standard SQL. It is fully managed and aimed at large-scale data processing for enterprises. One of the key differentiating features of BigQuery is that it is more AI and ML compatible. It allows you to gain data insights using natural language processing (NLP), offers built-in ML and AI integrations, provides ODBC and JDBC drivers, and provides seamless data integration using tools from Google.

BigQuery at a glance:

Compatible with large datasets
Supports AI, ML integrations
Ease of use for non-programmers
Relies on Googles own cloud infrastructure
Uses filing system for data storage
Fast and self-managed

Best use cases

As the name suggests, BigQuery is at its optimum level when the data queries are spiky and the idle time is high. Some examples include recommendations on e-commerce websites which run once a day, quarterly reports that demand occasional complex queries, sales intelligence for teams to make ad-hoc discovery via custom data analysis, and most importantly, machine learning for discovering and analysing patterns in data, such as consumer behavior.

The commonalities

Given both data warehouses promise a lot of potential for BI needs, it is evident for them to offer some common features. Heres a look at some of them:

SQL standards: Both BigQuery and Redshift support standard SQL data types. The former, however, works with some standard SQL data types.
Updates & deletes: Both tools are append-only but can handle updates and deletes if things go wrong in the query.
Security: Redshift uses Amazon IAM while BigQuery uses Google Cloud IAM, allowing for secure identity management.

The differentiators

Integrations: Redshift and BigQuery both offer a large number of integrations. However, Googles BigQuery stands out for its ML integrations, enabling data scientists to build and operationalize ML models on structured or semi-structured data.
Scalability: With Redshift, enterprises cant scale resources independently as the storage and compute are not separate. Any fine tuning requires cluster reconfiguration which is a time-consuming process. BigQuery, on the other hand, has separate storage and compute, providing greater flexibility, and more provision for scalability.
Maintenance: Redshift requires periodic vacuuming or table analysis. You may end up spending hours doing simple maintenance like updating. BigQuery is low maintenance; it has no indexes or column constraints, and doesnt allow performance tuning capabilities. It is a fully-managed service by Google which handles all the backend configuration and tuning.
Pricing: With Redshift, you can start at $0.25 per hour and scale up to petabytes of data and thousands of concurrent users, amounting to $1,000 per terabyte per year. You can choose to also opt for on-demand pricing that allows you to pay an hourly rate based on the type and number of nodes in your cluster. Redshift is also available in a pay-as-you-go pricing model based on spectrum, scaling, storage, and deploying ML models. BigQuery, on the other hand, charges for data storage, streaming inserts, and querying data, but loading and exporting data are free of charge. There are two subscription models - pay-as-you-go pricing at $5 per terabyte and flat-rate pricing which starts at $1,700 per month for 100 reserved slots or $4 per hour for 100 flex slots.

Who takes the cake?

If you are an enterprise dealing with data processing and want to hop on the new-age digital transformation, both Redshift and BigQuery are an extremely promising data warehouse solution. With similar amounts of potential to handle large data queries and deliver fast and accurate results, deciding between the two services boils down to the nature of workflow and BI needs.

In my opinion, Googles BigQuery is a more robust solution than Redshift as it offers additional features such as ML integration natively. With Googles own cloud infrastructure at the core, it makes for a more price-effective tool. It is fully managed which reduces the need to deploy system engineers to continuously do the fine-tuning as required if you are using Redshift. For larger data volumes and varied query loads, BigQuery fits the bill perfectly.

An App-solute Delight: Streamlit

Anuj Syal — Wed, 07 Apr 2021 09:03:40 GMT

A user-friendly tool that helps deploy any machine learning model and any Python project with ease by turning data scripts into shareable apps in minutes? Yep, it is true. And its here!

Video on Getting Started with Streamlit

https://www.youtube.com/watch?v=J2LsfsHss3Y

Getting started with Streamlit | Build your first Data/ML application by Anuj Syal

Decoding Streamlit

Created by Adrien Treuille, Thiago Teixeira, and Amanda Kelly, Streamlit is an open-source Python library that enables you to effortlessly build beautiful, custom web apps for machine learning and data science without worrying about the front end, for free.Astutely developed keeping the data scientists and ML engineers in mind, this tool allows them to create an interactive environment making it easier to share their model and present it to their colleagues and customers. The values can be altered as per the inputs provided, with the changes and results all happening interactively, in real-time, all thanks to its hot-reloading feature. Predominantly Markdown and Python are used in Streamlit, but HTML and CSS are also supported. Free and easy to install, Streamlit brings your ideas to life on the internet in the form of web apps.

What makes Streamlit stick out?

The main idea behind the tool is to make data app creation as easy as writing a Python script. And thats exactly what the creators have managed to achieve.

It is interactive and aesthetically pleasing particularly to your clients
If you know how to write Python scripts, you are all set for creating cool apps
Instant deployment from your GitHub repository

How to Streamlit?

One just needs to type:

pip install streamlit

In your terminal/command prompt and it is ready to use (provided you have pip installed first). Once Streamlit is successfully installed, run the python code given below and if you dont get an error, it automatically means Streamlit is successfully installed.Open command prompt or Anaconda shell and type:

streamlit run filename.py

Note: To avoid any version compatibility issues with other applications, install Streamlit in a separate/independent virtual environment.

Max Duzij on Unsplash

Time to Test Streamlit: My Demo Web App

While its one thing to call this tool fairly easy and convenient, its another to put it to test.Let us develop a simple web app that reads and visualizes a pandas data frame.

Step 1: Import Python libraries add the title+body of the app

import streamlit as stimport pandas as pdimport numpy as npst.title(App by Anuj)st.write('This app walkthroughs on creating your first data app with streamlit. It demonstrates how easy it is now for data scientists to develop and deploy quick prototypes.')

Step 2: Read data into dataframes

# simly read a csv in folder in pandas dataframedf = pd.read_csv('game_of_thrones_battles.csv')

Step 3: Include interactive elements

#Use sidebaroption = st.sidebar.selectbox(    'Which number do you like best?',     df['first column'])'You selected:', option

Step 4: Create your first chart

chart_data = pd.DataFrame(     np.random.randn(20, 3),     columns=['a', 'b', 'c'])st.line_chart(chart_data)

Step 5: Create a map with plots

map_data = pd.DataFrame(    np.random.randn(1000, 2) / [50, 50] + [37.76, -122.4],    columns=['lat', 'lon'])st.map(map_data)

Step 6: Add a progress bar

# Add a placeholderlatest_iteration = st.empty()bar = st.progress(0)for i in range(100):  # Update the progress bar with each iteration.  latest_iteration.text(f'Iteration {i+1}')  bar.progress(i + 1)  time.sleep(0.1)

Step 7: Save and run

Now that we are done with a simple script, its time to save it. I am going to save it as myapp_streamlit.pyRun from terminal (Make sure the pip requirements are installed)

streamlit run myapp_streamlit.py

(Command streamlit run followed by your Python script name)

And there you have it! Your first demo web application. Feel free to explore other commands, add sidebars, progress bars, columns and other iterations especially when you have too much to show on the UI but want a cleaner look.

Streamlits caching feature is one of its major highlights. The ability to cache makes it relatively faster especially when you are making small tweaks and then trying to rerun.

Note: I would suggest my fellow data scientists have their browser windows of the code and the app opened side by side for efficient and faster results as the changes can be seen in real-time.

Looking at streamlit competitors

Whilst Streamlit works entirely in Python, R and Plotly Dash is another popular choice that supports Python as well as Julia. For R users especially, R shiny is a great option to develop web apps quickly.

My Thoughts: Is Streamlit Really Lit?

Streamlit is a blessing for data scientists. Theres no two ways about it. It not only helps them to build ML web applications, but also conveniently share and demonstrate their models to stakeholders, customers and colleagues especially if they are non-technical and/or not well-versed with programs or scripts. It has become a de-facto choice of ML engineers today, enabling them to quickly build and share proofs-of-concept, while giving them an option for quick tweaks here and there. Streamlit is aesthetically pleasing and user-friendly that can be learned with zero to minimal effort by anyone who is already familiar with Python scripting.Streamlit is feature-rich and the ease with which it enables one to deploy models is beyond commendable. It is simple. And it is outstanding!

How to choose the right Google Cloud Platform Database?

Anuj Syal — Wed, 24 Mar 2021 05:38:42 GMT

To The Cloud and Beyond! I got you, fam!

[Tanner Boriack](https://unsplash.com/@tannerboriack?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash**

Choosing the right GCP Database depends on a lot of factors including your workload and the architecture involved. Today, Im going to provide you all with an overview of popular Google cloud database services, including key considerations when assessing and choosing a service.

Know Thy Database

Google Cloud Platform (GCP) was built to provide an array of computing resources, database services being one of them. Competent and capable of handling modern data, bound with efficiency, flexibility, and great performance, GCP is a hosted platform solution for disseminated data across geography.

When choosing a Google database service, one should consider a lot of things like type and size of data, latency, throughput, scalability, and IOPs, to name a few.

GCP predominantly offers three types of reference architecture model for global data distribution:

1) Single The simplest of all deployment models, one can deploy databases by creating new cloud databases on Google and/or by lift and shift of pre-existing workloads.

2) Hybrid These types of deployments are useful when one has applications in the cloud that need to access on-premises databases or vice-versa.

There are three primary factors to be considered when deploying a hybrid model (with some data on Google Cloud and some on-premises) :

Master Database: First and foremost you need to decide whether your master database is stored on-premises or on the cloud. Once you choose the cloud, GCP resources act as a data hub for on-premises resources, whereas if you choose on-premises, your in-house resources sync data to the cloud for remote use or backup.

Managed Services: Available for resources in the cloud, these services comprise scalability, redundancy, and automated backups. You, however, have an option of using third-party managed services.

Portability: Based on the type of data store you choose, the portability of your data is affected too. To ensure reliable and consistent transfer of data, you need to consider a cross-platform store, such as MySQL.

3) Multicloud These types of deployments can help you effectively distribute your database and create multiple fail-safes as it enables you to combine databases deployed on Google Cloud with database services from other cloud providers thereby giving you an advantage of a wider array of proprietary cloud features.

There are 2 primary factors to be considered when deploying this model:

Integration: Ensuring that client systems can seamlessly access databases, regardless of the cloud they are deployed on, for instance, use of open-source client libraries to make databases smoothly available across clouds.

Migration: Since there are multiple cloud providers, one may need to migrate data between clouds with the help of database replication tools or export/import processes. Google Storage Transfer service is one such tool to help you with database migration.

Cloud is the Limit: Google Cloud Platform Database Services

GCP offers several database services that you can choose from.

Cloud SQL:

A relational GCP database service that is fully managed and compatible with MySQL, PostgreSQL and SQL Server, Cloud SQL includes features like automated backups, data replication, and disaster recovery to ensure high availability and flexibility.

When to choose: From lift and shift of on-premise SQL databases to the cloud to handling large-scale SQL data analytics to supporting CMS data storage and scalability and deployment of micro services, Cloud SQL has many uses and is a better option when you need relational database capabilities but dont need storage capacity over 10TB.

Cloud Spanner:

Another fully managed, relational Google Cloud database service, Cloud Spanner differs from Cloud SQL by focusing on combining the benefits of relational structure and non-relational scalability. It provides consistency across rows and high-performance operations and includes features like built-in security, automatic replication, and multi-language support.

When to choose: Cloud Spanner should be your go-to option if you plan on using large amounts of data (more than 10TB) and need transactional consistency. It is also a perfect choice if you wish to use sharding for higher throughput and accessibility.

BigQuery:

With BigQuery you can perform data analyses via SQL and query stream-data. Since BigQuery is a serverless data warehouse thats fully managed, its built-in Data Transfer Service helps you migrate data from on-premises resources, including Teradata.

It incorporates features for machine learning, business intelligence, and geospatial analysis that are provided through BigQuery ML, BI Engine, and GIS.

When to choose: Use cases for BigQuery involve process analytics and optimization, big data processing and analytics, data warehouse modernisation, machine learning-based behavioural analytics and predictions.

Cloud Bigtable:

It is a fully managed NoSQL Google Cloud database service that is designed for large operational and analytics workloads. Cloud Bigtable includes features for high availability and zero-downtime configuration changes. You can practically integrate it with a variety of tools, including Apache tools and Google Cloud services.

Cloud Bigtable use cases cover financial analysis and prediction, IoT data ingestion, processing, and analytics, and hyper-personalised marketing applications.

When to choose: Cloud Bigtable is a good option if you are using large amounts of single key data and is preferable for low-latency, high throughput workloads.

Cloud Firestore:

A fully managed, serverless NoSQL GCP database designed for the development of serverless apps, Cloud Firestore can be used to store, sync, and query data for web, mobile, and IoT applications. With critical features like offline support, live synchronization, and built-in security, you can even integrate Firestore with Firebase, GCPs mobile development platform, for easier app creation.

Cloud Firestore use cases include mobile and web applications with both online and offline capabilities, multi-user, collaborative applications, real-time analytics, social media applications, and gaming forums and leaderboards.

When to choose: When your focus lies on app development and you need live synchronization and offline support.

Firebase Realtime Database:

This is a NoSQL Google Cloud database that is a part of the Firebase platform. It allows you to store and sync data in real-time and includes caching capabilities for offline use. It also enables you to implement declarative authentication, matching users by identity or pattern.

It includes mobile and web software development kits for easier app development. Use cases for Firebase Realtime Database involve development of apps that work across devices, advertisement optimisation and personalisation, and third-party payment processing.

Cloud Memorystore:

Designed to be secure, highly available, and scalable, Cloud Memorystore is a fully managed, in-memory Google Cloud data store that enables you to create application caches with sub-millisecond latency for data access.

Use cases for Cloud Memorystore include lift and shift migration of applications, machine learning applications, real-time analytics, low latency data caching and retrieval.

When to choose: If you are using key-value datasets and your main focus is transaction latency.

Choosing the database on key questions

I also created this flowchart that can show a direction in terms of selecting the database:

Created by Anuj Syal, whimsical link

Looking for more information

I created this video on youtube that covers this topic in detail

My Thoughts

With more and more enterprises looking to the cloud services to run their core-critical systems and also act as a primary platform for launching their new apps, choosing the right Google Cloud Platform (GCP) Database opens possibilities and can be a dealmaker, offering the least downtime for customers, secure and cost-effective services and solutions.

Follow Me on Linkedin & Twitter

If you are interested in similar content hit up the follow button on Medium or follow me on Twitter and Linkedin

Building the Future with Databricks!

Anuj Syal — Fri, 12 Mar 2021 23:35:56 GMT

Exploring the data elite company and what solutions they have to offer

[Carlos Muza](https://unsplash.com/@kmuza?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash**

Breaking (Data)brick By Brick!

Founded in 2013 by the real OGs the creators of Apache Spark, Delta Lake, and MLflow, Databricks is a single platform for all your data needs. It is a software (Data + AI) company that offers a Unified Data Analytics Platform (UDAP) and is basically built on a modern Lakehouse architecture in the cloud.

At present, Databricks is one of the fastest-growing data services on AWS and Azure with its headquarters in San Francisco and offices around the world serving over 5000 customers and over 450 partners worldwide. The company has recently hit a $28 billion valuation thereby setting in stone the companys vision and mission of democratizing and simplifying Data and AI to help data teams resolve any and every problem.

The tricks behind Databricks

The current version of Databricks 7.3 LTS operates over Apache 3.0.1 and supports a host of analytical capabilities that can work towards enhancing the outcome of your Data Pipeline. Leveraging Apache Spark for computational capabilities, Databricks supports several programming languages like R, Python, Scala, and SparkSQL/SQL for preparing code, making it imperative for the coders to have a grasp of the languages in order to optimally utilize the capabilities of Databricks platform.

But before I delve deeper, let me give you a brief idea about Apache Spark (an obvious prerequisite).

Apache Spark is an open-source, blazing-fast cluster computing technology, designed for, well, faster calculations. Used for huge data workloads, Spark is a distributed processing system with its main features being optimized query execution and in-memory caching both aiding in increasing the processing speed of the application. Apache Spark achieves this high performance for both batch and streaming data using a state-of-the-art DAG scheduler.

Spark offers over 80 high-level operators to effortlessly build parallel apps. You can even use it interactively from the Scala, Python, R, and SQL shells. Spark powers a stack of libraries and you can combine SQL, streaming, and complex analytic seamlessly.

Why pick Databricks?

The best of data warehouses meet the best of data lakes, offering an open and unified platform for data and AI.

To put it simply in the words of the company, Databricks is:

A single place for all your data
A base foundation for every workload from BI to AI
A single platform that runs everywhere
A single platform that brings everything together Lakehouse

Salient Features of Databricks Platform:

Collaborative Notebooks allowing DE and DS to work together
Reliable Data Engineering
Production Machine Learning
SQL Analytics on all your data

What are the key constituents of Databricks?

Notebook: It is a web-based interface to a document that contains runnable (executable) codes and commands, visualizations, and narrative text.

Dashboard: It is an interface that provides organized access to visualizations.

Library: It is a package of codes available to the notebook or job running on your cluster. Databricks runtimes consist of many libraries with an option of adding your own.

Experiment: It is a collection of MLflow that runs for training a machine learning model.

As compared to AWS, GCP, and AZURE, Databricks positions itself as a unified data platform.

A Sneak Peek: Unified Data Analytics Platform

Screenshot of my databricks instance

UDAP can be broadly divided into:

Data Science Workspace: The Workspace provides a physical location for collaborative working to your Data Science team, right from data ingestion to data analysis. Depending on the data practitioners task or roles assigned, the team can employ different functionalities.

UD Service: Lets just say it is the powerhouse fueling the work data practitioners performance in the Data Science Workspace. Right from Databricks Runtime to Delta Lakes to Databricks ingest, it takes care of everything.

Enterprise Cloud Service: From maintaining end-to-end security to ensuring production-ready infrastructure, it enables organizations to not only set up but also secure, manage, and scale their platform.

Unified Data Analytics (UDA) combines data processing with AI technologies offering better, meaningful insights and solutions from the data made available.

Getting started with databricks? Want to know more?

My two cents on Databricks

While some might consider entirely AI-driven businesses to be farfetched, one cannot deny the fact that the future really arrives sooner than most people can imagine it. Companies across the world are already adopting and applying innovative and over 80 distinct uses for Databricks tools in operation for better performances, giving them an edge over their competitors. With over a hundred global partners such as Microsoft, Amazon, Tableau, Informatica, Cap Gemini, and Booz Allen Hamilton, to name but a few, standing as a testament to the fact that Data and AI-driven businesses are the future, all I can say is Databricks is a platform to watch out for!

Dask: A Scalable Solution For Parallel Computing

Anuj Syal — Sun, 14 Feb 2021 18:36:43 GMT

Bye-bye Pandas, hello dask!

Photo by Brian Kostiuk on Unsplash

For data scientists, big data is an ever-increasing pool of information and to comfortably handle the input and processing, robust systems are always a work-in-progress. To deal with the large inflow of data, we either have to resort to buying faster servers that adds to the costs or work smarter and build custom libraries like Dask for parallel computing.

Before I go over Dask as a solution for parallel computing, let us first understand what this type of computing means in the big data world. By the very definition, parallel computing is a type of computation where many calculations or the execution of processes are carried out simultaneously. To simplify, parallel computing refers to getting a computational task done by multiple task doers (or processors) which are connected through a shared memory.

So what is Dask?

Written in Python, Dask is a flexible, open-source library for parallel computing. It allows developers to build their software in coordination with other community projects like NumPy, Pandas, and scikit-learn. Dask provides advanced parallelism for analytics, enabling performance at scale.

Dask is composed of two parts: Dynamic task scheduling for optimized computation and Big Data collections such as like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments, which run on top of dynamic task schedulers.

Why Dask?

One might argue the need to switch to Dask with other equivalents present. However, the biggest virtue of Dask is that it is easy to adopt. It is familiar for Python users and uses existing APIs and data structures, making it a seamless process to switch between NumPy, Pandas, scikit-learn to their Dask-powered equivalents all without requiring you to completely rewrite your code or retrain the software to scale.

In addition to offering familiarity, Dask is:

Flexible: Using Dask allows for custom workloads and integration with other projects.

Native: Dask natively scales Python with distributed computing and access to the PyData stack.

Agile: Low overhead, low latency, and minimal serialization, Dask offers impressive agility for numerical algorithms

Scalable: Dask can scale up on clusters with 1000s of cores and scale down to set up and run on a laptop in a single process!

Responsive: With interactive computing in consideration, Dask provides rapid feedback and diagnostics to aid humans.

Dask and Big Data

While Pandas is a powerful tool for data computation and analysis, Dask succeeds it with its capability to handle volume of data larger than your local RAM without overwhelming the system or compromising the performance. It provides an easy way to handle large and big data in Python without requiring extra effort.

Given the current situation of working from home, having to tag along bigger machines is not exactly the definition of flexible working. This is where Dasks power and convenience of functioning seamlessly on a laptop comes into play. It installs trivially with conda or pip and extends the size of convenient datasets from fits in memory to fits on disk.

Photo by imgix on Unsplash

As explained, Dask is resilient, elastic, data local, and low latency. The ease of transition lets you seamlessly move between single-machine to moderate cluster. For those already familiar with Pandas, Dask will seem like an extension in terms of performance and scalability. You can switch between a Dask dataframe and Pandas dataframe on-demand for any data transformation and operation.

Examples | Dask Dataframes Basics

Dask Dataframes coordinate many Pandas dataframes, partitioned along an index.

Create a random Dataset

**import** **dask****import** **dask.dataframe** **as** **dd**df = dask.datasets.timeseries()Output: Dask DataFrame Structure:

Load data from flat filesWe can load multiple csv files into dask.dataframe using the read_csv function. This supports globbing on filenames, and will sort them alphabetically. This results in a single dask.dataframe.DataFrame of all data

**import** **dask.dataframe** **as** **dd**df = dd.read_csv('employee_*.csv').set_index('id')

Perform transformations

Dask dataframe follows the pandas api. We can perform arithmetic, reductions, selections, etc... with the same syntax used by pandas.The main difference is that well need to add a .compute() to our final result.

df2 = df[df.y > 0]df3 = df2.groupby('name').x.std()df3Output:Dask Series Structure:npartitions=1    float64        ...Name: x, dtype: float64Dask Name: sqrt, 157 tasks

Call compute

computed_df = df3.compute()computed_df

Plot output

**from** **matplotlib** **import** pyplot **as** plt%matplotlib inlinedf.A.compute().plot()Output:

Examples | Working with a real dataset (NYCTaxi 2013)

In 2014 Chris Whong successfully submitted a FOIA request to obtain the records of all taxi rides in New York City for the year of 2013.

#Downloading and unzip!wget  [https://storage.googleapis.com/blaze-data/nyc-taxi/castra/nyc-taxi-2013.castra.tar](https://storage.googleapis.com/blaze-data/nyc-taxi/castra/nyc-taxi-2013.castra.tar)#Reading data from filesimport dask.dataframe as dddf = dd.from_castra('tripdata.castra/')df.head()#Setup progress barfrom dask.diagnostics import ProgressBarprogress_bar = ProgressBar()progress_bar.register()

Sample Computations

#**How many passengers per ride?**df.passenger_count.value_counts().compute()#How many medallions on the road per day?%matplotlib inlinedf.medallion.resample('1d', how='nunique').compute().plot()

Next stopDask

Now that we know why Dask is a popular solution for parallel computing, the next thing to do is getting started with it. Once you install Dask from Conda, pip, or source, you can also look at adding optional dependencies based on specific functionalities. You can find the list of supported optional dependencies here.

To conclude, if you are a fan of Pandas when it comes to big data, Dask will make you fall in love with handling volume of big data hasslefree. Performance, scalability, familiarity, and the popular Python at the core, make Dask a wholesome tool for your data volume-heavy projects.

Looking for more details

Watch this video on youtube

Follow Me on Linkedin & Twitter

If you are interested in similar content hit up the follow button on Medium or follow me on Twitter and Linkedin

Introduction to Ansible: Deployment of the Multi-Utility Automation Tool (Part 2)

Anuj Syal — Mon, 08 Feb 2021 23:40:23 GMT

In my previous blog, I introduced Ansible as a tool for IT automation that ends repetitive tasks to drive focus on more strategic work. As promised, in this part, I will elaborate on the deployment of Ansible. However, before we dig into how Ansible is the go-to multi-utility automation tool, let us rewind to what it is all about and why is it so important in automation.

Ansible, allows you to write the configuration files in YAML in a certain format, and they work cohesively to start a server, build a network, deploy the application, add configuration files, and restart the server for you; all of this is done order-wise. The tool is preferred as it reduces the need for bigger DevOps teams, has a low error rate, is agile, and is multi-utility.

Without much ado, let us now look at an example deployment of Ansible from scratch on Google Cloud Platform (GCP).

One of the advantages of using GCP with Ansible is the infrastructural scalability. With on-demand instances, software-defined networking, storage and databases, and big data solutions, the full range of GCP modules enable creating a wide variety of resources with the support of the entire GCP API.

Example deployment on GCP

The following playbook creates a GCE Instance. This instance relies on a GCP network and a Disk. By creating the Disk and Network separately, we can give as much detail as necessary about how we want the disk and network formatted. By registering a Disk/Network to a variable, we can simply insert the variable into the instance task. The gcp_compute_instance module will figure out the rest.

- name: Create an instance  hosts: localhost  gather_facts: no  vars:      gcp_project: my-project      gcp_cred_kind: serviceaccount      gcp_cred_file: /home/my_account.json      zone: "us-central1-a"      region: "us-central1"tasks:   - name: create a disk     gcp_compute_disk:         name: 'disk-instance'         size_gb: 50         source_image: 'projects/ubuntu-os-cloud/global/images/family/ubuntu-1604-lts'         zone: "{{ zone }}"         project: "{{ gcp_project }}"         auth_kind: "{{ gcp_cred_kind }}"         service_account_file: "{{ gcp_cred_file }}"         scopes:           - https://www.googleapis.com/auth/compute         state: present     register: disk   - name: create a network     gcp_compute_network:         name: 'network-instance'         project: "{{ gcp_project }}"         auth_kind: "{{ gcp_cred_kind }}"         service_account_file: "{{ gcp_cred_file }}"         scopes:           - https://www.googleapis.com/auth/compute         state: present     register: network   - name: create a address     gcp_compute_address:         name: 'address-instance'         region: "{{ region }}"         project: "{{ gcp_project }}"         auth_kind: "{{ gcp_cred_kind }}"         service_account_file: "{{ gcp_cred_file }}"         scopes:           - https://www.googleapis.com/auth/compute         state: present     register: address   - name: create a instance     gcp_compute_instance:         state: present         name: test-vm         machine_type: n1-standard-1         disks:           - auto_delete: true             boot: true             source: "{{ disk }}"         network_interfaces:             - network: "{{ network }}"               access_configs:                 - name: 'External NAT'                   nat_ip: "{{ address }}"                   type: 'ONE_TO_ONE_NAT'         zone: "{{ zone }}"         project: "{{ gcp_project }}"         auth_kind: "{{ gcp_cred_kind }}"         service_account_file: "{{ gcp_cred_file }}"         scopes:           - https://www.googleapis.com/auth/compute     register: instance- name: Wait for SSH to come up      wait_for: host={{ address.address }} port=22 delay=10 timeout=60- name: Add host to groupname      add_host: hostname={{ address.address }} groupname=new_instances- name: Manage new instances  hosts: new_instances  connection: ssh  sudo: True  roles:    - base_configuration    - production_server

Conclusion

The configuration playbooks are a game-changer. Basically, you write a bunch of configuration YML files in a certain format, and they work together to start a server, build a network, deploy an application, add configuration files, restart the server for you and all of this is done in order

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin

Linkedin: https://www.linkedin.com/in/anuj-syal-727736101/

Twitter: https://twitter.com/anuj_syal

Exploring spaCy: Your one-stop library to build advanced NLP products

Anuj Syal — Tue, 02 Feb 2021 01:27:52 GMT

Its a fast, seamless, and state-of-the-art library for natural language processing

Photo by Nathan Dumlao on Unsplash

As Natural Language Processing or NLP becomes a staple to build modern AI-enabled products, open-source libraries prove a boon for their architects as they help cut down on the time and allow greater flexibility and seamless integration. spaCy is one such library for advanced NLP in the popular Python language. Today, we will explore spaCy, its features, and how you can get started with the free library to seamlessly build NLP products.

What is spaCy?

A free, open-source library, spaCy is suited for those working with a lot of text. It is designed for production use and allows you to build applications that have to deal with a large volume of text. You can use spaCy to build systems for information extraction, natural language understanding, or pre-process text for deep learning.

What does it offer?

spaCy offers a number of features and capabilities ranging from linguistic concepts to machine learning functionality. Some of its features include:

Tokenization

Segmenting text into words, punctuation marks, etc.

Part-of-speech (POS) Tagging

Assigning word types to tokens, like verb or noun.

Dependency Parsing

Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.

Lemmatization

Assigning the base forms of words. For example, the lemma of was is be, and the lemma of rats is rat.

Sentence Boundary Detection (SBD)

Finding and segmenting individual sentences.

Named Entity Recognition (NER)

Labelling named real-world objects, like persons, companies or locations.

Entity Linking (EL)

Disambiguating textual entities to unique identifiers in a Knowledge Base.

Similarity

Comparing words, text spans and documents and how similar they are to each other.

Text Classification

Assigning categories or labels to a whole document, or parts of a document.

Rule-based Matching

Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.

Training

Updating and improving a statistical models predictions.

Serialization

Saving objects to files or byte strings.

Other features include:

Support for 61+ languages

46 statistical models for 16 languages

Pretrained word vectors

State-of-the-art speed

Easy deep learning integration

Syntax-driven sentence segmentation

Built-in visualizers for syntax and NER

Convenient string-to-hash mapping

Export to numpy data arrays

Easy model packaging and deployment

Robust, rigorously evaluated accuracy

and this is not even the complete list. However, as data scientists, we also need to look at all that spaCy isnt. For instance, it is not an API or a platform to provide SaaS or web app. On the contrary, its library allows you to build NLP apps. It isnt designed for chatbots either. It can, however, provide text processing capabilities as an underlying technology for chatbots.

spaCys architecture

The central architecture of spaCy includes the Doc, which owns the sequence of tokens and all their annotations, and the Vocab object that owns a set of look-up tables that make common information available across documents. This allows for centralizing strings, word vectors, and lexical attributes, thereby avoiding storing multiple copies of this data and in turn, saving memory and ensuring theres a single source of truth.

Similarly, the text annotations are designed to allow a single source of truth. For this, the Doc object owns the data, and Span and Token point into it. It is constructed by the Tokenizer and then modified in place by the components of the pipeline. The Language object coordinates these components by taking raw text and sending it through the pipeline which returns an annotated document. It also orchestrates training and serialization.

Getting started with spaCy

Considered by many as Ruby on Rails of NLP, spaCy is easy to install and its API is simple and productive. You can get started with spaCy by first installing it along with its English language model. It is compatible with 64-bit CPython 2.7 / 3.5+ and runs on Unix/Linux, macOS/OS X and Windows.

Next, use a sample of text. After spaCy, import displaCy, which is used for visualizing some of spaCys modeling, and a list of English stop words. Then, load the English language model as a Language object and then call it in the sample text. This will return a processed Doc object.

This processed doc is split into individual words and annotated, but it contains all information of the original text. One can get the offset of a token into the original string, or reconstruct the original by joining the tokens and their trailing whitespace.

You can install the latest spaCy releases over pip and conda. I will try to show a few samples of usage

Initializing spaCy in python

import spacynlp = spacy.load('en_core_web_sm')nlp("He went to play basketball")

Part-of-Speech (POS) Tagging using spaCy

import spacynlp = spacy.load('en_core_web_sm')doc = nlp("He went to play basketball")for token in doc:    print(token.text, "-->", token.pos_)

Output

He > PRONwent > VERBto > PARTplay > VERBbasketball > NOUN

Dependency Parsing using spaCy

# dependency parsingfor token in doc:    print(token.text, "-->", token.dep_)

Output

He > nsubjwent > ROOTto > auxplay > advclbasketball > dobj

The dependency tag ROOT denotes the main verb or action in the sentence. The other words are directly or indirectly connected to the ROOT word of the sentence. You can find out what other tags stand for by executing the code below:

Named Entity Recognition using spaCy

doc = nlp("Indians spent over $71 billion on clothes in 2018")for ent in doc.ents:    print(ent.text, ent.label_)

Output

Indians NORPover $71 billion MONEY2018 DATE

Conclusion

Overall this library like spaCy kind of abstracts away all the complexity in terms of usage, and getting started with it is as simple as just writing a few lines of code. It really enables data engineers like me to quickly get onboarded and use it for unstructured textual data.

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin

Twitter

Hugging Face: A Step Towards Democratizing NLP

Anuj Syal — Mon, 21 Dec 2020 00:23:30 GMT

Its not an emoji, its NLP for everyone

Photo by James Lee on Unsplash

Hugging face; no, I am not referring to one of our favorite emoji to express thankfulness, love, or appreciation. In the world of data science, Hugging Face is a startup in the Natural Language Processing (NLP) domain, offering its library of models for use by some of the A-listers including Apple and Bing.

For those wondering why the focus of todays blog is on a startup, let me first take you through what Hugging Face is all about and why it matters for fellow data scientists.

What is Hugging Face?

Hugging Face, a company that first built a chat app for bored teens provides open-source NLP technologies, and last year, it raised $15 million to build a definitive NLP library. From its chat app to this day, Hugging Face has been able to swiftly develop language processing expertise. The companys aim is to advance NLP and democratize it for use by everyone.

Photo by Daniela Turcanu on Unsplash

In a bid to make it easier for humans to communicate with machines, technologies such as NLP are crucial. For instance, with NLP, it is possible for computers to read text, hear speech, interpret it, measure sentiment, and even determine which parts of the text or speech are important. As more companies increasingly add NLP technologies for enhanced interactions, it becomes imperative to have ready libraries on which language models can be trained, saving time and cost. This is where companies like Hugging Face come into play. Its BERT models are considered highly effective and you can see them everywhere.

BERT.What?

Photo by Markus Winkler on Unsplash

Bidirectional Encoder Representations from Transformers or BERT is a technique used in NLP pre-training and is developed by Google. Hugging Face offers models based on Transformers for PyTorch and TensorFlow 2.0. There are thousands of pre-trained models to perform tasks such as text classification, extraction, question answering, and more. With its low compute costs, it is considered a low barrier entry for educators and practitioners. The company also offers inference API to use those models.

Hugging Face provides a number of models popular for their effectiveness and seamless implementation. Now that we have a fair idea about Hugging Face and its BERT models, let me give you a brief overview of two of its popular models for language processing.

Model: Bert-base-uncased

One of the popular models by Hugging Face is the bert-base-uncased model, which is a pre-trained model in the English language that uses raw texts to generate inputs and labels from those texts. It was pre-trained with two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

In MLM objective, the model randomly masks 15% of the words in a sentence and then the masked sentence is run through the model to predict the masked words. It allows the model to learn a bidirectional representation of the sentence.

In NSP objective, the model concatenates two masked sentences as inputs during pre-training. The model has to predict if two sentences were following each other or not. In this way, it learns an inner representation of the English language which can be leveraged for downstream tasks such as training a standard classifier if the dataset is of labeled sentences.

Model: bert-base-uncase

This bert-base-uncased model is intended to be fine-tuned on a downstream task, but it can be used for either masked language modeling or next sentence prediction.

You can use this model with a pipeline for masked language modeling:

**>> from** transformers **import** pipeline>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')>>> unmasker("Hello I'm a [MASK] model.")[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",'score': 0.1073106899857521,'token': 4827,'token_str': 'fashion'},{'sequence': "[CLS] hello i'm a role model. [SEP]",'score': 0.08774490654468536,'token': 2535,'token_str': 'role'},{'sequence': "[CLS] hello i'm a new model. [SEP]",'score': 0.05338378623127937,'token': 2047,'token_str': 'new'},{'sequence': "[CLS] hello i'm a super model. [SEP]",'score': 0.04667217284440994,'token': 3565,'token_str': 'super'},{'sequence': "[CLS] hello i'm a fine model. [SEP]",'score': 0.027095865458250046,'token': 2986,'token_str': 'fine'}]

To use this model to get the features of a given text in PyTorch:

**from** transformers **import** BertTokenizer, BertModeltokenizer = BertTokenizer.from_pretrained('bert-base-uncased')model = BertModel.from_pretrained("bert-base-uncased")text = "Replace me by any text you'd like."encoded_input = tokenizer(text, return_tensors='pt')output = model(**encoded_input)

To use this model in TensorFlow:

**from** transformers **import** BertTokenizer, TFBertModeltokenizer = BertTokenizer.from_pretrained('bert-base-uncased')model = TFBertModel.from_pretrained("bert-base-uncased")text = "Replace me by any text you'd like."encoded_input = tokenizer(text, return_tensors='tf')output = model(encoded_input)

Model: xlm-roberta

Another very popular model by Hugging Face is the xlm-roberta model. This is a multilingual model trained on 100 different languages, including Hindi, Japanese, Welsh, and Hebrew. It is capable of determining the correct language from input ids; all without requiring the use of lang tensors.

Trained on 2.5T of filtered CommonCrawl data in different languages, the xlm-robeta model is capable of obtaining state-of-the-art results on many cross-lingual understanding benchmarks. It is also a PyTorch subclass model and you can use it as a regular PyTorch module.

You can check the two pre-trained models here; one is the XLM-R using the BERT base architecture and the other is the XLM-R using the BERT large architecture.

To use this model for PyTorch 1.0:

Load XLM-R (for PyTorch 1.0 or custom models):

# Download xlmr.large modelwget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gztar -xzvf xlmr.large.tar.gz# Load the model in fairseqfrom fairseq.models.roberta import XLMRModelxlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')xlmr.eval()  # disable dropout (or leave in train mode to finetune)

To apply sentence-piece-model (SPM) encoding to input text:

en_tokens = xlmr.encode('Hello world!')assert en_tokens.tolist() == [0, 35378,  8999, 38, 2]xlmr.decode(en_tokens)  # 'Hello world!'zh_tokens = xlmr.encode('')assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]xlmr.decode(zh_tokens)  # ''hi_tokens = xlmr.encode(' ')assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]xlmr.decode(hi_tokens)  # ' 'ar_tokens = xlmr.encode(' ')assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]xlmr.decode(ar_tokens) # ' 'fr_tokens = xlmr.encode('Bonjour le monde')assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]xlmr.decode(fr_tokens) # 'Bonjour le monde'

To extract features from XLM-R:

# Extract the last layer's featureslast_layer_features = xlmr.extract_features(zh_tokens)assert last_layer_features.size() == torch.Size([1, 6, 1024])# Extract all layer's features (layer 0 is the embedding layer)all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)assert len(all_layers) == 25assert torch.all(all_layers[-1] == last_layer_features)

Conclusion

With companies such as Hugging Face providing their pre-trained language models, it becomes easier for businesses to extract easily decodable information on how well their product is functioning instead of deciphering graphs and reports. At the core of NLP is having the technology to understand the very language or inputs the human world functions upon and democratizing it only makes the process more seamless and effective.

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin

Linkedin: https://www.linkedin.com/in/anuj-syal-727736101/

Twitter: https://twitter.com/anuj_syal

Getting Started with Pytorch: How to Train a Deep Learning Model With Pytorch

Anuj Syal — Tue, 24 Nov 2020 03:56:39 GMT

Image classification with Pytorch using a Convolution Neural Network

Photo by Alina Grubnyak on Unsplash

Exploring the deep world of machine learning and artificial intelligence, today I will introduce my fellow AI enthusiasts to Pytorch. Primarily developed by Facebooks AI Research Lab, Pytorch is an open-source machine learning library that aids in the production deployment of models from research prototyping by accelerating the process.

The library consists of Python programs that facilitate building deep learning projects. Pytorch is easier to read and understand, is flexible, and allows deep learning models to be expressed in idiomatic Python, making it a go-to tool for those looking to develop apps that leverage computer vision and natural language processing.

How to get started with Pytorch

The best way to get started with Pytorch is through Google Colaboratory. Using this, you can easily write and execute Python in your browser. Colab is ideal as it is not only a great tool to help improve your coding skills but also allows you to develop deep learning applications using libraries such as Pytorch, TensorFlow, Keras, and OpenCV.

The best part? Colab supports free GPU. The flexibility of the tool lets you create, upload, store, or share notebooks, import from directories, or upload your personal Jupyter notebooks to get started. Recently, Colab added support for native Pytorch, enabling you to run Torch imports without the following code:

# [http://pytorch.org/](http://pytorch.org/)from os.path import existsfrom wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tagplatform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'!pip install -q [http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl](http://download.pytorch.org/whl/%7Baccelerator%7D/torch-0.4.1-%7Bplatform%7D-linux_x86_64.whl) torchvisionimport torch

Types of classification

In any deep learning model, you have to deal with data that is to be classified first before any network can be trained on it. One has to deal with image, text, audio, or video data. While using Pytorch, you can use standard python packages that load data into a numpy array which can then be converted into a torch.*Tensor.

When it comes to image data, packages such as Pillow, OpenCV are useful. For audio, scipy and librosa are recommended. For text, raw Python or Cython based loading, or NLTK and SpaCy are useful.

Image classification with Pytorch CNN

For visual data, Pytorch has created a package called torchvision that includes data loaders for common datasets such as Imagenet, CIFAR10, MNIST, etc. and data transformers for images, viz., torchvision.datasets and torch.utils.data.DataLoader.

We will look at this tutorial for training a classifier that uses the CIFAR10 dataset. It has the classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

How to train an image classifier

To begin training an image classifier, you have to first load and normalize the CIFAR10 training and test datasets using torchvision. Once you do that, move forth by defining a convolutional neural network. The third step is to define a loss function. Next, train the network on the training data, and lastly, test the network on the test data.

We will now look at each step in details:

1. Loading CIFAR 10 using Torchvision.

**import** torch**import** torchvision**import** torchvision.transforms **as** transforms

The output of torchvision datasets are PILImage images of range [0, 1]. We transform them to Tensors of normalized range [-1, 1]. .. note:

If running on Windows and you get a BrokenPipeError, try setting the num_worker of torch.utils.data.DataLoader() to 0.

transform **=** transforms**.**Compose([transforms**.**ToTensor(),transforms**.**Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,shuffle=True, num_workers=2)testset = torchvision.datasets.CIFAR10(root='./data', train=False,download=True, transform=transform)testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Out:

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gzExtracting ./data/cifar-10-python.tar.gz to ./dataFiles already downloaded and verified

Lets check out the images.

import matplotlib.pyplot as pltimport numpy as np# image showdef imshow(img):   img = img / 2 + 0.5     # unnormalize   npimg = img.numpy()   plt.imshow(np.transpose(npimg, (1, 2, 0)))   plt.show()# select random imagesdataiter = iter(trainloader)images, labels = dataiter.next()# show imshow(torchvision.utils.make_grid(images))# show labelsprint(' '.join('%5s' % classes[labels[j]] for j in range(4)))

Out

Out:frog frog dog bird

2. Defining a CNN

Now, you have to copy the neural network from the Neural Networks section before and modify it to take 3-channel images.

import torch.nn as nnimport torch.nn.functional as Fclass Net(nn.Module):    def __init__(self):        super(Net, self).__init__()        self.conv1 = nn.Conv2d(3, 6, 5)        self.pool = nn.MaxPool2d(2, 2)        self.conv2 = nn.Conv2d(6, 16, 5)        self.fc1 = nn.Linear(16 * 5 * 5, 120)        self.fc2 = nn.Linear(120, 84)        self.fc3 = nn.Linear(84, 10)def forward(self, x):        x = self.pool(F.relu(self.conv1(x)))        x = self.pool(F.relu(self.conv2(x)))        x = x.view(-1, 16 * 5 * 5)        x = F.relu(self.fc1(x))        x = F.relu(self.fc2(x))        x = self.fc3(x)        return xnet = Net()

3. Define a Loss function and optimizer

For this, you can use a classification cross-entropy loss and SGD with momentum.

**import** torch.optim **as** optimcriterion = nn.CrossEntropyLoss()optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

4. Training the neural network

A crucial and interesting step in training the classifier; you simply have to loop over the data iterator and feed the inputs to the network and optimize.

for epoch in range(2):  # loop over the dataset multiple times    running_loss = 0.0    for i, data in enumerate(trainloader, 0):        # get the inputs; data is a list of [inputs, labels]        inputs, labels = data# zero the parameter gradients        optimizer.zero_grad()# forward + backward + optimize        outputs = net(inputs)        loss = criterion(outputs, labels)        loss.backward()        optimizer.step()# print statistics        running_loss += loss.item()        if i % 2000 == 1999:    # print every 2000 mini-batches            print('[%d, %5d] loss: %.3f' %                  (epoch + 1, i + 1, running_loss / 2000))            running_loss = 0.0print('Finished Training')# print statistics        running_loss += loss.item()        if i % 2000 == 1999:    # print every 2000 mini-batches            print('[%d, %5d] loss: %.3f' %                  (epoch + 1, i + 1, running_loss / 2000))            running_loss = 0.0print('Finished Training')

At this stage, dont forget to save your trained model:

PATH **=** './cifar_net.pth'torch.save(net.state_dict(), PATH)

You can also follow this guide to learn more about saving Pytorch models correctly.

5. Test the neural network on the test dataset

Now that the training is complete, it is time to test the network. To check if the network has learnt anything, we will predict the class label that the neural network outputs, and check it against the ground-truth.

If the prediction is correct, we add the sample to the list of correct predictions.

The first step here will require you to display an image from the test set to get familiar.

dataiter = iter(testloader)images, labels = dataiter.next()# print imagesimshow(torchvision.utils.make_grid(images))print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

Output

Out:GroundTruth:    cat  ship  ship plane

Now, load back in the saved model.

net **=** Net()net.load_state_dict(torch.load(PATH))

You can now check what this neural network thinks these examples above are:

outputs **=** net(images)

The outputs are energies for the 10 classes. The higher the energy for a class, the more the network thinks that the image is of the particular class. So, lets get the index of the highest energy:

_, predicted **=** torch**.**max(outputs, 1)print('Predicted: ', ' '**.**join('%5s' **%** classes[predicted[j]] **for** j **in** range(4)))Out:Predicted:    cat  ship plane plane

The results seem pretty good.

Let us look at how the network performs on the whole dataset

correct = 0total = 0with torch.no_grad():    for data in testloader:        images, labels = data        outputs = net(images)        _, predicted = torch.max(outputs.data, 1)        total += labels.size(0)        correct += (predicted == labels).sum().item()print('Accuracy of the network on the 10000 test images: %d %%' % (    100 * correct / total))Out:Accuracy of the network on the 10000 test images: 57 %

That looks way better than chance, which is 10% accuracy (randomly picking a class out of 10 classes). This shows that the network has learnt something.

Now, lets look at the classes that performed well and the classes that did not perform well:

class_correct = list(0. for i in range(10))class_total = list(0. for i in range(10))with torch.no_grad():    for data in testloader:        images, labels = data        outputs = net(images)        _, predicted = torch.max(outputs, 1)        c = (predicted == labels).squeeze()        for i in range(4):            label = labels[i]            class_correct[label] += c[i].item()            class_total[label] += 1for i in range(10):    print('Accuracy of %5s : %2d %%' % (        classes[i], 100 * class_correct[i] / class_total[i]))Out:Accuracy of plane : 63 %Accuracy of   car : 59 %Accuracy of  bird : 37 %Accuracy of   cat : 24 %Accuracy of  deer : 53 %Accuracy of   dog : 64 %Accuracy of  frog : 73 %Accuracy of horse : 62 %Accuracy of  ship : 65 %Accuracy of truck : 72 %

Next, we can run these neural networks on the GPU.

Training on GPU

Similar to how you would transfer a Tensor onto the GPU, you will transfer the neural net onto the GPU. First, we need to define the device as the first visible cuda device if we have CUDA available:

device **=** torch**.**device("cuda:0" **if** torch**.**cuda**.**is_available() **else **"cpu")*# Assuming that we are on a CUDA machine, this should print a CUDA device:*print(device)Out:Cuda:0

The rest of this section assumes that the device is a CUDA device.

Then these methods will recursively go over all modules and convert their parameters and buffers to CUDA tensors:

net**.**to(device)

Remember that you will have to send the inputs and targets at every step to the GPU too:

inputs, labels **=** data[0]**.**to(device), data[1]**.**to(device)

You may notice that there is no massive speedup compared to CPU, this is because your network is small. To address this, try increasing the width of your network (argument 2 of the first nn.Conv2d, and argument 1 of the second nn.Conv2d they need to be the same number), and see what kind of speedup you get.

Conclusion

By following the tutorial above, you have successfully managed to train a small neural network to classify images.

Looking for more details?

Checkout this video on my YouTube Channel

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin

Linkedin: https://www.linkedin.com/in/anuj-syal-727736101/

Twitter: https://twitter.com/anuj_syal

GPT-3: The Latest Language Model In AI

Anuj Syal — Mon, 16 Nov 2020 01:16:35 GMT

Why write in geek, when you can describe in simple English?

Artificial intelligence, a creation of the human mind, is now progressing rapidly to aid in creating for humankind. One of the latest feats in the field of AI is GPT-3 or Generative Pre-trained Transformer 3 by OpenAI. The newest in the language models, GPT-3 is the third in line language prediction model in the GPT series with the potential to revolutionize industries; from publishing to coding. Heres how.

What is GPT-3

GPT-3 is a deep learning algorithm that produces human-like text. Similar to other language models, this third-generation language prediction model in the GPT series is also trained with the use of machine learning. With pre-fed data, the model is trained to create content that has a language structure human or machine language.

GPT-3 is considered as the largest model ever created with 175 billion parameters and trained on the largest dataset of any language model. However, it is not just the scale at which it is trained that makes GPT-3 revolutionary, the model requires minimal tuning even for very specific tasks, making it one of the most interesting and important AI systems ever produced.

How does it work

GPT-3 is trained using unlabeled text dataset, such as Common Crawl and Wikipedia. The training enables the model to fill missing words or phrases that are randomly removed from a text using surrounding words as context with human-like accuracy.

For instance, in the following sentence,

Its going to rain today, but I forgot to carry my __

GPT-3 model will be able to complete it based on the available contextual data and suggest the word umbrella.

The above example can also be considered as an advanced prediction model. However, with GPT-3, the potential extends beyond mere prediction in sentences as it holds the potential to cut down on more tedious tasks such as coding, writing research papers, building apps and websites with simple language-based instructions.

What can GPT-3 do

Put simply, GPT-3 can do what you can describe. One of the best ways to understand this language model is that it allows humans (even non-coders or non-geeks) to communicate with machines using simple English. It is as easy as it sounds and as vast as one can imagine. At its core, the model allows for the democratization of AI a technology that is a work-in-progress and the more it reaches the masses, the more it can be refined for its positive use.

Let us now look at ways GPT-3 is being tested by early users and how it can benefit a wide range of industries going forward.

Publishing

Have to push out an article based on limited data provided by an on-field reporter? GPT-3 will not only write out a full piece but give you multiple options while ensuring the tonality and length matches your instructions and editorial standards. Take for example this article published by The Guardian that was churned out by GPT-3. The human editors only put together the piece using segments from the number of different options provided by the model for one cohesive article that oozes perfection. In this case, the headline gives away the mystery, but consider the immense possibilities in the publishing domain, be it writing news, novels, blogs, or even research papers based on a few instructions.

Meanwhile, an early tester built a micro site that generates a GPT-3 generated text based on a word. To test how sophisticated the tool is, go to https://thoughts.sushant-kumar.com/word and replace word at the end of the web address.

Coding

Anyone can learn to write code. But what if you dont need to learn to code to be able to build apps or design websites? With GPT-3, it is possible to describe the requirement a UI creator, a layout generator, write SQL queries, and more, without having to write the code in entirety. For instance, in the following use case, a react dice component was built using description in simple English and GPT-3 did its magic.

In another instance, early testers used GPT-3 to write Java code using just text description. Whereas in one use case, the tool could build a mockup website using a reference websites URL and text description!

Spreadsheets

For spreadsheets, GPT-3 can be leveraged to match patterns, look up specific data, and take auto-completion to a whole new level. In the example below, early tester used the =GPT3() function to carry out tasks which included doing math, matching patterns, and looking up for data from the given information.

Search engine

Now, this might make even Google squirm. GPT-3 beta users have tested the mighty potential of the tool by building a fully functional search engine. Not only does it return the right answer to any query, but also provides a corresponding URL. Take a look at how the search engine built over GPT-3 works:

The Future

As people continue to explore the crazy GPT-3 uses, the possibilities of using the language model for auto-completion or simplification of tedious language-based tasks are endless. However, the boon doesnt come with its own set of downsides.

Some critics may consider GPT-3 a threat because theoretically, it can become a lousy writers tool to get work done fast, a students wild card to get into the top university without breaking a sweat when it comes to writing an SOP, building apps which can be simply described, websites or experiences with layouts that actually translate from imagination but may not truly be practical, or more threateningly, a rogue bloggers tool to spread misinformation or fraud.

While OpenAI continues refining the tool and more work needs to be done to regularize technologies such as the GPT-3, data engineers like me cant wait to test out the tools potential in the data science world. Open AI has released an API for accessing the new GPT-3 tool that allows users to try the new interface on virtually any English language task. You can also test out the potential of GPT-3 by requesting OpenAI request access to integrate the API in your product or build a brand new app of your own.

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin

Linkedin: https://www.linkedin.com/in/anuj-syal-727736101/

Twitter: https://twitter.com/anuj_syal

How Big Data is Helping in Big COVID-19 Pandemic Situation

Anuj Syal — Tue, 15 Sep 2020 02:58:43 GMT

Photo by Clay Banks on Unsplash

Health and technology are inseparable; technology enables easy accessibility to health, while significant breakthroughs in science and health are considered the core reason why technology is improved upon every now and then. Today, when the world is struggling with the novel COVID-19 (coronavirus) pandemic, the role of technology has never felt so important and game-changing; right from bringing education to the comfort of a mobile device to the more complex processes of contact tracing for the virus.

New cases of COVID-19 continue to grow at alarming rates worldwide, with more than 28 million people acquiring the infection, and more than 905k people dead so far. At the core of containing the spread of the pandemic is big data the health data acquired from these cases which has become a valuable source of information and knowledge, processed by government and health organizations to improve their response to the pandemic.

What is big data and how is it helping?

Photo by Franki Chamaki on Unsplash

Big Data involves an advanced technology to store, process, and analyze vast amounts of information for which traditional software techniques do not suffice. In the health sector, big data includes patients data for coronavirus which is stored digitally. With the help of artificial intelligence (AI), it helps reveal patterns, trends, correlations, and discrepancies through computational analysis. It may also help to reveal insights into the spreading and controlling of the virus. All of this data is used to conduct research and development about the virus that caused it along with the efforts to tackle this virus and its after-effects.

Big data can be used profitably with comprehensive data capture capability to reduce the risk of transmitting this virus. This system is used to store data of all forms of COVID-19-affected cases (infected, recovered, and expired). This data can be used efficiently to classify cases and to assist in allocating the resources for improved public health security. Several digital data modalities including location of patients, proximity, patient-reported travel, patient physiology, comorbidities, and current symptoms can be digitized and used to produce actionable insights at both demographic and community levels.

Leveraging public datasets

A quick search for publicly available datasets for COVID-19 and you will come across thousands of them continuously being updated and analyzed to help better the response of the nations and health industry against the pandemic.

Heres a link to some of the datasets you can access to understand the scope and reach of big data:

Microsoft

The COVID-19 Data Lake contains COVID-19 related datasets from various sources, covering testing and patient outcome tracking data, social distancing policy, hospital capacity, mobility, etc.

Bing COVID-19 DataBing COVID-19 data includes confirmed, fatal, and recovered cases from all regions, updated daily.
COVID Tracking ProjectThe COVID Tracking Project dataset provides the latest numbers on tests, confirmed cases, hospitalizations, and patient outcomes from every US state and territory.
European Centre for Disease Prevention and Control (ECDC) Covid-19 CasesThe latest available public data on geographic distribution of COVID-19 cases worldwide from the European Center for Disease Prevention and Control (ECDC). Each row/entry contains the number of new cases reported per day and per country or region.
Oxford COVID-19 Government Response TrackerThe Oxford Covid-19 Government Response Tracker (OxCGRT) dataset contains systematic information on which governments have taken which measures, and when.

Kaggle

COVID-19 Open Research Dataset Challenge (CORD-19)COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.
Coronavirus Genome SequencePhylogenetic analysis of the complete viral genome (29,903 nucleotides) revealed that the virus was most closely related (89.1% nucleotide similarity) to a group of SARS-like coronaviruses (genus Betacoronavirus, subgenus Sarbecovirus) that had previously been found in bats in China.
Coronavirus (covid19) TweetsThis dataset contains the Tweets of users who have applied the following hashtags: #coronavirus, #coronavirusoutbreak, #coronavirusPandemic, #covid19, #covid_19. From about 17 March, the dataset also included the following additional hashtags: #epitwitter, #ihavecorona

COVID data lake on CloudFront

This contains numerous datasets time-series datasets for reporting of daily cases country wise. There is a **Full Dashboard **representing this

Collection of all the sources of COVID datasets curated by the Reddit community Spreadsheets and Datasets:

https://www.worldometers.info/coronavirus/
John Hopkins University Github confirmed case numbers.
Google Sheets From DXY.cn (Contains some patient information [age,gender,etc] )
Kaggle Dataset
Strain Data repo
https://covid2019.app/ (Google Sheets, thanks /u/supertyler)
ECDC (Daily Spreadsheets, Thanks /u/n3ongrau)

Other Good sources:

BNO Seems to have latest number w/ sources. (scrape)
What we can find out on a Bioinformatics Level
DXY.cn Chinese online community for Medical Professionals *translate page.
John Hopkins University Live Map
Mutations (thanks /u/Mynewestaccount34578)
Protein Data Bank File
Early Transmission Dynamics Provides statistics on the early cases, median age, gender etc.

These publicly available datasets for COVID-19 prove a valuable resource to the public, doctors, other healthcare professionals, and researchers in order to track the virus and analyze the infection mechanism. Now let us look at how these datasets help towards the ongoing research and analysis for COVID-19.

Identification of infected cases

One of the publicly available datasets, such as the one available by Microsoft, provides information on infected cases based on region. Not only does this big data store the complete medical history of the patients, but it also assists in identifying the infected cases and conducting further risk level analysis.

Travel history

One of the first identifying factors for infection is the travel history of an individual. If you look closely, right from the moment you book a ticket, your data is stored with the airline, in the mandatory government apps such as Aarogya Setu to identify if youre coming from a contaminated zone, and also with the taxi aggregators. Big datasets such as these store peoples travel history for risk analysis. It also helps in identifying individuals who may be in contact with an infected individual.

Fever symptoms

We all have some or the other app installed in our mobile phones or available natively that helps keep a record of our health. These apps, too, have big data as their backbone to help identify and record symptoms for possible illness. The associated datasets keep records of a patients fever and other symptoms and help determine when medical treatment is required.

Identification at an early stage

In a pandemic, time is of utmost importance. If accurate identification is done in time, it is possible to save the lives of millions. With big data, it has become possible for health authorities to move swiftly in identifying infected people at an early stage. For instance, if a patient logs in information about symptoms associated with COVID-19 in his or her doctors appointment app, it is possible for easier and swifter identification of the infection at an early stage. Big data also helps examine and classify individuals who could be infected with this virus in the future.

When it comes to Indias fight against the pandemic, big data can be seen at play in a lot of places. For instance, enabled in even a few of your favorite food delivery apps such as Zomato and Swiggy, you can see the body temperature of the delivery person even before it reaches your doorstep with that large pizza you ordered. Meanwhile, the government-backed Aarogya Setu app helps in tracking the movement of the citizens. It also notifies the individual if they came in contact with an infected person and for how long. At the heart of all this is big data.

Big data analytics; a tool towards healthier tomorrow

Image by Gerd Altmann from Pixabay

Big data analytics is capable of serving as a tool for COVID-19 monitoring, control, study, and prevention. It will diversify research and help improve vaccine development. With the assistance of data collection, China suppressed COVID-19 and enforced the process with AI leading to a low spread rate. There are many big data components to this pandemic where AI plays an important role in biomedical testing and mining the scientific literature required to help speed up the process of containing the spread.

Access to public information has resulted in the creation of dashboards that track the virus continuously. Several entities use big data to create dashboards. Techniques for identifying faces and measuring infrared temperatures have been built in all leading cities. Heres how China used big data in its seemingly Big Brother ways to contain the virus.

Chinese AI companies such as Hanwang Technology and SenseTime claim to have developed a special face recognition technology that can accurately identify people even if they are wearing a mask.
Smartphone applications are also used to keep a watch on the movements of citizens and to determine if they have been in touch with an infected person or not.
Al Jazeera stated that China Mobile, a telecom provider, sent text messages to state-owned media agencies, informing them of those who were infected. The messages had all the information about the travel history of the citizens.
CCTV cameras are placed at several locations to ensure that quarantined individuals do not step out.

Big Data vs Privacy

The basis of any data is information collection. More often, this data collection gives privacy advocates a sore eye over infringing the rights of citizens. However, it needs to be widely acknowledged and accepted that when it comes to the health industry, no data would equate to outbreaks bigger than COVID-19.

Even as critics have an alternative say when it comes to data collection, in the coming years, big data is poised to play a crucial role in analyzing global data around detected viruses, modeling disease, monitoring human activity, and visualizing the data. As more and more data keeps on piling up into massive datasets, data scientists will get a better shot to avoid such outbreaks altogether. Meanwhile, a publicly available dataset ensures enhanced transparency as well as accessibility to all stakeholders, including the very public it is meant to benefit.

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin

Twitter

Introduction to Ansible: Getting Started With Multi-Utility Automation Tool (Part 1)

Anuj Syal — Mon, 07 Sep 2020 01:31:56 GMT

Ansible is a universal language, unraveling the mystery of how work gets done. Ansible

When it comes to automation, having a multipurpose and hassle-free option for ease and agility becomes crucial. This approach is applied when it comes to choosing an IT automation engine. Who would want to have separate engines for performing different tasks, after all? That is when Ansible comes into the picture.

So what is Ansible? Why is it an important tool in automation? How does it work? I will be answering all these questions through a data engineers perspective in this blog today.

What is Ansible?

Ansible is a radically simple IT automation engine that automates cloud provisioning, configuration management, application deployment, intra-service orchestration, and many other IT needs.

Ansible allows you to write the configuration files in YAML in a certain format, and they work cohesively to start a server, build a network, deploy the application, add configuration files, and restart the server for you; all of this is done order-wise.

It essentially delivers simple IT automation that ends repetitive tasks and frees up DevOps teams for more strategic work.

Why Do You Need Ansible?

Photo by Kevin Ku on Unsplash

Automation: All it takes is to run a single command to automate all the steps, which is a key to efficient production deployment that is done end-to-end. Therefore, it reduces the need for bigger DevOps teams.

Fewer Errors: Due to fewer codes and more configuration, Ansible is prone to fewer errors. The deployment is done using YAML files instead of heavy SSH files.

Less Time Consuming: Writing YAML files takes less amount of time, making Ansible an agile tool.

Universal Code for all Infrastructure Platforms: One of the reasons I recommend Ansible is its sheer multi-utility. You just need to write one code, be it for bare-metal machines, virtual machines, or clouds, and it works everywhere.

Baremetal Machines: Ansible integrates with multiple datacenter management tools to both invoke and enact the provisioning steps required.
Virtualized: With Ansible, you can simplify the experience of cross-platform management. It gives you the flexibility and choice to manage your diverse environment.
Network Automation: You can configure, validate, and ensure continuous compliance for physical network devices. Ansible is the only language that can easily provision across multi-vendor environments, replacing the need for manual processes.
Storage: From software-defined storage, cloud-based storage, or even hardware storage appliances, you can find a module to leverage Ansibles common and powerful language.
Cloud: For Public Cloud, Ansible is packaged with hundreds of modules supporting services on the largest public cloud platforms. If using Private Cloud, one of the easiest ways to deploy, configure, and orchestrate OpenStack private cloud is using Ansible.

How Does Ansible Work?

The core Ansible project manages systems by connecting to them over existing transport mechanisms already in use for your machines. For Linux and Unix machines, this means using SSH, either using OpenSSH or in constrained environments, paramiko (a Python library). For Windows hosts, this means using Windows Remote Management via PowerShell remoting.

It works by connecting to nodes (servers, clients, etc.) on a network, and then transferring Ansible Modules, which are small Ansible programs containing baked-in arguments, to these transport mechanisms to a temporary directory on the remote machine, executed, and then removed in one action. The modules return JSON over standard output, and this return data is processed by the Ansible program on the controlling machine.

These modules are executed by Ansible over SSH and are removed once finished. This only requires login access to managed nodes at the control node to take place. There are various forms of authentication that are supported, however, SSH keys are the most used way to grant access.

The result of this is that a very large amount of remote activity can occur with a minimum of traffic interchange. Modules manage idempotent resources and are not simply commands or scripts. For instance, a module can decide that a package should be installed at a particular version and knows not to execute any commands if the system is already in the proper working state.

What Does It Do for You?

The term Ansible Modules may sound complex, but the biggest complexity is handled by Ansible, and not by the user. Each module defines what holds on any managed node. For example, if an administrator decides that each workstation should have LibreOffice version X.Y, then it is Ansibles duty to find out if any workstation has a different version installed. If so, it detects the OS and runs the necessary routine to update it to the desired version. In this way, all the workstations in an organization can be updated overnight with the software that is supported by the IT department. However, maintaining infrastructure is more than just checking versions of the software.

Ansible Modules gives you the ability to automate tasks across numerous computers. If youre a programmer, it grants you the freedom to write your very own custom modules to perform specialized tasks.

Ansible Playbooks The Most Used Feature

While the modules provide the means to accomplish a task, Ansible Playbook provides you the way to use them. A playbook is basically a configuration file written in YAML which provides instructions about what needs to be done to bring a managed node into the desired state. They are meant to be simple, self-documenting, and human-readable. They are also idempotent, which means it can be run on a system at any time without any negative effect. If you run a playbook on a system that is already properly configured in its desired state, even then that system should still be properly configured after it runs.

A playbook can be fairly simple, such as this particular one that installs, as a privileged user, the Apache HTTP server on any node in an IT departments webservers group:

**- name: Apache server installed hosts: webservers become: yes**

Playbooks can also be quite complex, with variables and conditionals. However, Ansible modules do most of the real work, playbooks stay readable, brief, and clear even though they can arrange entire networks of the managed nodes.

If you want to play around with playbooks, some full sets are illustrating a lot of these techniques in the ansible-examples repository.

Learning Ansible

You can learn Ansible at work or by using it at home. If you have just started with YAML, take some time out to learn it and then write your first playbook. If you wish to start small, you can get Ansible installed on your PC and use it to manage itself or a few computers on your home network. Most importantly, practicing is the key. Try out different modules to get comfortable using and configuring the new ones. After all, they are your gateway to making your nodes to conform to your infrastructure designs. Ansible is the engine that makes it all possible.

What is in Part 2?

(Part 1) was more around introduction on key concepts of AnsibleAs a (Part 2) of the tutorial, I will cover an example deployment from scratch on GCP

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin

Linkedin: https://www.linkedin.com/in/anuj-syal-727736101/

Twitter: https://twitter.com/anuj_syal

Word Vectorization: A Revolutionary Approach In NLP

Anuj Syal — Wed, 19 Aug 2020 03:04:42 GMT

Photo by Hope House Press Leather Diary Studio on Unsplash

Language allows humans to communicate their ideas for enhanced understanding. Similarly, in AI and ML, the use of Natural Language Processing (NLP) advances deep learning models for input which can also be non-numerical.

In NLP, a methodology called Word Embeddings or Word Vectorization is used to map words or phrases from vocabulary to a corresponding vector of real numbers to enable word predictions, word similarities/semantics. This process of converting words into numbers is called Vectorization.

Word VectorWhat?

To help those new to AI mumbo-jumbo, I will try to explain it in a simpler manner. Take, for example, you are talking to someone who does not know or understand your language. So you use gestures or objects to explain to them an idea. Word Vectorization can also be understood in the same manner. For deep learning models, comprehending text or words in their original form is not possible. Therefore, Word Vectorization turns individual words into vectors for easy consumption and comprehension by the machine learning algorithm.

What is a Vector?

Vector denotes the mathematical or geometrical representation quantity. Consider a vector of geometrical point P [2, 3, 4]. This vector basically represents the point P in 3-dimensional space.

3 Dimensional vector for point P

N-dimensional word vector

In NLP consider an n-dimensional word vector in space of n-dimensions. The dimension can vary from 10 to 1000. Where each word would have from 1- -1000 coordinates. Visualizing this would be difficult but you can just consider this as an extension of the 3d vectors where the coordinates are increased drastically. This placing each word in n-dimensional space.

Image of Tesseract from Interstellar showing multiple dimensions

As an example, after running a word vectorization across a large text 7-dimensional word vector for banking looks like this.

Word vector sample from Stanford tutorial https://cdn.hashnode.com/res/hashnode/image/upload/v1616661445528/GuBVj_XG0.html

There are a number of techniques for Word Vectorization to convert text into integers or vectors to which mathematical operations can be applied in order to extract insights from the given data.

Rewind To Pre-2013 Era Of NLP

Denotational Semantics

The idea in denotational semantics is represented by word meaning which is the same way as a Thesaurus works. The best available solution is **WordNet **which contains synonym sets

>>> dog.lch_similarity(cat)  # doctest: +ELLIPSIS2.028...>>> hit.lch_similarity(slap)  # doctest: +ELLIPSIS1.312...>>> wn.lch_similarity(hit, slap)  # doctest: +ELLIPSIS1.312...>>> print(hit.lch_similarity(slap, simulate_root=False))None>>> print(wn.lch_similarity(hit, slap, simulate_root=False))

However, there were a lot of shortcomings for this

Great as a resource but missing nuance
Missing new meanings of words like lol, whatsup, etc
Requires human labor to create and adapt
Not that accurate

One-hot Vectors

These basically were kind of a foundation for building Word Vectorization. Traditional NLP had words referred to as localist representations. In a traditional process, if we wanted to train an ML model, we had to come up with categorical structures similar to building a classification model.

These one-hot encoded vectors should look something similar like this, however, they would be way bigger in size, equivalent to the size of the vocabulary

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

Now the major problem is that languages have a lot of words, infinite space of words, which means the one-hot encoded vectors become really big.

A bigger challenge is that there is no natural notion of similarity for one-hot vectors. Instead of using synonyms and trying relationship similarity tables, another method called Distributional Semantics makes it easier to explore words as vectors. This approach looks at the meaning of the word by looking at the context. For example, the word banking. Distributional Semantics gives a context to the word where it is used. It also does an excellent job of capturing meaning. For this approach, we still represent the meaning of a word as a numeric vector. The dimension of each word can land up to 5030020004000. However, this is way smaller than a one-hot vector of 500,000 in traditional NLP.

Introducing Word2Vec

In 2013 Tomas Mikolov released the paper on an effective way to represent words in the form of word vectors.

Distributed representations of words in a vector space help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words. Mikolov introduced the Skip-gram model, an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. Unlike most of the previously used neural network architectures for learning word vectors, training of the Skip-gram model does not involve dense matrix multiplications. This makes the training extremely efficient: an optimized single-machine implementation can train on more than 100 billion words in one day.

How does this work?

The word vectorization is mainly based on the idea of representing words by their context. A words meaning is given by the words that frequently appear close-by.

Word vector sample from Stanford tutorial https://youtu.be/8rXD5-xhemo

Use a large corpus of text this can be wikepedia or any other text in large amount
Every word in a fixed vocabulary is represented by a vector
Go through each position t in the text, which has a center word c and context (outside) words o
Use the similarity of the word vectors for c and o to calculatethe probability of o given c (or vice versa)
Keep adjusting the word vectors to maximize this probability

Calculation of word vector will somewhat look like this where we approach center word one by one and look at its context words bounded by a window length

Word vector sample from Stanford tutorial https://youtu.be/8rXD5-xhemo

So for each position= 1 to T we predict context words within awindow of fixed size m, given center word Wj we calculate the likelihood

Word vector equation from Stanford tutorial https://youtu.be/8rXD5-xhemo

The objective function J is the (average) negative log likelihood minimizing this will result in effective vectors

Minimizing objective function Maximizing predictive accuracy

Word vector equation from Stanford tutorial https://youtu.be/8rXD5-xhemo

The end result of this calculation should look somewhat similar like the image shown below where similar vectors are closer to each other in the word vector space

Word vector equation from Stanford tutorial https://youtu.be/8rXD5-xhemo

If you look at the figure above, we have words with their vector representations. Since each is going to have a vector representation, we have a vector space in which we can place all of the words. Now, this is a very big space, built from 100 dimension word vectors. It is very hard to visualize 100 dimensional word vectors that are being projected down to a 2D space. While this can represent what is in original space, even 2D space somehow shows related words are close.

How Word Vectorization Helps In Deep Learning

Word Vectorization helps in use cases including computing similar words, text classifications, document clustering or grouping, feature extraction for text classifications, and natural language processing (NLP).

Now there are various ways to convert sentences into vectors. The pre-trained methods include:

Word2Vec from Google is an iterative updating algorithm that learns the vector representation of words, in some sense captures their meaning, maximises objective function by putting similar words in nearby space.
Fasttext from Facebook is an extension of the Word2Vec model and instead of learning vectors for words directly, it represents each word as an n-gram of characters. Once the word has been represented using character n-grams, a skip-gram model is trained to learn the embeddings. It works well with rare words.
GloVe from Stanford allows converting files of GloVe vectors into Word2Vec. Unlike Facebook and Googles word vectors which have a large vocabulary and a lot of dimensions, GloVe is small with 100 dimensions and 50 dimensions.

Post 2013, Neural Net Style representation of NLP came into being, making word vectorization a simple and scalable solution. Word2Vec has had the biggest impact on NLP with deep learning. It is a framework for learning vectors. This approach lets you start with a big pile of text web pages, news article, corpus body of text. The idea behind Word2Vec is when we have a large corpus of text, every word in a fixed vocabulary is represented by a vector. You go through each position t in the text, which has a center word c, and context (outside) words o. Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa). Keep adjusting the word vectors to maximize this probability. We can repeat this a billion times (iteration) to build a word vector space.

References

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin

Linkedin: https://www.linkedin.com/in/anuj-syal-727736101/

Twitter: https://twitter.com/anuj_syal

Booking.com, Spotify, Pinterest: How Kubernetes solved big problems for big companies

Anuj Syal — Mon, 03 Aug 2020 23:07:06 GMT

Another week, another geeky update! As I continue to share my knowledge and experience from the world of data engineering, I bring to you an interesting insight into Kubernetes. If you have been following my blog, in my last tutorial, I elaborated upon what makes Kubernetes the ultimate container orchestration system and is the platform of choice for scaling and deployment. In todays article, I will touch upon some important case studies of major companies using Kubernetes to solve their container orchestration problems.

Spotify

Photo by Morning Brew on Unsplash

One of the biggest music streaming platforms with 130 million premium subscribers, Spotify is an early adopter of microservices and Docker. However, it was by late 2017, when the platform decided to move from the homegrown Helios to the feature-rich Kubernetes. But why did Spotify decide to take the big leap?

We wanted to benefit from added velocity and reduced cost, and also align with the rest of the industry on best practices and tools

Jai Chakrabarti, Director of Engineering, Infrastructure and Operations.

In 2018, the team addressed technology issues required for the migration from the existing content orchestration platform to Kubernetes. The team used a lot of Kubernetes APIs and extensibility features to support and interface with their legacy infrastructure for an easy integration. Spotify began the migration journey to Kubernetes later that year with a small percentage of its fleet, containing over 150 services.

However, Spotifys Kubernetes migration was not a smooth one as the team ended up accidentally deleting all its Kube clusters not once, but twice, all with little to no user impact. Since the incident, the team learned to operate many clusters automatically and safely. The team also cut down on its downtime and human error by declaratively defining clusters in code with Terraform, backing up and restoring clusters with Ark, and increasing scalability and availability by running many more clusters.

Now, many of you may wonder how Kubernetes exactly helped Spotify? Well, the time consumed by the team to create a new service and get an operational host to run it in production reduced to a matter of seconds and minutes with Kubernetes. In addition to the ease of scale and time efficiency, migrating to Kubernetes also helped improve CPU utilisation up to threefold. The biggest service running on the platform is capable of taking over 10 million requests per second. In the early days of Kubernetes, Spotifys team also built a tool called Slingshot that creates a temporary staging environment that auto destructs after 24 hours.

Booking.com

Photo by riccardo giorato on Unsplash

Booking.com adopted Kubernetes to achieve sustainable scalability. In 2016, the online travel agency migrated to an OpenShift platform to give product developers faster access to infrastructure. A year later, when OpenShift built its own vanilla Kubernetes platform, Booking.com decided to adopt the new container orchestration system.

The association of Booking.com and Kubernetes dates back to 2015 when a team at the travel platform prototyped a container platform based on Mesos and Marathon. However, in order to cater to their need for enterprise features at its scale, the team adopted the OpenShift platform. Even as the platform offered high-level CLI interface, developers faced a knowledge bottleneck that stopped them from being able to support themselves as most of them did not know it was Kubernetes beneath.

The team decided on a new solution of building a vanilla Kubernetes platform of their own and customise it. In doing so, their existing experience of working on OpenShift proved helpful.

We have a tutorial. You follow the tutorial. Your code is running. Then, its business-logic time. The time to gain access to resources is decreased enormously.

Ben Tyler, Principal Developer, B Platform Track at Booking.com

So what changed with Kubernetes for Booking.com. Earlier, creating a new service could take a couple of days or weeks, depending on whether the developers understood Puppet. On the new platform, it can take as few as 10 minutes. The team was able to build almost 500 new services on the platform in the first 8 months of adoption with hundreds of releases per day.

Helping Booking.com with its agenda of sustainable scaling are other CNCF (Cloud Native Computing Foundation) technologies including Envoy, Helm, and Prometheus. The team also developed Shipper which is an extension for Kubernetes to add more complex rollout strategies and multi-cluster orchestration.

Photo by Charles Deluvio on Unsplash

Another interesting case study of Kubernetes solving big brand problems is that of Pinterest. In 2016, the social media service decided to move to a new compute platform that could both be agile and seamless. It started by moving the services to Docker containers and when these services went into production, the team began looking at orchestration and Kubernetes became the obvious choice.

How did Kubernetes provide Pinterest the ease of scale and the benefit of simple deployment? Being a service with more than 200 million active monthly users and a thousand microservices running under the hood, the pain point for Pinterest was lack of velocity in taking an idea to production due to an inconsistent and complex end-to-end developer experience. With Kubernetes, the team at Pinterest was able to build on-demand scaling, new failover policies, and simplify overall deployment and management of Jenkins.

We not only saw reduced build times but also huge efficiency wins. For instance, the team reclaimed over 80 percent of capacity during non-peak hours. As a result, the Jenkins Kubernetes cluster now uses 30 percent less instance-hours per-day when compared to the previous static cluster.

-Micheal Benedict, Product Manager for the Cloud and the Data Infrastructure Group at Pinterest.

It was in July 2017 that Pinterest decided to address the issues of running on virtual machines by choosing Kubernetes over other orchestration platforms. The team began onboarding its first use case of Jenkins workloads into the Kubernetes system in the beginning of 2018. By the end of Q1, the team successfully migrated Jenkins Master to run on Kubernetes.

Spotify, Booking.com, and Pinterest are only some of the many key users on the global scale running Kubernetes in production. Given the ease of access, customizability, and scalability, Kubernetes is poised to be the new cloud platform.

Kubernetes Tutorial: The Ultimate Platform For Scale and Ease

Anuj Syal — Tue, 30 Jun 2020 04:48:59 GMT

For a number of data engineering tasks, Kubernetes (koo-burr-NET-eez), is now the platform of choice for scaling and deployment. But what is Kubernetes, how does it work, and how can you get started?

In this tutorial, I will take you through what makes Kubernetes the ultimate container orchestration system and how it allows any application to be scaled at the level of Google! So, sit back, and read on.

What is Kubernetes?

Kubernetes or k8s is a software that allows you to deploy, manage, and scale applications. It groups applications into units, allowing you to span the application over thousands of servers without looking like one single unit. It is meant to work best for microservice architecture but has now become a standard with all top cloud providers having Kubernetes support.

Recently, I have been working on migration of an existing application on Kubernetes. Even as the application is of a microservice-based architecture, we were facing issues from a pipeline perspective where we wanted to scale things up. We wanted to replicate an airflow task across VMS in parallel. So we decided it would be best if we scale this job on multiple pods on Kubernetes.

ContainerWhat?

Photo by Teng Yuhong on Unsplash

Before we get into the mumbo-jumbo of the functions and functionalities of Kubernetes, let me take you through the basics first. It is important to know what a container is in order to fully understand the container orchestration systems.

A container is a standardized unit of software that packages up code and all its dependencies so that the application runs quickly and reliably from one computing environment to another, such as Mac, Windows Linux, and Ubuntu. To further elaborate, consider it as a packaged software like a disc image (dmg) which works because it has everything installed and can be pulled out and placed anywhere else, pretty much like a shipping container that can be put on a ship, train or truck.

This container mashup with a container orchestration system like Kubernetes becomes really useful for deploying microservice-based architecture, which has multiple microservices interacting with each other.

Microservice architecture diagram from flow.ci

So Why Kubernetes?

Planet ScaleKubernetes has been designed on the same principles that allow Google to run billions of containers a week. It provides over 15 years of Googles experience running production workloads at scale, without increasing your ops team. The main goal of Kubernetes is to provide ease of deployment, scaling, and maintenance.

Never OutgrowWhether testing locally or running a global enterprise, Kubernetes offers flexibility to help deliver your applications consistently and easily, no matter how complex your need is. The platform provides you with a framework to run distributed systems resiliently. It takes care of scaling and failover for your application, provides deployment patterns, and more.

Run AnywhereThe open-source nature of Kubernetes allows you the freedom to leverage on-premises, hybrid, or public cloud infrastructure. You can effortlessly move workloads to where it matters to you. With modern web services, users expect applications to be available 24/7, and developers expect to deploy new versions of those applications several times a day and this is where Kubernetes becomes indispensable.

How Kubernetes works?

image from kubernetes.io

ClusterKubernetes coordinates a highly available cluster of computers that are connected to work as a single unit. Consider this as multiple computers working together as one. When you use kubectl command to do anything, it is essentially a group working together at your service as one entity.

The abstractions in Kubernetes allow you to deploy containerized applications to a cluster without tying them specifically to individual machines. It automates the distribution and scheduling of application containers across a cluster in a more efficient way. In this cluster, the Master is responsible for managing the cluster and it coordinates all activities.

NodeA node is an actual VM or a physical computer that serves as a worker machine in a Kubernetes cluster. Each node has a set of software, Kubelet, installed with Kubernetes. Kubelet agent manages the node and communicates with the Kubernetes master. Node also has docker installed.

The nodes communicate with the master using the Kubernetes API. So when you use kubectl to deploy apps, you are usually talking to the whole cluster as a whole through master and not directly to each node. Every node contains Kubelet the process responsible for Kubernetes communication with master and a container runtime Docker.

Deployment

You can deploy containerized applications on top of a cluster by creating a Kubernetes Deployment configuration. The Deployment instructs Kubernetes on how to create and update instances of your application. Once youve created a Deployment, the Kubernetes master schedules the application instances included in that Deployment to run on individual nodes in the cluster. This provides a self-healing mechanism to address machine failure or maintenance.

Once the application instances are created, a Kubernetes Deployment Controller continuously monitors those instances. If the node hosting an instance goes down or is deleted, the deployment controller replaces the instance with an instance on another node in the cluster.

Pod

Pods are the atomic unit on the Kubernetes platform. When we create a Deployment on Kubernetes, that Deployment creates Pods with containers inside them (as opposed to creating containers directly). Each pod is tied to the node and a node can have multiple pods.

A Pod is a Kubernetes abstraction that represents a group of one or more application containers (such as Docker or rkt), and some shared resources for those containers. logical host and can contain different application containers which are relatively tightly coupled example, a Pod might include both the container with your Node.js app as well as a different container that feeds the data to be published by the Node.js webserver.

ServiceBy default, the Pod is only accessible by its internal IP address within the Kubernetes cluster. To make the Container accessible from outside the Kubernetes virtual network, you have to expose the Pod as a Kubernetes Service.

Tutorial with minkube

Now that we have covered the basics of Kubernetes, let us proceed with Minikube a tool that makes it easy to run Kubernetes locally. Lets take a look at how Minikube works.

A Kubernetes cluster has multiple nodes working together in a sync without any human intervention. Minikube essentially replicates this cluster using a single Virtual Machine (VM) on your laptop for users looking to try out Kubernetes or develop with it on a day-to-day basis.

The steps below will help you get started with Minikube and Kubernetes.

Install Virtualbox

To simplify installation, Minikube will run on a local VM built using a Virtualbox. Click here and install respective hosts.

Install Minikube

Mac Homebrew to be installed

brew install minikube

For Windows, install from here

Start Minikube After virtual box and minikube is installed, start the minikube cluster, Easy peasy

minikube start --driver=virtualbox

Now your minikube cluster is up and running

To check out the Kubernetes dashboard run

minikube dashboard

Deploying WordPress and MySQL Using Minikube

Now that we know the installation and setting up of Minikube and Kubernetes, it is time to deploy it on WordPress and MySQL. Both applications use PersistentVolumes (PV) and PersistentVolumeClaim (PVC) to store data. Kubernetes defines a PV as a piece of storage in the cluster that has been manually provisioned by an administrator, or dynamically provisioned by Kubernetes using a StorageClass. A PersistentVolumeClaim (PVC) is a request for storage by a user that can be fulfilled by a PV. PersistentVolumes and PersistentVolumeClaims are independent of Pod lifecycles and preserve data through restarting, rescheduling, and even deleting Pods.

kuztomization.yaml

apiVersion: v1kind: Secretmetadata:  name: mysql-passdata:  password: YOUR_PASSWORD

mysql-deployment.yaml

application/wordpress/mysql-deployment.yaml apiVersion: v1kind: Servicemetadata:  name: wordpress-mysql  labels:    app: wordpressspec:  ports:    - port: 3306  selector:    app: wordpress    tier: mysql  clusterIP: None---apiVersion: v1kind: PersistentVolumeClaimmetadata:  name: mysql-pv-claim  labels:    app: wordpressspec:  accessModes:    - ReadWriteOnce  resources:    requests:      storage: 20Gi---apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2kind: Deploymentmetadata:  name: wordpress-mysql  labels:    app: wordpressspec:  selector:    matchLabels:      app: wordpress      tier: mysql  strategy:    type: Recreate  template:    metadata:      labels:        app: wordpress        tier: mysql    spec:      containers:      - image: mysql:5.6        name: mysql        env:        - name: MYSQL_ROOT_PASSWORD          valueFrom:            secretKeyRef:              name: mysql-pass              key: password        ports:        - containerPort: 3306          name: mysql        volumeMounts:        - name: mysql-persistent-storage          mountPath: /var/lib/mysql      volumes:      - name: mysql-persistent-storage        persistentVolumeClaim:          claimName: mysql-pv-claim

wordress-deployment.yaml

application/wordpress/wordpress-deployment.yaml apiVersion: v1kind: Servicemetadata:  name: wordpress  labels:    app: wordpressspec:  ports:    - port: 80  selector:    app: wordpress    tier: frontend  type: LoadBalancer---apiVersion: v1kind: PersistentVolumeClaimmetadata:  name: wp-pv-claim  labels:    app: wordpressspec:  accessModes:    - ReadWriteOnce  resources:    requests:      storage: 20Gi---apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2kind: Deploymentmetadata:  name: wordpress  labels:    app: wordpressspec:  selector:    matchLabels:      app: wordpress      tier: frontend  strategy:    type: Recreate  template:    metadata:      labels:        app: wordpress        tier: frontend    spec:      containers:      - image: wordpress:4.8-apache        name: wordpress        env:        - name: WORDPRESS_DB_HOST          value: wordpress-mysql        - name: WORDPRESS_DB_PASSWORD          valueFrom:            secretKeyRef:              name: mysql-pass              key: password        ports:        - containerPort: 80          name: wordpress        volumeMounts:        - name: wordpress-persistent-storage          mountPath: /var/www/html      volumes:      - name: wordpress-persistent-storage        persistentVolumeClaim:          claimName: wp-pv-claim

Now run the kubernetes configuration to spin up the instance

kubectl create -f k8/

Verify that the Pod is running by running the following command:

kubectl get pods

Verify that the Service is running by running the following command:

kubectl get services wordpress

Run the following command to get the IP Address for the WordPress Service:

minikube service wordpress --url

The response should be like this:

1.2.3.4:32406>

Copy the IP address, and load the page in your browser to view your site.

You should see the WordPress set up a page similar to the following screenshot.

Start image of Wordpress server

Voila! Your WordPress is now up and running on a Kubernetes cluster.

Whats Next

In my next blog, I will talk about various case studies on how Kubernetes helped solve large scale problems for big companies such as Spotify and Pinterest.

Follow me on Linkedin

If you are interested in similar content do follow me on Linkedin

Getting Into Stanford: From My Couch in COVID Era

Anuj Syal — Mon, 22 Jun 2020 03:03:18 GMT

Future of Education is Online: My Story of Getting Into Stanford From My Couch in COVID Era

Photo on Stanford

The new norm may be socially distant and self-isolated, but it sure has paved the way for novel ideas and possibilities of keeping the world going. In this, education has been one of the major industries undergoing drastic changes. With the world shut or under a lockdown, traditional educators had to devise ways to keep the curious minds engaged.

The role of technology, the internet, in particular, had never been felt so important before the Covid-19 virus crippled the global population. However, as the English proverb goes,

Necessity is the mother of invention

Being stuck at home has given learners like me the opportunity to up our skills, finish a degree, or simply add something new to an existing profile. As educators around the globe bring education to students from the comfort of their homes to the comfort of their learners couch, I too decided to leverage it.

From DE to NLP via Stanford

The selection mail from Stanford for my NLP program application

I am currently working as a Senior Data Engineer in Singapore and my expertise is in building planet scalable data pipelines that work 24 hours to pump in data from all over the world. I have done my graduation from Guru Gobind Singh Indraprastha University, New Delhi. While I have a really good hands-on experience in the DE field, I always wanted to pursue a masters degree. I even prepared for GRE and did a 3-month preparation window which helped me crack a few colleges such as ASU and SJSU, but Ivy League remained elusive.

If you are a data engineer and working with pipelines of public data, you would understand the eventual requirement of learning Natural Language Processing or NLP to extract information from textual data and organize. This domain is now no longer limited to data scientists. My curiosity to get into Machine Learning and Artificial Intelligence prompted me to apply at Stanford University.

When my application for Natural Language Processing with Deep Learning got approved at Stanford University, I could really sense my aspiration of pursuing further education coming true; all without having to compromise on my existing job, home-office routine, and personal time.

Why Go Online

Photo by Thomas Lefebvre on Unsplash

One of the biggest blessings in disguise the Covid-19 era has brought in is the ready availability of resources from wherever you are. If you ask me, a professional who has to juggle between office, home, and college, online classes mean:

I dont have to quit my job which I absolutely love

No student loans! I can self-sustain my education

Freedom to pick your own instructor. You can choose to learn from the best. I chose Professor Christopher Manning, the best in his domain!

The flexibility of course. You can always pick what interests you. In my case, it had to be NLP.

On-demand classes. The sheer ease of pausing a lecture video to jot down notes, or go back to a complex concept for clarity. I also rewind for all the geeky jokes ;)

Screenshot of a lecture by Professor Christopher Manning from my course

In addition to the above, based on the course, you also have access to reach out to fellow learners or course facilitator on Slack to get a better understanding. The NLP course I have opted for is a cohort-based program where almost 200 students get to collaborate and learn together. This basically means a wider perspective on the subject and enhanced collaborative learning. Also, as the assignments are challenging, they help tweak your brain in the right direction.

ButPhysical Classrooms Are Charming

Photo by Nathan Dumlao on Unsplash

It is an undeniable fact that physical classrooms and traditional education have a charm of their own. However, the current pandemic situation has forced a shift to virtual classes, and this trend is expected to transform from being a norm to being a culture in the future. The future of college is online, and it is cheaper. Unlike traditional college and higher education, online classes allow the flexibility to choose what to study and in a cost-effective manner.

When the children of the world are getting homeschooled and professionals are making it work with online classes and meetings, it is obvious there remain certain areas that will take time to familiarise oneself with. For instance:

Online education means no fancy campus experience. The most you could achieve is to allocate yourself a different corner in your home to study.

Socialising doesnt exist. Theres no hanging out with friends after lectures in the college.

Cafe sessions with campus friends just become another 711 coffee run.

No cool libraries or dormitories.

Then Why Pick Online College?

Online education is more of a self-discovery route where you have the freedom to learn at your pace, sans the other benefits that come with getting an education from a traditional institution. While you do end up staring at your notes more often than crack jokes with buddies or plan study nights, online college is time and cost-effective, helping you accomplish more in less time. If you opt for an online college, you get to:

Pick the course of your choice

Save on college fee

First-hand best professor experience

More detailed than regular course as you can always take a pause and take a deep look at the research papers

High-standard brain teasing assignments

It may seem a temporary arrangement for traditional education supporters, however, online education will continue to prove beneficial even after the pandemic ends. The education system will become more democratic, inclusive, accessible, and affordable.

In the end, it is about learning. Whether you pick an A-lister or a mediocre college, the future of education is set to change with online becoming the new default. Stanford University recently announced that it will reopen campus this fall, but will continue teaching students online, even those who are on campus over coronavirus concerns.

If you are looking to get into a similar course or type of online education, you could send me your queries and I will do my best to answer your questions

Data Collection, Cookies, Surveillance, and More: The Mumbo Jumbo of ‘Target Advertising’

Anuj Syal — Wed, 27 May 2020 01:52:46 GMT

Is my phone snooping on me? Hell, yeah. Can I do something about it? Maybe (not).

Photo by ev on Unsplash

So the promotional SMS you just received for an intimate product you were casually browsing through at the supermarket is not a coincidence. Neither are the ads you find on your Instagram feed for things you barely searched for on an e-commerce platform. Oh, and the reason why you are seeing more fitness-related content or insurance offers is that youve been stuck in your desk-job profile for a long time. Scared already? Read on to calm (or not) yourself a bit.

What Is Happening?

Technology is both a boon and a bane, subject to how it is working out for an individual or a company. Take for instance the everyday technology in the form of a smartphone it has brought people closer. However, if you zoom in to the amount of information it is collecting about you, it would just add to your daily anxiety.

It may seem ironic coming from someone who himself is a data engineer, but data collection intimidates me as well. Why? Well, let me share an anecdote.

Two years ago, I was in Bangalore for some work and was residing at an Airbnb homestay. One day, I went for grocery shopping with a friend Sarthak Chowdhary who purchased Indulekha branded hair product. When I reached my accommodation, I opened Instagram, and what I saw, shocked me; an ad for Indulekha in my feed! I did not purchase the product, my friend did. I didnt even pay for it. Then, HOW did the ad show up in my feed? Plain. Weird.

You Dont See It, Doesnt Mean It Isnt There

Photo by Kevin Ku on Unsplash

The year was 2013, when many of us were not even fully aware of big data, lest be known of its astounding potential, the US National Security Agency (NSA) whistleblower Edward Snowden exposed the nations mass surveillance program on its citizens. In addition to denting the reputation of the US government and military alike, and US slapping espionage charges against Snowden, the expos became a wake-up call for citizens to start questioning the intent* *of technological comforts of modern times.

When I say technological comforts, I refer to the extent of the way we interact or communicate using the internet and how each thing that we put out there could become a potential ghost from the past. Looking for directions using maps service, searching for a recipe online, watching a video of a stand-up comedian, sharing a link on WhatsApp, commenting, or liking a post on Instagram, even the amount of time you pick up your smartphone to just check!

Whenever you visit a website for the first time, it alerts you about certain cookies that will track you for an enhanced experience online. These first-party cookies are supposedly good cookies and are sent to your device to monitor your device behavior and remember particular information about you. For instance, the kind of websites you visit, the products you buy or keep in cart on a specific e-commerce platform, or even login information for faster access the next time.

Now, the problem arises when some cookies begin to know and remember more than necessary. These cookies are third-party, placed on the first website you visited. For example, you visit a news website and theres an ad for a car. Now, when you first visited the news website and agreed to the cookie policy, because hey, who has the time to read through the complex document, you allowed those cookies to collect data on you good and bad cookies both! The next time you open your Facebook feed, you may end up seeing an ad for the same car as you saw on the news article. Thats how cookies interact with each other to serve targeted ads on the pretext of enhanced experience.

The Mumbo Jumbo

Given how technology is not easy to comprehend by the end-users it has been designed for, most of us become victims of this non-consensual, yet fully agreed to, terms of data collection. Once these third-party cookies know about your likes, browsing behavior, and quite simply, what you do out there on the virtual world (including social media) that is intertwined with the real world, you have these targeted ads following you around the internet and spooking you out!

The debate over user privacy and national security began even before Snowdens expos and continues to be a very crucial factor when it comes to designing apps and services. These days, developers, as well as tech companies, swear by encryption at the core of their apps and services, which basically means no one will be able to break into the exchange of data or messages on a particular channel. However, that does not mean no data is being collected. It basically translates to, if ever someone is to hack into a service or app, the information is assumed to remain locked or encrypted.

We live in a world where the technology may not be at its nascent but it is accessible for the privileged. Smartphones, smartwatches, fitness bands, smart speakers, automatic lock systems, self-ordering refill buttons, website bots, AI-enabled apps, and services, are all part of the network where data is collected in order to make the experience better. In doing so, it is easier to step over the line, jeopardizing privacy.

What Do I Do To Feel Less Snooped-Upon?

Well, this is a tough question to answer. As an end-user, you really cant do much. It is understandable to feel unsafe and crippled by the very technology that made lives easier, brought people closer, made routes easier, and weather more predictable. However, not everyone is comfortable having data collected on them.

The idea of shunning technology is impractical. However, a cautious approach is a plausible way of being safe while enjoying the benefits of technology.

For starters, one can ensure no shady apps or services are installed on your personal or official devices.
Not every cookie policy needs to be agreed upon. You can continue to use a website without having to allow their cookies to collect data on you. Now, some websites may not work well after this.
Not everything personal needs to go on social media. In the end, you decide how much and what all should be available in the public eye.
You can choose to use VPN services, secure communication tools, encryption, private browsers, etc. to prevent your browsing behavior or interaction with the internet being profiled.

Dao Fork Hack: When Gods of Ethereum Bent Blockchain Rules

Anuj Syal — Tue, 19 May 2020 04:58:12 GMT

Photo by Clifford Photography on Unsplash

An overview of the Dao Fork scam when the cryptocurrency community came together to reverse an immutable $70 million in transaction value.

Being a fairly new technology, cryptocurrency is riddled with its complexities. One of the key milestones in the history of the digital currency was in 2016, when a smart contract concept fell prey to malicious hackers, compelling its creators community to step in and restore balance.

Cryptocurrencywhat?

Like any traditional currency, cryptocurrency lets you pay for goods and services. However, it is a digital currency, which means theres no printing or minting happening, and it can only be used online.

Then how does it work exactly?

For cryptocurrency to work, a decentralized technology called blockchain is used. Consider it as a public ledger that is distributed. With the immense security on offer, comes a few shortcomings. For one, anything written on this public ledger cant be reversed. Essentially, theres no coming back from a transaction, unless the receiving party wants to. So how did the community bent the rules?

The DAO hack

The Decentralized Autonomous Organization (The DAO), as a cryptocurrency management concept, had been designed with an aim to eliminate the need for hierarchical management and operate as a venture capital fund in the cryptocurrency business.

Members of the Ethereum community, which is an open-source, public blockchain-based distributed computing platform and operating system, announced the inception of The DAO at the beginning of 2016 as a smart contract on the Ethereum blockchain. While the creation period was successful and managed to raise 12.7 Ether worth around $150 million at that time, several bugs were overlooked.

Recursive Call Exploit

It was on 18th June 2016 an anonymous hacker lent a big blow to Ethereums reputation and The DAO concept. Not only did the hacker manage to extract funds from The DAO, in the first few hours alone, a total of 3.6m Ether, worth around $70 million at the time, were drained. The security loophole was found related to the splitting function, known as recursive call exploit.

As described by the Ethereum Foundation, using recursive calling vulnerability, the attacker called the split function, and then called the split function recursively inside of the split, thereby collecting ether many times over in a single transaction. To put it simply, the hacker managed to extract funds from The DAO into a child DAO which copied the originals structure. Given the irreversible nature of the ledger, the Ethereum community was split into two: to intervene or not to intervene and let malicious hackers take control of the loopholes for the years to come.

To intervene or not to intervene

The Ethereum community presented two proposals to stop the attacker from draining The DAO. First, a soft-fork was proposed to prevent the hacker from withdrawing funds from the child DAO after the 27-day window. However, a few hours before its implementation, a bug was discovered which could have allowed attackers to launch a DoS attack.

Now, the community had to make a tough call of releasing the hard fork which would return all the Ether taken from The DAO to a refund smart contract using its withdraw function. The DAO token holders could request to be sent 1 Ether for every 100 DAO, the investors who paid more than that could request the difference from the original address. That would mean bending the rules and the non-supporters in the community were not ready for it.

Bending rules

Photo by Kyle Glenn on Unsplash

Those against the hard fork listed down arguments including how bending the rules for one contract would mean doing it for others and more importantly, how using hard fork would reduce the value of Ether.

Eventually, the hard fork proposal got the upvote and the community members stepped in with the 1920000th block, about a month later on 20th July 2016, to reverse the transactions.

The DAO hack may have divided the Ethereum community over Code is law, but it did throw light on the weaknesses of the smart contract concepts, and how it may have been averted with thorough testing of the code before implementation.

DAO Fork Hack; A Lesson

As a data engineer, I work on cryptocurrencies these days and the DAO Fork Hack is one of the biggest reminders for our community about the sheer vulnerability of nascent concepts such as the DAO along with platforms like Ethereum which are equally prone to unethical activities; pretty much like the internet. Securing and safeguarding the technology lies in the hands of the humans, even as novel concepts like DAO aim at eliminating human-interference.

Build a Machine Learning Model On Cloud Using Google AutoML

Anuj Syal — Wed, 06 May 2020 05:55:46 GMT

Image from zdnet.com

Your guide to build, train, and deploy a machine learning model using Google Cloud.

Machine learning (ML) is the latest technological advancement, helping drive processes at a scalable level. The models running these complex processes can be understood simply as a mathematical equation that is trying to learn from a given data set and predict values.

Running these esoteric processes is considered a forte of data scientists who have the knowledge and capability to build the ML models. However, big cloud companies such as Google are breaking down the norm to help even those with minimum knowledge create complex ML models using its AutoML Vision.

What is AutoML Vision?

AutoML Vision is one of the ML products by Google, designed for developers with basic or limited machine learning expertise and cloud knowledge to train custom high-quality image classification or object detection models in a few hours. It leverages Googles transfer learning and neural architecture search technology.

Being a data engineer most of my work is around creating models with time-consuming processes such as data collection, cleaning, organizing, scaling, deploying. Tools like Googles AutoML Vision allow data engineers like me to easily get into machine learning by simplifying it.

AutoML Vision is cost-effective as it can reduce the cost of hiring multiple data scientists, especially for startups, who want to leverage machine learning for their products. Being scalable, the model can be easily deployed for millions of requests on Google cloud.

Image: Google Cloud

How to build AutoML Vision classification

In the guide below, we will make a simple ML model using Googles AutoML Vision for ease of understanding.

One of the first things we need is a good dataset. For this, I went to Kaggle for an image dataset of dogs and cats.

Once you download the dataset, the next step is to set up AutoML on your Google Cloud.
To setup AutoML on your Google Cloud, go to Google Cloud console and create a new project

Go to AutoML Vision datasets and enable AutoML API. For this, you must have billing enabled.

After you have enabled the billing, you must see a panel like the one shown below.

Now click on Datasets in the left panel, then on New Dataset and name it dogs_cats

You will see 3 classification options; all of them have different use cases
Single-Label Classification If there is a single label (outcome) in the image. In our case, it is either dog or cat. Hence, we will select this option.
Multi-Label Classification- This is used when there are multiple labels or outcomes in the image. For example, identification based on the breed of the dog or cat.
Object Detection This is used when you want to detect objects in the image such as trees, bricks, houses, and horses, etc. It requires the annotation of the image.
Select the Single-Label Classification, we need to upload images with the correct label. There are multiple options to upload images, but for beginners, we can directly upload from your local computer.
Click on Select Files, then choose Upload images from your computer.

Select all cat images from the training_set cat from the downloaded dataset.

You will need to select a Google Cloud storage bucket. This is where the images will be uploaded. When you got to browse, first create a new bucket and then select it.

Once you define the destination storage bucket for the images, the Continue button will be enabled. Clicking on it will begin the process of uploading images.
Image uploading will take a while. Once finished, go to the images tab and select the unlabelled checkbox. You will see all the uploaded cat images here.

Now click on add new label and name it cats.
Select all images using all checkboxes and click on assign label to cat and save.

Now all these images are under the cat label.
Repeat the steps to upload images of dogs from the downloaded dataset and then assign them dog label and save.
Now that all our dataset has been uploaded, its time to train the model. Go to Train tab

How big should the dataset be?

It is worth mentioning that while training ML models theoretically require a large dataset, Googles AutoML Vision works well even with such a small dataset of 100 images. This is because the tool is based on transfer learning which essentially means you are training on top of Googles model with the last layer of the neural network.

This sample training should be covered in the free trial of your google cloud account. You can check the pricing for training ML models by following this link. These costs are significantly lesser than having a full-time data scientist on board to do the work.

To leverage Googles offering and maximize the output of the model, the goal should be to include as much dataset as possible.

Let the training begin

To begin training, click on Start Training. This will take you to the final steps where you will define the model and set the node budget.
Select the cloud-hosted option and default node hour budget to 16 node hours.

Check the box labeled Deploy model to 1 node after training and click on Start Training.

Depending on the data set, the training time would vary. With our dataset, the training should take less than an hour while the deployment will take longer.
Once the training is done, you will be able to see the quality and specifics of the model which has been trained.

This provides data on precision, recall, count of images, along with the Confusion matrix. Like humans, the ML models are prone to get confused due to incorrect labeling or confusing labeling.

The confusion matrix shows how often the model classified each label correctly (in blue), and which labels were most often confused for that label (in grey). This table is limited to the 10 most confusing labels.

Using the ML model

It is easy to use the model we have just created and trained. Simply go to Test & Use.
You can use an uploaded image to test it manually. Go to your downloaded images > test folder > select any of the cat or dog images.

The model will be able to easily predict any new images of cats and dogs. You have now successfully trained a model that can predict unknown images based on the pre-fed dataset.

Integration of the ML model

Now that you have created and tested out the ML model, you can integrate it with your application.
For integration with REST APIs :

request .json

{ "payload": { "image": { "imageBytes": "YOUR_BASE64_ENCODED_IMAGE_BYTES" } }}

Curl command

curl -X POST -H "Content-Type: application/json" \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ https://automl.googleapis.com/v1beta1/projects/917835004810/locations/us-central1/models/ICN4414051002357907456:predict \ -d @request.json

Integration with python

import sysfrom google.cloud import automl_v1beta1from google.cloud.automl_v1beta1.proto import service_pb2# 'content' is base-64-encoded image data.def get_prediction(content, project_id, model_id):prediction_client = automl_v1beta1.PredictionServiceClient()name = 'projects/{}/locations/us-central1/models/{}'.format(project_id, model_id)payload = {'image': {'image_bytes': content }}params = {}request = prediction_client.predict(name, payload, params)return request  # waits till request is returnedif __name__ == '__main__':file_path = sys.argv[1]project_id = sys.argv[2]model_id = sys.argv[3]with open(file_path, 'rb') as ff:content = ff.read()print get_prediction(content, project_id, model_id)

Run

python predict.py YOUR_LOCAL_IMAGE_FILE 917835004810 ICN4414051002357907456

Other Google Cloud AutoML Offerings

Now that you have successfully created, trained, and deployed an ML model using Googles AutoML Vision tool, you can advance your learning with other ML tools by Google. These include AutoML Video Intelligence (Beta) for creating content discovery and video experiences, AutoML Natural Language for text-based language processing, AutoML Translation for models to detect and translate between languages, and AutoML Tables (Beta) to auto-build and deploy machine learning models on structured data.

Stackdriver on Google Cloud: The only logging solution you’ll ever need

Anuj Syal — Thu, 30 Apr 2020 14:28:44 GMT

I have always found logging a really big overhead, specially in the present micro-service era where everything is small and distributed. A solution like Stackdriver is way more efficient and effective to use compared to any other existing solution out there.

Understanding the Logging Principle

One of the many reasons that made me choose to implement logging and Monitoring of my Compute engines using Stackdriver is that all the tools are appropriately integrated. Before I started using it, I was using one tool for log aggregation, another tool for application performance monitoring, and another tool for crash reporting; to be honest, it was an issue moving from one platform to another. This was an annoying and time-consuming process, especially when I had to go through a pile of tools for Monitoring like Monit and Pingdom. The most painful part in all is when something goes wrong; almost everything turns red, and its tough to figure out what exactly was happening, and because I was receiving alerts simultaneously, it was hard pointing out the issue.

Stackdriver has a lot of functionalities built-in that makes it seamless to move from Monitoring to logging your app engines in one scalable and unified platform.

Why does logs matter?

When we look at anything related to troubleshooting distributed micro-services at its fundamental level, we will realize that it is a difficult process. Moving forward, The Cloud infrastructure has a bunch of interconnected entities, like a puzzle. So when something gets wrong, it is really complicated trying to sample out the problem and fixing them back together.

In this regard, log gives us visibility into our system. For instance, if we look at the puzzle analogy, while we can simply assemble or troubleshoot things individually, a centralized logging system in the cloud will allow you to see all of the pictures and missing details at the same time, which will help improve performance, reliability, and security of your log data.

Stackdriver Logging allows you to provide data, store, analyze, monitor log data and events from GCP and other platforms. It also has a fully enabled integration that works easily with cloud trace, error reporting, and cloud debugging. This connection allows you to move swiftly between traces, charts, errors, and logs across multiple cloud environments

Let us take a closer look at how we can implement Logging using Stackdriver

Implementing Logging

It is essential to know that Logging and Monitoring go together when working with Stackdriver. To get logs, we need to monitor whats going on within our compute engines. So we shall set up our cloud monitoring with the following steps.

Install the logging agents (if logs will be coming from multiple sources)

Cloud Logging agents function by allowing metrics and logs to move from your compute to monitoring and logging functions. Google-Fluentd logging agent installed in each VM is relied upon by google to stream logs from multiple sources to be viewed in the Stackdriver log viewer.

Kubernetes Engine / App Engine

If your VMs are running in Google Kubernetes Engine or App Engine, the agent is already included in the VM image.

Compute Engine

You can install logging agents on compute engine with just these two shell commands.

curl -sSO https://dl.google.com/cloudagents/install-logging-agent.shsudo bash install-logging-agent.sh --structured

Or these commands could be part of your startup shell script

Logging agent configuration (No additional configuration is required)

The Logging agent google-fluentd is a modified version of the fluentd log data collector. The Logging agent comes with a default configuration; in most common cases, no additional configuration is required.

This means most of the use cases are already covered and the logging is happening by default, this is because of the default fluentd configuration files. Fleuntd covers a lot of logs from different services like syslogs, apache, mysql, nginx, jenkins and many more. You can find a list here:Default Logging agent logs | Cloud Logging | Google CloudThis page lists the logs that are sent to Cloud Logging by the Logging agent. The next major version of the Loggingcloud.google.com

If you are using these services, youll be able to see the logs in Log Viewer as soon as you install the logging agent

Enabling logs in Docker container

This is actually one the most used configuration in my case as all of my micro-services are configured with docker.

In your docker run command just add the following argument.

--log-driver=gcplogs

This runs on any pre-defined image or your custom image.

Viewing Logs

Now that the configurations are setup, youll be able to see all the incoming logs at Logging > Logs Viewer in your google cloud console.

Image from official google cloud logging documentation

The basic query interface has the following major components indicated by red numbers in the screenshot above some of which are shared with the advanced query interface:

The window tabs let you stay on Logs (the Logs Viewer page), or choose between other Logging features: Metrics (see Logs-based metrics), Exports (see Exporting with the Logs Viewer), and Logs ingestion (see Logs exclusions).
The search-query box in the basic query interface lets you query log entries by label or text search.
The basic selector menu lets you choose resources, logs, and severity levels to display:Resources: The resources available in your current project.Logs: The log types available for the current resources in your project.Log severity: The log severity levels.
The time-range selector drop-down menus let you query for specific dates and times in the logs.
The streaming selector, at the top of the page, controls whether new log entries are displayed as they arrive.The log-entry table contains the log entries available according to your current queries and custom fields.
The expander arrow () in front of each log entry lets you look at the full contents of the entry. For more information, see Expand log entries.
The View Options menu, at the far right, has additional display options.
The Download logs menu, at the far right, lets you download a set of log entries. For details, see download log entries.

The idea is just you can create custom filters based on text search or attributes. It is easy to wrap your head around it, quite similar to SQL

Monitoring

Cloud Monitoring collects measurements to help you understand how your applications and system services are performing. A collection of these measurements is generically called a metric.Monitoring is the final layer to glue all the logs together to make sense.

The following 3 parts are essential to setup monitoring

Uptime checks

As you are working on your cloud compute engines, you need to know that your web servers are accessible from any location you want. Thats precisely what uptime checks does; it verifies and notifies you if the uptime checks encounter a problem.

Its straightforward to create an uptime check; go to Monitoring on your Google Cloud project. Create uptime checks and complete the following fields.

Image from official google cloud documentation

Ensure that your servers are properly installed to avoid connection error refused messages when running a test if your uptime checks are working.Click Save to complete setting up your uptime checks.

Alerting policies

Alerting gives timely awareness to problems in your cloud applications so you can resolve the problems quickly. To create an alerting policy, you must describe the circumstances under which you want to be alerted and how you want to be notified.

How it works?

Conditions that identify an unhealthy state for a resource or a group of resources. The conditions for an alerting policy are continuously monitored.
Notifications sent through email, SMS, or other channels to let your support team know a resource is unhealthy.
When events trigger conditions in one of your alerting policies, Cloud Monitoring creates and displays an incident in the Google Cloud Console. If you set up notifications, Cloud Monitoring also sends notifications to people or third-party notification services.

As an example an alert facility can be something described below. You have to use the monitoring panel to setup something like this

If HTTP response latency is higher than two seconds,and if this condition lasts longer than five minutes,open an incident and send email to your support team.

To set it up go to Alerting tab, you will see a tab which requires you to fill in conditions and add a notification channel .

A condition can be related to resource usage like CPU, memory, etc or something like uptime check, it really depends on the requirement.

After that click on save. This will setup the alert and will send notification if conditions fail

Dashboard

After Monitoring, you will want to display metrics and logs collected. To create a dashboard, go to Monitoring and select Dashboards and then click create a dashboard. Choose a name for your dashboard and click confirm. Add selection to chart and ensure the Metric tab is selected. Click save to complete the process.As a start you can create the following charts

CPU Utilization
Memory Utilization
Received/Sent bytes over network

Conclusion

With microservices going mainstream, understanding how to implement Stackdriver monitoring and logging capabilities will help developers optimize their productivity, reliability, and the security of their working environment.

Work From Home: How An Ergonomic Desk Setup Will Save You From Medical Bills Post Quarantine

Anuj Syal — Mon, 20 Apr 2020 07:53:50 GMT

Photo by Domenico Loia on Unsplash

An ergonomic desk setup will have your back, safely.

Yes, we all are home-bound and working from either our overtly comfortable sofas or uncomfortable dining table and chairs. One of the biggest challenges the ongoing pandemic has thrown at the working professionals is the inability to efficiently work from home without compromising the health in the longer run.

WFH IS DESIRABLE

A few years back I used to spend close to 4 hours daily commuting to Gurugram. It turned out more than just brutal when the initial discomfort of sitting long hours in the cabs grew into a chronic back pain. Various painkillers and floor exercises later I realized, the problem wasnt just the commute, it was the extended sitting hours while working as well.Like many others, working from home is desirable for me for dozens of reasons, the biggest being the comfort factor.

BUT HOME ISNT AN OFFICE!

Homes are designed to look and feel like a resting place and one doesnt expect it to have a disciplined office setup. However, the sheer comfort of not traveling for hours to the office makes working from home much more desirable. Admittedly, it begins with the ever-inviting sofa feeling like the perfect place to work before you end up feeling sleepy and think of moving to the dining table when it gets uncomfortable after a couple of hours and you decide to move to the bed to eventually doze off in the middle of work. ( I know I am not the only one! )

Home isnt an office. True. But one can make subtle tweaks to make it conducive for efficient and healthier working. How? For starters, install a work desk. Now, not all homes are spacious to have a dedicated spot that can be turned into an office or studio and even if there is, traditional work desks are as harmful at home as in the compact office cubicle.

MAKE THE DESK WORKOUT FOR YOU

Initially, I too invested my money in an Ikea chair and table setup. It would be lying if I said it didnt help me. The compact setup felt more disciplined and added the much-needed support to my frailing back. However, it still wasnt enough because my long work hours continued to take a toll on my overall health.

This is where a standing desk worked like magic for me.

STANDING DESK, BOON FOR HEALTHY WFH

Photo of WorkFit-S on Ergotron

I recently upgraded my home-workstation to the WorkFit-S from Ergotron. It is a height-adjustable standing desk that I have attached to my existing desk setup. Because of its excellent flexibility and ergonomics, I am able to comfortably switch between a relaxed sit-down mode and a fully standing style.

A flexible standing desk attachment is a great addition to a home-workstation when you have to balance between work, home space, and health. Not only did the sit-stand setup help me become more efficient in my work as I felt less sluggish, but it also helped ease my sore back.

Improved posture and reduced back pain are some of the strongest pros for having a sit-stand or standing desk, but there are also other benefits such as a lowered risk of gaining weight, reduced risk of heart disease, and lower blood sugar levels as well.

INVEST FOR HEALTHIER RETURNS

Photo of my WorkFit-S setup

Workspaces dont often allow us to choose how we treat our bodies while working. But home does. Consider opting for a flexible desk set up as an investment for a healthier body and efficient work from home. While the initial cost may seem a tad high, it will be more rewarding in the longer run, including saving you from costly medical bills and painful surgeries.

Having switched to Ergotrons solution, I am also planning to further invest in an ergonomic chair to add to my home-workstation. My next big purchase once the lockdown situation improves is going to be a Herman Miller Embody Chair which is designed for the right pressure distribution and supports the natural alignment of the back. I recommend all professionals, from the same industry or otherwise, to invest in a good desk setup for their home because lockdown or not, work wont stop and a sore back shouldnt hamper your success or longevity.

How to run Arangodb in docker container

Anuj Syal — Sat, 04 Apr 2020 13:41:25 GMT

I came across Arangodb in a project which really required an investigative graph explorer. This use case required investigation of more than 3-degree relationships fairly quickly. Thats when we ditched our transactional database

AQL

Arangodb is a NoSQL datastore which allows us to run graph algorithms and queries easily. Writing a query in Arangodb is done through AQL which is quite similar to writing a loop.The query below is filtering users for a key named phil.

FOR doc IN users    FILTER doc._key == "phil"    RETURN doc

AQL is mainly a declarative language. The syntax of AQL queries is different from SQL, even if some keywords overlap. Nevertheless, AQL should be easy to understand for anyone with an SQL background.

Foxx Microservices(game changer?)

Traditionally server-side is used as a standalone application that guides the communication between client-side frontend and database, this is actually a bit more complicated, as it has more moving parts if you compare it to what is happening with Foxx microservices in Arangodb.

ArangoDB allows developers to write their data access and domain logic as microservices running directly within the database with native access to in-memory data. The Foxx microservice framework makes it easy to extend ArangoDBs own REST API with custom HTTP endpoints using modern JavaScript running on Nodejs

If you have used NodJS with express before, it will look quite familiar. A sample implementation is as follows:

'use strict';const createRouter = require('@arangodb/foxx/router');const router = createRouter();module.context.use(router);const db = require('@arangodb').db;const errors = require('@arangodb').errors;const foxxColl = db._collection('myFoxxCollection');const DOC_NOT_FOUND = errors.ERROR_ARANGO_DOCUMENT_NOT_FOUND.code;router.post('/entries', function (req, res) {  const data = req.body;  const meta = foxxColl.save(req.body);  res.send(Object.assign(data, meta));}).body(joi.object().required(), 'Entry to store in the collection.').response(joi.object().required(), 'Entry stored in the collection.').summary('Store an entry').description('Stores an entry in the "myFoxxCollection" collection.');router.get('/entries/:key', function (req, res) {  try {    const data = foxxColl.document(req.pathParams.key);    res.send(data)  } catch (e) {    if (!e.isArangoError || e.errorNum !== DOC_NOT_FOUND) {      throw e;    }    res.throw(404, 'The entry does not exist', e);  }}).pathParam('key', joi.string().required(), 'Key of the entry.').response(joi.object().required(), 'Entry stored in the collection.').summary('Retrieve an entry').description('Retrieves an entry from the "myFoxxCollection" collection by key.');

Running on docker

Running an Arangodb instance using docker is quite straightforward, it just requires you to pull the image from docker-hub then run with a few configurations in place.

To run this you must have docker installed, its really quick to install docker on mac with few clicks, you can follow this link: https://docs.docker.com/docker-for-mac/install/

docker pull arangodb:3.6docker run \ -e ARANGO_ROOT_PASSWORD=password \ -p 8529:8529 \ -v /arangodb:/var/lib/arangodb3 \ -v /arangodb/logs/:/var/log/arangodb3 \ --restart always \ --name arangodb \ -d arangodb:3.6 \ --log.level warning \

After you run this command go to localhost:8529 to access the dashboard: