<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Anuj Syal's Blog]]></title><description><![CDATA[Data Engineer | Youtuber and writer @Towardsdatascience | Love Python! | Based in Singapore |  Cloud and Machine Learning Expertise | Experience with Block Chain]]></description><link>https://anujsyal.com</link><generator>RSS for Node</generator><lastBuildDate>Fri, 10 Apr 2026 14:40:47 GMT</lastBuildDate><atom:link href="https://anujsyal.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Mastering Data Quality in ETL Pipelines with Great Expectations]]></title><description><![CDATA[In the world of data engineering, ensuring data quality is paramount. From business analysts relying on dashboards to C-level executives making strategic decisions, and data scientists training machine learning models — everyone depends on the qualit...]]></description><link>https://anujsyal.com/mastering-data-quality-in-etl-pipelines-with-great-expectations</link><guid isPermaLink="true">https://anujsyal.com/mastering-data-quality-in-etl-pipelines-with-great-expectations</guid><category><![CDATA[Data Science]]></category><category><![CDATA[data-quality]]></category><category><![CDATA[great-expectations]]></category><category><![CDATA[GCP]]></category><category><![CDATA[data]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Sun, 08 Dec 2024 06:07:05 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733639243861/755fe485-32ce-49d3-b6d1-50b11e7e2d4e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the world of data engineering, ensuring data quality is paramount. From business analysts relying on dashboards to C-level executives making strategic decisions, and data scientists training machine learning models — everyone depends on the quality of the data. If the data is flawed, the insights and decisions drawn from it are equally flawed.</p>
<p>This is where Great Expectations (GX) comes into play. Built on Python, Great Expectations is a powerful open-source data validation and quality framework that helps ensure data is clean, accurate, and ready for use. This blog will guide you through the why and how of data quality, the Great Expectations workflow, and a hands-on tutorial for implementation.</p>
<h2 id="heading-why-data-quality-matters"><strong>Why Data Quality Matters</strong></h2>
<p>When building ETL pipelines, your end goal is to deliver clean, validated data. This data could be used as:</p>
<ul>
<li><p><strong>Business Intelligence Reports</strong> for business analysts.</p>
</li>
<li><p><strong>Executive Dashboards</strong> for decision-makers.</p>
</li>
<li><p><strong>Training Datasets</strong> for machine learning models.</p>
</li>
</ul>
<p>If your data contains nulls, duplicates, incorrect formats, or misaligned schema, it undermines every stakeholder's effort. Data scientists can't build accurate models, analysts can't trust insights, and executives risk making bad decisions.</p>
<p>For data engineers, this makes <strong>data quality a moral and technical responsibility</strong>. It ensures that the hours spent building complex pipelines don't go to waste due to unreliable data.</p>
<h2 id="heading-enter-great-expectations"><strong>Enter Great Expectations</strong></h2>
<p><img src="https://legacy.017.docs.greatexpectations.io/img/gx-logo.svg" alt="Welcome | Great Expectations" class="image--center mx-auto" /></p>
<p>Great Expectations (GX) aims to make data quality checks <strong>declarative, automated, and shareable</strong>. With GX, you can define <strong>“expectations”</strong> for your data, which are essentially quality rules, such as:</p>
<ul>
<li><p>Ensuring no null values exist in critical columns.</p>
</li>
<li><p>Validating specific column values (like <code>country</code> only having "US", "IN", "UK").</p>
</li>
<li><p>Checking for minimum, maximum, and unique constraints on data.</p>
</li>
</ul>
<p>The framework integrates with most data sources like <strong>databases, cloud storage (S3, GCS), Pandas, and Spark DataFrames</strong>. It provides easy-to-share <strong>data quality reports</strong> that can be shared with business users, creating transparency and clarity on <strong>what data passed/failed quality checks</strong>.</p>
<h2 id="heading-how-does-the-great-expectations-workflow-look"><strong>How Does the Great Expectations Workflow Look?</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733636454262/ad8581c4-3ac9-4063-ba7a-9f7904d43753.png" alt class="image--center mx-auto" /></p>
<p>A typical Great Expectations workflow has 4 key steps:</p>
<ol>
<li><p><strong>Environment Setup</strong>: Define the context to track your data sources and validation logic.</p>
</li>
<li><p><strong>Data Connection</strong>: Connect to databases, data warehouses, object storage, or files.</p>
</li>
<li><p><strong>Expectation Suite Definition</strong>: Specify the quality checks (null checks, range checks, etc.).</p>
</li>
<li><p><strong>Validation &amp; Reports</strong>: Run validation on the data and generate shareable reports.</p>
</li>
</ol>
<p>These steps map perfectly into a data engineer's workflow. If you think about an ETL pipeline, Great Expectations fits into the <strong>Extract, Transform, and Load (ETL) cycle</strong>.</p>
<ul>
<li><p><strong>During Extraction</strong>: Check for nulls, data formats, or schema inconsistencies.</p>
</li>
<li><p><strong>During Transformation</strong>: Verify data quality after transformations (e.g., after joining datasets).</p>
</li>
<li><p><strong>Before Loading</strong>: Final validation before pushing to a warehouse like <strong>BigQuery, Snowflake, or Redshift</strong>.</p>
</li>
</ul>
<h2 id="heading-great-expectations-product-offerings"><strong>Great Expectations Product Offerings</strong></h2>
<p>Great Expectations offers two core products:</p>
<ul>
<li><p><strong>GX Core</strong>: The open-source version (focus of this blog).</p>
</li>
<li><p><strong>GX Cloud</strong>: A managed cloud version with a GUI for managing expectations and validation reports.</p>
</li>
</ul>
<p>In this guide, we'll focus on <strong>GX Core</strong>.</p>
<h2 id="heading-deep-dive-into-the-great-expectations-workflow"><strong>Deep Dive into the Great Expectations Workflow</strong></h2>
<h3 id="heading-1-environment-setup"><strong>1. Environment Setup</strong></h3>
<p>The first step is to <strong>create a Great Expectations context</strong>. The context tracks your data sources, expectation suites, and batch definitions. It works like a "brain" for your project, keeping all metadata in one place.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> great_expectations <span class="hljs-keyword">as</span> gx
print(gx.__version__)
<span class="hljs-keyword">import</span> os
os.environ[<span class="hljs-string">"GOOGLE_APPLICATION_CREDENTIALS"</span>] = <span class="hljs-string">"&lt;service_account_key_path"</span>
</code></pre>
<p>The demo covers using Google Cloud for data storage and retrieval, therefore we also have to <a target="_blank" href="https://cloud.google.com/iam/docs/service-account-overview">authenticate GCP access using a service account</a></p>
<h3 id="heading-2-define-a-data-source"><strong>2.</strong> Define a Data Source</h3>
<p>After the context is set, you need to <strong>connect your data source</strong>. This can be a local CSV, GCS/S3 file, DataFrame, or even a connection to a <strong>PostgreSQL, BigQuery, or Snowflake database</strong>.</p>
<p>Here’s how you can connect a CSV from a Google Cloud Storage (GCS) bucket:</p>
<pre><code class="lang-python">data_source_name = <span class="hljs-string">"flights_data_source"</span>
bucket_or_name = <span class="hljs-string">"flights-dataset-yt-tutorial"</span>
gcs_options = {}
data_source = context.data_sources.add_pandas_gcs(
    name=<span class="hljs-string">"flights_data_source"</span>, bucket_or_name=<span class="hljs-string">"flights-dataset-yt-tutorial"</span>, gcs_options=gcs_options
)
</code></pre>
<p>This sample data was already placed in google cloud storage, and the in-built connector from great_expectations is used to fetch this data.</p>
<h3 id="heading-3-define-data-asset"><strong>3.</strong> Define Data Asset</h3>
<p>A <strong>Data Asset</strong> is a specific file, table, or logical grouping of data from the source.</p>
<pre><code class="lang-python">data_source = context.data_sources.add_pandas_gcs(
    name=<span class="hljs-string">"flights_data_source"</span>, 
    bucket_or_name=<span class="hljs-string">"flights-dataset-yt-tutorial"</span>, 
    gcs_options={}
)
</code></pre>
<h3 id="heading-4-define-a-batch-definition"><strong>4.</strong> Define a Batch Definition</h3>
<p>A <strong>Batch Definition</strong> defines which subset of the data (like rows or partitions) should be used for validation. This makes sense to use when you are working on a bigger project/database. You can define the batches of rows that needs to be tested in a certain way. For example, datasets coming from different countries would be subjected to different validations thus requiring multiple batches.</p>
<p>In our case to keep things simple we just have one big batch which is same as the data asset</p>
<pre><code class="lang-python">batch_definition_name = <span class="hljs-string">"goibibo_flights_data_whole"</span>
batch_definition_path = <span class="hljs-string">"goibibo_flights_data.csv"</span>
batch_definition = data_asset.add_batch_definition(
    name=batch_definition_name
)
batch = batch_definition.get_batch()
print(batch.head())
</code></pre>
<h3 id="heading-5-define-a-batch-definition"><strong>5.</strong> Define a Batch Definition</h3>
<p>Now that the data is connected with batch defined, you define a list of <strong>data quality checks (expectations)</strong>. This is done via an <strong>Expectation Suite</strong>, a collection of checks.</p>
<pre><code class="lang-python">suite =  context.suites.add(
    gx.ExpectationSuite(name=<span class="hljs-string">"flight_expectation_suite"</span>)
)
expectation1 = gx.expectations.ExpectColumnValuesToNotBeNull(column=<span class="hljs-string">"airline"</span>)
expectation2 = gx.expectations.ExpectColumnDistinctValuesToBeInSet(
    column=<span class="hljs-string">"class"</span>,
    value_set=[<span class="hljs-string">'economy'</span>,<span class="hljs-string">'business'</span>]
)
suite.add_expectation(expectation=expectation1)
suite.add_expectation(expectation=expectation2)
</code></pre>
<p>This defines two expectations:</p>
<ul>
<li><p><strong>expectation1:</strong> Column <code>airline</code> should not be NULL</p>
</li>
<li><p><strong>expectation2</strong>: Column <code>class</code> can only contain two values ['economy','business']</p>
</li>
</ul>
<p>You can create as many expectations as you want, each covering a specific data quality rule.</p>
<h3 id="heading-6-validation-and-run"><strong>6. Validation and Run</strong></h3>
<p>Once you’ve defined expectations, you must <strong>run validation</strong> to see if the data passes. Validation runs the expectations on batches of data and returns results (pass/fail) as <strong>shareable documentation</strong>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># define `Validation Definition`: A Validation Definition is a fixed reference that links a Batch of data to an Expectation Suite.</span>

validation_definition = gx.ValidationDefinition(
    data=batch_definition, suite=suite, name=<span class="hljs-string">'flight_batch_definition'</span>
)
validation_definition = context.validation_definitions.add(validation_definition)
validation_results = validation_definition.run()
print(validation_results)
<span class="hljs-comment"># %%</span>
<span class="hljs-comment"># Create a Checkpoint with Actions for multiple validation_definition</span>
validation_definitions = [
    validation_definition <span class="hljs-comment"># can be multiple definitions</span>
]

<span class="hljs-comment"># Create a list of Actions for the Checkpoint to perform</span>
action_list = [
    <span class="hljs-comment"># This Action sends a Slack Notification if an Expectation fails.</span>
    gx.checkpoint.SlackNotificationAction(
        name=<span class="hljs-string">"send_slack_notification_on_failed_expectations"</span>,
        slack_token=<span class="hljs-string">"${validation_notification_slack_webhook}"</span>,
        slack_channel=<span class="hljs-string">"${validation_notification_slack_channel}"</span>,
        notify_on=<span class="hljs-string">"failure"</span>,
        show_failed_expectations=<span class="hljs-literal">True</span>,
    ),
    <span class="hljs-comment"># This Action updates the Data Docs static website with the Validation</span>
    <span class="hljs-comment">#   Results after the Checkpoint is run.</span>
    gx.checkpoint.UpdateDataDocsAction(
        name=<span class="hljs-string">"update_all_data_docs"</span>,
    ),
]

checkpoint = gx.Checkpoint(
    name=<span class="hljs-string">"flight_checkpoint"</span>,
    validation_definitions=validation_definitions,
    actions=action_list,
    result_format={<span class="hljs-string">"result_format"</span>: <span class="hljs-string">"COMPLETE"</span>},
)

context.checkpoints.add(checkpoint)

<span class="hljs-comment"># %%</span>
<span class="hljs-comment"># Run checkpoint</span>
validation_results = checkpoint.run()
print(validation_results)
</code></pre>
<p>The validation result tells you:</p>
<ul>
<li><p>Which expectations passed/failed.</p>
</li>
<li><p>How many records failed.</p>
</li>
<li><p>JSON reports that can be shared with business teams</p>
</li>
</ul>
<h2 id="heading-reporting-amp-alerts"><strong>Reporting &amp; Alerts</strong></h2>
<p>One of the most powerful features of Great Expectations is <strong>reporting and alerting</strong>. After a validation run, you can:</p>
<ul>
<li><p><strong>Email the reports</strong> to business users.</p>
</li>
<li><p><strong>Log failures</strong> for debugging.</p>
</li>
<li><p><strong>Trigger alerts via Slack or other services</strong>.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Create a list of Actions for the Checkpoint to perform</span>
action_list = [
    <span class="hljs-comment"># This Action sends a Slack Notification if an Expectation fails.</span>
    gx.checkpoint.SlackNotificationAction(
        name=<span class="hljs-string">"send_slack_notification_on_failed_expectations"</span>,
        slack_token=<span class="hljs-string">"${validation_notification_slack_webhook}"</span>,
        slack_channel=<span class="hljs-string">"${validation_notification_slack_channel}"</span>,
        notify_on=<span class="hljs-string">"failure"</span>,
        show_failed_expectations=<span class="hljs-literal">True</span>,
    ),
    <span class="hljs-comment"># This Action updates the Data Docs static website with the Validation</span>
    <span class="hljs-comment">#   Results after the Checkpoint is run.</span>
    gx.checkpoint.UpdateDataDocsAction(
        name=<span class="hljs-string">"update_all_data_docs"</span>,
    ),
]

checkpoint = gx.Checkpoint(
    name=<span class="hljs-string">"flight_checkpoint"</span>,
    validation_definitions=validation_definitions,
    actions=action_list,
    result_format={<span class="hljs-string">"result_format"</span>: <span class="hljs-string">"COMPLETE"</span>},
)

context.checkpoints.add(checkpoint)
</code></pre>
<h2 id="heading-built-in-expectations"><strong>Built-in Expectations</strong></h2>
<p>Great Expectations provides a <strong>library of built-in expectations</strong>. Examples include:</p>
<ul>
<li><p><strong>Null checks</strong>: Ensure no nulls exist in a column.</p>
</li>
<li><p><strong>Range checks</strong>: Check if column values lie within a range.</p>
</li>
<li><p><strong>Data type checks</strong>: Ensure a column is of type integer, float, etc.</p>
</li>
<li><p><strong>Uniqueness checks</strong>: Verify column values are unique.</p>
</li>
</ul>
<p>You can see the full list of built-in expectations on the <a target="_blank" href="https://greatexpectations.io/expectations">official Expectation Gallery</a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>If you're a data engineer, <strong>implementing Great Expectations in your ETL pipelines is a game-changer</strong>. The power to automate, validate, and report on data quality is something every data-driven company needs. Great Expectations makes this process automated, reproducible, and shareable.</p>
<ul>
<li><p><strong>Use GX in your ETL pipelines</strong> to validate extracted and transformed data.</p>
</li>
<li><p><strong>Ensure transparency</strong> with quality reports that business teams can understand.</p>
</li>
<li><p><strong>Automate alerts and logging</strong> to notify when data fails checks.</p>
</li>
</ul>
<h1 id="heading-references">References</h1>
<p>Full source code on my <a target="_blank" href="https://github.com/syalanuj/youtube/tree/main/great_expectations_tutorial">GitHub</a></p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/oxOj30rl_xs">https://youtu.be/oxOj30rl_xs</a></div>
]]></content:encoded></item><item><title><![CDATA[Data Engineering vs Data Science]]></title><description><![CDATA[INTRODUCTION
In the tech-driven landscape of today, the dynamic duo of data engineering and data science stands at the forefront of innovation. While both fields play crucial roles in unlocking the potential of data, the limelight often gravitates to...]]></description><link>https://anujsyal.com/data-engineering-vs-data-science</link><guid isPermaLink="true">https://anujsyal.com/data-engineering-vs-data-science</guid><category><![CDATA[Data Science]]></category><category><![CDATA[dataengineering]]></category><category><![CDATA[Career]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Thu, 21 Sep 2023 15:13:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/Rm3nWQiDTzg/upload/faf32d10e649f8db1049ee3990738828.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">INTRODUCTION</h2>
<p>In the tech-driven landscape of today, the dynamic duo of data engineering and data science stands at the forefront of innovation. While both fields play crucial roles in unlocking the potential of data, the limelight often gravitates towards the captivating glamour of data science. However, it is time to look past the hype and recognize the unsung hero that is data engineering. </p>
<p>In this blog, we explore the often-underestimated field of data engineering and how it plays a critical role in helping data scientists gain valuable insights. Let's take a step back and realize that while data science might be the driving force behind data innovation, it's data engineering that's really pushing it forward.</p>
<h2 id="heading-what-is-data-engineering">WHAT IS DATA ENGINEERING?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1695307580352/bec0c674-c03b-47ff-b0a0-e05c890b3636.jpeg" alt="Photo by &lt;a href=&quot;https://unsplash.com/@alain_pham?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Alain Pham&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com/photos/P_qvsF7Yodw?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Unsplash&lt;/a&gt;   " class="image--center mx-auto" /></p>
<blockquote>
<p><a target="_blank" href="https://unsplash.com/@alain_pham?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Alain Pham</a><a target="_blank" href="https://unsplash.com/photos/P_qvsF7Yodw?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>In simple terms, Data engineering is the process of making data organized and ready for further analysis. It is critical to the data lifecycle because it ensures that raw data from multiple sources is converted into a usable format that is available for analysis and decision-making. In the world of data-driven decision-making, Data Engineering emerges as the hero, laying the foundation for the vast architecture of insights and analysis. </p>
<p>From designing scalable data pipelines to creating efficient data storage, a data engineer takes on different roles and responsibilities to make behind-the-scenes magic happen, transforming raw information into useful insights. However, for this, they require a diversified skill set that blends technical expertise with problem-solving ability. </p>
<p>Let’s look over the core skills required for data engineers in more detail.</p>
<h3 id="heading-skills-required">SKILLS REQUIRED</h3>
<ol>
<li><p>Programming Languages: Proficiency in programming languages is essential. Python and Java are popular scripting languages for creating data pipelines. Scala is quite widely used, particularly in large data frameworks such as Apache Spark.</p>
</li>
<li><p>Database Management: An in-depth knowledge of several database systems is required. SQL is essential, as is experience with both relational databases (e.g., MySQL, PostgreSQL) and NoSQL databases (e.g., MongoDB, Cassandra).  </p>
</li>
<li><p>ETL (Extract, Transform, Load) Tools: Data engineers must be adept in the use of ETL technologies and frameworks that enable data transportation and transformation. Commonly used tools include Apache NiFi, Apache Kafka, and Apache Airflow.</p>
</li>
<li><p>Big Data Technologies: Understanding big data technologies like Hadoop and Apache Spark is crucial for efficiently processing and analysing massive datasets. Understanding concepts such as MapReduce is handy.</p>
</li>
<li><p>Cloud Platforms: As more businesses shift their data infrastructure to the cloud, expertise of cloud platforms such as AWS, Azure, and Google Cloud and Databricks is growing more valuable. Proficiency in setting up and managing cloud-based data solutions, including Data Warehouses, Data Lakes, and the emerging Lakehouse architecture, is a key benefit.</p>
</li>
</ol>
<h2 id="heading-what-is-data-science">WHAT IS DATA SCIENCE?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1695307725318/a4521a2c-150f-4f3f-8530-3d5ad5572cfa.jpeg" alt="Photo by &lt;a href=&quot;https://unsplash.com/@logan_lense?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Logan Moreno Gutierrez&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com/photos/BQ95Oc7Nvvc?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Unsplash&lt;/a&gt;   " class="image--center mx-auto" /></p>
<blockquote>
<p><a target="_blank" href="https://unsplash.com/@logan_lense?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Logan Moreno Gutierrez</a><a target="_blank" href="https://unsplash.com/photos/BQ95Oc7Nvvc?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Data science is the study of data to extract valuable insights and knowledge, leading to informed decision-making and innovation. It combines elements from statistics, computer science, domain expertise, and machine learning to analyse complicated data sets. If data engineering lays the groundwork, data science breathes life into the information that illuminates the path forward.</p>
<p>Data Scientists, equipped with a combination of analytical skills and domain understanding, are responsible for performing several tasks. These majorly include model crafting, exploration for patterns, and visual storytelling. Data Scientists are also known to possess exceptional statistical knowledge since it supports the core of their field - analysis.</p>
<p>Let us explore the key skills needed to be a successful data scientist.</p>
<h3 id="heading-skills-required-1">SKILLS REQUIRED</h3>
<ol>
<li><p>Programming Languages: It is necessary to be adept in programming languages such as Python or R along with libraries like NumPy, Pandas for handling datasets. These languages are frequently used for data manipulation, analysis, and building of machine learning models.</p>
</li>
<li><p>Statistical Analysis: An extensive knowledge of statistics is required for planning experiments, evaluating hypotheses, and drawing meaningful conclusions from data. A strong grasp of statistical methods empowers data scientists to make informed decisions and extract valuable information from complex datasets.</p>
</li>
<li><p>Machine Learning: Building predictive and prescriptive models requires an in-depth understanding of various machine learning algorithms and techniques. Familiarity with libraries like scikit-learn, TensorFlow, or PyTorch is valuable for implementing machine learning algorithms.</p>
</li>
<li><p>Data Cleaning and Preprocessing: The ability to clean and preprocess data to remove irregularities and prepare it for analysis is an essential skill.</p>
</li>
<li><p>Data Visualization: Data visualisation expertise in tools such as Matplotlib, Seaborn, or Tableau are essential for developing informative visualisations that convey complicated discoveries to non-technical stakeholders.</p>
</li>
</ol>
<h2 id="heading-difference">DIFFERENCE</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1695308494553/9bf26a71-0f88-4bdd-b999-65ad437135f1.jpeg" alt="Photo by &lt;a href=&quot;https://unsplash.com/@gregjeanneau?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Gregoire Jeanneau&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com/photos/0StwxZ4NigE?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Unsplash&lt;/a&gt;   " class="image--center mx-auto" /></p>
<blockquote>
<p><a target="_blank" href="https://unsplash.com/@gregjeanneau?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Gregoire Jeanneau</a><a target="_blank" href="https://unsplash.com/photos/0StwxZ4NigE?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>As the digital environment evolves, the roles of Data Engineering and Data Science emerge as essential pillars, each with unique abilities. Let's dive into their differences and unravel their unique features.</p>
<h3 id="heading-field-focus">FIELD FOCUS</h3>
<p>Data Engineering has a broad scope that centres around the development and maintenance of data-related systems, pipelines, and overall infrastructure. Its primary goal is to efficiently collect, store, process, and effectively transform raw data into a thoroughly structured state suitable for analysis. </p>
<p>Data Science is largely concerned with extracting valuable insights from data through the use of a variety of sophisticated approaches. Statistical modelling is used to find patterns, machine learning techniques are used to forecast outcomes, and predictive analytics is used to obtain insight into future trends.</p>
<h3 id="heading-coding-practices">CODING PRACTICES</h3>
<p>Data Engineering places a strong emphasis on good coding practices, maintainable code, and production-ready solutions. Engineers are proficient in writing clean, modular, and efficient code using programming languages like Python, Java, or Scala. They specialise in creating robust pipelines and systems that can manage enormous amounts of data and are optimized for performance.</p>
<p>While Data Science professionals do write code, the emphasis is mostly on experimentation and analysis instead of production-level code. Coding is usually done in Jupyter notebooks, which are good for exploration and visualisation but could fail to adhere to the same level of software engineering standards as DE.</p>
<h3 id="heading-cloud-knowledge">CLOUD KNOWLEDGE</h3>
<p>DE professionals often possess in-depth knowledge of cloud platforms and their various services. They design and execute scalable data processing architectures that make use of the advantages of cloud infrastructure to manage and process massive datasets efficiently. Here, scalability is an important aspect, and knowledge of cloud services and distributed computing is crucial.</p>
<p>DS practitioners may not necessarily have a thorough understanding of cloud services, especially since their primary focus is data analysis. They may interact with cloud resources via more user-friendly interfaces or APIs, but they are not as skilled as data engineering specialists at optimising for scalability and cost-effectiveness. </p>
<h3 id="heading-productionalization">PRODUCTIONALIZATION</h3>
<p>DE specialists are responsible for taking data pipelines from development to production. They are well-versed in deploying complex data processing systems into production environments while taking security, scalability, and maintainability into account. They have a better understanding of the issues that might surface during productionization and are trained to deal with them.</p>
<p>DS professionals often prioritize data exploration, model development, and research. While working on models and analysis, they may not necessarily be as involved with the operational side of production systems. Ensuring that models work reliably and efficiently in real-world scenarios involves additional considerations, such as error handling and monitoring.</p>
<h3 id="heading-domain-knowledge">DOMAIN KNOWLEDGE</h3>
<p>DE specialists mostly deal with the design and maintenance of data pipelines, infrastructure, and data management systems. They might not always possess deep domain-specific knowledge or advanced statistical expertise since their primary objective is to ensure efficient data processing.</p>
<p>Data Scientists’ specialization in data analysis enables them to thoroughly grasp the underlying data and make informed conclusions. Their statistical grasping enables them to identify insights that Data Engineering specialists may not be focusing on.</p>
<h3 id="heading-tools-and-technologies">TOOLS AND TECHNOLOGIES</h3>
<p>DE professionals frequently use standard software development tools and practices. They employ tools such as Apache Spark, Apache Flink, Hadoop, and different ETL (Extract, Transform, Load) frameworks. They develop code in IDEs and use version control systems like Git to create modular and reusable components.</p>
<p>Data scientists use tools like Python's data science libraries (NumPy, pandas, scikit-learn), R, and specialized tools like TensorFlow and PyTorch for machine learning and deep learning tasks. Jupyter Notebooks are frequently used for exploratory data analysis and model experimentation which are great for quick iterations and research, but might skip on some of the best practices that DE tools and workflows provide.</p>
<h2 id="heading-collaboration">COLLABORATION</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1695308910219/4861963f-b97b-4e58-b9f2-b998f64dc48f.jpeg" alt="Photo by &lt;a href=&quot;https://unsplash.com/@chrisliverani?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Chris Liverani&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com/photos/9cd8qOgeNIY?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText&quot;&gt;Unsplash&lt;/a&gt;   " class="image--center mx-auto" /></p>
<blockquote>
<p><a target="_blank" href="https://unsplash.com/@chrisliverani?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Chris Liverani</a><a target="_blank" href="https://unsplash.com/photos/9cd8qOgeNIY?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Data engineers work together with software engineers, data analysts, and data scientists to navigate the complex environment of creating and maintaining robust data infrastructure. Their joint efforts provide smooth data flow, storage, and dependable processing, which supports the entire data ecosystem.</p>
<p>Data scientists often work closely with business stakeholders, domain experts, and decision-makers to translate data insights into actionable strategies. They play a crucial role in bridging the gap between technical insights and business outcomes. </p>
<h3 id="heading-collaborative-significance">COLLABORATIVE SIGNIFICANCE</h3>
<p>Collaboration between data engineers and data scientists is of the utmost importance for the smooth completion of analytics projects. Data engineers lay the framework for data scientists by building dependable infrastructure capable of storing and retrieving enormous amounts of data in multiple formats. This collaboration is critical because it allows for transparent communication across teams throughout the project journey - from understanding business requirements to applying machine learning models in real-world contexts. This collaborative cooperation ensures that data is effectively packaged and distributed for analysis, allowing data scientists to gain useful insights and make educated decisions.</p>
<h3 id="heading-benefits-of-collaboration">BENEFITS OF COLLABORATION</h3>
<p>Collaboration provides numerous benefits that go beyond the sum of individual efforts. Here are some significant benefits:</p>
<ol>
<li><p>Cross-Disciplinary Insights: Data engineers contribute technical knowledge, and data scientists bring analytical skills. Collaboration leads to the integration of technical expertise and analytical thinking, resulting in more comprehensive insights and new solutions.</p>
</li>
<li><p>Efficient Data Preparation: Data scientists rely on well-prepared, cleansed, and processed data for their research. Collaboration with data engineers ensures that data scientists obtain data in a format optimised for their analysis, reducing time spent on data preparation and reducing errors.</p>
</li>
<li><p>Seamless Model Deployment: Collaboration ensures that data engineers create tools that allow data science models to be deployed smoothly into production environments. This makes it easier to use predictive models in real-world circumstances, resulting in demonstrable economic benefit.  </p>
</li>
<li><p>Real-time Insights: Data engineers can collaborate to construct real-time data pipelines that provide data scientists with up-to-date information. This is especially useful for time-critical analysis and decision-making.</p>
</li>
</ol>
<p>In essence, collaboration between data engineers and data scientists is a bridge that unites technical foundations with analytical research, resulting in a coordinated, productive, and impactful data-driven environment.</p>
<h2 id="heading-conclusion">CONCLUSION</h2>
<p>In conclusion, the dynamic landscape of data-related roles is undergoing a significant shift, with Data Engineering emerging as the frontrunner while Data Science experiences a gradual decline. This change can be explained by Data Engineering’s growing importance in managing and improving data infrastructure to get useful insights. As organizations face the challenges of managing large volumes of data, the need for qualified Data Engineers continues to rise. They are beginning to understand that without a robust Data Engineering framework, the full impact of Data Science will not be achieved. In this digital age, the businesses that focus on Data Engineering are the ones that can unlock the real potential of their data and drive innovation leading to sustainable growth.</p>
]]></content:encoded></item><item><title><![CDATA[Top Data Certifications for a Successful 2024]]></title><description><![CDATA[In the fast-paced realm of data engineering, staying ahead of the curve with cutting-edge certifications is your passport to unlocking a world of exhilarating career prospects. As we brace for the challenges and opportunities of 2024, the demand for ...]]></description><link>https://anujsyal.com/top-data-certifications-for-a-successful-2024</link><guid isPermaLink="true">https://anujsyal.com/top-data-certifications-for-a-successful-2024</guid><category><![CDATA[Certification]]></category><category><![CDATA[AWS]]></category><category><![CDATA[Databricks]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Microsoft]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Thu, 27 Jul 2023 15:36:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1690472030085/5dc35318-cafc-4bc0-ac73-880676c0508f.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the fast-paced realm of data engineering, staying ahead of the curve with cutting-edge certifications is your passport to unlocking a world of exhilarating career prospects. As we brace for the challenges and opportunities of 2024, the demand for skilled data engineers continues to soar, presenting an ideal moment to seize the best data engineering certifications available. Welcome to our blog, where we'll be your guiding light, illuminating the path for aspiring data engineers like you, showcasing the must-have certifications for the upcoming year.</p>
<p>Embark with me on this exhilarating journey, as we unravel the advantages of each certification, unveil the extraordinary job opportunities they open, and analyze the red-hot demand they command in the ever-evolving market. If you're a data engineer seeking to propel your career to unprecedented heights, brace yourself for what lies ahead in 2024 – a world of boundless opportunities and endless success!</p>
<h2 id="heading-aws-certified-big-data-specialityhttpsawsamazoncomcertificationcertified-big-data-specialty"><a target="_blank" href="https://aws.amazon.com/certification/certified-big-data-specialty/">AWS Certified Big Data Speciality</a></h2>
<p>Amazon Web Services (AWS) offers the AWS Certified Big Data Speciality certification, which relates to the data engineers' expertise in creating and implementing big data solutions while leveraging AWS's services. The approach of the practical exam for the certification focuses an immense value on hands-on expertise, ensuring that certified professionals have the required practical abilities. AWS certification additionally provides opportunities for networking with a community with cloud-agnostic abilities that may be utilized on different cloud platforms, broadening career choices beyond AWS-specific projects.</p>
<p>Let's delve into the essential features of this certification:</p>
<ol>
<li><p>AWS Big Data Services: This certification covers a wide range of AWS big data services, including Amazon S3, Amazon EMR, Amazon Redshift, Amazon Athena, AWS Glue, AWS Lambda, Amazon Kinesis, and more. It provides in-depth knowledge of these services and how they can be used for various big data scenarios.</p>
</li>
<li><p>Data Streaming and Real-time Analytics: The certification covers Amazon Kinesis, a service for ingesting, processing, and analyzing real-time streaming data. You'll learn how to capture and process data from various sources, perform real-time analytics, and gain insights from streaming data using Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics.</p>
</li>
<li><p>Data Warehousing and Business Intelligence: The certification delves into Amazon Redshift, AWS's data warehousing service. You'll gain a deep understanding of how to design and optimize Redshift clusters for data warehousing and learn techniques for building effective business intelligence solutions on top of Redshift.</p>
<p> <strong>DEMAND &amp; OPPORTUNITIES</strong></p>
<p> With AWS being one of the leading cloud service providers, organizations across industries seek professionals skilled in AWS Big Data services like Amazon EMR, Redshift, and Athena. AWS Certified Big Data opens up various job opportunities for data engineers, such as Big Data Engineer, Data Architect, Data Analyst, or Cloud Data Engineer. Having this certification sets you apart from the competition, providing you an advantage in the job market and raising your profile with future employers.</p>
<p> AWS certifications, including AWS Certified Big Data, have a global presence and are highly regarded in many countries. Globally and across industries, businesses are moving their data infrastructure to the cloud, requiring the demand for skilled data engineers familiar with AWS services. AWS certifications are highly sought after in the United States, particularly in technology hubs like Silicon Valley and major cities with a strong tech industry presence. Along with that, AWS certifications are also valued in countries like India, Australia, and Singapore, where there is substantial cloud adoption and a growing tech ecosystem.</p>
</li>
</ol>
<h2 id="heading-microsoft-certified-azure-data-engineer-associatehttpslearnmicrosoftcomen-uscertificationsazure-data-engineer"><a target="_blank" href="https://learn.microsoft.com/en-us/certifications/azure-data-engineer/">Microsoft Certified: Azure Data Engineer Associate</a></h2>
<p>The Microsoft Certified: Azure Data Engineer Associate certification provides comprehensive coverage of various Azure data services and tools, enabling data professionals to leverage the full potential of Azure's data capabilities. With this certification, data engineers gain deep knowledge of these services and learn how to design, implement, and manage data solutions on Azure. To earn this certification, candidates need to pass two exams: DP-200 (Implementing an Azure Data Solution) and DP-201 (Designing an Azure Data Solution). These exams cover a range of topics, including data storage, data processing, data integration, data security, and monitoring and optimization of Azure data solutions.</p>
<p>Here are a few key features that make this certification stand out:</p>
<ol>
<li><p>Azure Data Services Knowledge: This certification focuses on Azure data services and tools, equipping you with comprehensive knowledge of Azure's data offerings. It covers various services such as Azure Data Factory, Azure Databricks, Azure SQL Database, Azure Synapse Analytics, and Azure Cosmos DB.</p>
</li>
<li><p>Data Engineering Concepts: The certification delves into key data engineering concepts, including data ingestion, data transformation, data storage, data integration, data orchestration, data security, and data governance.</p>
</li>
<li><p>Collaboration and DevOps: The certification places a strong emphasis on collaboration and DevOps techniques in projects involving data engineering. You'll discover how to work well with cross-functional teams, apply DevOps principles to data pipelines, automate the procedures involved in data engineering, and put continuous integration and deployment into practice.</p>
</li>
</ol>
<p><strong>DEMAND &amp; OPPORTUNITIES</strong></p>
<p>There is a rising need for experts with experience in Azure data engineering as firms use Azure increasingly for their data needs. Being a Microsoft certification, it carries substantial industry recognition and credibility. This enhances certified data engineers' exposure to potential employers and improves their chances of landing data engineering positions involving projects and implementations related to Azure. Holding the Microsoft Certified: Azure Data Engineer Associate certification can lead to job roles like Azure Data Engineer, Data Architect, Data Integration Engineer, or Data Platform Engineer.</p>
<p>Azure certifications, including the Azure Data Engineer Associate, are in demand throughout the United States, particularly in industries like finance, healthcare, and technology. Azure certifications also have a strong demand in European countries, including the United Kingdom, Germany, and the Netherlands, where Azure is widely used. As Azure continues to expand its footprint worldwide and gain market share, the demand for professionals skilled in Azure data engineering is expected to grow in other countries as well like India, China and Japan. Due to the availability of Azure on a global scale and Microsoft's large market presence, the demand for people with the Microsoft Certified: Azure Data Engineer Associate certification can be seen in many nations across the world.</p>
<h2 id="heading-google-cloud-certified-professional-data-engineerhttpscloudgooglecomlearncertificationdata-engineer"><a target="_blank" href="https://cloud.google.com/learn/certification/data-engineer">Google Cloud Certified - Professional Data Engineer</a></h2>
<p>The Google Cloud Certified - Professional Data Engineer certification is designed to validate the skills and knowledge of data engineers in designing and building data processing systems and solutions on the Google Cloud Platform (GCP). The Professional Data Engineer certification offers comprehensive coverage of Google Cloud Platform's data services, including Google BigQuery, Google Cloud Storage, Google Cloud Dataflow, Google Cloud Pub/Sub, and more. By obtaining this certification, data engineers gain a deep understanding of these services and learn how to architect scalable, reliable, and secure data solutions on GCP.</p>
<p>Discover the noteworthy features that define this certification:</p>
<ol>
<li><p>Comprehensive GCP Data Engineering Knowledge: This certification covers a wide range of topics related to data engineering on the Google Cloud Platform. It encompasses data ingestion techniques, data transformation methods, data storage and processing solutions, data analysis and visualization tools, and machine learning integration for data engineering projects.</p>
</li>
<li><p>Advanced Data Engineering Concepts: In-depth advanced data engineering principles are covered in the certification, including designing data pipelines, building scalable data structures, optimizing data storage and retrieval, establishing data security and compliance standards in place, and incorporating data governance and quality procedures.</p>
</li>
<li><p>Hands-on Experience: The certification emphasizes practical experience with GCP data engineering tools and services. It assesses your ability to architect, build, and optimize data processing systems using GCP services like BigQuery, Dataflow, Dataproc, Pub/Sub, Cloud Storage, and Data Studio.</p>
</li>
</ol>
<p><strong>DEMAND &amp; OPPORTUNITIES</strong></p>
<p>As more organizations embrace cloud-based data solutions, the demand for professionals skilled in GCP data engineering is rapidly growing. Earning this certification demonstrates your proficiency in designing, building, and optimizing data solutions on GCP. This certification can create job opportunities as Data Engineer, Data Architect, Cloud Data Engineer, or Data Analyst.</p>
<p>Google Cloud certifications have been gaining traction globally as more organizations adopt Google Cloud Platform (GCP) for their data engineering needs. The demand for Google Cloud certifications, including the Professional Data Engineer, is prominent in the United States, especially in technology-driven regions like California.GCP certifications, including the Professional Data Engineer, have a growing demand in countries like India, Singapore, and Australia, as Google Cloud expands its presence in the region.</p>
<h2 id="heading-databricks-certified-associate-developer-for-apache-sparkhttpswwwdatabrickscomlearncertificationapache-spark-developer-associate"><a target="_blank" href="https://www.databricks.com/learn/certification/apache-spark-developer-associate">Databricks Certified Associate Developer for Apache Spark</a></h2>
<p>Apache Spark is a distributed computing framework designed for big data processing and analytics. The Databricks Certified Associate Developer certification focuses on Spark and covers various aspects of its architecture, core components, and programming concepts. By obtaining this certification, data professionals gain a comprehensive understanding of Spark's capabilities and learn how to utilize its full potential to solve complex data problems.</p>
<p>Discover the noteworthy features that define this certification:</p>
<ol>
<li><p>Apache Spark Fundamentals: This certification covers the fundamental concepts of Apache Spark, including RDDs (Resilient Distributed Datasets), transformations, actions, Spark SQL, Spark Streaming, and MLlib. It provides a solid foundation in understanding Spark's core components and functionalities.</p>
</li>
<li><p>Hands-on Spark Development: The certification focuses on hands-on experience with Spark development. It includes exercises and projects that require you to write Spark applications using Scala, Python, or SQL. You'll learn how to work with Spark clusters, write efficient Spark code, and optimize Spark jobs.</p>
</li>
<li><p>Machine Learning with MLlib: The certification covers Spark's MLlib library, which provides a rich set of machine learning algorithms and tools. You'll gain expertise in using MLlib to train and evaluate machine learning models, perform feature engineering, and make predictions or recommendations using Spark.</p>
</li>
</ol>
<p><strong>DEMAND &amp; OPPORTUNITIES</strong></p>
<p>Apache Spark is widely adopted in industries that deal with large-scale data processing and analytics, creating a strong demand for professionals with Spark expertise. Earning this certification demonstrates your proficiency in Spark development and validates your ability to work with Spark clusters, design efficient data processing workflows, and apply Spark for machine learning tasks. Professionals holding this certification can pursue roles like Spark Developer, Data Engineer, Big Data Engineer, or Data Analyst.</p>
<p>As more organizations across different countries adopt Spark for their big data processing and analytics needs, the demand for professionals skilled in Spark development is expected to grow. Databricks certifications, including the Associate Developer for Apache Spark, have gained recognition among data engineering and data science professionals.Spark has a strong presence in countries such as the United States, United Kingdom, Canada, Australia, Germany, India, and many others.</p>
<h2 id="heading-databricks-certified-machine-learning-associatehttpswwwdatabrickscomlearncertificationmachine-learning-associate"><a target="_blank" href="https://www.databricks.com/learn/certification/machine-learning-associate">Databricks Certified Machine Learning Associate</a></h2>
<p>Databricks is a unified data analytics platform that brings together data engineering, data science, and business analytics in one collaborative environment. This platform provides a seamless experience for data scientists to develop, test, and deploy machine learning models at scale. The Databricks Certified Machine Learning Associate certification equips data scientists with the expertise to leverage the platform's capabilities and harness the power of machine learning. The Databricks Certified Machine Learning Associate certification is a valuable credential for individuals who want to showcase their expertise in applying machine learning techniques using the Databricks Unified Analytics Platform.</p>
<p>Take a closer look at the significant features inherent to this certification:</p>
<ol>
<li><p>Machine Learning Concepts: Essential machine learning ideas including supervised learning, unsupervised learning, and deep learning are covered in the certification. A strong foundation in machine learning principles is provided by its exploration of algorithms, model evaluation, and feature engineering techniques.</p>
</li>
<li><p>Databricks Platform: Candidates gain expertise in using Databricks notebooks, Databricks Runtime, and Databricks MLflow for building, training, and deploying machine learning models.</p>
</li>
<li><p>Integration with Big Data Technologies: The certification covers the integration of machine learning with big data technologies. Candidates learn how to work with large datasets stored in distributed file systems like Hadoop Distributed File System (HDFS) or cloud-based storage systems.</p>
</li>
</ol>
<p><strong>DEMAND &amp; OPPORTUNITIES</strong></p>
<p>Earning this certification demonstrates your proficiency in machine learning using the Databricks platform. It certifies your proficiency with Databricks tools for machine learning model development, training, and deployment. The Databricks Certified Machine Learning  Associate certification opens up job opportunities as Machine Learning Engineer, Data Scientist, AI Engineer, or ML Platform Engineer. The certification offers industry recognition, specialized Databricks skills, expanded career opportunities, and a competitive advantage in the job market.</p>
<p>The demand for Databricks certifications, including the Machine Learning Associate, is driven by the adoption of Databricks as a unified analytics platform. Databricks certifications have gained popularity in the United States, as organizations leverage Databricks for machine learning initiatives. It is also sought after in European countries where Databricks is used for data engineering and machine learning tasks, including the United Kingdom, Germany, and the Nordics.</p>
<h2 id="heading-conclusion">CONCLUSION</h2>
<p>As we brace for the boundless opportunities of 2024, the field of data engineering promises immense potential for career growth. To thrive in this dynamic industry, aspiring data engineers must set their sights on certifications that elevate their skills and knowledge.</p>
<p>Among the top certifications to consider, the AWS Certified Big Data shines, equipping you with expertise in creating and implementing AWS-driven big data solutions. The Microsoft Certified: Azure Data Engineer Associate certification validates your proficiency in data engineering on the Azure platform. Meanwhile, the Google Cloud Certified - Professional Data Engineer addresses the surging demand for GCP data engineering prowess.</p>
<p>For those seeking to conquer Spark development and big data processing, the Databricks Certified Associate Developer for Apache Spark emerges as a compelling choice. And don't overlook the Databricks Certified Associate for Machine Learning, showcasing your mastery in applying machine learning techniques through the Databricks platform.</p>
<p>As data engineering continues to evolve and drive data-driven insights, these certifications hold the power to accelerate your career, broaden your skill set, and position you as a highly sought-after data engineering professional. Embrace the opportunity that awaits, embark on your certification journey, and unlock a world of endless possibilities in the exhilarating realm of data engineering throughout 2024!</p>
]]></content:encoded></item><item><title><![CDATA[Top 5 New Data Engineering Technologies to Learn in 2023]]></title><description><![CDATA[In today's fast-paced digital world, keeping up with the latest advancements in data engineering is crucial to stay ahead of the competition. With the amount of data collected every day increasing, data engineering plays an important role in guarante...]]></description><link>https://anujsyal.com/top-5-new-data-engineering-technologies-to-learn-in-2023</link><guid isPermaLink="true">https://anujsyal.com/top-5-new-data-engineering-technologies-to-learn-in-2023</guid><category><![CDATA[chatgpt]]></category><category><![CDATA[apache]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Delta Lake]]></category><category><![CDATA[Apache Superset]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Wed, 17 May 2023 05:44:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/HgoKvtKpyHA/upload/724711bdb4d3fa9e2c97918e2ed11d4f.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today's fast-paced digital world, keeping up with the latest advancements in data engineering is crucial to stay ahead of the competition. With the amount of data collected every day increasing, data engineering plays an important role in guaranteeing data accuracy, consistency, and reliability for enterprises.</p>
<p>In this blog, we will be discussing the top 5 new data engineering technologies that you should learn in 2023 to stay ahead of the curve. Each of the technologies we will be looking at brings a unique set of capabilities and benefits to the table that can help businesses improve their data engineering processes and make better data-driven decisions. So, let’s dive in and learn!</p>
<h2 id="heading-apache-superset">APACHE SUPERSET</h2>
<p><a target="_blank" href="https://superset.apache.org">Apache Superset</a> is a modern, open-source data visualisation and exploration platform that allows businesses to analyse and visualise data from multiple sources in real-time. Apache Superset was initially launched in 2016 by Airbnb as an internal tool but was later <a target="_blank" href="https://news.apache.org/foundation/entry/the-apache-software-foundation-announces70">open-sourced in 2017</a> and has since become a popular choice for businesses and organisations. Apache Superset is designed to be extremely scalable and capable of managing massive amounts of data without sacrificing performance.</p>
<p>The <strong>most notable feature</strong> about Apache Superset is its ability to connect to a wide range of data sources, including SQL-based databases, Druid, Hadoop, and cloud-based data warehouses such as Amazon Redshift and Google BigQuery. As a result, it is a very adaptable tool that can simply be integrated into existing data infrastructures.</p>
<p>Let’s explore some of the features of Apache Superset:</p>
<ol>
<li><p><strong>Data Visualisation:</strong> Provides various visualisation options, such as line charts, scatter plots, pivot tables, heat maps, and more. Users can customise these visualisations to suit their branding and style.</p>
</li>
<li><p><strong>Advanced Analytics:</strong> In addition to data visualisation, Apache Superset also offers advanced analytics features, including predictive analytics and machine learning capabilities. This enables firms to acquire insights into their data and make well-informed decisions based on real-time data analysis.</p>
</li>
<li><p><strong>Dashboard Sharing:</strong> Makes it easy for users to share their dashboards with others. Users can share dashboards via a URL or embed them in other applications using an iframe.</p>
</li>
<li><p><strong>Query Building</strong>: <a target="_blank" href="https://apache-superset.readthedocs.io/en/0.28.1/sqllab.html">Query builder interface</a> enables users to create complex queries using a drag-and-drop interface. Users can also write SQL queries directly if they prefer.</p>
</li>
</ol>
<p>Overall, Superset is anticipated to gain more popularity in 2023 as companies seek open-source substitutes for proprietary data visualisation software. If you're keen on data visualisation and reporting, Superset is an excellent tool to acquire knowledge.</p>
<h2 id="heading-apache-iceberg">APACHE ICEBERG</h2>
<p><a target="_blank" href="https://iceberg.apache.org">Apache Iceberg</a> is an open-source data storage and query processing platform that was developed to provide a modern, scalable, and efficient way of managing large datasets. It is made to accommodate a variety of workloads, such as batch and interactive processing, machine learning, and ad-hoc queries. Apache Iceberg was created by the team at Netflix and was open-sourced in 2018.</p>
<p>One of the <strong>most significant features</strong> of Apache Iceberg that makes it special is its ability to support schema evolution. As datasets grow and change over time, It's crucial to be able to add or remove columns from a database without interfering with already-running applications or queries. Apache Iceberg allows users to add or remove columns to a table without having to rewrite the entire dataset. This makes it easy to evolve and maintain data models as business needs change.</p>
<p>Let’s look over the benefits provided by Apache Iceberg:</p>
<ol>
<li><p><strong>Efficient Query Processing</strong>: Uses a columnar format that reduces the amount of data that needs to be read from disk, which improves query performance. It also supports predicate pushdown and other optimizations that further improve query performance.</p>
</li>
<li><p><strong>Data Consistency:</strong> Combination of versioning and snapshot isolation ensures that readers and writers never interfere with each other. Data is always in a consistent state, even during updates or when multiple users are accessing the same data simultaneously.</p>
</li>
<li><p><strong>Easy Integration:</strong> Designed to be easy to integrate with existing data processing frameworks, such as Apache Spark, Apache Hive, and Presto. It provides connectors for these frameworks, which makes it easy to start using Iceberg with minimal changes to existing code.</p>
</li>
<li><p><strong>Scalability:</strong> It supports partitioning and clustering, which allows users to organise their data into smaller, more manageable chunks. This makes it easier to distribute and process large datasets across multiple nodes in a cluster.</p>
</li>
<li><p><strong>Data Management:</strong> Provides a modern, efficient, and scalable way of managing large datasets. It makes it easy to store, organise, and query data, which can improve data quality and increase business agility.</p>
</li>
</ol>
<p>Hence, Apache Iceberg should be learnt for its ability to handle large datasets efficiently and its support for schema evolution, which are critical for modern data management scenarios. It is also a popular technology used by many organisations, making it a valuable skill to have.</p>
<h2 id="heading-great-expectations">GREAT EXPECTATIONS</h2>
<p><a target="_blank" href="https://greatexpectations.io">Great Expectations</a> is an open-source Python library that provides a set of tools for testing and validating data pipelines. First launched in October 2019 as an open-source project on GitHub, it enables users to specify "expectations" for their data - assertions or limitations on how their pipelines should behave. These expectations can be simple rules, like checking for missing values or checking that a column contains only certain values, or more complex constraints, like ensuring that the correlation between two columns falls within a certain range. Additionally, the library offers a number of tools for visualising and documenting data pipelines, making it simple to comprehend and troubleshoot complex data workflows.</p>
<p>Several key features make Great Expectations a valuable tool for data engineers:</p>
<ol>
<li><p><strong>Expectation Library:</strong> Provides a comprehensive library of pre-defined expectations for common data quality checks. Users can also define their own custom expectations to meet specific requirements.</p>
</li>
<li><p><strong>Data Documentation:</strong> Makes it easier to document and understand the data used in pipelines, providing data dictionaries that capture metadata, such as column descriptions, data sources, and data owners. This allows teams to collaborate and understand the data being used in their pipelines.</p>
</li>
<li><p><strong>Data Validation:</strong> Offers a range of validation tools, such as data profiling, schema validation, and batch validation, which help users catch issues and errors in their pipelines before they cause downstream problems.</p>
</li>
<li><p><strong>Extensibility:</strong> Easy integration with a wide range of data processing and analysis tools, such as Apache Spark, Pandas, and SQL databases. This allows users to use Great Expectations with their existing data stack and workflows.</p>
</li>
<li><p><strong>Automation:</strong> Provides a suite of tools for automating the testing and validation of data pipelines, including integration with workflow management tools, such as Apache Airflow and Prefect. This enables users to automate the monitoring and validation of their pipelines to ensure data quality and reliability over time.</p>
</li>
</ol>
<p>Data engineers should learn Great Expectations in 2023 because it offers a comprehensive suite of data validation, documentation, and automation tools. As data quality becomes increasingly important, Great Expectations provides a reliable solution for ensuring data integrity. Furthermore, its integration with popular data processing tools makes it a valuable addition to any data engineer's toolkit.</p>
<h2 id="heading-delta-lake">DELTA LAKE</h2>
<p><a target="_blank" href="https://delta.io">Delta Lake</a> is an open-source storage layer that is designed to improve the reliability, scalability, and performance of data lakes. It was initially released in 2019 by Databricks and has since gained popularity among data teams and has become an important tool for managing and maintaining data lakes. Data dependability is provided by Delta Lake, which is built on top of Apache Spark, using a transactional layer to make sure that all data updates are atomic and consistent.</p>
<p>Delta Lake has several features to offer that make it a valuable tool for data teams:</p>
<ol>
<li><p><strong>ACID Transactions:</strong> Delta Lake uses atomic, consistent, isolated, and durable (ACID) transactions to ensure data reliability. This means that data changes are atomic and consistent, and can be rolled back in the event of a failure.</p>
</li>
<li><p><strong>Schema Enforcement:</strong> Supports schema enforcement, which ensures that all data stored in the data lake conforms to a predefined schema. This helps to improve data quality and reduces the risk of errors and inconsistencies in the data.</p>
</li>
<li><p><strong>Data Versioning:</strong> Supports data versioning, allowing users to track changes to their data over time. This helps to ensure data lineage and enables teams to audit and understand changes to their data over time.</p>
</li>
<li><p><strong>Performance:</strong> Delta Lake is designed for performance and can support petabyte-scale data lakes. It also includes optimizations such as indexing and caching to improve query performance.</p>
</li>
<li><p><strong>Open Source:</strong> Delta Lake is an open-source project, meaning that it can be used and contributed to by the wider community. This helps to drive innovation and ensures that Delta Lake remains a flexible and evolving solution.</p>
</li>
</ol>
<p>Since its debut, Delta Lake has grown significantly in popularity, and by 2023, data engineers are expected to get familiar with this tool. With more businesses switching to cloud-based solutions for their data infrastructure, Delta Lake is becoming an increasingly important tool for data teams owing to its support for cloud storage services and its capacity to handle difficult data management problems. Furthermore, as more businesses seek to leverage the power of big data and advanced analytics to drive informed decision-making, the need for reliable and scalable data management solutions like Delta Lake will only continue to grow.</p>
<h2 id="heading-chatgpt">ChatGPT</h2>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/ChatGPT">ChatGPT</a> is a large language model developed by OpenAI and released in June 2020, It is based on the GPT-3.5 architecture and designed to generate human-like responses to natural language queries and conversations. The model is capable of understanding and generating responses in multiple languages, and it can be fine-tuned on specific domains or tasks to improve its performance. ChatGPT's ability to perform multiple tasks such as text classification, sentiment analysis, and language translation can help data engineers to gain insights from unstructured data.</p>
<p>One of <strong>ChatGPT's key strengths</strong> is its capacity to generate open-ended responses to inquiries and conversations, enabling users to have impromptu talks with the model.ChatGPT is trained on a massive corpus of text data, which allows it to generate responses that are contextually relevant and grammatically correct.</p>
<p>Some valuable features of ChatGPT that make it an all-rounder are:</p>
<ol>
<li><p><strong>Contextual understanding:</strong> ChatGPT can understand the context of a conversation and generate responses that are relevant to the topic being discussed.</p>
</li>
<li><p><strong>Machine learning:</strong> Based on deep learning algorithms that enable it to learn and improve over time based on the data it processes.</p>
</li>
<li><p><strong>Customization:</strong> ChatGPT can be fine-tuned on specific domains or tasks to improve its accuracy and effectiveness.</p>
</li>
<li><p><strong>Content Creation:</strong> Used to generate content for websites, blogs, and social media posts. This can save content creators time and effort while ensuring that the content generated is high-quality and engaging.</p>
</li>
<li><p><strong>Language translation:</strong> The ability to understand and generate responses in multiple languages makes it a valuable tool for language translation services.</p>
</li>
</ol>
<p>ChatGPT is an AI-powered chatbot that can help data engineers and other professionals automate repetitive tasks, streamline workflows, and improve productivity. As AI and natural language processing continue to advance, ChatGPT is poised to become an increasingly valuable tool for data engineering teams in 2023 and beyond. Learning how to use ChatGPT can help data engineers stay ahead of the curve and enhance their data engineering capabilities.</p>
<h2 id="heading-conclusion">CONCLUSION</h2>
<p>In conclusion, data engineering is an ever-evolving field, and staying up-to-date with the latest technologies and tools is crucial to gain a competitive edge in the industry. From Apache Superset which can provide powerful data visualisation capabilities to Apache Iceberg which offers easy and efficient table evolution; these technologies can help data engineers work more efficiently and effectively. Great Expectations can ensure data quality and maintain data integrity, while Delta Lake provides a reliable and efficient way to manage big data. On the other hand, ChatGPT offers an innovative and interactive way to create conversational AI models. By learning these technologies, data engineers can stay ahead of the curve and be better equipped to handle the complex challenges of data management and analysis. So, don't wait - start exploring these exciting tools and stay on top of the latest trends in data engineering in 2023 and beyond.</p>
<p>If you are further interested in this topic, check out my youtube video:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/mDoNk2y1GUg">https://youtu.be/mDoNk2y1GUg</a></div>
]]></content:encoded></item><item><title><![CDATA[Create Your First ETL Pipeline with Python]]></title><description><![CDATA[When it comes to pursuing a career in the field of Data and specifically Data Engineering and many other tech-related fields, Python comes off as a powerful tool. As you will be forging ahead in your profession, this programming language will be conv...]]></description><link>https://anujsyal.com/create-your-first-etl-pipeline-with-python</link><guid isPermaLink="true">https://anujsyal.com/create-your-first-etl-pipeline-with-python</guid><category><![CDATA[Python]]></category><category><![CDATA[ETL]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[fundamentals]]></category><category><![CDATA[pandas]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Mon, 23 Jan 2023 07:54:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/JUDPnpHHRqs/upload/540fa343807fd073489b81ec130d706a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When it comes to pursuing a career in the field of Data and specifically Data Engineering and many other tech-related fields, <a target="_blank" href="https://www.python.org/">Python</a> comes off as a powerful tool. As you will be forging ahead in your profession, this programming language will be convenient in many ways. </p>
<p>Without further ado, let’s dive into the fundamentals of Python that are needed to create your first ETL Pipeline!</p>
<h1 id="heading-a-demonstration-of-the-etl-process-using-python"><strong>A Demonstration of the ETL Process using Python</strong></h1>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=uqRRjcsUGgk">https://www.youtube.com/watch?v=uqRRjcsUGgk</a></div>
<p> </p>
<p>It may be helpful to use an actual bare-bones example to illustrate how to build an ETL pipeline to gain a better understanding of the subject. With this, we will better understand how easy Python is to use as a whole.</p>
<p>Create a file called <a target="_blank" href="http://etl.py">etl.py</a> in the text editor of your choice. And add the following docstring.</p>
<pre><code class="lang-python"><span class="hljs-string">"""
Python Extract Transform Load Example
"""</span>
</code></pre>
<p>We will begin with a basic ETL Pipeline consisting of essential elements needed to extract the data, then transform it, and finally, load it into the right places. At this step, things are not as complex as they might seem, even if you are a complete beginner at it. </p>
<p>So as we go down the path, you can witness how easy it is to use python for building any such ETL Pipelines. </p>
<h2 id="heading-importing-the-right-packages"><strong>Importing the Right Packages</strong></h2>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sqlalchemy <span class="hljs-keyword">import</span> create_engine
</code></pre>
<p>Coming to this step, you will realize how Python is resourceful as a tool. Because it carries an ecosystem of libraries around data and general programming which makes it even more painless, as well as, effective to use. Importing the right libraries is the first step to creating anything using python. </p>
<p>Here, we will require the use of the first three libraries known as ‘request’ libraries. They help to pull data from an API, which is used for the extraction of data. Apart from that, <a target="_blank" href="https://pandas.pydata.org/">Pandas</a> is another library to perform transformation and manipulation of data. This is similar to Excel on steroids with the only difference being that it is based on using codes. Hence, pandas can be used to read data in excel formats, and CSV formats (basically, anything in a tabular format), and we can easily transform it using pandas. The last on this list is <a target="_blank" href="https://sqlalchemy.org/">SQLAlchemy</a>, which is meant to support creating a connection to a database (essentially, it’s an SQLite database). </p>
<h2 id="heading-step-1-extract"><strong>Step 1: Extract</strong></h2>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract</span>()-&gt; dict:</span>
    <span class="hljs-string">""" This API extracts data from
    http://universities.hipolabs.com
    """</span>
    API_URL = <span class="hljs-string">"http://universities.hipolabs.com/search?country=United+States"</span>
    data = requests.get(API_URL).json()
    <span class="hljs-keyword">return</span> data
</code></pre>
<p>The first step to creating the pipeline will begin with the extract. As shown in the sample, we are extracting from an API source that is freely available for use. The sample used is to derive the information on universities available in the United States as a whole. When we run this API, it will provide the data back to us in a JSON format as the sample is shown below.</p>
<pre><code class="lang-json">[{<span class="hljs-attr">"web_pages"</span>: [<span class="hljs-string">"http://www.marywood.edu"</span>], <span class="hljs-attr">"state-province"</span>: <span class="hljs-literal">null</span>, <span class="hljs-attr">"alpha_two_code"</span>: <span class="hljs-string">"US"</span>, <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Marywood University"</span>, <span class="hljs-attr">"country"</span>: <span class="hljs-string">"United States"</span>, <span class="hljs-attr">"domains"</span>: [<span class="hljs-string">"marywood.edu"</span>]}, {<span class="hljs-attr">"web_pages"</span>: [<span class="hljs-string">"http://www.lindenwood.edu/"</span>], <span class="hljs-attr">"state-province"</span>: <span class="hljs-literal">null</span>, <span class="hljs-attr">"alpha_two_code"</span>: <span class="hljs-string">"US"</span>, <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Lindenwood University"</span>, <span class="hljs-attr">"country"</span>: <span class="hljs-string">"United States"</span>, <span class="hljs-attr">"domains"</span>: [<span class="hljs-string">"lindenwood.edu"</span>]}, {<span class="hljs-attr">"web_pages"</span>: [<span class="hljs-string">"https://sullivan.edu/"</span>], <span class="hljs-attr">"state-province"</span>: <span class="hljs-literal">null</span>, <span class="hljs-attr">"alpha_two_code"</span>: <span class="hljs-string">"US"</span>, <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Sullivan University"</span>, <span class="hljs-attr">"country"</span>: <span class="hljs-string">"United States"</span>, <span class="hljs-attr">"domains"</span>: [<span class="hljs-string">"sullivan.edu"</span>]}, {<span class="hljs-attr">"web_pages"</span>: [<span class="hljs-string">"https://www.fscj.edu/"</span>], <span class="hljs-attr">"state-province"</span>: <span class="hljs-literal">null</span>, <span class="hljs-attr">"alpha_two_code"</span>: <span class="hljs-string">"US"</span>, <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Florida State College at Jacksonville"</span>, <span class="hljs-attr">"country"</span>: <span class="hljs-string">"United States"</span>, <span class="hljs-attr">"domains"</span>: [<span class="hljs-string">"fscj.edu"</span>]}, {<span class="hljs-attr">"web_pages"</span>: [<span class="hljs-string">"https://www.xavier.edu/"</span>], <span class="hljs-attr">"state-province"</span>: <span class="hljs-literal">null</span>, <span class="hljs-attr">"alpha_two_code"</span>: <span class="hljs-string">"US"</span>, <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Xavier University"</span>, <span class="hljs-attr">"country"</span>: <span class="hljs-string">"United States"</span>, <span class="hljs-attr">"domains"</span>: [<span class="hljs-string">"xavier.edu"</span>]}, {<span class="hljs-attr">"web_pages"</span>: [<span class="hljs-string">"https://home.tusculum.edu/"</span>], <span class="hljs-attr">"state-province"</span>: <span class="hljs-literal">null</span>, <span class="hljs-attr">"alpha_two_code"</span>: <span class="hljs-string">"US"</span>, <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Tusculum College"</span>, <span class="hljs-attr">"country"</span>: <span class="hljs-string">"United States"</span>, <span class="hljs-attr">"domains"</span>: [<span class="hljs-string">"tusculum.edu"</span>]}, {<span class="hljs-attr">"web_pages"</span>: [<span class="hljs-string">"https://cst.edu/"</span>], <span class="hljs-attr">"state-province"</span>: <span class="hljs-literal">null</span>, <span class="hljs-attr">"alpha_two_code"</span>: <span class="hljs-string">"US"</span>, <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Claremont School of Theology"</span>, <span class="hljs-attr">"country"</span>: <span class="hljs-string">"United States"</span>, <span class="hljs-attr">"domains"</span>: [<span class="hljs-string">"cst.edu"</span>]}]
</code></pre>
<p>The data we get here is in the bare boost structure being used in HTTPS calls or HTTP calls. We get this data as an output from the URL:</p>
<p> “<a target="_blank" href="http://universities.hipolabs.com/search?country=United+States”">http://universities.hipolabs.com/search?country=United+States”</a> </p>
<p>We will use the request library to achieve the extraction and obtain a response as a JSON, which is a dictionary within Python. </p>
<h2 id="heading-step-2-transform"><strong>Step 2: Transform</strong></h2>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transform</span>(<span class="hljs-params">data:dict</span>) -&gt; pd.DataFrame:</span>
    <span class="hljs-string">""" Transforms the dataset into desired structure and filters"""</span>
    df = pd.DataFrame(data)
    print(<span class="hljs-string">f"Total Number of universities from API <span class="hljs-subst">{len(data)}</span>"</span>)
    df = df[df[<span class="hljs-string">"name"</span>].str.contains(<span class="hljs-string">"California"</span>)]
    print(<span class="hljs-string">f"Number of universities in california <span class="hljs-subst">{len(df)}</span>"</span>)
    df[<span class="hljs-string">'domains'</span>] = [<span class="hljs-string">','</span>.join(map(str, l)) <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> df[<span class="hljs-string">'domains'</span>]]
    df[<span class="hljs-string">'web_pages'</span>] = [<span class="hljs-string">','</span>.join(map(str, l)) <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> df[<span class="hljs-string">'web_pages'</span>]]
    df = df.reset_index(drop=<span class="hljs-literal">True</span>)
    <span class="hljs-keyword">return</span> df[[<span class="hljs-string">"domains"</span>,<span class="hljs-string">"country"</span>,<span class="hljs-string">"web_pages"</span>,<span class="hljs-string">"name"</span>]]
</code></pre>
<p>This step is mainly about transforming the data to be in the right format and sequence. Mostly, the transformation of any data is done around particular business conditions and their requirements. For this specific sample, we have assumed a hypothetical condition where we are searching for universities in California.</p>
<p>Firstly, we will find data located in a dictionary. This data then will be read into a pandas data frame. But what do pandas do? To elaborate, the Pandas data frame is like a data structure, which means it’s a library that will enable us to convert this dictionary into a data frame. Further, we can think of a data frame as a CSV which has rows and columns and various added functionalities. It’s a comprehensive tool when it comes to transforming data  </p>
<p>Next, we will filter out [Line 5 in the snippet] all the universities whose name contains “California”. As mentioned before, Pandas is like Excel on Steroids where we can use a simple syntax that helps us with any such actions required to filter out or transform the data.</p>
<p><img src="https://lh5.googleusercontent.com/WizVEEtWK8-Bq1qMMTSs_82E4agXfaZsSHJ4Oj7SL_pESw4K4UWE-6R9cQQ2vhLU-hP4CBXZZDcrKgEpx2Pf3urbjvOtJRs2yx9DAwmiBegUHYZIcOcgn-5NZnIN_JI9e7KQqrk1BQEWf2CPOyJFWYEkmxIsz08cA-uJz0NBLEZk-UWVK8Ho5oG_UjjcUw" alt /></p>
<h2 id="heading-step-3-load"><strong>Step 3: Load</strong></h2>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load</span>(<span class="hljs-params">df:pd.DataFrame</span>)-&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-string">""" Loads data into a sqllite database"""</span>
    disk_engine = create_engine(<span class="hljs-string">'sqlite:///my_lite_store.db'</span>)
    df.to_sql(<span class="hljs-string">'cal_uni'</span>, disk_engine, if_exists=<span class="hljs-string">'replace'</span>)
</code></pre>
<p>The last step of creating this pipeline is about reading the SQLite database which pre-exists on the disk. This type of data can be found on a host or a server. We will then save this data frame into a table by using “<a target="_blank" href="http://df.to"><strong>df.to</strong></a><strong>_sql</strong>” which is a functionality of the pandas object. Here, we will further have to provide the <a target="_blank" href="https://www.sqlite.org/">SQLite</a> engine with a condition that if such data exists, it shall ‘<strong>replace</strong>’ it (as highlighted in the image below).</p>
<p>This was the last step of loading after which we can have the transformed data as we perform the load into a database. These are the functional aspects of a programming language where we will be able to reuse this set again. This is a complete ETL pipeline consisting of all the elements to perform any such actions on the data.</p>
<h2 id="heading-running-your-first-etl-pipeline"><strong>Running your first ETL pipeline</strong></h2>
<pre><code class="lang-python">data = extract()
df = transform(data)
load(df)
</code></pre>
<p>Finally, we will need to execute this function that we need this pipeline to perform. As a result, we will get a data frame with all the columns and at this point, we will have to load it into an SQLite database. Now, what’s left to be done is to run it using the ‘<strong>shift</strong>’ &amp; ‘<strong>enter</strong>’ keys.   </p>
<p>While it will take some time to execute the code and run the API, we will get to extract filtered-out data about the total number of universities in the United States. Along with this, we will achieve the result for the total number of universities in California. We can further store and save this transformed data into an SQLite database in the file explorer. For any future requirements, we can use the same ETL pipeline and search for the data that we need. </p>
<p><img src="https://lh3.googleusercontent.com/eW5UbZJTfUua2mT7eBoUdTyz9oy7Gyy0ljv-VvV1M3QRXM8531nG6UX4W8F4mpER5CakLHIvzZW9HuSXli_C7twnLKBHdrjU_FRf_SJxWqxfql2rSrwp0c2hVnSMR-4M6XUtQUeCFHadoC0TIX6bJWaqmZ0BnJrR2oaGJZ64XRFGzZ0PxPX4tM_qRxiJ_A" alt /></p>
<p>This is simply how the ETL process works using Python to achieve whatever we want to extract out of data.  </p>
<h1 id="heading-conclusion"><strong>Conclusion</strong></h1>
<p>A programming language as versatile as Python is marked under the essentials by many data engineers, data scientists, and developers, including software engineers. Therefore, as a beginner in the field of data engineering, it is a must-have skill to have a core knowledge of python. However, it is not necessary to know everything under the sun when it comes to python. I’d like to emphasize more that we will always be learning as we go, so no need to panic and try to gulp down everything as it’s not required at all! </p>
<p>Check out the video link below that talks about the same for a better explanation!</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=uqRRjcsUGgk&amp;feature=youtu.be">https://www.youtube.com/watch?v=uqRRjcsUGgk&amp;feature=youtu.be</a></div>
<p> </p>
<h2 id="heading-source-code">Source Code</h2>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/syalanuj/youtube/blob/main/de_fundamentals_python">https://github.com/syalanuj/youtube/blob/main/de_fundamentals_python</a></div>
]]></content:encoded></item><item><title><![CDATA[A Step-by-Step Roadmap To Data Engineering]]></title><description><![CDATA[The specialized field of data engineering is ever-expanding and its elements of it are scattered all around. But how do we come to grips with not being confused & consumed in the process of learning? 
Initially, you must follow a roughly drawn map an...]]></description><link>https://anujsyal.com/a-step-by-step-roadmap-to-data-engineering</link><guid isPermaLink="true">https://anujsyal.com/a-step-by-step-roadmap-to-data-engineering</guid><category><![CDATA[Roadmap]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Python]]></category><category><![CDATA[SQL]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Fri, 13 Jan 2023 11:37:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1673609306691/ee0cf5c2-b14f-49aa-a4ff-65f169ffa7f2.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The specialized field of data engineering is ever-expanding and its elements of it are scattered all around. But how do we come to grips with not being confused &amp; consumed in the process of learning? </p>
<p>Initially, you must follow a roughly drawn map and leave the rest to the skills and opportunities you opt for. A fun way to understand this roadmap is to imagine a smooth transition from a rookie to a professional data engineer with each sprint you take!</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/TjxmoAXkaAU">https://youtu.be/TjxmoAXkaAU</a></div>
<p> </p>
<p>/</p>
<h1 id="heading-sprint-1-strengthening-the-base-level"><strong>Sprint 1:</strong> Strengthening the Base Level</h1>
<h3 id="heading-fundamentals-i-focus-on-fundamentals-such-as-python-ds-amp-algorithms"><strong>Fundamentals I: Focus on fundamentals such as Python, DS &amp; Algorithms</strong></h3>
<p>Firstly, you must focus on fundamental skills such as Python, Data Structures &amp; Algorithms. These programming languages will be used to interact when working with various types of databases. That is why they are also known as interactional languages. At this step, it will be a fruitful decision to learn about data structures and algorithms as it holds most of the things that you will come across regularly later. </p>
<p><img src="https://lh6.googleusercontent.com/CeSVHiRBV8bTVFUppVd-BM8qQaK9FnD2xquLSrttuiJVzJiYMKSwlL5uMIvHQmgOhBBrZviUP6Ra7akkL_PFKQRSGHPTDeSFf0XYxKH6bOgMjD0g5SEaPUV6TuN1eZKurAV0iNpryGXPumt9K0ATZTrBAtmmvrt_u1lRG01StgRg8BcOiiwnwdp3ysiShg" alt /></p>
<p>Although object-oriented languages such as Python have in-built data structures and assorted open source packages for the application of algorithms. However, it is still preferred to have a better understanding of data structures and algorithms as they help in writing optimized code. </p>
<h3 id="heading-fundamentals-ii-linux-commands-shell-scripting-git-networking-concepts"><strong>Fundamentals II: Linux Commands, Shell Scripting, Git, Networking Concepts</strong></h3>
<p><img src="https://lh5.googleusercontent.com/FgCSENUyt1VtyEE65iI6e12dfsAgRc_TiI3XMtATwUEhM5zzHd8bE_1-qkQWpaJEAPeTo_B05KIESEiE5kxZqcg7x6XjAgs_2VnYJ_SvaGjBS94iCZ9be8PFcQWrXyXEf265Lr6drn5PdWc7HEhmdBTAfdxcu0htfVAIND4c9WtfrC7ej-WNQA8yuheFiQ" alt /></p>
<p>Next in the fundamentals comes a mixed combination of skills like Linux commands, Shell Scripting, Git, and Networking Concepts. These are important for times when you will be dealing with virtual/cloud servers and several other platforms to transform, manage, and store the data. </p>
<h1 id="heading-sprint-2-database"><strong>Sprint 2:</strong> Database</h1>
<p><img src="https://lh6.googleusercontent.com/pZ4XXjWWFmGrINRhNT6wRtIXilOB2wclHfEJNrLrrE5Wo49CVqXMoBqmQX9W7XpL7IjXoxbf53s8qarybsTNdFBjm01XicTQEeDmJXNJweybS-fS2hcj6IlG5srYMPLjFc2AYDX0o3E3Wrkz2IsM8LcPSUqjrodL7YGxzuUSMBdkAnZB1qrTj02p17sm8Q" alt /></p>
<p>A full sprint dedicated to databases and SQL, as you will have maximum interaction with them in this journey. You need to focus on Database Fundamentals, SQL, Database Modelling, ACID Transaction, Relational, and Non-Relational Databases. Here, you are free to play and experiment with data as you go along and build a good understanding of these concepts.</p>
<p><img src="https://lh4.googleusercontent.com/J9MTDPABSsjWpQGrUCA2NJ9IisfjfQJZfUgD_H1Vu-BR3O_cNJxZJeYG8-2rq4BNJwABLSFwx1SNU3tLNTbKc9Nu7PYmim9G55ClAdWuZ2MAcbATLxxWg0dAne4IiOnSvDdznuENahpJag-R0qKLnKZ6BB5BfWpv-j1U7HHnd703yOFqEVqK2qEtlZ9tfg" alt /></p>
<h1 id="heading-sprint-3-data-lakes-amp-warehouses"><strong>Sprint 3:</strong> Data Lakes &amp; Warehouses</h1>
<p><img src="https://lh3.googleusercontent.com/bQE5eWICRT74CzcsG8RqW8A0jgmWpdir2vX8qulJhKBqDfwL8UEm_OBV2YBy3_nQIM5SWUY7hC1NG43RlR-fzXh1zjv191PV4rExBjNp6FWJ8noav2yzbG--omnEedQ6S9Pm3pip3uhvlIgE25vYIht1dMhPszUVDrqaQP9dbQrkwJtsbi6HV84mcOJfWg" alt /></p>
<p>Understand fundamentals of data lakes and warehouses, OLAP Vs. OLTP, S3, GCS, Big Query, Redshift, <a target="_blank" href="https://clickhouse.com/">ClickHouse</a>, Normalization Vs. Denormalization, and processing Big Data using <a target="_blank" href="https://www.mysql.com/">MySQL</a>. These concepts demand dedicated attention from the learners as they play a major role when storing data and managing it all in different places for different purposes. </p>
<h1 id="heading-sprint-4-distributed-systems"><strong>Sprint 4:</strong> Distributed Systems</h1>
<p><img src="https://lh3.googleusercontent.com/JjfxyzCxEbuGXFO_C4gIEM1TNgmPW4vuxBeSybe3X85rc3BxAy7FqPZZy7P5TgoyzoxmktCqSGp4fYg95cDDutiSf8u4yiU-OzfMZtf0Lm8ZS8X6wUs5svmOhGhbAaRa9rQLzQX-bk_zYowidnWao3CMZKPBF9YNeK38vk3nG_omgNWHlUi7lypOCXntSA" alt /></p>
<p>Distributed Systems are formed when multiple machines work together in groups to manage massive data sets which cannot be done by a single machine alone. These modern frameworks achieve big data processing using Distributed Systems. Hence, it is required to understand the fundamentals of Hadoop, Map Reduce, HDFS, and Cluster Tech such as EMR, <a target="_blank" href="https://cloud.google.com/dataproc">Dataproc</a>, and <a target="_blank" href="https://www.databricks.com/">Databricks</a>.</p>
<h1 id="heading-sprint-5-data-processing"><strong>Sprint 5:</strong> Data Processing</h1>
<p><img src="https://lh3.googleusercontent.com/r--rO0EtBnILf9OVDoF8rHfs5f12vbUlWvxGIPSZwLTzVcV6kaI0e63hkVBTjVxXuclDLYKgGsitEn2g-liOWe4H3L3RHNcq4aFkwgQH1pRObdaSnMCjIWBW7YSn7KbFXWuBnapySS4IroBP-z3JgFE7HkGwN7GufRZFMfMbTw6jTmUQnZF0XihU7M2lFQ" alt /></p>
<p>Data Processing is a step where your coding skills will be challenged. Why? Because it will be required to transform the raw data to bring the most of its utility. A programming language such as Python is a must-have as a coding language that is mostly used. It is suggested that you get accustomed to a variety of tools such as <a target="_blank" href="https://pandas.pydata.org/">Pandas</a>, SQL, <a target="_blank" href="https://spark.apache.org/">Spark</a>, <a target="_blank" href="https://beam.apache.org/">Beam</a>, <a target="_blank" href="https://hadoop.apache.org/">Hadoop</a>, etc. </p>
<h1 id="heading-sprint-6-orchestration"><strong>Sprint 6:</strong> Orchestration</h1>
<p><img src="https://lh3.googleusercontent.com/JENOHCZ0Qhp7TecTkZz-39vsZ_WIJFDjytDr7rP64oRYfhJrw9lpwgEuAD_FH5IR1rhPeGU1dQwzjyxiK7SUzNCfrbd1zbAlTrNDWLJW-PjKkxu5Lihrst936Hc168DPR3ix8MkNTDWWpc9-uD03R3rinxsfCCq-tQIq2qn_zmwxPIDfDIBtnI6F_Xxdqg" alt /></p>
<p>With the sixth sprint you will take, you need to learn to orchestrate the pipelines using tools, where you define the flow and schedule of your tasks. But that’s not just it! It is needed that you gain a detailed understanding of how to use <a target="_blank" href="https://airflow.apache.org/">Airflow</a> and create <a target="_blank" href="https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html">DAGS</a> (Direct Acrylic Graphs). Also, get a glimpse of other orchestration tools such as <a target="_blank" href="https://luigi-project.io/">Luigi</a> and <a target="_blank" href="https://www.jenkins.io/">Jenkins</a>. </p>
<h1 id="heading-sprint-7-backend-frameworks"><strong>Sprint 7:</strong> Backend Frameworks</h1>
<p><img src="https://lh6.googleusercontent.com/lbJEacf1x_Gn1ubBAFneGx-sQOr2zNWfKj1RN_BC9qCdd-DB2-7fAfEvAsrzVGn3IvFaQfpxyrLmOQWJBWpIVyLbMPvTPvzegkYbgfOyDKAlFd6HYuyWmDm0lYUeabFncgL1YkvUQ6CgrzCy9-Gxc69bMHW95YaFXGXaj7QUPyQJvf-VZ0cXr8V4zoMngA" alt /></p>
<p>This part of the concept in data engineering overlaps with that of software engineering. Sometimes you need to serve your models and their functionality. Therefore, it is crucial to be well aware and learn how to create APIs with frameworks such as <a target="_blank" href="https://flask.palletsprojects.com/en/2.2.x/">Flask</a>, <a target="_blank" href="https://fastapi.tiangolo.com/">FastAPI</a>, and <a target="_blank" href="https://www.djangoproject.com/">Django</a>.</p>
<h1 id="heading-sprint-8-automation-amp-deployments"><strong>Sprint 8:</strong> Automation &amp; Deployments</h1>
<p><img src="https://lh3.googleusercontent.com/drTsKYLyDBQxDWvMRbLtAv58qtdo6aIJojAougCk01E27filBQ0SxiR_8FkQk1SYqbBvV540BFGYgRMLTaD9vCuJ401gRJXZBqXyPXpqMbZjtWuV2BeNirNaICXqa7tjjHUKzfrgLASec0M07n3FrelW09Z1Rz7MSZJGW1YHf2fEyvKM_KgyeSAM8rnMLQ" alt /></p>
<p>This kind of technology is important to understand as it lets you automate and deploy the codes using a variety of tools and platforms. For you, a few of the necessary learning technologies would be Containerization with <a target="_blank" href="https://www.docker.com/">Docke</a>r, CI/CD with <a target="_blank" href="https://github.com/">GitHub</a> Actions &amp; Infrastructure as code using <a target="_blank" href="https://www.terraform.io/">Terraform</a> and <a target="_blank" href="https://www.ansible.com/">Ansible</a>. </p>
<h1 id="heading-sprint-9-frontend-amp-dashboarding"><strong>Sprint 9:</strong> Frontend &amp; Dashboarding</h1>
<p><img src="https://lh3.googleusercontent.com/dtfI0v47FBScYg2J5B7sc1ZZJ2Q1iUcY14AhhKCIF_FoUmlIwxaWEP8FmT23syUUG6KjIWI6pqR_vH-xKFXbL6NCyTpptwtlHXLWBN5BclSvLXfrsVQOZCwyozQot8y4VwmkvNOEYwN-8CxoVG-bvjXu2fChd1Of_iHpCG56hksmE2lo0H7SCHteONRmlw" alt /></p>
<p>Frontend and exploration technologies are essential tools when it comes to showing the outcomes and actions taken on large data sets. In other words, they help visualize the ongoing changes and results using charts, graphs, and diagrams. Some of the popular tools to get used to are <a target="_blank" href="https://jupyter.org/">Jupyter Notebooks</a>, Dashboarding tools such as <a target="_blank" href="https://powerbi.microsoft.com/en/">PowerBI</a> &amp; <a target="_blank" href="https://www.tableau.com/">Tableau</a>, and Python frameworks such as <a target="_blank" href="https://plotly.com/dash/">Dash</a> &amp; <a target="_blank" href="https://plotly.com/">Plotly</a>, etc.</p>
<h1 id="heading-sprint-10-machine-learning"><strong>Sprint 10:</strong> Machine Learning</h1>
<p><img src="https://lh4.googleusercontent.com/vMsaZ5hitKBKAjZO_927TLrVF9l9OFslA3Knx4yeLZ2-PX0pWpHEO9fQkklxrqLI8RnQMZeQp8IHhX5F1Z1r0wrTOifgnO113z0OcqFD906ANA7qqkmSqzK0K079IZ2Wo5Qwvof66Iq2pLQDFaq6itLJGTfviitJ1SGNmvrSsVV1oeXBVPn3VGMIWELE-w" alt /></p>
<p>At this point, you’re already competent enough in the field of data engineering. However, to work as a professional alongside a team of other engineers, data scientists, and analysts, It is necessary to grasp the concepts of Machine Learning. ML models and algorithms are used by data scientists to study and then make calculative predictions that can benefit business organizations to make big decisions.</p>
<h1 id="heading-conclusion"><strong>Conclusion</strong></h1>
<p>An in-depth understanding of the core concepts is the first step when learning any subject as it promises you great success. Likewise, it is imperative to go with a clear and succinct approach in the advancing field of Data Engineering. This roadmap has simple guiding steps to ensure you can build a promising career with undying enthusiasm. The aspirants of data and its engineering must indulge in learning and practicing from a wide array of skills and technologies as time passes. </p>
<p>For more information and elaborative understanding, you can check out the video!</p>
]]></content:encoded></item><item><title><![CDATA[12 Must-Have Skills to become a Data Engineer]]></title><description><![CDATA[Are you passionate about using data to create innovative products and solutions? If so, a career as a data engineer may be the perfect fit for you. But what does it take to be successful in this field? In this blog, we will explore the skills and req...]]></description><link>https://anujsyal.com/12-must-have-skills-to-become-a-data-engineer</link><guid isPermaLink="true">https://anujsyal.com/12-must-have-skills-to-become-a-data-engineer</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[spark]]></category><category><![CDATA[#datawarehouse]]></category><category><![CDATA[big data]]></category><category><![CDATA[GitHub]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Tue, 03 Jan 2023 06:13:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1672725924394/68f0344e-22d5-4e59-8a42-5389387e8916.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Are you passionate about using data to create innovative products and solutions? If so, a career as a data engineer may be the perfect fit for you. But what does it take to be successful in this field? In this blog, we will explore the skills and requirements necessary to become a data engineer and succeed in this exciting profession.</p>
<p>To begin with the fundamentals, or say, to build an in-depth understanding, we must start from scratch. ​​</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1672726283888/2b120d07-201a-4765-a05f-97e4c3053bd2.png" alt class="image--center mx-auto" /></p>
<p>Full visual diagram: <a target="_blank" href="https://whimsical.com/de-skills-CYXEYCaa3U4zNL5JTYKv59">Visit this mind map on Whimsical</a></p>
<h2 id="heading-fundamental-skills"><strong>Fundamental Skills</strong></h2>
<ul>
<li><p><strong>SQL:</strong> Structured Query Language, also called See-Quell, is always at the top of the list for beginners in the domain. The language was developed in 1970 and is the standard language to interact with data in databases.</p>
<p>  Almost all the databases and warehouses used a version of SQL as an interactional language.<br />  The popular standard relational databases are MySQL and PostgreSQL. Moreover, other tools and warehouses have adopted SQL as an abstraction, which allows you to build ML models using SQL in big queries.</p>
</li>
<li><p><strong>Programming Language:</strong> Next comes the Programming Language. This is the language that engineers begin using to code with and is the central aspect of data engineering. For most of us, Python is the language of choice as it's easier to get started with.<br />  This language comes packed with data science packages and frameworks and is a perfect choice for production code. Alternatively, there are plenty of other languages (such as R, Scala, and Java) that can be used but Python is recommended.</p>
</li>
<li><p><strong>Git:</strong> Git is an important tool for version control, which is a practice of tracking and managing changes to software code. As for every single change that you perform, that change becomes a part of your code base in some remote server/cloud.<br />  But how does Git help you? Git lets you save all the changes and actions that you take while coding and this works wonderfully while collaborating with your team, without losing your code. You just need to simply create a new branch and send a pull request to merge code and Voila! You’re ready to collaborate and work on your code. Check out my video tutorial on git if you want to get started with this technology</p>
</li>
</ul>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/7_pr528CYQw">https://youtu.be/7_pr528CYQw</a></div>
<p> </p>
<p><a target="_blank" href="https://youtu.be/7_pr528CYQw">Complete Git and GitHub Tutorial for Beginners (Data Professionals)</a></p>
<ul>
<li><p><strong>Linux Commands &amp; Shell Scripting:</strong> Being a practitioner in the world of data engineering, you would mostly be dealing with a Linux VM or Server. No matter if in a public cloud or a private server, these machines inherently use some version of Linux such as Ubuntu, Fedora, etc.<br />  Therefore, to work with such machines, you are required to have some knowledge of commands to navigate with Linux servers. Some of the basic commands such as cd, pwd, cp, and mv are a good start, and much more to learn further. However, Shell Scripting is a great tool to automate these Linux commands, without needing to manually use these commands.</p>
</li>
<li><p><strong>Data Structure &amp; Algorithm:</strong> Next in the line is Data Structure and Algorithm. Even though you will not be required to create data structures on your own, it is still required for an aspiring data engineer to have an adequate understanding and problem-solving skills of DS &amp; Algo (similar to software engineering). For this purpose, Easy- Intermediate-level LeetCode problems will be enough for the initial practice.</p>
</li>
</ul>
<h2 id="heading-concept-of-networking"><strong>Concept of Networking</strong></h2>
<p>As a data engineer, you would be responsible for quite a lot of deployments to VMs and servers. Therefore, It is important for someone dealing with VMs (Virtual Machines), Servers, and APIs (Application Programming Interfaces) to have a basic understanding of basic networking concepts such as IP (Internet Protocol), DNS (Domain Name Server), VPN, TCP, HTTP, Firewalls, etc.</p>
<h2 id="heading-databases"><strong>Databases</strong></h2>
<ol>
<li><p><strong>Fundamentals:</strong> A database is a space where data is stored. You will be interacting with many of these databases as a data engineer. For this reason, you need to understand the fundamental concepts of databases, such as tables, rows, columns, keys, joins, merges, and schema.</p>
</li>
<li><p><strong>SQL:</strong> This was supposed to be covered once again when talking of the databases, as it comes in handy as an interactional language when working with these databases.</p>
</li>
<li><p><strong>ACID:</strong> This abbreviation stands for Atomicity, Consistency, Isolation, and Durability. This is a set of properties of database transactions intended to guarantee data validity, despite errors, power failures, and any other such mishaps.</p>
</li>
<li><p><strong>Database Modelling:</strong> Data Modelling or Schema Design helps extensively when building any database, be it applications or warehouses. That is why it is essential to have some knowledge of design patterns for creating schemas for databases. This includes star schema, flat design, snowflake model, etc.</p>
</li>
<li><p><strong>Database Scaling:</strong> Vertical Scaling refers to the increase of configuration of a single machine where the database is deployed, which can also be scaled up later on. Alternatively, for Horizontal scaling also known as Sharding, you perform the same process to store the data but, into multiple machines.</p>
</li>
<li><p><strong>OLTP Vs. OLAP:</strong> OLTP (Online Analytical Processing) &amp; OLTP (Online Transaction Processing) are two different types of data processing systems. Complex queries are used by online analytical processing (OLAP) to examine past data that has been collected from OLTP systems.</p>
</li>
<li><p><strong>Relational Databases:</strong> These are traditional-style databases that power most of the applications. a single database can contain multiple tables with rows and columns. The most commonly used databases of this kind are PostgreSQL and MySQL.</p>
</li>
<li><p><strong>Non-Relational Databases:</strong> Non-Relational Databases store the data as nodes and relations separately using a storage model, instead of a tabular schema. This helps in storing the data systematically and then promptly extracting and fetching the data records. Non-relational data further comes in three different types that can be understood well as and when needed.</p>
<ul>
<li><p><strong>Key-Value Databases</strong>: examples are Redis, DynamoDB, and FireBase</p>
</li>
<li><p><strong>Graph Database</strong>: examples are Neo4j and ArangoDB</p>
</li>
<li><p><strong>Wide Column Databases</strong>: examples are Apache Cassandra, and Google BigTable</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-data-warehousing"><strong>Data Warehousing</strong></h2>
<p>The inability of databases to store a huge amount of data leads us to a warehouse. These data warehouses can store large volumes of current and historical data for query and analysis. <strong>Data Warehousing is simply databases designed with analytical workloads in mind</strong>. These are powerful enough to perform complex aggregate queries and transformations to yield insights. Some of the key concepts to understand within warehousing are-</p>
<ol>
<li><p><strong>SQL</strong>: With the advent of powerful data warehouses that abstract away complexity, proficiency in SQL is all that is required to unlock their full potential.</p>
</li>
<li><p><strong>Normalization Vs. Denormalization</strong>: Normalization involves removing redundancy or any inconsistencies in the data. While Denormalization is the technique of merging the data into a single table to improve data change speeds.</p>
</li>
<li><p><strong>OLAP Vs. OLTP</strong>: The primary distinction between the two is that one uses data to gather important insights while the other supports transaction-oriented applications.</p>
</li>
</ol>
<p>Some of the popular data warehouses are:</p>
<ul>
<li><p>Google’s Big Query</p>
</li>
<li><p>AWS Redshift</p>
</li>
<li><p>Azure Synapse</p>
</li>
<li><p>Snowflake</p>
</li>
<li><p>ClickHouse</p>
</li>
<li><p>Hive</p>
</li>
</ul>
<h2 id="heading-data-lakesobject-storage"><strong>Data Lakes/Object Storage</strong></h2>
<p>These work as file storage sources where you can store your files or blob files. They are huge cloud networks that are used globally and are readily available to you.</p>
<h2 id="heading-distributed-systems"><strong>Distributed Systems</strong></h2>
<p>When multiple machines work together as a cluster, they form Distributed Systems. These systems are used when the data is huge and cannot be managed by a single machine. They have separate sets of technologies due to their own complexities. Some of the concepts you must know in depth-</p>
<ol>
<li><p>Big Data</p>
</li>
<li><p>Hadoop</p>
</li>
<li><p>HDFS</p>
</li>
<li><p>Map Reduce</p>
</li>
</ol>
<p>Some of the technologies that are built for this purpose include Cluster technologies like Kubernetes, Databricks, Custom Hadoop Cluster, etc. Open-source technologies are also available in distributed systems.</p>
<h2 id="heading-data-processing"><strong>Data Processing</strong></h2>
<p>This is where your coding skills will come to use for transforming the data as the raw data is never usable. Being a data engineer, your job responsibility will mainly revolve around transforming the data to be served in the right format. This further includes cleaning the data and its validation. Panda can be your first-hand tool to perform this process as it’s an easy-to-use python package that uses data frames. SQL can also be used to transform big data as most of the data warehouses support this language. <code>Spark</code> is the most popular framework used for big data transformation. Similarly for stream processing <code>Spark Streaming</code> is the preferred choice.</p>
<h2 id="heading-orchestration"><strong>Orchestration</strong></h2>
<p>Orchestration is used to schedule and orchestrate jobs and create pipelines and workflows. The <strong>best tool for orchestration is Airflow</strong>, as it uses python-based Direct Acrylic Graphs to write down the workflow of jobs. From the simplest of tasks to the most complex ones, Airflow can create everything. Some other orchestration tools are Luigi, Nifi, and Jenkins.</p>
<h2 id="heading-backend-frameworks"><strong>Backend Frameworks</strong></h2>
<p>It can be assumed by the name itself that Backend frameworks somehow overlap with software engineering. Backend Frameworks come to use when you require to serve some data set, model, or functionality to be used by some application. For this task, you will be needed to create the backend APIs/frameworks such as Flask, Django, and FastAPI. Some of the dedicated technologies based on python are Flask, Django, and FastAPI. Some of the cloud-based technologies are GCP Vertex AI API for model deployments and Automl APIs.</p>
<h2 id="heading-frontend-andamp-dashboarding"><strong>Frontend &amp; Dashboarding</strong></h2>
<p>Frontend and exploration technologies are about displaying the results and actions performed through charts, images, and diagrams. There are plenty of tools available which might not come to use for a data engineer but are good to know about. The popular ones are Jupyter Notebooks, Dashboarding (PowerBI, Tableau), Python Frameworks (Dash, Gradio), etc.</p>
<h2 id="heading-automation-andamp-deployments"><strong>Automation &amp; Deployments</strong></h2>
<p>Automation and Deployments are about automating and deploying the codes using a variety of tools and technologies. A few of the important technologies include the following:</p>
<ol>
<li><p>Infrastructure as code: Using Terraform, Ansible, Shell Scripts</p>
</li>
<li><p>CI/CD: Using GitHub Actions, Jenkins</p>
</li>
<li><p>Containerization: Docker, Docker Compose</p>
</li>
</ol>
<h2 id="heading-machine-learning"><strong>Machine Learning</strong></h2>
<p>Machine Learning algorithms (or models) are just another great concept to gain knowledge about. Machine learning is majorly used by data scientists to make predictions by analyzing current and historical data. However, data engineers must have a strong understanding of the basics of machine learning as it can naturally enable them to deploy models, as well as build pipelines having more accuracy. This directly benefits the data scientists to make precise decisions. Hence, it is good to understand the fundamentals and frameworks of ML. Some of the platforms used for ML Operations are Google AI Platforms, Kubeflow, and Sagemaker.</p>
<h2 id="heading-integrated-platforms"><strong>Integrated Platforms</strong></h2>
<p>Integrated platforms allow data scientists and data engineers to have integrated workflows together in one place. AWS Sagemaker, Databricks, and Hugging Face, are some examples of Integrated Platforms,</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>In the field of data engineering, there are a myriad of skills that one needs to learn, and that requires gaining hands-on experience too. As an aspiring data engineer, you get to choose from a wide variety of skills and tools to work with, and that’s the thrill of it all!</p>
<p>For more information and elaborative understanding, you can check out the video below!</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=0qmrsjW_rVI">https://www.youtube.com/watch?v=0qmrsjW_rVI</a></div>
<p> </p>
<p><a target="_blank" href="https://youtu.be/0qmrsjW_rVI"><strong>DATA ENGINEERING SKILLSET</strong></a></p>
]]></content:encoded></item><item><title><![CDATA[Data Engineering Explained]]></title><description><![CDATA[When we scroll through these sites in hopes to find something we need to buy (say, a shirt), we add it to the cart, or we just let it be saved for later. Within a few moments, you begin to see advertisements of the same or similar-looking shirts whil...]]></description><link>https://anujsyal.com/data-engineering-explained</link><guid isPermaLink="true">https://anujsyal.com/data-engineering-explained</guid><category><![CDATA[Data Science]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[big data]]></category><category><![CDATA[#datawarehouse]]></category><category><![CDATA[Data-lake]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Sat, 24 Dec 2022 08:34:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/04f0598c840abea21926a76e8492f28f.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When we scroll through these sites in hopes to find something we need to buy (say, a shirt), we add it to the cart, or we just let it be saved for later. Within a few moments, you begin to see advertisements of the same or similar-looking shirts while surfing other platforms.</p>
<p>For these creepy advertisements to be in the right spots, apart from data tracking using cookies, there is also a good amount of <strong>data engineering</strong> working behind the scenes.</p>
<p>In this blog post let's try to understand how data engineering works.</p>
<h2 id="heading-data-engineering-in-the-past"><strong>Data Engineering in the Past</strong></h2>
<p>First, let's try to understand data engineering in the past. A while back, when things were simple, data was scattered across multiple sources such as transaction databases (e.g. MySQL, Postgres), analytics tools (e.g. Google Analytics, Facebook Pixels), and CRM databases. This data was often accessed and analyzed by an Excel professional who would gather the data from various teams, manipulate it in Excel using pivot tables and other functions, and create a final report.</p>
<p><img src="https://lh6.googleusercontent.com/KURdXHpnKko5t2d2Yj4Ph0wTI_YEBpGXqbpiYuzON7RI5AAEJqVTp3REwPFkAgkS2dWtp0NL1ix5ey9HsVwfBe_X17VBN6BaRJ0PWCgoID6f2SIWgjCR3mxAHCkGjDw9N4N0ZD_2JAYG20WeAi2cxSYoUV0_RgEQjh3q5Byp8U4rkeFpgHrph_ZO33zqFg" alt="Illustration of data engineering in past by author" /></p>
<p>While this process worked for small applications, it was prone to error due to the manual intervention involved and became increasingly cumbersome as the amount of data grew.</p>
<h2 id="heading-understanding-extract-transform-load-etl"><strong>Understanding Extract Transform Load (ETL)</strong></h2>
<p>Extract, Transform, Load (ETL) is a process in data engineering that involves extracting data from various sources, transforming it into a format suitable for analysis or other purposes, and loading it into a target system or database. The purpose of ETL is to make it easier to work with data from different sources by bringing it into a centralized location or format.</p>
<p><img src="https://lh5.googleusercontent.com/vUpZH8EMB0h-O0Z5uKGFl5MjYzOpqyay2gQaEr4Mfh5iDidZyinOFTvcbMTY9D69L418ucN90z5YMOGemqpBz393bbP9Lbn-PI1Utp1DYElewEw1ixKJSI7gSwTI7DZctgDXOXldAMzXjnhD2WdfcPzUfb3rZSMpXnhsuEkpAtQxNw_QtQRH70NT8ADALg" alt="Illustration of ETL pipeline from author" class="image--center mx-auto" /></p>
<p>Any simple ETL Pipeline would first extract data from all the sources such as databases, APIs, files, and any other type of data connectors. Then comes the next step where it transforms the data. But what exactly does this step involve?  For transforming this collected data, the system removes any ambiguities, missing fields, and columns with nulls. It further includes putting the misplaced columns into the right places, in the right format, joins or merges where it's needed, and sometimes it also brings out a pivot summary.</p>
<p>Finally, the last step is performed where this customized, transformed data is loaded into a sink. This ‘sink’ could simply be a database. Since this process is not just a one-time thing, it is likely that it will be repeated with a continuing frequency. Therefore, the data engineers make scripts that carry on the whole process of ETL to run on a weekly, monthly, or yearly basis. Airflow Orchestration is one good example of an ETL tool.</p>
<h2 id="heading-insights-for-stakeholders-bi"><strong>Insights for Stakeholders (BI)</strong></h2>
<p>Business Intelligence, abbreviated as BI, is software used to accumulate, process, analyze, and visually represent larger sets of unstructured data. They are used to inform decision-making and drive business outcomes. As fascinating as it sounds, BI tools are a great invention as they make it possible for everyone to observe and make sense of their own data.</p>
<p><img src="https://lh3.googleusercontent.com/JmvtSYdJhQ8HvHyPy1LVQPsDvYCoNKOE_H0QTwER5WW-ml30mmrXeMtBU9NVRFACCcRInxsDLT8Z9menGfOE76wwm7kE_Wb_D-IfaptNBj89V87Pb4_PxLeMVKdgwtrHc55gOrn0CzQlpPfL_GkNu82LgKYtKx04wGrNDOFCrdHhX0Xg_PubkoHjqMvsfw" alt class="image--center mx-auto" /></p>
<p>Photo Credit: <a target="_blank" href="https://www.pexels.com/@tima-miroshnichenko/">Tima Miroshnichenko/Pexels</a></p>
<p>BI tools are made for end-users like stakeholders and analysts who get access to the recorded insights. Using BI tools, they are able to track KPIs (Key Performance Indicators) and trends and make decisions on the basis of well-curated data. Some popular BI tools are PowerBI, Tableau, and Google Data Studio. These tools make it easier to create charts, graphs, and maps, and can easily be transported into excel sheets for future use.</p>
<h2 id="heading-data-warehousing"><strong>Data Warehousing</strong></h2>
<p>We studied how ETL helps in consistently pushing the data forward each day. But there is a limit to how much data can be stored in a database like MySQL, and for how long. That’s when Data Warehouse comes into the picture.</p>
<p><img src="https://lh3.googleusercontent.com/CcEzV-7CYzxa3zaUqkL_lQLzL4lEradDGYolIS3P3PPi-Ws8-Pl5bg3chCH-Y3llgvhY1T_-XcQ3LkgIk7hxyopFquKwifcdY5kdwppPgMG24Q7Ad12K0WcOLGbAfWZ3z2Cq2FQG-sHXeKP0UXoiWkdhDcgGuQuGgx_BCMe2zPOi5ClUJrIE3fO-zPpzRQ" alt class="image--center mx-auto" /></p>
<p>As its name suggests, this system works as a warehouse for data in larger amounts and often, to store historical data. Data stored in data warehouses is structured to cater to analytical workloads in mind. The schema is usually denormalized to fetch insights without doing a lot of <code>joins</code>. A data warehouse is also termed as OLAP DB system.</p>
<p>It mainly focuses on the analytical part as it supports in performing queries and analysis. In short, this is just another tool that helps big organizations to take mindful and strategic decisions using historical records and generating queries.</p>
<h2 id="heading-elt-data-lake-used-by-data-scientists"><strong>ELT/ Data Lake used by Data Scientists</strong></h2>
<p>The data warehouse is built to support the business requirements at the moment, it contains structures &amp; well-designed schema. Business users need these to track KPIs and metrics.</p>
<p>However, Data Scientists are required to build ML models in order to make futuristic predictions based on existing data. Their part of the task is to hunt every square inch of data they can find. So they are also interested in looking at unstructured data such as logs, events data, which is not part of the warehouse.</p>
<p><img src="https://lh3.googleusercontent.com/SahP4-_jc8Dezpk8QwXLlLIkgd8Bjc6aR_pOY6ofeyjsYGzNJ52P7E5EidYKoK753W6eB2EXbTdmvVYQYbxlU4DS7Tana1d_n97hsH_6Glmy4QwgUeJcyvk97z9h6SlH5HyMeLl_IWtjQUcmkVtnyToqMYm6WMlM4IMDwlvUSzwD-JLEZkLPRafA-vyANw" alt="Illustration of ELT" class="image--center mx-auto" /></p>
<p>But to make all this possible, a Data Engineer has to do some work. Instead of ETL, they have to turn it into ELT. This means that extraction comes first and then loading of the data into a data lake happens. Data lake stores all the raw data without processing it as DS needs to see all the columns. This data is usually stored in blob storage like S3, HDFS, or GCS. At last, data scientists transform the data in Jupyter notebooks to churn out its usability.</p>
<h2 id="heading-big-data-andamp-computing-spark"><strong>Big Data &amp; Computing Spark</strong></h2>
<p>Big Data was one of the most intriguing, overused buzzwords back then when it was first introduced in the world of data technology. But what do we mean when we use the term?</p>
<p>To be concise, the data which cannot be processed/used in a single server is called Big Data. But there is more to it!</p>
<p>For data to be classified as big data, there are 4 Vs that are required:</p>
<ol>
<li><p>Volume</p>
</li>
<li><p>Variety</p>
</li>
<li><p>Veracity (accuracy)</p>
</li>
<li><p>Velocity (speed)</p>
</li>
</ol>
<p><img src="https://lh4.googleusercontent.com/dDov_7aVoR00D089muH2HL9ffi86KxFiGNwvNlbttM3HgdTdpULZM5C58_z1F_jTgybMnJ-rrhytVTglNk7ooVx09tL12k-7xaF3NIDOTknteqfXrXjtvOugrNQ60Ta1K9Vyg-RvHpRRvBQvqwZLGKNCvSfeLMKOhhcNEj2qTDbBTWWBSYDCYdydXTNXyg" alt="Illustration of 4 Vs in Big Data by author" class="image--center mx-auto" /></p>
<p>Some of the key areas where <strong>big data</strong> comes to use are:</p>
<ul>
<li><p>Ecommerce website doing thousands of sales and logistic transactions</p>
</li>
<li><p>Payment Systems</p>
</li>
<li><p>Financial Institutions</p>
</li>
<li><p>Blockchain Exchanges</p>
</li>
<li><p>Streaming Services like Youtube</p>
</li>
</ul>
<p>The petabytes of data cannot be stored on a single server. Hence, this quantity of data is required to be distributed over other computing and cloud alternatives. For such a purpose, there are open source frameworks like Apache Hadoop, which efficiently stores and processes data that is very huge in volume. Such servers are also known as clusters as you can use as much storage as you need and compute.</p>
<p><img src="https://lh6.googleusercontent.com/ocpTSwwMZtGQIKL_AeIWZVYTn_NdJvM39c7ksPXojIih7bfVbfNbhBCn3IoAYvBB5qzB2gQGpXldZEFkqoeLNDFlnpfnXr-xUBlHfoCS4uEyiH4sNE7J7J2kvCmDBZ6MQr6ONAgydrQ8bvjCLxm2hb_xNNxGEu0AEBMlDg902Otd1PLlKHzUAChacSE90A" alt="Big Data ecosystem" class="image--center mx-auto" /></p>
<p>Some other great variants of these cloud storage support are GCS and S3, as they are more resilient. Such distributed storage provides scalability and redundancy, for the data can be retrieved if some server crashes in the future. In addition to this, some specific technologies come handy to work on the distributed computing and streaming of this stored data. Spark and Kafka are some good examples for it!</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The world of data engineering is huge and includes major components of data science as well. This means that data engineering and data science are not contrasting but complementary to each other. Data engineers design and build the pipelines to transform and then transport the data into the desired format. At the same time, data scientists utilize that data to bring out most of its utility for the business organization and its stakeholders.</p>
<p>However, in order to complement each other’s efforts, both data engineers and data scientists are supposed to learn data literacy skills and must be well aware of their respective contributions to the system. This is how any business organization flourishes and performs well in the market, understands the likes &amp; preferences of its consumer base, and makes important decisions that are crucial for growth.</p>
<h2 id="heading-looking-for-more-information">Looking for more information</h2>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=cAJCcpiVpOY">https://www.youtube.com/watch?v=cAJCcpiVpOY</a></div>
]]></content:encoded></item><item><title><![CDATA[Warehousing with Google’s Big Query]]></title><description><![CDATA[Data, in the modern world, is decentralized and is being generated and collected at a record pace. To ensure that this data is collected and processed in a manner that enables businesses and organizations to achieve their business goals, specialized ...]]></description><link>https://anujsyal.com/warehousing-with-googles-big-query</link><guid isPermaLink="true">https://anujsyal.com/warehousing-with-googles-big-query</guid><category><![CDATA[Databases]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[google cloud]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Thu, 24 Mar 2022 06:01:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/BNBA1h-NgdY/upload/v1648015810341/sxTshnNLX.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Data, in the modern world, is decentralized and is being generated and collected at a record pace. To ensure that this data is collected and processed in a manner that enables businesses and organizations to achieve their business goals, specialized and optimized tools are required. ‘The right solution will enable businesses and organizations to store data with resiliency, and swiftly analyze large amounts of data such that it can be used to achieve business outcomes, with decisions powered by data and analytics. </p>
<p>Google’s BigQuery is one such tool. Google’s proprietary BigQuery is a serverless multi-cloud data warehouse that is highly scalable, cost-effective, and specifically designed for offering superior business agility. It democratizes your data-driven insights with built-in machine learning, powered by a flexible and end-to-end multi-cloud analytics solution. In addition to state-of-the-art machine learning, BigQuery also enables lower TCO at scale by almost 26-34% as compared to alternatives. Furthermore, BigQuery adapts to your data with zero operational overhead. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648022051095/yc5gh9JTu.jpg" alt="taylor-vick-M5tzZtFCOfs-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@tvick?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Taylor Vick</a> on <a href="https://unsplash.com/s/photos/big-data?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>BigQuery’s architecture is developed for Big data. It works optimally when it is being fed with several petabytes of data to be cleaned, processed, and analyzed. BigQuery removed the requirements for humans to make interactive ad-hoc queries to massive data sets in read-only mode. </p>
<h2 id="heading-bigquery-has-the-following-hierarchical-structure">BigQuery has the following hierarchical structure:</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648022137561/tTQu9NcSI.jpg" alt="eta-qRmq4tXM9sI-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@etaplus?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">ETA+</a> on <a href="https://unsplash.com/s/photos/hierarchy?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<h4 id="heading-projects">Projects</h4>
<p>From the context of BigQuery, all BigQuery resources are contained within a project. Since there is decoupling between Storage and Compute with BigQuery, the projects that store data and those that query data can be separate. </p>
<h4 id="heading-datasets">Datasets</h4>
<p>You can utilize datasets to organise BigQuery views A dataset is bound to a location that may be regional (a specific geographical place) or multi-regional (a region that contains two or more geographical places). The location for a dataset can only be defined at the time of its creation.</p>
<h4 id="heading-tables">Tables</h4>
<p>BigQuery tables hold your data. Each table is defined by a schema (data types, column/row names, and other information). There are different types of tables, namely, native tables are supported by BigQuery storage, external tables exist on storage external to BigQuery, and views which are virtual tables defined by SQL queries.</p>
<h4 id="heading-jobs">Jobs</h4>
<p>The actions that BigQuery runs, such as loading, exporting querying, or copying data are referred to as jobs. The location of the job is linked to the location of the dataset for executing the job. </p>
<h2 id="heading-key-features-of-bigquery">Key Features of BigQuery:</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648022233117/qxWWZh00B.jpg" alt="adam-smigielski-ZSct3GqtTL0-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@smigielski?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Adam Śmigielski</a> on <a href="https://unsplash.com/s/photos/attributes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<h4 id="heading-predictive-modelling-with-bigquery-ml">Predictive Modelling with BigQuery ML</h4>
<p>BigQuery enabled data analysts or data scientists to build machine learning models structured or semi-structured data sets several petabytes big. All this is achieved through simple SQL in minimal time. </p>
<h4 id="heading-bigquery-omni-for-multicloud-data-analytics">BigQuery Omni for Multicloud data analytics</h4>
<p>BigQuery Omni allows you to analyze data across the multi-cloud such as Azure and AWS, as a fully managed and end-to-end data analytics solution for the multi-cloud with a focus on saving costs and securing data. </p>
<h4 id="heading-bigquery-bi-engine-for-interactive-data-analytics">BigQuery BI Engine for Interactive Data Analytics</h4>
<p>With its highly optimized, in-memory analysis service, BigQuery BI Engine enables data analysts to obtain actionable insights from massive and complex datasets with a sub-second query response time and high scalability through high concurrency. </p>
<h3 id="heading-bigquery-gis-with-geospatial-analysis">BigQuery GIS with Geospatial Analysis</h3>
<p>As a unique feature, combine geospatial analysis with BigQuery’s serverless architecture in order to improve and augment your analytics workflows with location-based intelligence. Simplify your analyses and visualize your special data to unlock new potential for your business</p>
<h2 id="heading-warehousing-in-bigquery">Warehousing in BigQuery</h2>
<p>If you are interested in a step by step guide, check out this youtube video</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/_Wm_GYO-r_Q">https://youtu.be/_Wm_GYO-r_Q</a></div>
<blockquote>
<p>Warehousing with Google's Big Query from Anuj Syal</p>
</blockquote>
<h4 id="heading-loading-dataset-in-bigquery">Loading dataset in BigQuery</h4>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648022421522/YaRO45HAP.png" alt="image.png" /></p>
<blockquote>
<p>Screenshot from <a target="_blank" href="https://cloud.google.com/">Google Cloud</a></p>
</blockquote>
<p>Big query provides multiple options for you to load the data:</p>
<ul>
<li>Use a pre-existing connector, eg youtube analytics, google analytics</li>
<li>Google Cloud Storage</li>
<li>Big Query Console</li>
<li>Big Query CLI</li>
<li>Using python client libraries</li>
</ul>
<h4 id="heading-public-datasets">Public Datasets</h4>
<p>Google Cloud <a target="_blank" href="https://cloud.google.com/solutions/datasets">Public Datasets</a> offer a powerful data repository of more than 200 high-demand public datasets from different industries. It also provides free storage cost and 1TB query cost per month if you intend to use it. </p>
<h4 id="heading-exploring-ecommerce-public-dataset-on-big-query">Exploring Ecommerce Public Dataset on Big Query</h4>
<p>If you are familiar with simple SQL, Big Query allows you to explore it's biggest datasets for free. So as an example let's check out this Ecommerce dataset publicly available:</p>
<ul>
<li><strong>About the Dataset</strong>
The dataset provides <a target="_blank" href="https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data">Google Analytics 360 data from the Google Merchandise Store</a> , a real ecommerce store that sells Google-branded merchandise, in BigQuery. We will explore the all_sessions table</li>
</ul>
<ol>
<li>Query: Total unique visitors<pre><code><span class="hljs-keyword">SELECT</span>
<span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> product_views,
<span class="hljs-keyword">COUNT</span>(<span class="hljs-keyword">DISTINCT</span> fullVisitorId) <span class="hljs-keyword">AS</span> unique_visitors
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`data-to-insights.ecommerce.all_sessions`</span>;
</code></pre></li>
</ol>
<p>Out: </p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>product_views</td><td>unique_visitors</td></tr>
</thead>
<tbody>
<tr>
<td>21493109</td><td>389934</td></tr>
</tbody>
</table>
</div><ul>
<li>Query: Total unique visitors by Channel grouping</li>
</ul>
<pre><code><span class="hljs-keyword">SELECT</span>
  <span class="hljs-keyword">COUNT</span>(<span class="hljs-keyword">DISTINCT</span> fullVisitorId) <span class="hljs-keyword">AS</span> unique_visitors,
  channelGrouping
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`data-to-insights.ecommerce.all_sessions`</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> channelGrouping
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> channelGrouping <span class="hljs-keyword">DESC</span>;
</code></pre><p>Out:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>unique_visitors</td><td>channelGrouping</td></tr>
</thead>
<tbody>
<tr>
<td>38101</td><td>Social</td></tr>
<tr>
<td>57308</td><td>Referral</td></tr>
<tr>
<td>11865</td><td>Paid Search</td></tr>
<tr>
<td>211993</td><td>Organic Search</td></tr>
<tr>
<td>3067</td><td>Display</td></tr>
<tr>
<td>75688</td><td>Direct</td></tr>
<tr>
<td>5966</td><td>Affiliates</td></tr>
<tr>
<td>62</td><td>(Other)</td></tr>
</tbody>
</table>
</div><ul>
<li>Query: Top Five products with the most views</li>
</ul>
<pre><code><span class="hljs-keyword">SELECT</span>
  <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> product_views,
  (v2ProductName) <span class="hljs-keyword">AS</span> ProductName
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`data-to-insights.ecommerce.all_sessions`</span>
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">type</span> = <span class="hljs-string">'PAGE'</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> v2ProductName
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> product_views <span class="hljs-keyword">DESC</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>;
</code></pre><p>Out :</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>product_views</td><td>ProductName</td></tr>
</thead>
<tbody>
<tr>
<td>316482</td><td>Google Men's 100% Cotton Short Sleeve Hero Tee White</td></tr>
<tr>
<td>221558</td><td>22 oz YouTube Bottle Infuser</td></tr>
<tr>
<td>210700</td><td>YouTube Men's Short Sleeve Hero Tee Black</td></tr>
<tr>
<td>202205</td><td>Google Men's 100% Cotton Short Sleeve Hero Tee Black</td></tr>
<tr>
<td>200789</td><td>YouTube Custom Decals</td></tr>
</tbody>
</table>
</div><ul>
<li>Query: Top Five products with the most unique views</li>
</ul>
<pre><code><span class="hljs-comment">#&gt; You can use the SQL `WITH`</span>
<span class="hljs-comment">#&gt; clause to help break apart a complex query into multiple steps.</span>
<span class="hljs-keyword">WITH</span> unique_product_views_by_person <span class="hljs-keyword">AS</span> (
<span class="hljs-comment">-- find each unique product viewed by each visitor</span>
<span class="hljs-keyword">SELECT</span>
 fullVisitorId,
 (v2ProductName) <span class="hljs-keyword">AS</span> ProductName
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`data-to-insights.ecommerce.all_sessions`</span>
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">type</span> = <span class="hljs-string">'PAGE'</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> fullVisitorId, v2ProductName )
<span class="hljs-comment">-- aggregate the top viewed products and sort them</span>
<span class="hljs-keyword">SELECT</span>
  <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> unique_view_count,
  ProductName
<span class="hljs-keyword">FROM</span> unique_product_views_by_person
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> ProductName
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> unique_view_count <span class="hljs-keyword">DESC</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>
</code></pre><p>Out: </p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>unique_view_count</td><td>ProductName</td></tr>
</thead>
<tbody>
<tr>
<td>152358</td><td>Google Men's 100% Cotton Short Sleeve Hero Tee White</td></tr>
<tr>
<td>143770</td><td>22 oz YouTube Bottle Infuser</td></tr>
<tr>
<td>127904</td><td>YouTube Men's Short Sleeve Hero Tee Black</td></tr>
<tr>
<td>122051</td><td>YouTube Twill Cap</td></tr>
<tr>
<td>121288</td><td>YouTube Custom Decals</td></tr>
</tbody>
</table>
</div><ul>
<li>Final Query: Total number of distinct products ordered and the total number of total units ordered</li>
</ul>
<pre><code><span class="hljs-keyword">SELECT</span>
  <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> product_views,
  <span class="hljs-keyword">COUNT</span>(productQuantity) <span class="hljs-keyword">AS</span> orders,
  <span class="hljs-keyword">SUM</span>(productQuantity) <span class="hljs-keyword">AS</span> quantity_product_ordered,
  v2ProductName
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`data-to-insights.ecommerce.all_sessions`</span>
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">type</span> = <span class="hljs-string">'PAGE'</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> v2ProductName
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> product_views <span class="hljs-keyword">DESC</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>;
</code></pre><p>Out:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>product_views</td><td>orders</td><td>quantity_product_ordered</td><td>v2ProductName</td></tr>
</thead>
<tbody>
<tr>
<td>316482</td><td>3158</td><td>6352</td><td>Google Men's 100% Cotton Short Sleeve Hero Tee White</td></tr>
<tr>
<td>221558</td><td>508</td><td>4769</td><td>22 oz YouTube Bottle Infuser</td></tr>
<tr>
<td>210700</td><td>949</td><td>1114</td><td>YouTube Men's Short Sleeve Hero Tee Black</td></tr>
<tr>
<td>202205</td><td>2713</td><td>8072</td><td>Google Men's 100% Cotton Short Sleeve Hero Tee Black</td></tr>
<tr>
<td>200789</td><td>1703</td><td>11336</td><td>YouTube Custom Decals</td></tr>
</tbody>
</table>
</div><p>This process of doing multiple queries helps us derive insights from data using SQL, and technically this data can be in petabytes and the process will remain somewhat same</p>
<h2 id="heading-benefits-of-bigquery">Benefits of BigQuery:</h2>
<h4 id="heading-superior-insights-with-predictive-analytics">Superior Insights with Predictive Analytics</h4>
<p>Get updated analytics and information on all your business processes by querying streaming data in real-time. Utilize these insights to make data-driven decisions for your business and effectively predict business outcomes without moving data across. </p>
<h4 id="heading-share-insights-seamlessly">Share Insights Seamlessly</h4>
<p>Share and access analytics and insights securely from within your organization, enabling stakeholders to develop insightful reports and dashboards using BI-tool right out of the box.</p>
<h4 id="heading-enhanced-security-for-your-data">Enhanced Security for your Data</h4>
<p>Experience enhanced data resiliency, robust security, and reliability control offering 99.99% uptime SLA. Ensuring that your data is protected, secured and unreachable to unauthorized and unauthenticated access. </p>
<h4 id="heading-provisioning-and-system-sizing">Provisioning and System Sizing</h4>
<p>Unlike many relational database management systems (RDBMS), Google BigQuery dynamically allocates query resources as you consume them and deallocates resources as data is deleted or tables are dropped. Furthermore, allocated resources match the query type and complexity. </p>
<h4 id="heading-storage-management">Storage Management</h4>
<p>BigQuery utilizes a proprietary format called Capacitor. It is columnar in nature and holds many benefits including the fact that it can evolve with the query engine. Access patterns are used to determine the most optimal number of shards of data and how they are to be encoded for storage. BigQuery queries can either be stored on Google's Colossus platform or outside of BigQuery storage in the cloud, on Google Drive.  </p>
<h4 id="heading-maintenance">Maintenance</h4>
<p>BigQuery receives constant updates from it’s engineering team. These upgrades cause little to no downtime on BigQuery’s operations. Ensuring optimal performance and minimal downtime as you collect essential insights for your business goals. </p>
<h4 id="heading-backup-and-recovery">Backup and Recovery</h4>
<p>Database administrators have always found backup and recovery to be extremely tedious and complex tasks. Costs rise as there is almost always a need for additional licenses and hardware. With BigQuery, backup and recovery is handled at the service level. BigQuery maintains a complete seven-day history of changes against your tables and lets you write specific queries to point-in-time snapshots of your data. If a table is deleted, its history is removed after a period of seven days. </p>
<h4 id="heading-monitoring-and-auditing">Monitoring and Auditing</h4>
<p>Using the BigQuery metric, you can monitor how BigQuery is behaving in the form of various charts and alerts. In order to have a proactive approach towards system health, you can create alerts that will be triggered based on thresholds defined by you. BigQuery also creates various logs, including audit logs of actions made by users. </p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>As is apparent, BigQuery provides powerful enablement for your business and decision-making through optimized data processing, smart data insights, and resiliency in how this data is stored. It is powerful too used to allow your organization to utilize data to its advantage. </p>
]]></content:encoded></item><item><title><![CDATA[Data Lake VS Data Warehouse]]></title><description><![CDATA[Data Lakes and Data Warehouses are used widely to store large amounts of data. However, they are not interchangeable terms. You will be surprised to know that both of these approaches are complementary to one another.
Let’s know about these two terms...]]></description><link>https://anujsyal.com/data-lake-vs-data-warehouse</link><guid isPermaLink="true">https://anujsyal.com/data-lake-vs-data-warehouse</guid><category><![CDATA[big data]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Databases]]></category><category><![CDATA[data]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Mon, 17 Jan 2022 03:21:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/itZ0oxI2CCY/upload/v1642389531037/qAzWXRORG.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Data Lakes and Data Warehouses are used widely to store large amounts of data. However, they are not interchangeable terms. You will be surprised to know that both of these approaches are complementary to one another.
Let’s know about these two terms deeply in the segments mentioned below. </p>
<h2 id="heading-introduction-to-data-lake">Introduction to Data Lake</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1642389577788/vlTCX_FLfC.jpeg" alt="aaron-burden-aRya3uMiNIA-unsplash.jpg" />
Photo by <a href="https://unsplash.com/@aaronburden?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Aaron Burden</a> on <a href="https://unsplash.com/s/photos/lake?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<p>Data Lake is known to be a depository that is centralized. It enables you to accumulate all your given informed and unformed data. One of the best things is that it does that at any scale. It allows you to store your data unstructured and ideate various types of analytics. From visualizations and dashboards to the big procession, machine learning guides you towards better decisions.</p>
<blockquote>
<p>Data  Lake is less structured, more like a lake where you dump everything first then find out usage later</p>
</blockquote>
<h2 id="heading-why-does-an-enterprise-need-a-data-lake">Why does an enterprise need a data lake?</h2>
<p>Organizations and firms that successfully business worth from their data, will overdo positively their peers. In various surveys, it is noted that plenty of organizations that executed a Data Lake outperformed familiar companies by 10% (approx.) in authentic revenue increase. These firms managed to do unique sorts of analytics like data from clickstreams, social media, log files, and internet-corresponding devices stowed in the Data Lake. </p>
<p>Ultimately, it helped them to recognize and act on the opportunities for the faster growth of the business by retaining and attracting customers, increasing productivity, and making instructed decisions. </p>
<h2 id="heading-what-value-does-a-data-lake-hold-in-an-enterprise">What value does a data lake hold in an enterprise?</h2>
<p>The capability to store plenty of data, from tons of sources, in minimum time and empower users in order to ask them to unite and examine data in various ways often directs to better and quicker decision making. The following are some instances that will make it clear for you:</p>
<ul>
<li><strong>Enhanced Customer Interplay</strong>:</li>
</ul>
<p>A data lake has the ability to combine all the client data through a CRM platform. It happens to do so with social media analytics. Then it creates a marketing platform that consists of purchasing history and ‘’happening’ tickets in order to delegate the business to acknowledge the most valuable and promising client cohort, the reason behind client churn, and the rewards and other promotional activities that will improve the loyalty in them. </p>
<ul>
<li><strong>Enhanced R&amp;D Innovation Choices</strong>:</li>
</ul>
<p>Data Lake enables your R&amp;D squads to examine their thesis, refine inferences and analyze results accordingly. It can include selecting the suitable materials in your product creation that results in quicker performance, genomic research eventually leading to improved medication. </p>
<ul>
<li><strong>Improved Functional Efficiencies</strong>:
The IoT presents various ways to collect data on processes like production, with live data incoming from internet-connected devices. Ideally, data lake happens to make it much easier to store and run the analytics on IoT data (machine-generated) resulting in reduced operational cost and improved quality.</li>
</ul>
<h2 id="heading-positioning-data-lakes-in-the-cloud">Positioning Data Lakes in the Cloud</h2>
<p>Essentially, Data Lakes are an exemplary workload that happens to be cloud-deployed, as the cloud introduces implementation, dependability, scalability, availability, and a unique and categorized set of analytics engines.</p>
<p>Moreover, the major reasons clients perceived the clouds as an edge for Data Lakes. It is so due to better security, quicker time to availability, deployment, often feature updates, geographical coverage, elasticity, and costs connected to existing utilization.</p>
<blockquote>
<p>A good example for a Data Lake is Google Cloud Storage or Amazon S3</p>
</blockquote>
<h2 id="heading-introduction-to-data-warehouse">Introduction to Data Warehouse</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1642386035393/KvVawA_xv.jpeg" alt="joshua-tsu-x6vDHnMNJFw-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@joshdatsu?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Joshua Tsu</a> on <a href="https://unsplash.com/s/photos/water-tank?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Data Warehouse is a central repository of information that is enabled to be analyzed in order to make informed decisions. Typically, the data flows into a data warehouse from transactional systems and other sources. </p>
<blockquote>
<p>Data Warehouse is more structured, more like a water tank where you define usage first then put in the data</p>
</blockquote>
<h2 id="heading-how-does-a-data-warehouse-work">How does a data warehouse work?</h2>
<p>You may find multiple databases in a data warehouse. Each database has its own data which is organized into tables and columns. And each column, a description of the data is enabled to define accordingly. On the other hand, the tables can be organized inside schemas, which can be known as folders. Finally, when the data is ingested, it is simply stored in different tables. </p>
<h2 id="heading-why-is-data-warehouse-important">Why is Data Warehouse important?</h2>
<p>Data Warehouse holds a great value when it comes to informed decision-making like Data Lake. not only this but it also manages to consolidate data from plenty of sources. In addition to this, historical data analysis, data quality, accuracy, and consistency are some of the elements data warehouse comes with. Furthermore, the separation of analytics processing from transnational databases ultimately enhances the performance of both the given systems</p>
<blockquote>
<p>A good example for Data Warehouse is Google's Big Query or Amazon Redshift</p>
</blockquote>
<h2 id="heading-data-lakes-and-data-warehouses-have-two-different-approaches-heres-how">Data Lakes and Data Warehouses have two different approaches- Here’s how</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1642389095821/pjsgLDzHr.jpeg" alt="oliver-roos-PCNdauVPbjA-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@fairfilter?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Oliver Roos</a> on <a href="https://unsplash.com/s/photos/two-paths?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Depending on the concerned needs, an organization will need to have a data warehouse and a data lake because they offer diverse needs and use cases. </p>
<p>A data warehouse is quite different from a data lake. A data warehouse is a database optimized in order to analyse relational data arriving from transactional systems and lines of enterprise applications. </p>
<p>On the other hand, a data lake serves different purposes as it stores relational data from a line of enterprise applications. The difference is that it stores data from mobile applications, social media, and IoT devices as well. Meaning, it stores all of your given data without any careful design.</p>
<p>Moreover, data warehouses are primarily used for batch reporting, visualizations, and BI analytics on structured data. Whereas data lake can be potentially be used for solving problems of machine learning, data discovery, predictive analytics, and profiling with large amount of data</p>
<p>Organizations with data warehouses happen to see the perks of data lakes. To maximize their benefits, they are evolving their warehouse to include data lakes as well. Not only does it ensure diverse query abilities but advanced abilities for discovering new info models as well</p>
<h2 id="heading-how-do-data-lake-and-data-warehouse-work-together">How do data lake and data warehouse work together?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1642389355869/gP_VmxVYr.png" alt="image.png" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@pawankawan?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Pawan Kawan</a> on <a href="https://unsplash.com/s/photos/gears?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>These approaches are complementary to one another. Data warehouse manages to structure and package the quality of the data, consistency, and performance with significantly increased concurrency. On the other hand, data lake ensures to focus on original raw data commitment and permanent storage. It does that at a reasonable cost while offering a new state of analytical dexterity. </p>
<p>These two different yet complementary solutions are recommended to be a part of any </p>
]]></content:encoded></item><item><title><![CDATA[Why Get a Cloud Certificate in Data Engineering?]]></title><description><![CDATA[Learn the core concepts of data engineering to seek a job with professional data engineering skills, which will be in demand in 2022. Find which course suits you the best! 
What can you earn with a professional certificate?


Photo by Sasun Bughdarya...]]></description><link>https://anujsyal.com/why-get-a-cloud-certificate-in-data-engineering</link><guid isPermaLink="true">https://anujsyal.com/why-get-a-cloud-certificate-in-data-engineering</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Certification]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[big data]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Thu, 02 Dec 2021 05:30:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/QiLPQeQSXD0/upload/v1638415158572/b7VyRerHF.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Learn the core concepts of data engineering to seek a job with professional data engineering skills, which will be in demand in 2022. Find which course suits you the best! </p>
<h2 id="heading-what-can-you-earn-with-a-professional-certificate">What can you earn with a professional certificate?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1638422650793/VHAAksTuN.jpeg" alt="sasun-bughdaryan-OyDZRZOlENw-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@sasun1990?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Sasun Bughdaryan</a> on <a href="https://unsplash.com/s/photos/earn?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>A professional certificate allows you to prepare for your upcoming job and gain skills that will be adequate to secure a job and make a mark in the industry. These courses are based on developing skills rather than reading relevant theories. 
Once the program is completed, you will acquire a certification that will help you to secure a job after completion of projects. 
Also, you will gain relevant experience in the field, which allows you to kick start your career with a bang! </p>
<h2 id="heading-professional-data-engineering-certification">Professional Data Engineering Certification</h2>
<p>A professional data engineering certification allows you to identify the core concepts of Big Data and Machine Learning. It enables you to understand the employment of BigQuery for interactive insights of data. </p>
<p>With this professional data engineer course, you will migrate existing workloads of SQL and Hadoop to a Cloud. It will also help to use various data processing techniques to engineer data. </p>
<h2 id="heading-different-options-for-certification">Different options for certification</h2>
<p>First, you need to select the type of certification that you require! Different cloud providers have learning and certification tracks of their own. Sharing certification tracks for top 3 public cloud providers that provide certification around data engineering skillset</p>
<ol>
<li><a target="_blank" href="https://aws.amazon.com/certification/certified-data-analytics-specialty/?ch=tile&amp;tile=getstarted">AWS </a></li>
<li><a target="_blank" href="https://docs.microsoft.com/en-us/learn/certifications/roles/data-engineer">Azure </a></li>
<li><a target="_blank" href="https://cloud.google.com/certification/data-engineer">Google</a></li>
</ol>
<p>These are different service providers, efficiently, and you can select the certification according to your preference.</p>
<blockquote>
<p>I personally have started preparing for Google Cloud Professional Data Engineer Certification</p>
<p>One of the best parts of choosing Google is the exposure to ML and AI capabilities which are way more advance than other cloud providers. Also, the learning track involves hands-on labs which helps you get accustomed to solving real-life problems.</p>
</blockquote>
<h2 id="heading-responsibilities-of-data-engineering">Responsibilities of Data Engineering</h2>
<p>A data engineer works to process/engineer data for operations and analytical purposes. The data engineer then cleanses the data and structure the data so that it can be used analytically. 
One of the significant responsibilities of a data engineer is to make the data easier to handle and optimize the Big Data Ecosystem within the organization. Here are some organizations that use Big Data insights for their organization, and you can become a part of them:</p>
<blockquote>
<p>1) Uber  2) Netflix  3) Starbucks  4) BDO  5) T-mobile  6) Facebook  7) Spotify  8) Amazon </p>
</blockquote>
<p>Most popular companies and organizations are using data analytics to provide their customers with relevant results. Taking a Cloud certification can be the first step towards your dream job.</p>
<h2 id="heading-why-to-study-for-a-certification">Why to study for a certification?</h2>
<p>A certification helps you to stand out from other candidates who are applying for a job. It also enables you to secure an appointment if you are a fresher or even allows you to get promotions! Yes, if you want job promotions, then you are on the right track. </p>
<blockquote>
<p><strong>Best way to acquire Data Engineering Skillset</strong></p>
</blockquote>
<p>A professional data engineering certification is one of the best ways to land in 
the data engineer space. Most importantly, you will earn hands-on data engineering skills rather than a degree with theoretical knowledge. </p>
<p><strong>Here is how you can prepare for a professional data engineer certification. </strong></p>
<h2 id="heading-preparing-for-a-google-data-engineering-certification">Preparing for a Google Data Engineering Certification</h2>
<p>If you are interested in a video version for preparation check out this youtube video</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/blCMQRhZgso">https://youtu.be/blCMQRhZgso</a></div>
<p>It is necessary to prepare for the professional data engineering exam to help you achieve a certification that will lead to a data engineering career. 
This Google Cloud Certification will examine the following abilities: </p>
<ul>
<li>Design a data processing system </li>
<li>Solution Quality </li>
<li>Can you operationalize the machine learning models? </li>
<li>Can you build a data processing system? </li>
</ul>
<p>If you are interested in what a job scope for a data engineer will look like, check out this <a target="_blank" href="https://careers.google.com/jobs/results/81200290256036550-data-engineer-users-and-products/"> job post from google</a> . This should also give a sneak peek of <strong>A day in life of data engineer</strong></p>
<h2 id="heading-complete-exam-guide-for-google-cloud-certification">Complete Exam Guide for Google Cloud Certification</h2>
<p>Follow this complete exam guide for Google Cloud Certifications.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1638422519002/b-wk2FHqP.jpeg" alt="daniil-silantev-ioYwosPYC0U-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@betagamma?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Daniil Silantev</a> on <a href="https://unsplash.com/s/photos/guide?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<ol>
<li><p>You need to prepare to design the data processing system, which includes the mapping and storage systems and the business requirements, creating the data pipelines, and the schema designs. Also, it will consist of publishing data and data visualization, which is BigQuery. </p>
</li>
<li><p>You will also design the data processing solutions by considering some factors like infrastructure, system availability, fault tolerance, use of distributed systems. </p>
</li>
<li><p>You should also migrate the data, which is the phase of data warehousing and data processing.</p>
</li>
<li><p>You will need to build and operationalize the data processing systems for the storage and management of data. </p>
</li>
<li><p>You will need to operationalize the machine learning models to deploy (ML pipeline), ingest data, and continuously evaluate it. </p>
</li>
<li><p>Lastly, you will ensure the security and reliability of the solution. </p>
</li>
</ol>
<p>For more details check out the  <a target="_blank" href="https://cloud.google.com/certification/guides/data-engineer">official exam guide from Google</a> </p>
<h2 id="heading-certification-learning-path">Certification Learning Path</h2>
<h4 id="heading-google-cloud-certification-learning-path">Google Cloud Certification Learning Path</h4>
<p>Google itself provides an entirely well-designed learning path that includes various videos, and you can get a hand on experience and apply all the concepts of the course simultaneously. </p>
<p>The Google Cloud Certification course costs around $39, which will allow you to build a career and help you secure a job once you are certified. 
Cloud Services you need to cover!</p>
<p>You will also need to cover the cloud services, including the core functional concepts of storage, ingestion, analytics, machine learning, and serving data solutions. </p>
<p><strong>With this guide, you will get the Google Cloud certification in no time, and you will enjoy the perks of this certification real soon, take a step and change your data game today!</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1638421650173/c76fW2v9n.png" alt="image.png" /></p>
<blockquote>
<p> <a target="_blank" href="https://cloud.google.com/training/data-engineering-and-analytics#data-engineer-learning-path">Website Screenshot from Google Cloud Training</a> </p>
</blockquote>
<h4 id="heading-google-certified-professional-data-engineer-from-acloudguru-from-tim-berry">Google Certified Professional Data Engineer from <strong>AcloudGuru</strong> from <strong>Tim Berry</strong></h4>
<p>Acloudguru previously known as Linux Academy. According to the website, the primary focus of this course is to prepare you for the GCP Professional Data Engineer certification exam.  </p>
<p>Along the way you’ll solidify your foundations in data engineering and machine learning, ensuring that by the end of the course you will be able to design and build data processing solutions, operationalize machine learning models and gain a working knowledge of relevant GCP data processing tools and technologies.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1638421992551/QH-9bDlKH.png" alt="image.png" /></p>
<blockquote>
<p> <a target="_blank" href="https://acloudguru.com/course/google-certified-professional-data-engineer">Website Screenshot from <strong>AcloudGuru</strong></a> </p>
</blockquote>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This brings us to the end of the article. Now, you can prepare for the professional data engineer certification that will change your job-seeking experience, and you can avail the best opportunities by gaining this certification. 
You can choose from multiple varieties of options and select the best certification for your career path to becoming a data engineer. 
It is always a good option to choose a certification that will add skills to your skillset and help different organizations to hire you instead of other candidates. The best part is that you will outweigh others in the job search. </p>
]]></content:encoded></item><item><title><![CDATA[Creating a Machine Learning Model with SQL]]></title><description><![CDATA[Even though Machine Learning is already really advanced, however, it has some weaknesses which can make it hard for you to use it.
Current machine learning workflows and its problems

If you've worked with ml models you may realize that structure and...]]></description><link>https://anujsyal.com/creating-a-machine-learning-model-with-sql</link><guid isPermaLink="true">https://anujsyal.com/creating-a-machine-learning-model-with-sql</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[SQL]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[big data]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Fri, 12 Nov 2021 05:34:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1636687525869/46MaTuwL1.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Even though Machine Learning is already really advanced, however, it has some weaknesses which can make it hard for you to use it.</p>
<h2 id="current-machine-learning-workflows-and-its-problems">Current machine learning workflows and its problems</h2>
<ul>
<li>If you've worked with ml models you may realize that structure and preparing them can be extremely time-escalated.</li>
<li>For a typical information researcher should initially trade limited quantities of information from information store into I-Python note pad and into information taking care of structures like pandas for python.</li>
<li>In case you're assembling a custom model you first need to change and pre-process all that information and play out all that component designing before you can even take
care of the model information in.</li>
<li>Then, at that point, at long last after you've fabricated your model and say TensorFlow
another comparable library then you train it locally on your PC or on a VM doing that
with a little model then, at that point, expects you to return and make all the more new
information includes and further develop execution and you rehash and rehash and
rehash furthermore, it's hard so you stop after a couple of cycles</li>
</ul>
<p>But hey there, don’t worry! you can work on building models, even if you are not the
data scientist of your team.</p>
<h2 id="introducing-to-google-bigquery">Introducing to Google BigQuery</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636691624334/Rm6bgS779.jpeg" alt="patrick-lindenberg-1iVKwElWrPA-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@heapdump?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Patrick Lindenberg</a> on <a href="https://unsplash.com/s/photos/storage?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>BigQuery is a fully-managed petabyte-scale enterprise data warehouse. It is basically made up of two key things</p>
<ul>
<li>Fast SQL query engine</li>
<li>Fully managed data storage<blockquote>
<p>Big Query supports querying petabytes of data with standard SQL that everyone is used to.</p>
</blockquote>
</li>
</ul>
<p>Example:</p>
<pre><code><span class="hljs-meta">#standardSQL</span>
<span class="hljs-keyword">SELECT</span>
COUNT(*) <span class="hljs-keyword">AS</span> total_trips
<span class="hljs-keyword">FROM</span>
`bigquery-<span class="hljs-built_in">public</span>-data.san_francisco_bikeshare.bikeshare_trips`
</code></pre><h3 id="other-big-query-key-features">Other Big Query Key features</h3>
<ol>
<li>Serverless</li>
<li>Flexible pricing model</li>
<li>Standard GCP Data encryption and security</li>
<li>Perfect for BI and AI use cases</li>
<li>ML and predictive modeling with BigQuery ML</li>
<li>Really cheap storage - same as Google cloud storage buckets</li>
<li>Interactive data analysis with BigQuery BI Engine - connect to tableau, data studio, looker, etc</li>
</ol>
<h4 id="big-query-and-its-ml-features">Big query and its ml features</h4>
<p>As a part of one of the main features, big query allows building predictive machine learning models with just simple SQL syntax. With the petabyte processing power of google cloud, you can easily create models right there in the warehouse.
A sample syntax to create models looks like this</p>
<pre><code><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">OR</span> <span class="hljs-keyword">REPLACE</span> <span class="hljs-keyword">MODEL</span> <span class="hljs-string">`dataset.classification_model`</span>
OPTIONS
(
model_type=<span class="hljs-string">'logistic_reg'</span>,
labels = [<span class="hljs-string">'y'</span>]
)
<span class="hljs-keyword">AS</span>
</code></pre><h2 id="a-typical-workflow-using-big-query-ml-5-step">A typical workflow using Big Query ML [5 Step]</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636692633165/rgRnSkQ3R.png" alt="Untitled (1).png" /></p>
<blockquote>
<p>Flowchart diagram from the author</p>
</blockquote>
<h2 id="tutorial-building-a-classification-model-with-big-query-ml-simple-sql-syntax">Tutorial: Building a classification model with big query ML (Simple SQL Syntax)</h2>
<p>You can open up a  <a target="_blank" href="https://console.cloud.google.com/bigquery">big query console</a>  and start replicating the steps below:</p>
<p>OR</p>
<p><strong>You can watch a Video Tutorial I created</strong></p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=7fJs1gEpPjo">https://www.youtube.com/watch?v=7fJs1gEpPjo</a></div>
<h4 id="dataset">Dataset</h4>
<p>I am using a  public big query dataset <a target="_blank" href="https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data">google_analytics_sample</a> . The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store. 
The data includes The data is typical of what an ecommerce website would see and includes the following information:</p>
<p>Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display traffic
Content data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc.
Transactional data: information about the transactions on the Google Merchandise Store website.</p>
<p><strong>Dataset license:</strong>
A public dataset is any dataset that is stored in BigQuery and made available to the general public through the Google Cloud Public Dataset Program. The public datasets are datasets that BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these datasets and provides public access to the data via a project. You pay only for the queries that you perform on the data. The first 1 TB per month is free, subject to query pricing details. Under Creative Commons Attribution 4.0 License</p>
<h4 id="machine-learning-problem-we-will-try-to-solve">Machine Learning problem we will try to solve</h4>
<p>We will try to predict if the user will buy products on return visit, hence we name our label <code>will_buy_on_return_visit</code></p>
<h3 id="step1-exploring-the-dataset">Step1: Exploring the dataset</h3>
<ul>
<li>Checking conversion rate</li>
</ul>
<pre><code><span class="hljs-keyword">WITH</span> visitors <span class="hljs-keyword">AS</span>(
<span class="hljs-keyword">SELECT</span>
COUNT(<span class="hljs-keyword">DISTINCT</span> fullVisitorId) <span class="hljs-keyword">AS</span> total_visitors
<span class="hljs-keyword">FROM</span> `bigquery-<span class="hljs-built_in">public</span>-data.google_analytics_sample.ga_sessions_20170801`
),
purchasers <span class="hljs-keyword">AS</span>(
<span class="hljs-keyword">SELECT</span>
COUNT(<span class="hljs-keyword">DISTINCT</span> fullVisitorId) <span class="hljs-keyword">AS</span> total_purchasers
<span class="hljs-keyword">FROM</span> `bigquery-<span class="hljs-built_in">public</span>-data.google_analytics_sample.ga_sessions_20170801`
<span class="hljs-keyword">WHERE</span> totals.transactions <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">NULL</span>
)
<span class="hljs-keyword">SELECT</span>
  total_visitors,
  total_purchasers,
  total_purchasers / total_visitors <span class="hljs-keyword">AS</span> conversion_rate
<span class="hljs-keyword">FROM</span> visitors, purchasers
</code></pre><ul>
<li>What are the top 5 selling products?</li>
</ul>
<pre><code><span class="hljs-keyword">SELECT</span>
  p.v2ProductName,
  p.v2ProductCategory,
  <span class="hljs-keyword">SUM</span>(p.productQuantity) <span class="hljs-keyword">AS</span> units_sold,
  <span class="hljs-keyword">ROUND</span>(<span class="hljs-keyword">SUM</span>(p.localProductRevenue/<span class="hljs-number">1000000</span>),<span class="hljs-number">2</span>) <span class="hljs-keyword">AS</span> revenue
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`</span>,
<span class="hljs-keyword">UNNEST</span>(hits) <span class="hljs-keyword">AS</span> h,
<span class="hljs-keyword">UNNEST</span>(h.product) <span class="hljs-keyword">AS</span> p
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">1</span>, <span class="hljs-number">2</span>
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> revenue <span class="hljs-keyword">DESC</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>;
</code></pre><ul>
<li>Question: How many visitors bought on subsequent visits to the website?</li>
</ul>
<pre><code><span class="hljs-keyword">WITH</span> all_visitor_stats <span class="hljs-keyword">AS</span> (
<span class="hljs-keyword">SELECT</span>
  fullvisitorid, # <span class="hljs-number">741</span>,<span class="hljs-number">721</span> <span class="hljs-keyword">unique</span> visitors
  <span class="hljs-keyword">IF</span>(COUNTIF(totals.transactions &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">AND</span> totals.newVisits <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NULL</span>) &gt; <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> will_buy_on_return_visit
  <span class="hljs-keyword">FROM</span> `bigquery-<span class="hljs-built_in">public</span>-data.google_analytics_sample.ga_sessions_20170801`
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> fullvisitorid
)
<span class="hljs-keyword">SELECT</span>
  COUNT(<span class="hljs-keyword">DISTINCT</span> fullvisitorid) <span class="hljs-keyword">AS</span> total_visitors,
  will_buy_on_return_visit
<span class="hljs-keyword">FROM</span> all_visitor_stats
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> will_buy_on_return_visit
</code></pre><h3 id="step-2-select-features-and-create-your-training-dataset">Step 2. Select features and create your training dataset</h3>
<p>Now that we know a bit more about the data, let's finalize and create a final dataset we want to use for training</p>
<pre><code><span class="hljs-keyword">SELECT</span>
  * <span class="hljs-keyword">EXCEPT</span>(fullVisitorId)
<span class="hljs-keyword">FROM</span>
  <span class="hljs-comment"># features</span>
  (<span class="hljs-keyword">SELECT</span>
    fullVisitorId,
    <span class="hljs-keyword">IFNULL</span>(totals.bounces, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> bounces,
    <span class="hljs-keyword">IFNULL</span>(totals.timeOnSite, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> time_on_site
  <span class="hljs-keyword">FROM</span>
    <span class="hljs-string">`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`</span>
  <span class="hljs-keyword">WHERE</span>
    totals.newVisits = <span class="hljs-number">1</span>)
  <span class="hljs-keyword">JOIN</span>
  (<span class="hljs-keyword">SELECT</span>
    fullvisitorid,
    <span class="hljs-keyword">IF</span>(COUNTIF(totals.transactions &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">AND</span> totals.newVisits <span class="hljs-keyword">IS</span> <span class="hljs-literal">NULL</span>) &gt; <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> will_buy_on_return_visit
  <span class="hljs-keyword">FROM</span>
      <span class="hljs-string">`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`</span>
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> fullvisitorid)
  <span class="hljs-keyword">USING</span> (fullVisitorId)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> time_on_site <span class="hljs-keyword">DESC</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre><h3 id="step-3-create-a-model">Step 3: Create a Model</h3>
<p>This step uses a create model statement over the dataset created in previous step</p>
<pre><code><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">OR</span> <span class="hljs-keyword">REPLACE</span> <span class="hljs-keyword">MODEL</span> <span class="hljs-string">`ecommerce.classification_model`</span>
OPTIONS
(
model_type=<span class="hljs-string">'logistic_reg'</span>,
labels = [<span class="hljs-string">'will_buy_on_return_visit'</span>]
)
<span class="hljs-keyword">AS</span>
<span class="hljs-comment">#standardSQL</span>
<span class="hljs-keyword">SELECT</span>
  * <span class="hljs-keyword">EXCEPT</span>(fullVisitorId)
<span class="hljs-keyword">FROM</span>
  <span class="hljs-comment"># features</span>
  (<span class="hljs-keyword">SELECT</span>
    fullVisitorId,
    <span class="hljs-keyword">IFNULL</span>(totals.bounces, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> bounces,
    <span class="hljs-keyword">IFNULL</span>(totals.timeOnSite, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> time_on_site
  <span class="hljs-keyword">FROM</span>
    <span class="hljs-string">`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`</span>
  <span class="hljs-keyword">WHERE</span>
    totals.newVisits = <span class="hljs-number">1</span>)
  <span class="hljs-keyword">JOIN</span>
  (<span class="hljs-keyword">SELECT</span>
    fullvisitorid,
    <span class="hljs-keyword">IF</span>(COUNTIF(totals.transactions &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">AND</span> totals.newVisits <span class="hljs-keyword">IS</span> <span class="hljs-literal">NULL</span>) &gt; <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> will_buy_on_return_visit
  <span class="hljs-keyword">FROM</span>
      <span class="hljs-string">`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`</span>
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> fullvisitorid)
  <span class="hljs-keyword">USING</span> (fullVisitorId)
;
</code></pre><h3 id="step-4-evaluate-classification-model-performance">Step 4: Evaluate classification model performance</h3>
<p>Evaluate the performance of the model you just created using SQL</p>
<pre><code><span class="hljs-keyword">SELECT</span>
  roc_auc,
  <span class="hljs-keyword">CASE</span>
    <span class="hljs-keyword">WHEN</span> roc_auc &gt; <span class="hljs-number">.9</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'good'</span>
    <span class="hljs-keyword">WHEN</span> roc_auc &gt; <span class="hljs-number">.8</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'fair'</span>
    <span class="hljs-keyword">WHEN</span> roc_auc &gt; <span class="hljs-number">.7</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'not great'</span>
  <span class="hljs-keyword">ELSE</span> <span class="hljs-string">'poor'</span> <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> model_quality
<span class="hljs-keyword">FROM</span>
  ML.EVALUATE(<span class="hljs-keyword">MODEL</span> ecommerce.classification_model,  (
<span class="hljs-keyword">SELECT</span>
  * <span class="hljs-keyword">EXCEPT</span>(fullVisitorId)
<span class="hljs-keyword">FROM</span>
  <span class="hljs-comment"># features</span>
  (<span class="hljs-keyword">SELECT</span>
    fullVisitorId,
    <span class="hljs-keyword">IFNULL</span>(totals.bounces, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> bounces,
    <span class="hljs-keyword">IFNULL</span>(totals.timeOnSite, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> time_on_site
  <span class="hljs-keyword">FROM</span>
    <span class="hljs-string">`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`</span>
  <span class="hljs-keyword">WHERE</span>
    totals.newVisits = <span class="hljs-number">1</span>
    <span class="hljs-keyword">AND</span> <span class="hljs-built_in">date</span> <span class="hljs-keyword">BETWEEN</span> <span class="hljs-string">'20170501'</span> <span class="hljs-keyword">AND</span> <span class="hljs-string">'20170630'</span>) <span class="hljs-comment"># eval on 2 months</span>
  <span class="hljs-keyword">JOIN</span>
  (<span class="hljs-keyword">SELECT</span>
    fullvisitorid,
    <span class="hljs-keyword">IF</span>(COUNTIF(totals.transactions &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">AND</span> totals.newVisits <span class="hljs-keyword">IS</span> <span class="hljs-literal">NULL</span>) &gt; <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> will_buy_on_return_visit
  <span class="hljs-keyword">FROM</span>
      <span class="hljs-string">`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`</span>
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> fullvisitorid)
  <span class="hljs-keyword">USING</span> (fullVisitorId)
));
</code></pre><h4 id="future-steps-and-feature-engineering">Future steps and feature engineering</h4>
<p>If you are further interested in improving the performance of the model, you can always opt-in for adding in more features from the dataset</p>
<h2 id="conclusion">Conclusion</h2>
<p>Products like Google's Big Query ML make building machine learning models accessible to more people.
With simple SQL syntax, and googles processing power it is really easy to churn out real-life big data models.</p>
]]></content:encoded></item><item><title><![CDATA[Spark Streaming with Python]]></title><description><![CDATA[Photo by JJ Ying on Unsplash
Apache Spark Streaming is quite popular. Due to its integrated technology, Spark Streaming outperforms previous systems in terms of data stream quality and comprehensive approach.
Python and Spark Streaming do wonders for...]]></description><link>https://anujsyal.com/spark-streaming-with-python</link><guid isPermaLink="true">https://anujsyal.com/spark-streaming-with-python</guid><category><![CDATA[Python]]></category><category><![CDATA[spark]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data]]></category><category><![CDATA[big data]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Wed, 03 Nov 2021 05:15:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1635911185166/uYhG6kso8.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Photo by <a href="https://unsplash.com/@jjying?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">JJ Ying</a> on <a href="https://unsplash.com/s/photos/pipeline?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<p>Apache Spark Streaming is quite popular. Due to its integrated technology, Spark Streaming outperforms previous systems in terms of data stream quality and comprehensive approach.</p>
<p>Python and Spark Streaming do wonders for industry giants when used together. Netflix is an excellent Python/Spark Streaming representation: the people behind the popular streaming platform have produced multiple articles about how they use the technique to help us enjoy Netflix even more. Let's get started with the basics.</p>
<h2 id="what-is-spark-streaming-and-how-does-it-work">What is spark streaming, and how does it work?</h2>
<p>The Spark platform contains various modules, including Spark Streaming. Spark Streaming is a method for analyzing "unbounded" information, sometimes known as "streaming" information. This is accomplished by dividing it down into micro-batches and allowing windowing for execution over many batches.</p>
<p>The Spark Streaming Interface is a Spark API application module. Python, Scala, and Java are all supported. It allows you to handle real data streams in a fault-tolerant and flexible manner. The Spark Engine takes the data batches and produces the end results stream in batches. </p>
<h2 id="what-is-a-streaming-data-pipeline">What is a streaming data pipeline?</h2>
<p>It is a technology that allows data to move smoothly and automatically from one location to another. This technology eliminates many of the typical issues that the company had, such as information leakage, bottlenecks, multiple data clash, and repeated entry creation.</p>
<p>Streaming data pipelines are data pipeline architectures that process thousands of inputs in actual time at scalability. As an outcome, you'll be able to gather, analyses, and retain a lot of data. This functionality enables real-time applications, monitoring, and reporting.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635912373028/h3GNU3Jwa.jpeg" alt="agence-olloweb-Z2ImfOCafFk-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@olloweb?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Agence Olloweb</a> on <a href="https://unsplash.com/s/photos/concepts?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<h2 id="streaming-architecture-of-spark">Streaming Architecture of Spark.</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635912997193/aK1pB69W0.png" alt="image.png" /></p>
<blockquote>
<p>Spark streaming architecture diagram from  <a target="_blank" href="https://spark.apache.org/docs/latest/streaming-programming-guide.html">spark.apache.org</a> </p>
</blockquote>
<p>Spark Streaming's primary structure is batch-by-batch discrete-time streaming. The micro-batches are constantly allocated and analyzed, rather than traveling through the stream processing pipelines one item at a time. As a result, data is distributed to employees depending on accessible resources and location.</p>
<p>When data is received, it is divided into RDD divisions by the receiver. Because RDDs are indeed a key abstraction of Spark datasets, converting to RDDs enables group analysis with Spark scripts and tools.</p>
<h2 id="real-life-spark-streaming-example-twitter-pyspark-streaming">Real-life spark streaming example (Twitter Pyspark Streaming )</h2>
<p>In this solution I will build a streaming pipeline that gets tweets from the internet for specific keywords (Ether), and perform transformations on these realtime tweets to get other top keywords associated with it.
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635913646959/J-cLQdOqG.png" alt="Untitled.png" /></p>
<blockquote>
<p>Real-life spark streaming example architecture by the author</p>
</blockquote>
<h4 id="video-tutorial">Video Tutorial</h4>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=jMtKh05xR-8">https://www.youtube.com/watch?v=jMtKh05xR-8</a></div>
<h3 id="step1-streaming-tweets-using-tweepy">Step1: Streaming tweets using tweepy</h3>
<pre><code><span class="hljs-keyword">import</span> tweepy
<span class="hljs-keyword">from</span> tweepy <span class="hljs-keyword">import</span> OAuthHandler
<span class="hljs-keyword">from</span> tweepy <span class="hljs-keyword">import</span> Stream
<span class="hljs-keyword">from</span> tweepy.streaming <span class="hljs-keyword">import</span> StreamListener
<span class="hljs-keyword">import</span> socket
<span class="hljs-keyword">import</span> json

<span class="hljs-comment"># Set up your credentials</span>
consumer_key=<span class="hljs-string">''</span>
consumer_secret=<span class="hljs-string">''</span>
access_token =<span class="hljs-string">''</span>
access_secret=<span class="hljs-string">''</span>


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TweetsListener</span>(<span class="hljs-params">StreamListener</span>):</span>

  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, csocket</span>):</span>
      self.client_socket = csocket

  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_data</span>(<span class="hljs-params">self, data</span>):</span>
      <span class="hljs-keyword">try</span>:
          msg = json.loads( data )
          print( msg[<span class="hljs-string">'text'</span>].encode(<span class="hljs-string">'utf-8'</span>) )
          self.client_socket.send( msg[<span class="hljs-string">'text'</span>].encode(<span class="hljs-string">'utf-8'</span>) )
          <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>
      <span class="hljs-keyword">except</span> BaseException <span class="hljs-keyword">as</span> e:
          print(<span class="hljs-string">"Error on_data: %s"</span> % str(e))
      <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>

  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_error</span>(<span class="hljs-params">self, status</span>):</span>
      print(status)
      <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">sendData</span>(<span class="hljs-params">c_socket</span>):</span>
  auth = OAuthHandler(consumer_key, consumer_secret)
  auth.set_access_token(access_token, access_secret)

  twitter_stream = Stream(auth, TweetsListener(c_socket))
  twitter_stream.filter(track=[<span class="hljs-string">'ether'</span>])

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
  s = socket.socket()         <span class="hljs-comment"># Create a socket object</span>
  host = <span class="hljs-string">"127.0.0.1"</span>     <span class="hljs-comment"># Get local machine name</span>
  port = <span class="hljs-number">5554</span>                 <span class="hljs-comment"># Reserve a port for your service.</span>
  s.bind((host, port))        <span class="hljs-comment"># Bind to the port</span>

  print(<span class="hljs-string">"Listening on port: %s"</span> % str(port))

  s.listen(<span class="hljs-number">5</span>)                 <span class="hljs-comment"># Now wait for client connection.</span>
  c, addr = s.accept()        <span class="hljs-comment"># Establish connection with client.</span>

  print( <span class="hljs-string">"Received request from: "</span> + str( addr ) )

  sendData( c )
</code></pre><h3 id="step2-coding-pyspark-streaming-pipeline">Step2: Coding PySpark Streaming Pipeline</h3>
<pre><code><span class="hljs-comment"># May cause deprecation warnings, safe to ignore, they aren't errors</span>
<span class="hljs-keyword">from</span> pyspark <span class="hljs-keyword">import</span> SparkContext
<span class="hljs-keyword">from</span> pyspark.streaming <span class="hljs-keyword">import</span> StreamingContext
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SQLContext
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> desc
<span class="hljs-comment"># Can only run this once. restart your kernel for any errors.</span>
sc = SparkContext()

ssc = StreamingContext(sc, <span class="hljs-number">10</span> )
sqlContext = SQLContext(sc)
socket_stream = ssc.socketTextStream(<span class="hljs-string">"127.0.0.1"</span>, <span class="hljs-number">5554</span>)
lines = socket_stream.window( <span class="hljs-number">20</span> )
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> namedtuple
fields = (<span class="hljs-string">"tag"</span>, <span class="hljs-string">"count"</span> )
Tweet = namedtuple( <span class="hljs-string">'Tweet'</span>, fields )
<span class="hljs-comment"># Use Parenthesis for multiple lines or use \.</span>
( lines.flatMap( <span class="hljs-keyword">lambda</span> text: text.split( <span class="hljs-string">" "</span> ) ) <span class="hljs-comment">#Splits to a list</span>
  .filter( <span class="hljs-keyword">lambda</span> word: word.lower().startswith(<span class="hljs-string">"#"</span>) ) <span class="hljs-comment"># Checks for hashtag calls</span>
  .map( <span class="hljs-keyword">lambda</span> word: ( word.lower(), <span class="hljs-number">1</span> ) ) <span class="hljs-comment"># Lower cases the word</span>
  .reduceByKey( <span class="hljs-keyword">lambda</span> a, b: a + b ) <span class="hljs-comment"># Reduces</span>
  .map( <span class="hljs-keyword">lambda</span> rec: Tweet( rec[<span class="hljs-number">0</span>], rec[<span class="hljs-number">1</span>] ) ) <span class="hljs-comment"># Stores in a Tweet Object</span>
  .foreachRDD( <span class="hljs-keyword">lambda</span> rdd: rdd.toDF().sort( desc(<span class="hljs-string">"count"</span>) ) <span class="hljs-comment"># Sorts Them in a DF</span>
  .limit(<span class="hljs-number">10</span>).registerTempTable(<span class="hljs-string">"tweets"</span>) ) ) <span class="hljs-comment"># Registers to a table.</span>
</code></pre><h3 id="step3-running-the-spark-streaming-pipeline">Step3: Running the Spark Streaming pipeline</h3>
<ul>
<li>Open Terminal and run TweetsListener to start streaming tweets</li>
</ul>
<p><code>python TweetsListener.py</code></p>
<ul>
<li>In the jupyter notebook start spark streaming context, this will let the incoming stream of tweets to the spark streaming pipeline and perform transformation stated in step 2</li>
</ul>
<p><code>ssc.start()</code></p>
<h3 id="step4-seeing-real-time-outputs">Step4: Seeing real-time outputs</h3>
<p>Plot real-time information on a chart/dashboard from the registered temporary table in spark <code>tweets</code>. This table will update in every 3 seconds with fresh tweet analysis</p>
<pre><code><span class="hljs-keyword">import</span> <span class="hljs-type">time</span>
<span class="hljs-keyword">from</span> IPython <span class="hljs-keyword">import</span> display
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
# <span class="hljs-keyword">Only</span> works <span class="hljs-keyword">for</span> Jupyter Notebooks!
%matplotlib <span class="hljs-keyword">inline</span> 

count = <span class="hljs-number">0</span>
<span class="hljs-keyword">while</span> count &lt; <span class="hljs-number">10</span>:
    <span class="hljs-type">time</span>.sleep( <span class="hljs-number">3</span> )
    top_10_tweets = sqlContext.<span class="hljs-keyword">sql</span>( <span class="hljs-string">'Select tag, count from tweets'</span> )
    top_10_df = top_10_tweets.toPandas()
    display.clear_output(wait=<span class="hljs-keyword">True</span>)
    plt.figure( figsize = ( <span class="hljs-number">10</span>, <span class="hljs-number">8</span> ) )
#     sns.barplot(x=<span class="hljs-string">'count'</span>,y=<span class="hljs-string">'land_cover_specific'</span>, data=df, palette=<span class="hljs-string">'Spectral'</span>)
    sns.barplot( x="count", y="tag", data=top_10_df)
    plt.<span class="hljs-keyword">show</span>()
    count = count + <span class="hljs-number">1</span>
</code></pre><p>Out:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635916208614/XqCVXtlGV.png" alt="image.png" /></p>
<h2 id="some-of-the-pros-and-cons-of-spark-streaming">Some of the pros and cons of Spark Streaming</h2>
<p>Now that we have gone through building a real-life solution of spark streaming pipeline, let's list down some pros and cons of using this approach.</p>
<p><strong>Pros</strong></p>
<ul>
<li>For difficult jobs, it offers exceptional speed.</li>
<li>Sensitivity to faults.</li>
<li>On cloud platforms, it's simple to execute.</li>
<li>Support for multiple languages.</li>
<li>Integration with major frameworks.</li>
<li>The capability to connect databases of various types.</li>
</ul>
<p><strong>Cons</strong></p>
<ul>
<li>Massive volumes of storage are required.</li>
<li>It's difficult to use, debug, and master.</li>
<li>There is a lack of documentation and instructional resources.</li>
<li>Visualization of data is unsatisfactory.</li>
<li>Unresponsive when dealing with little amounts of data</li>
<li>There have only been a few machine learning techniques.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>Spark Streaming is indeed a technology for collecting and analyzing large amounts of data. Streaming data is likely to become more popular in the near future, so you should start learning about it now. Remember that data science is more than just constructing models; it also entails managing a full pipeline.</p>
<p>The basics of Spark Streaming were discussed in this post, as well as how to use it on a real-world dataset. We suggest you work with another sample or take real-time data to put everything we've learned into practice.</p>
]]></content:encoded></item><item><title><![CDATA[Building Computer Vision Datasets in Coco Format]]></title><description><![CDATA[Computer vision is among the biggest disciplines of machine learning with its vast range of uses and enormous potential. Its purpose is to duplicate the brain's incredible visual abilities. Algorithms for computer vision aren't magical. They require ...]]></description><link>https://anujsyal.com/building-computer-vision-datasets-in-coco-format</link><guid isPermaLink="true">https://anujsyal.com/building-computer-vision-datasets-in-coco-format</guid><category><![CDATA[Computer Vision]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[AI]]></category><category><![CDATA[ML]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Thu, 02 Sep 2021 03:25:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1630222621224/sMoxBZ-26.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Computer vision is among the biggest disciplines of machine learning with its vast range of uses and enormous potential. Its purpose is to duplicate the brain's incredible visual abilities. Algorithms for computer vision aren't magical. They require information to perform, and they'll only be as powerful as the information you provide. Based on the project, there are various sources to obtain the appropriate data.</p>
<p>The most famous object detection dataset is the Common Objects in Context dataset (COCO). This is commonly applied to evaluate the efficiency of computer vision algorithms. The COCO dataset is labeled, delivering information for training supervised computer vision systems that can recognize the dataset's typical elements. Of course, these systems are beyond flawless, thus the COCO dataset serves as a baseline for assessing the systems' progress over time as a result of computer vision studies.</p>
<p>In this article, we have discussed Coco File Format a standard for building computer vision datasets, object detection, and image detection methods.</p>
<h2 id="why-do-neural-nets-work-really-well-for-computer-vision">Why do neural nets work really well for computer vision?</h2>
<p>Artificial neural networks are considered a major subcategory of ML which constitutes the core of deep learning techniques. Their origin and architecture are same as the human mind, and they work like real neurons.</p>
<p>Since pictures do not always have labels, sub-labels for sections and elements must be removed or cleverly reduced, neural networks perform effectively for computer vision. Training information is used by neural networks to train and increase their efficiency with experience. But, once these learning techniques have been fine-tuned for precision, they become formidable resources in computer technology and AI, enabling us to quickly categorize and organize data.</p>
<p>When compared to traditional classification by experienced scientists, activities in voice recognition or image recognition may take only a few minutes rather than hours. Google’s technology is among the most famous neural networks.</p>
<h2 id="why-still-there-is-a-need-to-create-a-custom-dataset">Why still there is a need to create a custom dataset</h2>
<p>Transfer learning has been a specific technique of machine learning in which a model created for one job is applied as the basis for a model on a different task. Considering the enormous compute as well as time resource base needed to establish neural network systems on such concerns, as well as the large leaps in expertise that they deliver on similar issues, it is a common strategy in machine learning in which pre-trained systems are used as the preliminary step on natural language data processing.</p>
<p>We can deal with these instances using transfer learning, which uses previously labeled data from a comparable task or topic.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630223160476/sIUzWFdFu.jpeg" alt="neonbrand-zFSo6bnZJTw-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@neonbrand?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">NeONBRAND</a> on <a href="https://unsplash.com/s/photos/learning?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<h2 id="coco-file-format-is-a-standard-for-building-computer-vision-datasets">Coco File Format is a standard for building computer vision datasets</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630224828996/Y0IR5slI1.jpeg" alt="Untitled.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@ikredenets?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Irene Kredenets</a> on <a href="https://unsplash.com/s/photos/coco?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Analyzing visual environments is a major objective of computer vision; it includes detecting what items are there, localizing them in 2D and 3D, identifying their properties, and describing their relationships. As a result, the dataset could be used to train item recognition and classification methods. COCO is frequently used to test the efficiency of real-time object recognition techniques. Modern neural networking modules can understand the COCO dataset's structure.</p>
<p>Contemporary AI-driven alternatives are not quite skillful of creating complete precision in findings that lead to a fact that the COCO dataset is a substantial reference point for CV to train, test, polish, and refine models for faster scaling of the annotation pipeline.</p>
<p>The COCO standard specifies how your annotations and picture metadata are saved on disc at a substantial stage. Furthermore, the COCO dataset is an addition to transfer learning, in which the material utilized for one model is utilized to start another.</p>
<h2 id="tutorial-to-build-computer-vision-dataset-using-datatorchhttpsdatatorchio">Tutorial to build Computer vision dataset using  <a target="_blank" href="https://datatorch.io/">Datatorch</a></h2>
<h3 id="video-tutorial">Video Tutorial</h3>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/E-_o2Q0YTrs">https://youtu.be/E-_o2Q0YTrs</a></div>
<p>Datatorch is one of the cloud-based free-to-use annotation tools out there. It is a web-based platform where you can just hop on to and quickly start annotating dataset</p>
<p><strong>Step0: Discovering Data</strong></p>
<p>Solving any machine learning problem first starts with data. The first question is what problem you want to solve. Then the next question is where can I get this data. </p>
<p>In my case (hypothetical), I want to build an ML model that detects different dog breeds from photos. 
I am sourcing this relatively simple  <a target="_blank" href="https://www.kaggle.com/jessicali9530/stanford-dogs-dataset">Stanford Dogs Dataset</a> from Kaggle</p>
<p><strong> Step1: Create New Project </strong></p>
<p>After you log in you will see the dashboard main screen showing your projects and organization. This will be good when you are trying to work on multiple projects across different teams.
Now on the top right of the title bar, click on <code>+</code> and create a new project</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630228367944/rcuZIZIih.png" alt="image.png" /></p>
<p><strong> Step2: Onboard Data</strong></p>
<p>Then go to <code>Dataset</code> tab from the left navigation bar, click on <code>+</code> to create a new dataset named <code>dogtypes</code>. After that you can easily drop the images</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630229935048/1xnpMAMNo.png" alt="image.png" /></p>
<p>Or there is another option to directly connect to a cloud provider storage (AWS, Google, Azure)</p>
<p><strong> Step3: Start Annotating</strong></p>
<p>If you click on any of the image in the dataset, it will directly lead you to the annotating tool</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630230774259/ATVSqrW6S.png" alt="image.png" /></p>
<ul>
<li><strong>Annotation Tools</strong> On the left there are the annotation tools you can use on the <strong>visualizer window</strong> in the center</li>
<li><strong>Dataset:</strong> List of all the images, click to annotate them</li>
<li><strong>Change/Create Labels: </strong> Click to change the label associated with annotation</li>
<li><strong>Annotation Details: </strong> After you have done some annotation in the image you will see the details here</li>
<li><strong>Tool Details/ Configuration: </strong> When you select an annotation tool the details/configuration appear on this. For example, if you select a brush tool you can change its size here</li>
</ul>
<p>To start annotating you can just select an annotation tool from the options, it also depends on the type of model you are trying to build. For an object detection model, something like a bounding box or circle tool is good to use, otherwise for a segmentation model you can use the brush tool or an AI-based superpixel tool to highlight relevant pixels. For example for I just used a simple brush tool (increased the size) to highlight over the dog.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630232861376/2J9V_jjcn.png" alt="image.png" /></p>
<p>Also, it would be best to discover annotation by trying or you can watch the tutorial on my youtube channel.</p>
<ul>
<li><strong>Step4: Export to Annotated Data to Coco Format</strong>
After you are done annotating, you can go to exports and export this annotated dataset in COCO format.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630233858690/EmfFWXV7y.png" alt="image.png" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>If you're inexperienced to object detection and need to create a completely new dataset, the COCO format is an excellent option because of its simple structure and broad use. The COCO dataset structure has been investigated for the most common tasks: object identification and segmentation. COCO datasets are large-scale datasets that are suited for starter projects, production environments, and cutting-edge research. </p>
]]></content:encoded></item><item><title><![CDATA[Introduction to graph-based analytics using Cylynx Motif]]></title><description><![CDATA[Data visualization and analysis may help you find summary data, complexity, unseen connections, patterns, differences, irregularities, and insights in your dataset. Although there are several technologies available to help in the presentation of tabu...]]></description><link>https://anujsyal.com/introduction-to-graph-based-analytics-using-cylynx-motif</link><guid isPermaLink="true">https://anujsyal.com/introduction-to-graph-based-analytics-using-cylynx-motif</guid><category><![CDATA[data analysis]]></category><category><![CDATA[graph database]]></category><category><![CDATA[analytics]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data structures]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Thu, 12 Aug 2021 04:26:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1628738396830/MTM7TgimR.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Data visualization and analysis may help you find summary data, complexity, unseen connections, patterns, differences, irregularities, and insights in your dataset. Although there are several technologies available to help in the presentation of tabular information, this can be true with graph data.</p>
<p>The motif is one of the no-code graph visualization tools. This software will assist researchers, data analysts, and executives in establishing the links between their datasets and making graph analysis more simple and interesting!</p>
<p>In this article, we will discuss graph-based analytics using Cylynx Motif, challenges of graph data exploration, and motif graph intelligence platform. Keep scrolling to read more.</p>
<h2 id="what-is-graph-based-analytics">What is Graph-Based Analytics?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1628738463898/KqX0qodMb.jpeg" alt="alina-grubnyak-R84Oy89aNKs-unsplash.jpg" /></p>
<p>It is a fast-growing field of study in which graph-theoretic, analytical, and database approaches are used to describe, save, extract, and execute performance evaluations on graph-structured information. </p>
<p>Analysts can use these methods to figure out how a network's framework changes under various circumstances, discover pathways between combinations of entities that fulfill distinct limitations. These are also used to recognize clumps or carefully interacting subsets within a graph and discover subgraphs that seem to be comparable to a data set.</p>
<p>It is necessary to visualize one's information as a graph of nodes or vertices that indicate items and connections that reflect connections between things for this and several other activities. Several technology fields, like sensing devices, need huge graphs with billions of nodes and edges.</p>
<p>These could indicate a thousand different sorts of things and connections in activities like situation monitoring. The interconnections in telecommunication systems could change over time, and certain organizations might be quite closely linked to one another.</p>
<ul>
<li><strong>Financial services</strong></li>
</ul>
<p>GA for financial sectors is a compelling technique for visualizing networks, connections, and activities between individuals, companies, and things. The most well-known applications of GA so far have been in the analysis of social media information. </p>
<p>However, the system can potentially change the financial solutions sector, particularly once used to reinforce Artificial Intelligence (AI)-based metrics, like trying to stabilize different processes or reducing time-consuming actions like information processing, verification, and error adjustment.</p>
<p>Financial Institutions (FIs) may obtain priceless – and quick – information into their networks (through cybersecurity threat control), counterparties (through counterparty credit threat), users (mostly through KYC/ AML), as well as the broader community by implementing GA successfully (via supply chain analysis).</p>
<p>GA does have a big future. It's feasible to build an AI tool that can either automate choices or enhance and boost rational decisions by integrating it with other scientific approaches. Machine Learning (ML) approaches have recently received a lot of attention in cutting-edge analyses, and GA may have a similar impact on finance and safety innovation.</p>
<ul>
<li><strong>Supply chain management</strong></li>
</ul>
<p>We can save cash on shipments by using the supply chain management graph. That form of the graph may help you utilize economies of scaling, particularly if you have a broader perspective of consumption, how everything relates to your stock, as well as how much you'll have to purchase in a given timeframe. </p>
<p>Perhaps you're now placing smaller, more frequent orders since you’re unable to implement this long-term strategy throughout the entire supply chain. If you've got a broader perspective (that the graph may assist you to see), you could purchase in large quantities, make more informed purchasing choices, and save cash.</p>
<p>You may also use the graph to improve your inventories. You could then shift your stock around so it's in the appropriate location at the proper moment after you realize how many usable components you need, wherever those are, how much time will need need to bring them to business, and also what transportation they'll have to travel in (in regards of your prediction).</p>
<p>Instead of completing new purchases and keeping your stock untouched, you may use this to stay profitable. This might help you fine-tune your stock management so you don't have as much on the store. And besides, you know how your prediction will come out in perspective of how many components you'll require. Then you'll be able to efficiently arrange your inventories and avoid buying or storing more than you'll need.</p>
<p>The graph can assist you in comparing and contrasting providers and related items. You may analyze customer complaints and evaluate the process variation from different vendors once you have that picture of the providers versus all the parts and materials you're utilizing.</p>
<ul>
<li><strong>Customer 360</strong></li>
</ul>
<p>By looking at the graph, you can see how this user is related to certain other businesses; for instance, consumers could operate in any of your other business companies, or maybe they're friends and family. You got contract details, and you can access not only details regarding their existing vehicles - its actual BOM, how it's maintained, choices available, product definition, and many more – but also any other cars they've manufactured.</p>
<p>You can also examine the cars they've bought in the past, as well as the choices and requests they've made. Sensor information, telemetry statistics, billing information, and any customer encounters may all be included in this graph.</p>
<h2 id="about-the-motif-graph-intelligence-platform">About the Motif Graph Intelligence Platform</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1628738583264/-m3yMwcvG.png" alt="image.png" /></p>
<h3 id="what-is-motif">What is Motif?</h3>
<p>The motif is a graph analytics application that converts linked information into business information without the use of programming. It allows users to accelerate information mining, research, and interaction by allowing companies to make the connections between disparate data sets. Researchers, business analysts, and executives may use Motif to do a graphical search on graph information.</p>
<h3 id="what-was-the-motivation-for-developing-motif">What was the motivation for developing Motif?</h3>
<p>Motif wants to reduce the access restrictions for anybody interested in graph issues. One of the major issues encountered was that integrating graph data into corporate decision-making in sectors such as financing and known security vulnerabilities. </p>
<p>The majority of options are custom-built, private, time-consuming, and costly. Cylynx is trying to fix that with Motif, which makes 80 percent of the most widespread networking visualization use scenarios as simple as possible while also adding interactive functionalities to turn graph data into business insight.</p>
<h2 id="a-simple-tutorial-to-get-started-with-motif-using-movie-dataset">A Simple Tutorial to get started with Motif (using movie dataset)</h2>
<p> <strong> If you interested in a comprehensive tutorial, check out this video on my youtube channel </strong></p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/jPs5UxNNKQ4">https://youtu.be/jPs5UxNNKQ4</a></div>
<h3 id="steps">Steps</h3>
<ul>
<li>To get started go to the following  <a target="_blank" href="https://demo.cylynx.io/">Demo Motif Link</a> </li>
<li>As a first step you would need to import the data, click on the import data button. There are various ways you can start with the data.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1628740025652/ejsrI1Q7Y.png" alt="image.png" /></p>
<ul>
<li><p>Information about the dataset: In this tutorial, I will use a neo4j graph database server sample <code>Movie Dataset</code>. And connect directly using the server URL and credentials. To create a similar dataset, go to  <a target="_blank" href="https://sandbox.neo4j.com/">ne4j sandbox</a> . After you log in, you can create a new database server with <code>Movie Dataset</code> preloaded, please note it will expire in 3 days.</p>
</li>
<li><p>After the server is connected, you would a query to extract this data. I am using a query that extracts information on movies and the corresponding actors. Click on execute query and import the nodes and edges.</p>
</li>
</ul>
<pre><code><span class="hljs-selector-tag">MATCH</span> (<span class="hljs-attribute">p</span>:Person)<span class="hljs-selector-tag">-</span><span class="hljs-selector-attr">[r:ACTED_IN]</span><span class="hljs-selector-tag">-</span>&gt;(<span class="hljs-attribute">m</span>:Movie) <span class="hljs-selector-tag">RETURN</span> <span class="hljs-selector-tag">p</span>,<span class="hljs-selector-tag">r</span>,<span class="hljs-selector-tag">m</span>
</code></pre><ul>
<li>After the data has been imported you will see the graph with all the nodes and edges populated in the visual interface</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1628740659599/N3fbrz1w_.png" alt="image.png" /></p>
<ul>
<li>On the right, there are quick action buttons like zoom-in, zoom-out, undo, redo, etc. On the left there is a control panel with the following options</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1628740936442/9KeL3blKg.png" alt="image.png" /></p>
<ul>
<li>The idea of using Motif-like tool for graph-based analytics is to explore the data and relationships visually. That is where the <code>Styles</code> tab helps you with in depth exploration. You should be able to select different layout options, radius, node spacing, focus nodes, and also node styles, edges styles.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1628741350059/M-Z_DMHkD.png" alt="image.png" /></p>
<ul>
<li>All the options discussed are best understood by the actual implementation. In my case with around 5 minutes of playing around with data helped me better get better insights about the dataset. I selected a Radial Layout, Added Node size relationships to the degree of connections, added node color with legends. And the final output looked something like this</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1628741735214/VL0KUUbAi.png" alt="image.png" /></p>
<ul>
<li>I was able to answer quick questions by just visually looking at the graph, like who is the most popular actor/movie. If you are interested in knowing the detailed steps you can check out my  <a target="_blank" href="https://youtu.be/jPs5UxNNKQ4">youtube video</a> Or you can just try it  yourself by going to the  <a target="_blank" href="https://demo.cylynx.io/">demo link</a> </li>
</ul>
<h2 id="a-few-challenges-of-graph-data-exploration">A Few Challenges of Graph Data Exploration</h2>
<ul>
<li><strong>Finding the connection through the tabular form</strong></li>
</ul>
<p>The tabular representation of our data is quite difficult to understand in some cases. The companies and organizations are working on relational database structures, they prefer to stay within the boundaries of the data. </p>
<p>These boundaries are not assisting the companies to evolve.
The implementation of graphical data exploration in the entities will help them understand the relationship in the tabular forms. It will also assist these companies to adopt flexibility and make new research and discoveries in the relevant fields.</p>
<ul>
<li><strong>Challenges in data exploration</strong></li>
</ul>
<p>Finding the connection between different things through graph data may take some time in several situations. This could be challenging for some companies as they don’t have enough time to work on data exploration.</p>
<p>Data scientists and researchers in these situations play a huge role. They can implement new technologies and solutions in the companies to make things easier. Networkx and igrapgh are some of these tools that are used by data scientists.</p>
<ul>
<li><strong>Issues with high dimensional data</strong></li>
</ul>
<p>A graph designed in a real-world scenario may have some edges, nodes, and some different properties that may be challenging for the new users. The users will need some extra time to deal with these graphs and understand the properties.</p>
<ul>
<li><strong>Exchanging the results with others is challenging</strong></li>
</ul>
<p>There are sometimes when you want to share the results with your friends and colleagues. In these situations, the graphical data in tabular form is hard to share with others. There are no special tools for sharing the insights of your graphs like several other technologies.</p>
<h2 id="conclusion">Conclusion</h2>
<blockquote>
<p>Cylynx is continuously trying to introduce and implement new tools and software to make things easier. Also providing products and techniques to assist financial fraud authorities in connecting the links among various data pieces.</p>
<p>Graphs are the greatest way to display and handle such information. Motif has been built to make graph finding clear and open to researchers, data analysts, and executives by integrating the demands of our different application releases.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Building Deep Neural Networks the Easy Way | Perceptilabs]]></title><description><![CDATA[PerceptiLabs' visual simulation model offers a graphical user interface for creating, learning, and evaluating designs as well as allowing for further programming modifications. You can get quick repetitions and improved solutions that are easier to ...]]></description><link>https://anujsyal.com/building-deep-neural-networks-the-easy-way-or-perceptilabs</link><guid isPermaLink="true">https://anujsyal.com/building-deep-neural-networks-the-easy-way-or-perceptilabs</guid><category><![CDATA[Deep Learning]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Python]]></category><category><![CDATA[TensorFlow]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Mon, 02 Aug 2021 06:59:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1627875573736/3VMqTnM3k.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>PerceptiLabs' visual simulation model offers a graphical user interface for creating, learning, and evaluating designs as well as allowing for further programming modifications. You can get quick repetitions and improved solutions that are easier to describe.</p>
<p>The framework of Perceptible allows users to create modified model configurations without requiring scientific knowledge as well as end-to-end simulation techniques that enable users to perceive and analyze the model in an entirely clear way improving awareness and allowing for error detection.</p>
<h2 id="what-is-perceptilabs">What is Perceptilabs?</h2>
<p>This is actually a user interface for TensorFlow with an advanced machine learning platform with a graphical modeling process that combines the freedom of programming with the convenience of a drag-and-drop interface that would be a visual design on top of TensorFlow. This can make modeling creation simpler, quicker, and more available to a broader range of people.</p>
<p>It also includes pre-built algorithms for a variety of disciplines that can be transferred into the workplace by individuals, enabling them to modify and learn these on their own datasets. Tax fraud identification, object classification for pattern recognition, and other applications are among the system's use applications.</p>
<h2 id="perceptilabs-and-machine-learning">PerceptiLabs and machine learning</h2>
<p>PerceptiLabs was founded with the goal of making machine learning modeling easier for businesses of all sizes. Machine learning could have a vital aspect in our development, and PerceptiLabs is now on a journey to enable businesses of all sizes to get started in this specific industry.</p>
<p>It analyses the ever-increasing amounts of data accessible today, assists businesses in identifying trends in the information, and provides estimates depending on those trends. Every business has a range of applications, such as employing object identification to predict which grocery stores are getting low on stock or utilizing picture identification to recognize a person in a congested field.</p>
<p>Users can easily create machine learning algorithms for any type of business with PerceptiLab's visual modeling solution. It enables users to click and drag items and join elements, then configures variables before the software writes their programming instantly. Users may quickly train and fine-tune their machine learning model, as well as observe its performance.</p>
<h2 id="modeling-workflow-of-perceptilabs">Modeling workflow of Perceptilabs</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627875353809/R0j0dMWT8.jpeg" alt="tobias-carlsson-d3Zu34NBg7A-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@tobias_carl?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Tobias Carlsson</a> on <a href="https://unsplash.com/s/photos/flow?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Pre-made elements encapsulate TensorFlow data and simplify it into visible components, while still enabling customized code updates. This graphical interface enables you to move these elements into a structure that depicts the design of your system. This user interface makes it simple to implement additional features such as one-hot encoding and thick layering.</p>
<p>As you alter the design in PerceptiLabs, every element also offers graphical information on how it has converted the dataset. This immediate overview reduces the requirement to execute the entire simulation before viewing results, allowing you to change more quickly.</p>
<p>Whenever you compare PerceptiLabs to any other platform, you'll see how much easier it is to visualize pictures and categorize information. You could also observe how every element alters the information, as well as how the alterations contributed to the final categorization.</p>
<p>During modeling, PerceptiLabs retrieves and utilizes the initial piece of an accessible dataset, and it re-runs the system as you implement adjustments, and you'll see how your modifications affect your outcome right away. This useful tool allows you to examine results without having to execute the algorithm on the whole sample.</p>
<h2 id="building-your-first-deep-learning-model-on-perceptilabsperceptilabscom">Building your first Deep Learning model on  <a target="_blank" href="perceptilabs.com">Perceptilabs</a></h2>
<h3 id="if-you-are-interested-in-a-more-comprehensive-video-tutorial-check-out-my-youtube-video-below">If you are interested in a more comprehensive video tutorial, check out my youtube video below</h3>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=Ez4la9Lwh04">https://www.youtube.com/watch?v=Ez4la9Lwh04</a></div>
<p><strong>Step 1: Install and run Perceptilabs on local</strong>
Open terminal, use pip to install and run the tool locally (Make sure to have python version &lt; 3.9</p>
<pre><code><span class="hljs-attribute">pip</span> install perceptilabs
perceptilabs
</code></pre><p>After the setup the tool is up and running on localhost:8080
<strong> Step 2: Understanding the dataset </strong>
I am using the default sample dataset of <code>X-Ray scans of patients</code> provided in Perceptilabs.
The dataset has Xray scans with 3 labels - <code>Normal</code>,<code>Viral Pneumonia</code>, <code>Covid</code>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627881438767/uFzJA8XCG.png" alt="image.png" />
To import this dataset in Perceptilabs you would need it in the right format which is in
<code>data.csv</code> file contains the path of the files with corresponding labels</p>
<p><strong>Step 3: Go to the model hub (first tab on the left), click on create model and import the dataset <code>data.csv</code> </strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627881874542/mHAtg1Phs.png" alt="image.png" />
Select the URL as input feature and labels as Target
 Keep the data partition as default [70% Train,20% Validation,10% Test]</p>
<p>Select the training settings</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627881898785/mQ6Q6_HrC.png" alt="image.png" />
Provide the details such as <code>Model Name</code>, <code>Epochs</code>, <code>Batch Size</code>, <code>Loss Function</code>, <code>Learning Rate</code>
and click on Customize and go to the Modelling window</p>
<p> Within the modelling window you will see all the layers of the neural network being laid out as per the inference of the modeling tool, it should look something like the screen shot below</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627886138302/C41n5WQSI.png" alt="image.png" />
It contains by default 1 convolution layer connected to input images and two dense layers with softmax to convert it to a final label output</p>
<p><strong> Step 4: Play around with the tool and multiple layers</strong>
You can add in more deep learning components as a part of modeling tool, or you can easily code a custom Keras function for another layer</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627886395628/p7phNib9I.png" alt="image.png" /></p>
<p>Building any deep learning model requires a lot of iterations, therefore the visual approach comes in handy where we can plug play and see the iterated results</p>
<p><strong> Step 5: Start Training and see live stats (Statistics View)</strong>
Click on Run with current settings on the top bar to start training the model, pass in the model settings as discussed previously, for a classification use case a Cross-Entropy Loss function would make more sense.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627886577022/uaQ_6DXG4e.png" alt="image.png" /></p>
<p>After you start modelling, you will be redirected to the statistics view, to see the live statistics of the model while it is getting trained, you should be able to see weights output, loss, accuracy of each row of the dataset being trained. And all of this analysis can be done layer by layer
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627886944645/y9UPt3nKM.png" alt="image.png" /></p>
<p>You should also be able to see the accuracy increase as the epoch pass on a global level</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627887080658/oS_j0_ljk.png" alt="image.png" /></p>
<p><strong> Step 6: Run Validation on Tests Dataset</strong>
Go to the test view and run the test to get model metrics and confusion matrix of labels</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627887354964/boqfW-qOs.png" alt="image.png" /></p>
<p>And after it is complete you should be able to see the quality of the model you have built using these test metrics such as <code>Model Accuracy</code>, <code>Precision</code>, <code>Recall</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627887428125/zdonHbxpN.png" alt="image.png" /></p>
<h2 id="why-something-like-perceptilabs-makes-sense">Why something like Perceptilabs makes sense?</h2>
<p>Data analysts may use this technology to perform more effectively with machine learning techniques and get a good understanding of them.</p>
<p><strong>Helps you get the Real-time information</strong></p>
<p>Real-time metrics and detailed summaries of every modeling element's data are available. You can simply follow and analyze the behavior of the variables, troubleshoot in real-time, and identify where your system may be improved.</p>
<p><strong>Helps you share them on GitHub</strong></p>
<p>PerceptiLabs allows you to maintain many simulations, evaluate them, and communicate the findings with your group quickly and efficiently. Export your data as a TensorFlow framework.</p>
<p><strong>Helps you overcome Compatibility problems</strong></p>
<p>When a corporation's researchers create models and put them into operation, they must all be using the same model. Otherwise, problems would arise. According to some experts, this problem may be avoided if everyone in a firm utilizes PerceptiLabs' technology.
<strong> Helps you export your model </strong></p>
<p>Perceptilabs allows you to examine and explain how your program runs and executes, as well as why particular outcomes are being produced. You may also export your data as a training TensorFlow version after you're okay with it.</p>
<p><strong>Advantages of using Perceptilabs</strong></p>
<p>This tool offers a wide range of benefits. Some of them are;</p>
<ul>
<li>Quick modeling - Includes a simple drag-and-drop user interface that helps make system design simple to create and analyze.</li>
<li>Visibility - It can be used to start understanding how your strategy performs so that it can be explained.</li>
<li>Versatility - Built as a graphical API on top of TensorFlow, this allows programmers to use TensorFlow's low-level Interface while also allowing them to use other Python libraries.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>The procedure of developing algorithms must be simplified if businesses are to embrace machine learning. PerceptiLabs offers graphical machine learning modeling solutions to assist businesses in implementing computer learning. It not only allows you to develop computer learning networks quickly, but it also gives you a graphical representation of how the model was performing and allows you to exchange that information with one another.</p>
]]></content:encoded></item><item><title><![CDATA[Introduction to Pyspark ML Lib: Build your first linear regression model]]></title><description><![CDATA[Photo by Genessa Panainte on Unsplash

Machine learning that is applied to build personalizations, suggestions, and future analyses are becoming increasingly important as companies generate increasingly diversified and user-focused digital goods and ...]]></description><link>https://anujsyal.com/introduction-to-pyspark-ml-lib-build-your-first-linear-regression-model</link><guid isPermaLink="true">https://anujsyal.com/introduction-to-pyspark-ml-lib-build-your-first-linear-regression-model</guid><category><![CDATA[spark]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[Artificial Intelligence]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Mon, 26 Jul 2021 04:48:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1627257772013/FtuoIHX0F.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Photo by <a href="https://unsplash.com/@genessapana?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Genessa Panainte</a> on <a href="https://unsplash.com/s/photos/spark?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Machine learning that is applied to build personalizations, suggestions, and future analyses are becoming increasingly important as companies generate increasingly diversified and user-focused digital goods and solutions. Rather than dealing with the complications of different datasets, the Apache Spark machine learning library (MLlib) enables data engineers to concentrate on specific data challenges and algorithms.</p>
<p>A linear technique to modeling the connection across a dependent factor and one or maybe more random factors is known as linear regression. It is one of the most fundamental and widely used kinds of predictive modeling.</p>
<h2 id="what-is-spark-mllib">What is Spark MLlib?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627257895913/zqfyMHpiV.jpeg" alt="marius-masalar-CyFBmFEsytU-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@marius?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Marius Masalar</a> on <a href="https://unsplash.com/s/photos/machine-learning?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Spark MLlib is among the most appealing features of Spark as it has the capacity to enormously scale processing, which is precisely what machine learning models require. However, there are some machine learning models that cannot be properly implemented, which is a drawback.</p>
<p>MLlib is a comprehensive machine learning package that includes categorization, regression, clustering, cooperative filtration, and fundamental optimal primitives, as well as other popular learning methods and tools.</p>
<p>##What is linear regression model?</p>
<p>By matching a line to the given information, regression methods illustrate the connection among variables. A straight line is used in this system, whereas a curved path is used in nonlinear systems.</p>
<p>You can also start using regression to predict the characteristics of a dependent variable that depends on the characteristics of an independent variable. The link among two qualitative parameters is estimated using simple linear regression.</p>
<h2 id="why-to-use-spark-mllib-for-ml">Why to use spark Mllib for ml</h2>
<p>Spark is a strong, centralized platform for data scientists due to its fast speed. It is also a simple to use tool that helps them get desired results quickly. This enables data scientists to fix deep learning complications along with pattern calculation, broadcasting, and interactive request handling at a much larger scale.</p>
<p>R, Python and Java, are just a few of the languages available in Spark. The 2015 <a target="_blank" href="https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html">Spark Study</a>, which interviewed the Spark community, revealed that Python and R have seen primarily fast growth. In particular, 58 percent of participants said they used Python and 18 percent said they were currently utilising the R API.</p>
<h2 id="interested-in-more-comprehensive-tutorial">Interested in more comprehensive tutorial:</h2>
<p>Check out this youtube video:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/9Sy1x1fa1no">https://youtu.be/9Sy1x1fa1no</a></div>
<h2 id="create-your-first-linear-regression-model-with-spark-mllib">Create your first linear regression model with Spark Mllib</h2>
<ul>
<li><p>Step 1: Pyspark environment setup
For pyspark environment on local machine, my preferred option is to use docker to run <code>jupyter/pyspark-notebook image</code>. However, if you are interested in an extensive installation guide check out my  <a target="_blank" href="https://anujsyal.com/pyspark-installation-guide">blog post</a>  or <a target="_blank" href="https://www.youtube.com/watch?v=Ql_jfk3UnHE">youtube video</a></p>
</li>
<li><p>Step 2: Create a spark session</p>
</li>
</ul>
<pre><code><span class="hljs-keyword">from</span> pyspark.<span class="hljs-keyword">sql</span> <span class="hljs-keyword">import</span> SparkSession
spark = SparkSession.builder.master("local").appName("linear_regression_model").getOrCreate()
</code></pre><ul>
<li>Step3: Load dataset</li>
</ul>
<p>For the dataset, I am using a simple  <a target="_blank" href="https://www.kaggle.com/quantbruce/real-estate-price-prediction">Real Estate dataset from Kaggle</a> , which contains a simple data for real estate with continuous features like <code>distance from mrt station</code>, <code>coordinates</code>, <code>size</code>, etc</p>
<p>After you download read the dataset into a spark dataframe</p>
<pre><code>real_estate = spark.<span class="hljs-keyword">read</span>.<span class="hljs-keyword">option</span>("inferSchema", "true").csv("real_estate.csv",<span class="hljs-keyword">header</span>=<span class="hljs-keyword">True</span>)
</code></pre><ul>
<li>Step4: Explore data and its attribute</li>
</ul>
<p>We can explore different attributes/columns of the data using a few inbuilt functions in spark.</p>
<p>PrintSchema() to see the columns with data types</p>
<pre><code><span class="hljs-string">real_estate.printSchema()</span>
<span class="hljs-attr">Out:</span>
<span class="hljs-string">root</span>
 <span class="hljs-string">|--</span> <span class="hljs-attr">No:</span> <span class="hljs-string">integer</span> <span class="hljs-string">(nullable</span> <span class="hljs-string">=</span> <span class="hljs-literal">true</span><span class="hljs-string">)</span>
 <span class="hljs-string">|--</span> <span class="hljs-attr">X1 transaction date:</span> <span class="hljs-string">double</span> <span class="hljs-string">(nullable</span> <span class="hljs-string">=</span> <span class="hljs-literal">true</span><span class="hljs-string">)</span>
 <span class="hljs-string">|--</span> <span class="hljs-attr">X2 house age:</span> <span class="hljs-string">double</span> <span class="hljs-string">(nullable</span> <span class="hljs-string">=</span> <span class="hljs-literal">true</span><span class="hljs-string">)</span>
 <span class="hljs-string">|--</span> <span class="hljs-attr">X3 distance to the nearest MRT station:</span> <span class="hljs-string">double</span> <span class="hljs-string">(nullable</span> <span class="hljs-string">=</span> <span class="hljs-literal">true</span><span class="hljs-string">)</span>
 <span class="hljs-string">|--</span> <span class="hljs-attr">X4 number of convenience stores:</span> <span class="hljs-string">integer</span> <span class="hljs-string">(nullable</span> <span class="hljs-string">=</span> <span class="hljs-literal">true</span><span class="hljs-string">)</span>
 <span class="hljs-string">|--</span> <span class="hljs-attr">X5 latitude:</span> <span class="hljs-string">double</span> <span class="hljs-string">(nullable</span> <span class="hljs-string">=</span> <span class="hljs-literal">true</span><span class="hljs-string">)</span>
 <span class="hljs-string">|--</span> <span class="hljs-attr">X6 longitude:</span> <span class="hljs-string">double</span> <span class="hljs-string">(nullable</span> <span class="hljs-string">=</span> <span class="hljs-literal">true</span><span class="hljs-string">)</span>
 <span class="hljs-string">|--</span> <span class="hljs-attr">Y house price of unit area:</span> <span class="hljs-string">double</span> <span class="hljs-string">(nullable</span> <span class="hljs-string">=</span> <span class="hljs-literal">true</span><span class="hljs-string">)</span>
</code></pre><p>Show() Check out a few rows and understand the data</p>
<pre><code>real_estate.show(<span class="hljs-number">2</span>)
<span class="hljs-symbol">Out:</span>
+---+-------------------+------------+--------------------------------------+-------------------------------+-----------+------------+--------------------------+
<span class="hljs-params">| No|</span>X1 transaction date<span class="hljs-params">|X2 house age|</span>X3 distance to the nearest MRT station<span class="hljs-params">|X4 number of convenience stores|</span>X5 latitude<span class="hljs-params">|X6 longitude|</span>Y house price of unit area<span class="hljs-params">|
+---+-------------------+------------+--------------------------------------+-------------------------------+-----------+------------+--------------------------+
|</span>  <span class="hljs-number">1</span><span class="hljs-params">|           2012.917|</span>        <span class="hljs-number">32.0</span><span class="hljs-params">|                              84.87882|</span>                             <span class="hljs-number">10</span><span class="hljs-params">|   24.98298|</span>   <span class="hljs-number">121.54024</span><span class="hljs-params">|                      37.9|</span>
<span class="hljs-params">|  2|</span>           <span class="hljs-number">2012.917</span><span class="hljs-params">|        19.5|</span>                              <span class="hljs-number">306.5947</span><span class="hljs-params">|                              9|</span>   <span class="hljs-number">24.98034</span><span class="hljs-params">|   121.53951|</span>                      <span class="hljs-number">42.2</span><span class="hljs-params">|
+---+-------------------+------------+--------------------------------------+-------------------------------+-----------+------------+--------------------------+
only showing top 2 rows</span>
</code></pre><p>describe() to see statistics of columns</p>
<pre><code>real_estate.describe().show()
<span class="hljs-symbol">Out:</span>
+-------+-----------------+-------------------+------------------+--------------------------------------+-------------------------------+--------------------+--------------------+--------------------------+
<span class="hljs-params">|summary|</span>               No<span class="hljs-params">|X1 transaction date|</span>      X2 house age<span class="hljs-params">|X3 distance to the nearest MRT station|</span>X4 number of convenience stores<span class="hljs-params">|         X5 latitude|</span>        X6 longitude<span class="hljs-params">|Y house price of unit area|</span>
+-------+-----------------+-------------------+------------------+--------------------------------------+-------------------------------+--------------------+--------------------+--------------------------+
<span class="hljs-params">|  count|</span>              <span class="hljs-number">414</span><span class="hljs-params">|                414|</span>               <span class="hljs-number">414</span><span class="hljs-params">|                                   414|</span>                            <span class="hljs-number">414</span><span class="hljs-params">|                 414|</span>                 <span class="hljs-number">414</span><span class="hljs-params">|                       414|</span>
<span class="hljs-params">|   mean|</span>            <span class="hljs-number">207.5</span><span class="hljs-params">| 2013.1489710144933|</span> <span class="hljs-number">17.71256038647343</span><span class="hljs-params">|                    1083.8856889130436|</span>              <span class="hljs-number">4.094202898550725</span><span class="hljs-params">|  24.969030072463745|</span>  <span class="hljs-number">121.53336108695667</span><span class="hljs-params">|         37.98019323671498|</span>
<span class="hljs-params">| stddev|</span><span class="hljs-number">119.6557562342907</span><span class="hljs-params">| 0.2819672402629999|</span><span class="hljs-number">11.392484533242524</span><span class="hljs-params">|                     1262.109595407851|</span>             <span class="hljs-number">2.9455618056636177</span><span class="hljs-params">|0.012410196590450208|</span><span class="hljs-number">0</span>.<span class="hljs-number">0153471</span>83004592374<span class="hljs-params">|        13.606487697735316|</span>
<span class="hljs-params">|    min|</span>                <span class="hljs-number">1</span><span class="hljs-params">|           2012.667|</span>               <span class="hljs-number">0</span>.<span class="hljs-number">0</span><span class="hljs-params">|                              23.38284|</span>                              <span class="hljs-number">0</span><span class="hljs-params">|            24.93207|</span>           <span class="hljs-number">121.47353</span><span class="hljs-params">|                       7.6|</span>
<span class="hljs-params">|    max|</span>              <span class="hljs-number">414</span><span class="hljs-params">|           2013.583|</span>              <span class="hljs-number">43.8</span><span class="hljs-params">|                              6488.021|</span>                             <span class="hljs-number">10</span><span class="hljs-params">|            25.01459|</span>           <span class="hljs-number">121.56627</span><span class="hljs-params">|                     117.5|</span>
+-------+-----------------+-------------------+------------------+--------------------------------------+-------------------------------+--------------------+--------------------+--------------------------+
</code></pre><p>Kaggle also provides the details on these attributes such as count, mean, standard deviation. This will allow you to decide on which parameters to use as features for the model</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1627258998184/WQj3DqvWA.png" alt="image.png" /></p>
<ul>
<li>Step4: VectorAssembler to transform data into feature columns</li>
</ul>
<p>After you have decided which columns to use VectorAssembler to format the dataframe</p>
<pre><code>from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=[ 
 <span class="hljs-string">'X1 transaction date'</span>,
 <span class="hljs-string">'X2 house age'</span>,
 <span class="hljs-string">'X3 distance to the nearest MRT station'</span>,
 <span class="hljs-string">'X4 number of convenience stores'</span>,
 <span class="hljs-string">'X5 latitude'</span>,
 <span class="hljs-string">'X6 longitude'</span>],
 outputCol=<span class="hljs-string">'features'</span>)

data_set = assembler.transform(real_estate)
data_set.select([<span class="hljs-string">'features'</span>,<span class="hljs-string">'Y house price of unit area'</span>]).show(<span class="hljs-number">2</span>)

<span class="hljs-symbol">Out:</span>
+--------------------+--------------------------+
<span class="hljs-params">|            features|</span>Y house price of unit area<span class="hljs-params">|
+--------------------+--------------------------+
|</span>[<span class="hljs-number">2012.917</span>,<span class="hljs-number">32.0</span>,<span class="hljs-number">84</span>...<span class="hljs-params">|                      37.9|</span>
<span class="hljs-params">|[2012.917,19.5,30...|</span>                      <span class="hljs-number">42.2</span><span class="hljs-params">|
+--------------------+--------------------------+
only showing top 2 rows</span>
</code></pre><ul>
<li>Step5: Split into Train and Test set</li>
</ul>
<pre><code><span class="hljs-attribute">train_data</span>,test_data = final_data.randomSplit([<span class="hljs-number">0</span>.<span class="hljs-number">7</span>,<span class="hljs-number">0</span>.<span class="hljs-number">3</span>])
</code></pre><ul>
<li>Step 6: Train your model (Fit your model with train data)</li>
</ul>
<pre><code><span class="hljs-keyword">from</span> pyspark.ml.regression <span class="hljs-keyword">import</span> LinearRegression

lr = LinearRegression(labelCol=<span class="hljs-string">'Y house price of unit area'</span>)
lrModel = lr.fit(train_data)
</code></pre><ul>
<li>Step6: Perform descriptive analysis with correlation</li>
</ul>
<p>Check out coefficients after validating with the test set:</p>
<pre><code>test_stats = lrModel.evaluate(test_data)
print(<span class="hljs-string">f"RMSE: <span class="hljs-subst">{test_stats.rootMeanSquaredError}</span>"</span>)
print(<span class="hljs-string">f"R2: <span class="hljs-subst">{test_stats.r2}</span>"</span>)
print(<span class="hljs-string">f"R2: <span class="hljs-subst">{test_stats.meanSquaredError}</span>"</span>)

Out:
RMSE: <span class="hljs-number">7.553238336636628</span>
R2: <span class="hljs-number">0.6493363975473592</span>
R2: <span class="hljs-number">57.051409370037256</span>
</code></pre><p>Root Mean Squared Error (RMSE) on test data = 7.553238336636628</p>
<h2 id="conclusion">Conclusion</h2>
<p>Spark isn't just a better approach to comprehend our information; it's also a lot faster. Spark transforms data analytics and research by enabling us to handle a wide variety of data challenges in a preferred language. Spark MLlib makes it simple for new data scientists to engage with their models right out of the package and specialists can fine-tune as needed.</p>
<p>Distributed networks could be the subject of data engineers, while machine learning methods and algorithms might be the subject of data scientists. Spark has significantly improved and revolutionized the machine learning by allowing data scientists to concentrate on the data challenges that matter to them while openly utilizing Spark's single system's performance, convenience, and integration.</p>
]]></content:encoded></item><item><title><![CDATA[Pyspark Installation Guide]]></title><description><![CDATA[Stick around if you're for a complete guide to set up a pyspark environment for data science applications; pyspark functionality as well as the best platforms to be explored.
What is Pyspark?
Pyspark, a robust language that must be considered to lear...]]></description><link>https://anujsyal.com/pyspark-installation-guide</link><guid isPermaLink="true">https://anujsyal.com/pyspark-installation-guide</guid><category><![CDATA[spark]]></category><category><![CDATA[Python]]></category><category><![CDATA[guide]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[big data]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Mon, 07 Jun 2021 08:11:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1623053313098/iNREdXNP2.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Stick around if you're for a complete guide to set up a pyspark environment for data science applications; pyspark functionality as well as the best platforms to be explored.</p>
<h2 id="what-is-pyspark">What is Pyspark?</h2>
<p>Pyspark, a robust language that must be considered to learn if you're into the idea of creating more scalable pipelines and analyses. According to Chris Min, a data engineer, Pyspark basically enables writing Spark apps in Python and makes data processing efficient in a distributed fashion. Python is not just a great language, but an all-in-one ecosystem to perform exploratory data analysis, create ETLs for data platforms, and build ML pipelines. You might also say that PySpark is no less than a whole library that can be used for a great deal of large data processing on a single/cluster of machines, Moreover, it has you covered up with handling all those parallel processing without even threading or multiprocessing modules in Python. </p>
<h2 id="spark-is-the-real-deal-for-data-engineering">Spark is the Real Deal For Data Engineering</h2>
<p>According to the <a target="_blank" href="https://rdcu.be/clqD9">International Journal of Data Science and Analytics</a>, the emergence of Spark as a general-purpose cluster computing framework having language-integrated API in Python, Scala, and Java is a real thing right now. Its impressively advanced in-memory programming model and libraries for structured data processing, scalable ML, and Graph analysis increase its functionality in the data science industry. And as matter of fact, it is undeniable that at a certain point of data processing, scaling with Pandas is hard. Being a data engineer involves a lot of large data processing which isn't a big deal if you get well-versed with Spark.</p>
<h2 id="why-should-data-scientists-learn-spark">Why Should Data Scientists Learn spark?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1623052888461/eY2vW6ilG.jpeg" alt="sigmund-AxAPuIRWHGk-unsplash.jpg" /></p>
<p><a target="_blank" href="https://unsplash.com/photos/AxAPuIRWHGk">https://unsplash.com/photos/AxAPuIRWHGk</a></p>
<p>Being a data scientist, learning Spark can be a game-changer. For large data processing, Spark is way better than Pandas while not so different in use, so switching to it is not a big deal, and that too when you get real deal benefits while your operations in data engineering. Spark has solutions to various issues and it's a complete collection of libraries to execute logic quite efficiently. Spark ensures you a very clean and efficient experience of operations, even better than Pandas somehow, especially while dealing with large data sets. Spark has you covered up by its efficiently high-performance analysis and user-friendly structure.</p>
<h1 id="exploring-all-the-options-for-pyspark-setup">Exploring All The Options for Pyspark Setup</h1>
<p>I also have a video version of this article, if you are interested feel free to watch this video on my youtube channel</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/Ql_jfk3UnHE">https://youtu.be/Ql_jfk3UnHE</a></div>
<p>Following is a set of various options you can consider to set up the PySpark ecosystem. The list mentioned below addresses all the best  platform that you can consider:</p>
<h2 id="setting-up-locally-spark-and-python-on-ubuntu">Setting Up Locally Spark and Python On Ubuntu</h2>
<ul>
<li>Install Java</li>
</ul>
<pre><code class="lang-python">sudo apt install openjdk<span class="hljs-number">-8</span>-jdk
</code></pre>
<ul>
<li>Download spark from <a target="_blank" href="https://spark.apache.org/downloads.html"><code>https://spark.apache.org/downloads.html</code></a> linux version</li>
<li>Set environment variables <code>sudo nano /etc/environment</code></li>
</ul>
<pre><code class="lang-python">JAVA_HOME=<span class="hljs-string">"/usr/lib/jvm/java-8-openjdk-amd64"</span>
<span class="hljs-comment">#Save and exit</span>
</code></pre>
<ul>
<li>To test <code>echo $JAVA_HOME</code> and see path to confirm installation</li>
<li>Open bashrc <code>sudo nano ~/.bashrc</code> and at the end of the file add <code>source /etc/environment</code></li>
<li>This should setup your Java environment on ubuntu</li>
<li>Install spark, after you downloaded spark in step 2 install with the following commands</li>
</ul>
<pre><code class="lang-python">cd Downloads
sudo tar -zxvf spark<span class="hljs-number">-3.1</span><span class="hljs-number">.2</span>-bin-hadoop3<span class="hljs-number">.2</span>.tgz
</code></pre>
<ul>
<li>Configure environment for spark <code>sudo nano ~/.bashrc</code> and add the following</li>
</ul>
<pre><code class="lang-python">export SPARK_HOME=~/Downloads/spark<span class="hljs-number">-3.1</span><span class="hljs-number">.2</span>-bin-hadoop3<span class="hljs-number">.2</span>
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:~/anaconda3/bin
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=<span class="hljs-string">"jupyter"</span>
export PYSPARK_DRIVER_PYTHON_OPTS=<span class="hljs-string">"notebook"</span>
export PYSPARK_PYTHON=python3
export PATH=$PATH:$JAVA_HOME/jre/bin
</code></pre>
<ul>
<li>Save and exit</li>
<li>To test <code>pyspark</code></li>
</ul>
<p><strong>Don't have ubuntu?Use VirtualBox</strong></p>
<p>Setup ubuntu on your local using virtualbox. VirtualBox basically enables you to build a virtual computer, and that too, on your own physical computer. You can explore VirtualBox to set up Spark and Python: (20-30 mins approx)</p>
<ul>
<li>Start with downloading the <a target="_blank" href="https://www.virtualbox.org/wiki/Downloads">Virtualbox</a>.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1623052986399/Dsq0H_avU.png" alt="Untitled 1.png" /></p>
<blockquote>
<p>Screenshot from Virtualbox download</p>
</blockquote>
<ul>
<li>Download ubuntu ISO <a target="_blank" href="https://ubuntu.com/download/desktop/thank-you?version=20.04.2.0&amp;architecture=amd64">Image</a></li>
<li>In virtual box click on new and setup ubuntu 64 bit environment</li>
<li>Pass in desired cpu cores,memory and storage</li>
<li>Point to the downloaded ubuntu image</li>
</ul>
<h2 id="setting-up-locally-spark-and-python-on-mac">Setting Up Locally Spark and Python On Mac</h2>
<ul>
<li>Make sure Homebrew is installed and updated, if not go to this <a target="_blank" href="https://brew.sh/">link</a> or type in terminal</li>
</ul>
<pre><code class="lang-python">/usr/bin/ruby -e <span class="hljs-string">"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"</span>
</code></pre>
<ul>
<li>Open terminal and Install Java</li>
</ul>
<pre><code class="lang-python">brew install java
<span class="hljs-comment">#to check if java installed?</span>
brew info java
</code></pre>
<ul>
<li>Install scala</li>
</ul>
<pre><code class="lang-python">brew install scala
</code></pre>
<ul>
<li>Install Spark</li>
</ul>
<pre><code class="lang-python">brew install apache-spark
</code></pre>
<ul>
<li>Install python</li>
</ul>
<pre><code class="lang-python">brew install python3
</code></pre>
<ul>
<li>Setup environment bashrc
Open file <code>sudo nano .bashrc</code></li>
<li>Add following env variables</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment">#java</span>
export JAVA_HOME=/Library/java/JavaVirtualMachines/adoptopenjdk<span class="hljs-number">-8.j</span>dk/contents/Home/
export JRE_HOME=/Library/java/JavaVirtualMachines/openjdk<span class="hljs-number">-13.j</span>dk/contents/Home/jre/
<span class="hljs-comment">#spark</span>
export SPARK_HOME=/usr/local/Cellar/apache-spark/<span class="hljs-number">2.4</span><span class="hljs-number">.4</span>/libexec
export PATH=/usr/local/Cellar/apache-spark/<span class="hljs-number">2.4</span><span class="hljs-number">.4</span>/bin:$PATH
<span class="hljs-comment">#pyspark</span>
export PYSPARK_PYTHON=/usr/local/bin/python3 <span class="hljs-comment"># or your path to python</span>
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=<span class="hljs-string">'notebook'</span>
</code></pre>
<ul>
<li>This should configure the pyspark setup, to test type <code>pyspark</code> in terminal</li>
</ul>
<h2 id="setting-up-locally-with-docker-and-jupyter-notebook-my-preferred-method">Setting up locally with docker and jupyter notebook (My preferred Method)</h2>
<p><strong>What is docker?</strong></p>
<p>Docker is an open platform for developing, shipping, and running applications. Want to learn more about docker, check out this <a target="_blank" href="https://docs.docker.com/get-started/overview/">link</a> </p>
<p>Setting up Spark with docker and jupyter notebook is quite a simple task involving a few steps that help build up an optimal environment for PySpark to be run on Jupyter Notebook in no time. Follow the steps mentioned below:</p>
<ul>
<li>Install <a target="_blank" href="https://docs.docker.com/get-docker/">Docker</a></li>
<li>Use a pre-existing docker image <a target="_blank" href="https://hub.docker.com/r/jupyter/pyspark-notebook">jupyter/pyspark-notebook</a> by <a target="_blank" href="https://hub.docker.com/u/jupyter">jupyter</a></li>
<li>Pull Image</li>
</ul>
<pre><code class="lang-python">docker pull jupyter/pyspark-notebook
</code></pre>
<ul>
<li>Docker Run</li>
</ul>
<pre><code class="lang-python">docker run -d -p <span class="hljs-number">8888</span>:<span class="hljs-number">8888</span> jupyter/pyspark-notebook:latest
</code></pre>
<ul>
<li>Go to <a target="_blank" href="http://localhost:8888">localhost:8888</a> and create a new notebook, and run cell with <code>import pyspark</code></li>
</ul>
<h2 id="databricks-setup">Databricks Setup</h2>
<p>Databricks, a unified analytics platform basically has Spark clusters in the cloud that are quite well managed. It is an easy-to-use environment that encourages the users to learn, collaborate and work in a fully integrated workspace. Any spark code can be easily scheduled without any hassle as databricks support pyspark natively</p>
<ul>
<li>To start <a target="_blank" href="https://databricks.com/try-databricks">create a databricks account</a> (This is usually done by databricks admins). and link it to your preferred cloud provider. For more information to get started check out this <a target="_blank" href="https://www.youtube.com/watch?v=3fqfWYBXj2A">video</a></li>
</ul>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=3fqfWYBXj2A">https://www.youtube.com/watch?v=3fqfWYBXj2A</a></div>
<ul>
<li>You have to start with creating a Databricks cluster.</li>
<li>Create a databricks notebook and test by <code>import pyspark</code></li>
</ul>
<h2 id="spark-and-python-on-aws-ec2">Spark and Python on AWS EC2</h2>
<p>Amazon EC2, are virtual machines provided by AWS, these come with pre-installed os software AMIs but the rest of the dependencies would need to be installed separately. </p>
<ul>
<li>Go to AWS Console and EC2</li>
<li>Select Ubuntu AMI</li>
<li>Follow the steps from Option 1</li>
</ul>
<p>Avoid doing this and use other options</p>
<h2 id="pyspark-on-aws-sagemaker-notebooks"><strong>Pyspark on AWS Sagemaker Notebooks</strong></h2>
<p>Launched in 2017, Amazon SageMaker is a cloud-based machine-learning platform that is fully managed and decouples your environments across developing, training, and deploying, letting you scale them separately whilst helping you optimize your spending and time. It is really easy to spin Sagemaker notebooks with a click of a few buttons.  Amazon SageMaker notebook instance is a machine learning (ML) compute instance running the Jupyter Notebook Environment. It comes with pre-configured Conda environments like python2, python3, PyTorch, TensorFlow etc</p>
<ul>
<li>Log in to your aws console and go to Sagemaker</li>
<li>Click on Notebook, Notebook Instances on the left side</li>
<li>Click on Create Notebook Instances, give it a name and select desired configurations</li>
<li>Select instance type, maybe start small ml.t2.medium, and maybe you can spin up a powerful instance later</li>
<li>Click create and wait for a few minutes and then click on open jupyterlab to go to the notebook</li>
<li>Create a new notebook and write the following code snippet to run pyspark</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> sagemaker_pyspark
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession, DataFrame

classpath = <span class="hljs-string">":"</span>.join(sagemaker_pyspark.classpath_jars())
spark = SparkSession.builder.config(
        <span class="hljs-string">"spark.driver.extraClassPath"</span>, classpath
    ).getOrCreate()
</code></pre>
<ul>
<li>If you are interested to know more about Sagemaker, do check out my previous <a target="_blank" href="https://youtu.be/95332cm5ROo">video</a>, <a target="_blank" href="https://youtu.be/95332cm5ROo">Sagemaker in 11 Minutes</a></li>
</ul>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/95332cm5ROo">https://youtu.be/95332cm5ROo</a></div>
<h2 id="aws-emr-cluster-setup">AWS EMR Cluster Setup</h2>
<p>Amazon EMR, probably one of the best places to run Spark, can help you create Spark clusters very easily as it is equipped with various features such as Amazon S3 connectivity which makes it all lightning-fast and super-convenient. Moreover, integrated operations with EC2 spot market and EMR Managed scaling.</p>
<p>To be precise, EMR is a well-managed big data service enabling data scientists to get assistance in their work with data science applications written in Python, Scala, and Pyspark. It ensures a convenient cluster setup for Spark for the data scientist to have a platform to develop and visualize.</p>
<ul>
<li>Go to AWS console and search for EMR</li>
<li>Click on create a cluster</li>
<li>In general configuration give it a name, in Software configuration select Spark Application</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1623053201774/btbcrPm3O.png" alt="EMR Cluster setup 1.png" /></p>
<ul>
<li>In Hardware configuration select instance type, maybe start small from m1.medium and select number of instances in cluster</li>
<li>In Security Select EC2 Key pairs, usually created by administrator, if not you can follow the steps on the right to create programmatic access keys for the cluster to use</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1623053214057/njKjy_dB5.png" alt="EMR Cluster setup 2.png" /></p>
<ul>
<li>Keep the rest options to default and create the cluster</li>
<li>After that create a EMR notebook and select the newly created cluster to execute your jobs for scale</li>
</ul>
<h2 id="conclusion">Conclusion:</h2>
<p>Spark, a complete analytic engine, helps data scientists in their operations of lengthy data processings that are rather difficult when handled with Pandas. Thus, learning PySpark, the robust library, can help data engineers a lot in their course of work. Now that you know various platforms that enable you setup Spark clusters with well managed clouds, you can explore them yourself.</p>
]]></content:encoded></item><item><title><![CDATA[How To NFT?]]></title><description><![CDATA[NFTs are digital directories that are powered by a blockchain system which is the same infrastructure that characterizes common cryptocurrency. However, an NFT is a unique kind of cryptocurrency, and the blockchain database on which it is stored auth...]]></description><link>https://anujsyal.com/how-to-nft</link><guid isPermaLink="true">https://anujsyal.com/how-to-nft</guid><category><![CDATA[Cryptocurrency]]></category><category><![CDATA[Ethereum]]></category><category><![CDATA[Blockchain]]></category><category><![CDATA[Bitcoin]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Tue, 11 May 2021 06:27:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1620713415904/l1eYLV3AV.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>NFTs are digital directories that are powered by a blockchain system which is the same infrastructure that characterizes common cryptocurrency. However, an NFT is a unique kind of cryptocurrency, and the blockchain database on which it is stored authenticates whoever the legitimate holder of that cryptocurrency.</p>
<p>The NFTs are considered an element of the Ethereum blockchain that is like any other cryptocurrency.  Before we understand what NFT is, let's look at the underlying technology behind NFTs which is Ethereum. </p>
<h2 id="what-is-ethereum-blockchain">What is Ethereum blockchain?</h2>
<p>After Bitcoin, Ethereum is the second-largest blockchain by trading volume. However, it wasn't even designed to serve only as an electronic currency. Alternatively, Ethereum's creators decided to create a different type of international, distributed computing platform. This will introduce blockchain's protection and transparency to a wide variety of applications.</p>
<p>Different finance resources, apps, and complicated systems are already running on Ethereum and only the creators' ideas are restricting its possible potential. Ethereum will be used to formalize, decentralize, preserve, and exchange almost anything,</p>
<h2 id="nft-values-digital-art-in-millions">NFT Values digital art in millions</h2>
<p>The value of NFT artwork is on the rise. Several artists are selling their masterpieces in millions. Recently Mike Winkelmann sold an NFT of his art piece for $69 million and the auction house has rated him among the top valuable living artists.</p>
<p>When NFTs, initially came to the notice of everyone, a few people knew what they would be or what they would be used for. They are now a flourishing industry with items sold for millions, of dollars each. </p>
<p>NFT token revenues exceeded $500 million in February, more than double the total for the whole year of 2020. More than 191,000 electronic art pieces have now been purchased for a record of $533 million, per the Blockchain Art report.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1620713898338/N9m0y545k.png" alt="image.png" /></p>
<h2 id="how-to-buy-your-first-nft">How to buy your first NFT?</h2>
<p>NFT payment systems allow users to upload their creative piece as well as purchase other people’s art which would be a wonderful experience Furthermore, having to look at what other people are offering will offer you a better idea of trending and popular activities.</p>
<h3 id="opensea-marketplace">OpenSea Marketplace</h3>
<p>This is unquestionably the best place for generating your NFT. You can generate the token for yourself, quickly and easily, thanks to a very user-friendly development environment. Even then, you can expect to be required to submit a price in ETH to get your NFT delivered.</p>
<p>The method of token generation is free, but the method of selling is not. Even, OpenSea is a great option because it is well-known and draws a large number of customers. The framework has some very innovative tools that are worth exploring. For example, OpenSea allows you to offer your NFT in offerings with those of other vendors. This is a one-of-a-kind feature that can be very useful because it expands the perceptual range.</p>
<h2 id="how-to-create-your-nft-on-opensea">How to create your NFT on OpenSea?</h2>
<p>•    Visit the official website and click on ‘Create’.
•    You will have to use your wallet for accessing the website.
•    After accepting these terms of service, add the details of your collection including name, description, and logo.
•    After creating a collection, you can select the products that you have to tokenize.
•    You can add images, sound, 3D models in any format.
•    Different characters can also be added to the token.
•    After creating the token, you can start selling it online by clicking on ‘Sell’.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1620713868246/UK7whZJA7.png" alt="image.png" /></p>
<h2 id="current-problems-with-nfts">Current problems with NFTs</h2>
<p>•    ERC-20 ventures are unable to execute any micropayment transfers on Ethereum due to rising Ethereum fuel prices. This makes it impossible to employ the Ethereum system for one of its main applications. Ethereum includes gas charges. These are just the fees that miners would pay to complete transfers. This price isn't fixed; it varies according to a system designed. If a transaction will not exceed the miners' requirement, it will be postponed or refused entirely.</p>
<p>•    Non-fungible tokens (NFTs), which are independent components of cryptocurrency material, have been somewhat liable for the thousands of tonnes of planet-warming atmospheric carbon pollution produced by the tokens used to purchase and trade them. Many creators, particularly those who have already profited from the trend, believe it is a simple challenge to overcome. Others believe the existing strategies are ineffective.</p>
<p>•    A critical feature of NFTs is that they can be hacked much like any account. There's also the issue of NFTs being robbed in the first place. , some Social media users announced that their Nifty Portal identities were already compromised and that NFTs valued millions of dollars had been robbed</p>
<p>•    Several creators have reported about the unique ecological consequences of crypto art. The exact amount of resources used to mint paintings on the blockchain varies, but somehow it ranges from days to weeks to months of an ordinary Country o citizen's energy intake.</p>
<h2 id="ethereum-community-is-trying-to-curb-the-high-gas-price">Ethereum community is trying to curb the high gas price</h2>
<p>Eth2, is a collection of enhancements planned to increase the platform's performance, usability, stability, durability, and versatility. It would solve the gas problem by keeping Dai transactions and other DeFi facilities less costly.</p>
<p>By transitioning to proof-of-stake (PoS), Ethereum's creators hope to start reducing the existing proof-of-work agreement program's heavy resource demand and dependency on advanced resources. The PoS framework that will be implemented on the Beacon Network can enable the distributed Ethereum blockchain to achieve equilibrium and keep the platform secured while reducing energy consumption by providing an economic contribution. </p>
<p>Sharding, another expected enhancement, would allow the system to handle far more payments than it currently allows, lowering payment costs by reducing rivalry for storage in the next block. To spread the massive burden, Eth2 can scatter transactions through a vast range of shards.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Many traders are making risky investments on the NFT industry and NFT artwork with expectations of seeing their worth boom. Others buy NFTs primarily for the purpose of recognition, personal satisfaction, or simply to enter a new culture.</p>
<p>To summarize, an NFT is a virtual piece of artwork with a token that makes it exclusive and registers it on the web's database. Its significance is determined by who created it and just how much individuals believe it is valuable.</p>
<h1 id="more-interested-in-the-topic">More interested in the topic</h1>
<p>Check out this youtube video I created for detailed walkthrough for buying your first NFT token </p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/Malw5Wg79kk"></iframe>


<h1 id="follow-me-on-linkedin-and-twitter">Follow Me on Linkedin &amp; Twitter</h1>
<p>If you are interested in similar content hit up the follow button on Medium or follow me on  <a target="_blank" href="https://twitter.com/anuj_syal">Twitter</a>  and  <a target="_blank" href="https://www.linkedin.com/in/anuj-syal-727736101/">Linkedin</a> </p>
]]></content:encoded></item><item><title><![CDATA[The SageMaker Saga]]></title><description><![CDATA[Many data scientists develop, train, and deploy ML models within a hosted environment.  Regrettably for them, they do not have the convenience and facility for scaling up or scaling down resources as and when required based on their models.
This is w...]]></description><link>https://anujsyal.com/the-sagemaker-saga</link><guid isPermaLink="true">https://anujsyal.com/the-sagemaker-saga</guid><category><![CDATA[AWS]]></category><category><![CDATA[Amazon Web Services]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Anuj Syal]]></dc:creator><pubDate>Wed, 28 Apr 2021 06:06:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1619584927235/_g9qwG7rC.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Many data scientists develop, train, and deploy ML models within a hosted environment.  Regrettably for them, they do not have the convenience and facility for scaling up or scaling down resources as and when required based on their models.</p>
<p>This is where AWS SageMaker comes into picture! It solves the issue by facilitating developers to build and train models in order to get faster production with bare minimum efforts at an economical cost. </p>
<h3 id="but-firstwhat-is-aws-you-ask">But first…what is AWS you ask?</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1619585596424/QS-2edvGo.jpeg" alt="hello-i-m-nik-r22qS5ejODs-unsplash.jpg" /></p>
<blockquote>
<p>Photo by <a href="https://unsplash.com/@helloimnik?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Hello I'm Nik</a> on <a href="https://unsplash.com/s/photos/amazon?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>Amazon Web Services (AWS) is a widely adopted, world’s most comprehensive on-demand cloud platform by, you guessed it...Amazon, offering over 200 fully featured services from data centres from around the world. AWS services can be used to build, monitor, and deploy any application type in the cloud enabling millions of people and businesses, including the fastest-growing start-ups, leading government agencies and largest enterprises to lower costs, innovate faster and become more agile.
Providing a massive global cloud infrastructure, AWS allows you to quickly innovate, iterate and experiment. With proven operational expertise, flexibility to choose the services you need and way more functionality and features than any other cloud provider, AWS lets you focus on innovation not just infrastructure.
As a language and OS agnostic platform providing an unmatched experience, AWS provides a highly secure, scalable and reliable low-cost infrastructure platform in the cloud that powers hundreds of thousands of businesses and millions of customers in over 190 countries around the world.
Today AWS has the most dynamic and largest community of customers and partners virtually from every industry and every size.</p>
<h3 id="welcome-aws-sagemaker">Welcome, AWS SageMaker</h3>
<p>Launched in 2017, Amazon SageMaker is a cloud-based machine-learning platform that is fully-managed and decouples your environments across developing, training and deploying, letting you scale them separately whilst helping you optimise your spend and time. AWS SageMaker includes modules that can be used together or independently to build, train, and deploy ML models at any scale by the data scientists and developers.
AWS SageMaker empowers everyday developers and scientists to use machine learning without any previous experience. A whole lot of developers across the world are adopting SageMaker in various ways, some for end-to-end flow while others to scale up training jobs.</p>
<h3 id="why-aws-sagemaker-the-advantages">Why AWS SageMaker: The Advantages</h3>
<p>The AWS SageMaker comes with a pool of advantages, some of which I am listing below:</p>
<ul>
<li>It improves and enhances the productivity of a machine learning project</li>
<li>It aids in creating and managing compute instance within the least amount of time </li>
<li>It reduces the cost of building machine learning models by up to 70% </li>
<li>It automatically creates, deploys, and trains model with complete visibility by - </li>
<li>inspecting raw data </li>
<li>It reduces the time required for data labelling tasks </li>
<li>It helps in storing all Machine Learning components in one place</li>
<li>It trains model faster and is highly scalable</li>
<li>It maintains uptime — Process keeps on running without any stoppage</li>
<li>It maintains high data security</li>
</ul>
<p>A big umbrella of all the ML services, Sagemaker tries to provide one single place for all your Machine Learning and Data science workflows. It tries to cover all steps involved right from Provisioning Cloud Resources and Importing Data to Cleaning the data, Labelling the data (including manual labelling) and Training models to Automation and Deploying models in production. </p>
<h3 id="aws-sagemaker-demo-in-10-minutes">AWS Sagemaker Demo in 10 minutes</h3>
<p>Looking for a quick start on Sagemaker Console, check out this video on youtube</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/95332cm5ROo"></iframe>

<blockquote>
<p>Sagemaker in 11 minutes by Anuj Syal</p>
</blockquote>
<h3 id="exploring-the-full-potential-sagemakers-features-and-capabilities">Exploring the Full Potential: SageMaker’s Features and Capabilities</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1619585192671/NqEQ569gQ.png" alt="Untitled.png" /></p>
<blockquote>
<p>Source: https://aws.amazon.com/sagemaker/</p>
</blockquote>
<h4 id="prepare">Prepare</h4>
<p>Even if you don't have a labelled dataset, AWS Sagemaker allows you to take the help of mechanical Turks to label your dataset correctly. One of it is <code>Amazon SageMaker Ground Truth</code> which is a fully managed data labeling service which helps to build the right training dataset. You can get started with labeling your data in minutes through the SageMaker Ground Truth console using custom or built-in data labeling workflows. </p>
<h4 id="build">Build</h4>
<p>AWS SageMaker makes it easy to build ML models and get them ready for training by providing everything you need to swiftly connect to your training data and help you select and optimize the best algorithm and the apt framework for your application. Amazon SageMaker includes hosted Jupyter notebooks that make it easy to explore and visualise your training data stored on Amazon S3. You can either connect directly to data in S3, or use AWS Glue to move data from Amazon DynamoDB, Amazon RDS, and Amazon Redshift into S3 for analysis in your notebook.
For ease of selection of algorithms, AWS SageMaker includes the 10 most frequently used ML algorithms which come pre-installed and optimised thereby delivering up to 10 times the performance you would find running these algorithms anywhere else. SageMaker also comes pre-configured to run Apache MXNet and TensorFlow - two of the most widely used open-source frameworks. Besides, you even have the option of using your own framework.</p>
<h4 id="train">Train</h4>
<p>The next essential feature in AWS SageMaker machine learning is training a model. In this stage you need to focus on the evaluation of the model. The training of a model primarily involves an algorithm, and the selection of the algorithm involves various other factors. For effective and faster use, AWS SageMaker provides in-built algorithms as well.
Another key requirement for training the Machine Learning model refers to compute resources. The size of the training dataset and the desired speed of results help in determining the requirement of resources. The next important characteristic also accounts as a formidable aspect in Amazon ML vs. SageMaker which deals with evaluation.
After completion of the AWS online training of the model, you have to evaluate the model for testing the accuracy of the inferences. The AWS SDK for Python (Boto) or high-level Python library in SageMaker helps in sending requests for inferences to model. Jupyter notebook assists in training and evaluation of the model.</p>
<h4 id="deploy">Deploy</h4>
<p>Once your model is trained and tuned, AWS SageMaker makes it easy to deploy in production so you can start running generating predictions on new data (a process called inference). To deliver high performance and high availability both, SageMaker deploys your model on an auto-scaling cluster of Amazon EC2 instances spread across multiple availability zones. AWS SageMaker also comes with built-in A/B testing capabilities to help test your model and experiment with different versions to achieve the best results.
AWS SageMaker takes away the heavy lifting of ML, so one can build, train, and deploy machine learning models easily and efficiently.</p>
<h3 id="validating-a-model-with-sagemaker">Validating a Model with SageMaker</h3>
<p>You have the option of evaluating your model using offline or historical data:</p>
<p><strong>Offline Testing: </strong>For this, historical data is used to send requests to the model through Jupyter notebook in Amazon SageMaker for evaluation.</p>
<p><strong>Online Testing with Live Data:</strong> Multiple models are deployed into the endpoint of Amazon SageMaker, and it directs live traffic to the model for validation.</p>
<p><strong>Validating Using a "Holdout Set":</strong> For this, a part of the data called a “holdout set” is set aside. The model is later trained with remaining input data and generalizes the data based on what it learnt initially.</p>
<p><strong>K-fold Validation:</strong> For this, the input data is split into two parts - K, which is the validation data for testing the model, and the other part called k−1 which is used as training data. Now, based on the input data, the ML models evaluate the final output.</p>
<h3 id="sneak-peek-aws-sagemaker-studio-and-architectural-view">Sneak Peek: AWS SageMaker Studio and Architectural View</h3>
<p>It is a fully integrated development environment for machine learning where build, training, and deployment of models can all be done under one single roof.</p>
<ul>
<li><p><strong>Amazon SageMaker Notebooks:</strong> Used for easily creating and sharing Jupyter notebooks.</p>
</li>
<li><p><strong>Amazon SageMaker Experiments:</strong>  Used for tracking, organizing, comparing, and evaluating different ML experiments.</p>
</li>
<li><p><strong>Amazon SageMaker Debugger:</strong> As the name suggests, it is used for debugging and analyzing training issues of complex types and receiving alert notifications for the errors.</p>
</li>
<li><p><strong>Amazon SageMaker Model Monitor:</strong> This is used to detect quality deviations for deployed ML models.
Amazon SageMaker Autopilot: It is used to build ML models automatically with full visibility and control.</p>
</li>
</ul>
<h3 id="final-words-conclusion">Final Words: Conclusion</h3>
<p>Machine learning is the future of application development and AWS SageMaker is all set to revolutionize the world of computing. The sheer productivity of applications in machine learning will create new prospects for adoption of ML services such as the SageMaker. </p>
]]></content:encoded></item></channel></rss>