Big Data Processing with Apache Beam

2 min readOct 8, 2021

In this world, daily every minute, every second, lots of data is generated from a variety of data sources. So, it is very tedious to extract and process information from it. In order to solve these problems, Apache Beam comes into the picture.

Apache Beam is an open-source unified programming model to define and execute data processing pipelines, transformation, including ETL and processing batch and streaming data. Using your favorite programming language (Python, Java and Go currently), you can use Apache Beam SDK for your jobs and execute your pipeline on your favorite runner like Apache Spark, Apache Flink, Cloud Dataflow, Amazon Kinesis, etc.

Data Ingestion & type of data

Our data is of two types; batch data and streaming data. Depending on the use cases we choose different architectural models to process our data. Here, we will move ahead by using Python code for further operations. Apache Beam SDK requires Python version 3.6 or higher. Now, Install the apache beam SDK using the following command.

Local

pip install apache-beam

Google Cloud Platform

pip install apache-beam[gcp]

Amazon Web Server

pip install apache-beam[aws]

For I/O operations you can read and write data from various data sources like Avro, Parquet, BigQuery, PubSub, MongoDB, TFRecord, etc.

Batch Data

First of all, collect historical data into data lakes where we put raw data (unprocessed data). To do some processing and transformation, put the data into a storage service (S3 bucket, Cloud storage, on-premise storage device, etc). This is called the extraction of data from data lakes.

beam.io.ReadFromText(‘’)

Stream Data

This is real-time data generated from data centers, automobiles, Maps, Health care, log devices, and sensors, etc. For ingesting streaming data, use Apache Kafka or any other messaging services (like Cloud Pub/Sub, SNS). In the Pub/Sub, you can filter data according to your need.

beam.io.ReadFromPubSub(subscription=subscription_name)

Processing & Transform

First, create a Pipeline object and set the pipeline execution environment (Apache Spark, Apache Flink, Cloud Dataflow and Amazon Kinesis, etc.). Now, create a Pcollection from some external storage or data source then apply PTransforms to transform each element in the Pcollection to produce output Pcollection.

You can filter, group, analyze or do other processing on data. Finally, store the final Pcollection to some external storage system using I/O libraries. When you run this pipeline, it creates a workflow graph of the pipeline, which executes asynchronously on the runner engine.

Pipelines — It encapsulates the entire process of reading bounded or unbounded data from the external sources, transforming it, and saving the output into external storage sources like BigQuery, etc.

Pcollections — It defines the data on which the data pipeline works, it could be either bounded data or unbounded data. We can create Pcollections from any external system (Data lakes, geographical data, health care).

PTransforms — It takes Pcollection as an input data, applies processing function(ParDo, Map, Filter etc) on it, and produces another Pcollection.

Pipeline IO — It enables you to read/write data from/to various external sources.

Big Data Processing with Apache Beam

Data Ingestion & type of data

Batch Data

Stream Data

Processing & Transform

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Tudip Technologies

No responses yet