Big Data Processing with Apache Beam
In this world, daily every minute, every second, lots of data is generated from a variety of data sources. So, it is very tedious to extract and process information from it. In order to solve these problems, Apache Beam comes into the picture.
Apache Beam is an open-source unified programming model to define and execute data processing pipelines, transformation, including ETL and processing batch and streaming data. Using your favorite programming language (Python, Java and Go currently), you can use Apache Beam SDK for your jobs and execute your pipeline on your favorite runner like Apache Spark, Apache Flink, Cloud Dataflow, Amazon Kinesis, etc.
Data Ingestion & type of data
Our data is of two types; batch data and streaming data. Depending on the use cases we choose different architectural models to process our data. Here, we will move ahead by using Python code for further operations. Apache Beam SDK requires Python version 3.6 or higher. Now, Install the apache beam SDK using the following command.
Local
pip install apache-beam
Google Cloud Platform
pip install apache-beam[gcp]
Amazon Web Server
pip install apache-beam[aws]
For I/O operations you can read and write data from various data sources like Avro, Parquet, BigQuery, PubSub, MongoDB, TFRecord, etc.
Batch Data
First of all, collect historical data into data lakes where we put raw data (unprocessed data). To do some processing and transformation, put the data into a storage service (S3 bucket, Cloud storage, on-premise storage device, etc). This is called the extraction of data from data lakes.
beam.io.ReadFromText(‘’)
Stream Data
This is real-time data generated from data centers, automobiles, Maps, Health care, log devices, and sensors, etc. For ingesting streaming data, use Apache Kafka or any other messaging services (like Cloud Pub/Sub, SNS). In the Pub/Sub, you can filter data according to your need.
beam.io.ReadFromPubSub(subscription=subscription_name)
Processing & Transform
First, create a Pipeline object and set the pipeline execution environment (Apache Spark, Apache Flink, Cloud Dataflow and Amazon Kinesis, etc.). Now, create a Pcollection from some external storage or data source then apply PTransforms to transform each element in the Pcollection to produce output Pcollection.
You can filter, group, analyze or do other processing on data. Finally, store the final Pcollection to some external storage system using I/O libraries. When you run this pipeline, it creates a workflow graph of the pipeline, which executes asynchronously on the runner engine.
Pipelines — It encapsulates the entire process of reading bounded or unbounded data from the external sources, transforming it, and saving the output into external storage sources like BigQuery, etc.
Pcollections — It defines the data on which the data pipeline works, it could be either bounded data or unbounded data. We can create Pcollections from any external system (Data lakes, geographical data, health care).
PTransforms — It takes Pcollection as an input data, applies processing function(ParDo, Map, Filter etc) on it, and produces another Pcollection.
Pipeline IO — It enables you to read/write data from/to various external sources.
Read more: https://tudip.com/blog-post/big-data-processing-with-apache-beam/