Google DataFlow — the best way to handle batch and stream data processing

With business becoming increasingly data-driven, Google DataFlow is one of the best tools for enabling cost-efficient streaming and batch Big Data analytics.

Vladimir Fedak
4 min readSep 14, 2020

--

Our world starts operating more and more online, various business products and solutions deliver an ever-increasing flow of data and the businesses that want to remain competitive and must be able to process this data efficiently. Google Dataflow is one of the best solutions for batch and stream data processing available nowadays, and today we describe its basics, possible use cases and implementation scenarios.

Google Dataflow is a managed service from Google Cloud using Apache Beam SDK to build serverless batch and streaming data processing pipelines with unlimited horizontal scalability. Due to this, the customers are able to solve literally any data processing request cost-efficiently, as they pay only for the resources used regardless of the scope of tasks.

Dataflow provides the following advantages:

  • Automation of provisioning and management of the computing resources needed for your tasks. As Dataflow is a managed service with serverless architecture, it can automatically scale up and down to meet the demands of your data processing tasks, either between the jobs or during runtime.
  • Unlimited horizontal autoscaling to handle data processing at scale. Dataflow components — Dataflow Shuffle, Dataflow SQL and Streaming Engine allow seamlessly scaling to hundreds of terabytes of data batch processing on request.
  • Unified streaming for batch and streaming data processing. There is no need to adjust your data processing model, as Dataflow switches seamlessly between streaming and batch data processing.
  • Top-notch technology using innovative open-source Apache Beam SDK. Backed by a passionate open-source community and rapidly developed by Apache Foundation, Beam SDK is the latest and most advanced approach to handling data processing.
  • Reliability and consistency that ensures every event being processed exactly once. This is crucial for data processing at scale, where thousands of events must be processed instantaneously and without duplication to ensure the system does not stall. Dataflow removes this hurdle by providing fault-tolerant architecture by design.
  • Cost-efficiency with flexible resource scheduling. FlexRS feature allows placing batch data processing jobs into a flexible queue. These events will be processed overnight, within 6 hours of placing them into the queue, as the costs of resources are lower in the night, so data processing at scale sees substantial savings of resources.

Dataflow use cases

Companies like Dow Jones, Unity, Quantiphi and Sky use Google Dataflow to enable a variety of business results:

  • Stream analytics allowing to ingest data from a variety of sources, transform it and provide data analysts with easily-digestible charts and diagrams, enabling real-time data analysis and helping get maximum business value out of dispersed information.
  • Sensor data logging and processing helps keep the hand on the pulse of your Industry 4.0 facilities and IoT solutions with the Google IoT platform.
  • Real-time application of AI algorithms using features like Google AI platform and TensorFlowExtended with CI/CD pipelines for ML. This helps provide real-time personalization, fraud detection, predictive analytics, etc.

Literally any company in need to effectively utilize its machine-generated data can benefit from using Google Dataflow. Let’s take a closer look at how you can start using it.

Starting to use Google Dataflow

Google provides detailed documentation on how to start using Dataflow. Most popular enterprise Big Data solutions use Java and Maven, Python, or Java with Eclipse — and these variants are covered in detail by Google. In addition, you can leverage one of the preconfigured templates from Google using Google Big Query and other services.

DataFlow also supports several APIs — for integration with Java and Python applications, as well as SQL API and RESTful API should your project needs them. Dataflow works in the cloud, yet to enable minimal latency for optimal data processing, you can configure regional endpoints, which will collect the data at its geographical location and reduce the time needed to process it in Google availability zones. Should your system already use Apache Kafka, Google Dataflow can be integrated with it very easy to ensure seamless batch and stream data handling.

Thus said, both a fledgling startup and an established enterprise can benefit from deploying their data processing jobs to Google Dataflow, as it will allow you to turn the wealth of your machine-generated data into a goldmine of actionable business insights. However, an in-depth Google Cloud expertise is needed, as Dataflow is easy to learn — but hard to master.

This is where experienced DevOps engineers from IT Svit can come in handy, as we have ample experience with configuring and running Google Cloud infrastructures for our customers. Therefore, if you need help with optimizing an existing Dataflow project or building a new one from scratch — contact IT Svit, we are always ready to assist!

--

--