OpenMRS-FHIR Analytics Batch Streaming Mode

Ayeshmantha Perera
5 min readDec 17, 2020

OpenMRS team is currently working with the Google Cloud Platform team on an Analytical platform that provides a set of tools for transforming OpenMRS data into FHIR based data warehouse.

The platform supports both Streaming and Batch mode, focusing more on the Batch (Bulk Upload) in this blog post. Batch mode reads the whole OpenMRS MYSQL database, transforms it into FHIR resources, and uploads it to the Google Cloud Platform's target data warehouse.

Setting up the platform for Batch mode.

Currently, the repository for the FHIR analytics platform is in Google Cloud Platform Github itself. You can find it here.

Prerequisites

You can find the required docker-compose files to set up the required environment in the repository itself.

Setting up OpenMRS with dummy data.

You can find the compose file here. This compose file includes MySQL 5.7 distribution, and you can see a dump attached as a volume to the entry point, which will spin up the database with dummy data.

You can also see a MySQL.cnf file attached as a volume, which will enable the Binlog in MySQL. This is required when you are trying out streaming Binlog mode with Debezium.

Setting up the sink FHIR store.

With the current state, you can point the platform to either an FHIR store in Google Cloud Platform or to point to the local FHIR store you set up for testing purposes.

I will be focusing on setting up locally since an obvious guide on setting up the FHIR store in Google Cloud Platform is available here.

To set up for local usage again, we have provided a docker-compose file here.

Add FHIR2 module to OpenMRS.

FHIR2 module is a reimplementation of the FHIR module in OpenMRS using FHIR R4.

Different modes in the Analytics Platform get used of the FHIR2 module in different ways. Anyhow, in the end, all these modes will be publishing the data to an FHIR store. This highlights the main usage of the FHIR2 module to the Analytics platform.

Although Streaming AtomdFeed mode uses the Atom Feed module of OpenMRS to capture different types of OpenMRS resources, it uses the FHIR2 module to get the FHIR object of the relevant OpenMRS resource.

In Batch mode, we are directly using the FHIR2 module of OpenMRS to fetch the resources defined as search list parameters by the end-user.

You need to download the latest released version and install the module through the admin page in OpenMRS (or copy that file to the modules directory of your OpenMRS installation).

Suppose you made it to this point 🤞🏾🤞🏾. Now it’s time to look into the Batch Mode.

More about Batch Mode.

As of now, the core technology behind the Batch mode is Apache Beam. And based on your requirement, we have defined a few run times for you to execute the Beam ETL Pipeline.

  1. Direct Runner

The Direct Runner executes pipelines on your machine and is designed to validate that pipelines adhere to the Apache Beam model as closely as possible. Instead of focusing on efficient pipeline execution, the Direct Runner performs additional checks to ensure that users do not rely on semantics that is not guaranteed by the model. Some of these checks include:

  • enforcing immutability of elements
  • enforcing encodability of elements
  • elements are processed in an arbitrary order at all points
  • serialization of user functions (DoFn, CombineFn, etc.)

Using the Direct Runner for testing and development helps ensure that pipelines are robust across different Beam runners. Besides, debugging failed runs can be a non-trivial task when a pipeline executes on a remote cluster. Instead, it is often faster and simpler to perform local unit testing on your pipeline code. Unit testing your pipeline locally also allows you to use your preferred local debugging tools.

2. Data Flow Runner

The Google Cloud Dataflow Runner uses the Cloud Dataflow managed service. When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform.

The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs and provide:

3. Spark Runner

The Apache Spark Runner can be used to execute Beam pipelines using Apache Spark. The Spark Runner can execute Spark pipelines just like a native Spark application; deploying a self-contained application for local mode, running on Spark’s Standalone RM, or using YARN or Mesos.

The Spark Runner executes Beam pipelines on top of Apache Spark, providing:

  • Batch and streaming (and combined) pipelines.
  • The same fault-tolerance guarantees as provided by RDDs and DStreams.
  • The same security features Spark provides.
  • Built-in metrics reporting using Spark’s metrics system, which reports Beam Aggregators as well.
  • Native support for Beam side-inputs via spark’s Broadcast variables.

These are defined as separate maven profiles, and you can specify which profiles to run by their id. In this blog, I am focusing more on the direct runner defined in the default profile.

Currently, the batch mode will dump the output to Paraquest files as well as it will publish data to the FHIR store as well.

ETL Process

We are currently using Avro, a binary data structure that stands in the middle as a pass-through format. The resources queried from the OpenMRS FHIR2 module will be in JSON format, which harder to convert in the PARQUET format. But Avro files can be easily converted into PARQUET format.

The ETL process directly queries FHIR2 resource (JSON) from the OpenMRS FHIR2 module and then converted into Avro data objects using Bunsen library, a handy library that helps end-users to load, transform and analyze FHIR data with Apache Spark. We are using the FHIR resource schemas provided by Bunsen to convert openMRS FHIR resources into Avro files.

These Avro files will be converted into Parquet files using the Beam pipeline in the Batch mode of the Analytics Platform. Later on, we can spin up the FHIR data warehouse using the PARQUET files generated from the ETL Pipeline.

This is all about what we offer as a Platform for now in the Batch Mode. We will be working on adding more features in the upcoming months. Keep in touch for more updates. Cheers 🤞🏾🤞🏾🤞🏾

--

--