Technologies

With every new institute and with every new disease picture considered by GCAM, the data stock and complexity grow. In order to meet these increasing requirements and to create optimal conditions for the best possible evaluations and analyses, the experts at GCAM have decided on the following architectures:

 

The GCAM Data Lake approach in the Microsoft Azure Cloud:

Within the framework of the M2OLIE project, GCM is successively driving forward the design and implementatuin of the data lake on the Microsoft Azure Clouf Computing platform. In each step, a basis clinical system will be integrated into the existing database. All data collected for each patient is stored in the Data Lake.

Connection of the basic clinical systems: For communication between the internal network of a clinic and the Azure Cloud, a Windows server is operated within the clinic network. This interface server receives the data from each base system and transmits it to the cloud via the Azure Virtual Network, a secure point-to-site VPN gateway.

Ingestion zone: The ingestion zone is formed by blob storage. This object storage receives the data transmitted from the hospital network to the cloud and stores it unchanged in raw format.

Transformation zone: In the transformation zone, all data stored in the blob storage is processed and converted into a standardized structured form. Saving a file in the blob storage automatically starts the event-driven service of Azure Functions. Depending on the data format, an individualized function is executed. This function extracts the relevant information, converts it into a relational data schema and transfers it to the storage zone. After the end of a patient's treatment, the data held in the storing-zone is anonymized. This step is guaranteed by the Apache Spark based analysis platform, Databricks. The data to be anonymized is read, masked and overwritten in the storage.

GCAM Azure

Storing-zone: The transformed data is kept in the storing zone with its two components. On the one hand, the processed data from the different base systems is stored in an SQL database in an integrated form to make it available for further analyses. In addition, all data is archived in raw format in another blob storage. For large files, such as DICOM image data, only the metadata are stored in the SQL server, the complete file is stored in the blob storage.

Associated Processes: For the security and monitoring of a complete data integration, two accompanying processes, the Master Service and the Alive Check, were implemented as personalized Azure Functions.

 

The Hadoop data lake approach:

In the Ingestion Zone, new data gets stored in an anonymized form, so that no reference back to the supplier institute or to the patient can be established. Here, the data is already tagged with metadata and subjected to rudimentary checks (identification of transmission errors, duplication check, etc.).

After the first check, the data is transferred into the raw zone. During this transport, the data types and storage modes are harmonized. New deliveries are merged with the previously existing data and stored permanently and compressed if necessary. While doing so, no information of any kind is deleted. This is the actual raw data basis, on which the next layers can build.

In the Integration Zone, the raw data are then related to each other and either transferred to the presentation layer (Serving Zone) or subjected to extensive analyses. With each new evaluation and analysis, new data areas can be formed in the Integration Zone, but these are always based on the raw data. 

In the Serving Zone, the data of the Integration Zone is presented to the users. This can be either related images of raw data or the results of analyses and calculations.

At the same time, there is also the Open Data area, in which publicly accessible information is made available to users. This includes classifications as well as additional data (weather, locations, etc.).

In addition to the actual task of structuring and making analyses available, new analysis procedures or methods are tested in the Discovery & Sandbox Area.

Technologically, this data-lake is based on a Hadoop cluster, which offers optimal possibilities for this approach and is easily scalable.