This is the sixth blog in a series of 6 blogs about integration 2.0. Did you miss one blog, read the fifth blog Services!
The API- and Event hub may support most of the integration needs. Certain data however may require a specific approach, for instance if the size, speed or format of data requires this, or an external party demands it.
Business Intelligence (BI) & Analytics
Business Intelligence & Data Warehouse
BI is a way to do analytics over a combined set of (historical) data. This requires data to be integrated and made accessible for the analytics users and processes. Traditionally this is called “Data Integration” as opposed to “application integration”. Data integration has focused on gathering data and copying it into an integrated/aggregated format into a Data Warehouse, from which data marts were generated for specific user groups.
Analytics is about making predictions on what will happen in the future. Besides the traditional business transaction data, it may leverage “big data” as well. This data may differ from traditional data in volume (large amounts), speed (high velocity of data generation), format (it may be unstructured) and quality. This requires for specific storage requirements to handle these amounts and variances of data, special ways of transporting high speed streaming data, and additional processing to structure data and improve quality.
The need for real time analytics makes batch-processing or ETL a less suitable way of processing. Integration 2.0 addresses this with 2 patterns: Stream processing and Data Virtualization.
Streaming data is a continuous stream of small size data records that can be generated by multiple sources simultaneously. They are generally triggered at a speed too high to handle instantly by conventional EAI means. Sources may be web-traffic events, RSS feeds or social media messages. The high number / high speed event streams require specific capabilities to handle them. This may be:
- Ingestion: Receiving the data and store it immediately in a Data Lake, where it awaits further processing, possible after filtering.
- Stream processing: Receive the data and immediately process it, e.g. by run add it to a machine learning process that may trigger new events.
In general, these capabilities are part of big data platforms, like the Hadoop ecosystem.
Flow based over ETL
Extract-Transform-Load (ETL) is generally used for traditional BI. It is a kind of batch-processing that obtains data from different sources, copies and move it and combine and aggregate it into a data warehouse. This approach is opposed to flow-based (event driven) processes, where each transaction or event is handled as it occurs.
ETL may be very efficient. However, it comes with several downsides.
- ETL is not real-time: ETL is a time-consuming process that is scheduled at specific times. As results are usually not available before the complete batch ends, hence the results are on average the half of the time between schedules + the time of the ETL run.
- ETL may cause high system load: ETL is relatively efficient as it handles many items in a short time. Though this may cause a burst of high resource consumption, which may impact other processes. Hence the scheduling of ETL-batches needs to be carefully planned, e.g. to run them outside office hours.
- If one element in the ETL-process fails, the complete batch fails.
Usually a complete ETL batch is treated as a single transaction, so when one of the items fail the complete batch fails, which results is. This implies the following:
- The cause of the error needs to be solved before any of all the transaction may be handled correctly
- Analysis is complex it requires to analyze a batch of items instead of a single item
- If the cause cannot be solved it requires additional effort to isolate the erroneous item to be able to run the ETL after all
- After solving a specific issue, the time-slot where the ETL was planned may be passed, so a new slot is to be planned, and no actual data is provided for this period
- After solving a specific issue, a next issue may occur, preventing the batch to run. Requiring the steps before to be repeated
- ETL requires specific ETL-tooling and ETL-skills
- ETL requires a predefined data structure. Changing this takes a lot of effort.
Therefore flow-based processes are preferred over batch-oriented like ETL.
ETL however may be suitable in the following cases:
- A set of data should be loaded (initial load, historical data): If a particular set of transactions from the past is should be loaded ETL may be an option. This is for instance the case for initial loads, or data migrations.
- Event handling would cause an unacceptable system load: The Event-mechanism has more overhead than ETL. If this could cause resource issues ETL may be considered as an alternative.
- The event mechanism is unreliable: If there is no guaranteed delivery of events ETL may be a more reliable alternative. (Note: Unreliable messaging will most likely cause other issues as well. Needless to say, that this issue would have to be solved)
Data Virtualization (DV)
Data Virtualization is a technique for data integration and may be used as alternative to ETL. Traditionally all data is processed and physically loaded into a single data store (e.g. a data warehouse). As we have seen ETL has several drawbacks.
Data Virtualization is an approach that allows to create views on multiple data sources. This may be sources like data warehouse and data lake, but also transaction systems. The advantage is the availability of real time data without a long development process. Therefor it is a much more agile approach. Further DV allows to apply data governance by providing rules who may access which data. Data Virtualization though should be used with care, as it depends on the availability of data sources and may put additional load on real time systems. Data caching may be used to work around these risks.
Internet of Things
The world of internet of things is still very dynamic. Several consortia battle to set the standard for a platform that allows for devices communicate on short or longer range with hubs. Whatever platform, the event and API hubs may be used to disclose the IoT functionality with the enterprise. The IoT platform may be a hierarchical one consisting of things in a certain IoT domain, connected by a device that serves as an Edge platform. The Edge is connected to a central IoT platform that may be connected to the Event hub to trigger relevant IoT events, or to the API hub to directly observe or control IoT functionality.