I dedicated my first two posts to describing how and why to use the Azure Service Bus for connecting various services. Today, I would like to raise the bar even higher and introduce you to Azure Event Hubs. This solution is of particular importance when you’re processing large volumes of information, so it’s perfect for projects involving big data.
As per Microsoft documentation, Event Hubs can ingest large amounts of data and process or store it for business insights. The primary purpose of this service is high throughput and data streaming, where the customer or another service may send billions of requests per day.
The Azure Event Hubs uses a partitioned consumer model to scale out events. It can integrate with other big data and analytics services in Azure (e.g. Databricks, Stream Analytics, ADLS, and HDInsight).
Personally, I see Azure Event Hubs as a huge buffer where you can store messages. It automatically retains them for a given period. You can then read those messages much as you would read a file stream from a disk. What’s key to know is that you can scroll through all events and start reading anywhere, even at the beginning of the stream, and then process everything again.
In the context of this service, first, we should focus on the word “event”. An event is essentially a data piece detailing what happened. It should be very small and include only the important information about what was changed.
An event in Azure Event Hubs consists of the following data:
- Sequence Number
- Custom Properties
- System Properties
All events are sent via AMQP 1.0 which is a session- and state-aware bidirectional communication channel.
In the IT world, operating with events is called Event Driven Development. If you are interested in this topic, I encourage you to read Martin Fowler’s article that explains more about it.
Azure Service Bus vs Azure Event Hubs
Event Hubs is very similar to previously described Service Bus; in a way, we can call Azure Event Hubs the brother of the Azure Service Bus. That being said, there are a number of differences between the two services, because each has a different responsibility.
The purpose of Azure Service Bus is to manage the transportation and delivery of a specific payload (brokered message) from point A to point B. Many features, such as guaranteed delivery or visibility controls, can have a significant impact on scalability and thus make it difficult. However, it was built with a focus on achieving high performance when it comes to ingestion of messages.
Event Hubs is responsible for quickly consuming and processing many events from numerous inputs at the same time.
To delve deeper into the Azure Event Hubs, we first need to discuss partitions and how they affect scalability.
A partition is an ordered sequence of events that is held in an event hub. As newer events arrive, they are added to the end of this sequence. A partition can be thought of as a “commit log.” Event Hubs retains data for a configured retention time that applies across all partitions in the event hub. Events expire on a time basis; you cannot explicitly delete them. Because partitions are independent and contain their own sequence of data, they often grow at different rates.
As I wrote before, each partition creates its own stream where it keeps all events ordered by time. This aids event hub scalability significantly, because consequently, it can have many streams ordered by a specific key (e.g. by event type or custom name).
Azure Service Bus also provides this mechanism, working much in the same way, however it can only hold a maximum of 16 partitions in one topic or queue. On the other hand, the Event Hubs by default has 16 partitions, and at a maximum can have 32.
Azure Service Bus removes messages after reading, even though they are in partitions. Event Hubs stores all events in the partition where they remain until their retention period lapses (a maximum of 7 days). We can define our partition count after creating an event hub and we cannot change it later, so we should think about how many we’ll need, before creating a service.
A note on defining event hub parameters
Event Hubs works according to the CAP theorem.
This theorem discusses the choice between consistency, availability, and partition tolerance. It states that for the systems partitioned by network there is always a tradeoff between consistency and availability.
Brewer’s theorem defines consistency and availability as follows:
- Partition tolerance: the ability of a data processing system to continue processing data even if a partition failure occurs.
- Availability: a non-failing node returns a reasonable response within a reasonable amount of time (with no errors or timeouts).
- Consistency: a read is guaranteed to return the most recent write for a given client.
It is best to understand the implications of this tradeoff before you set up an event hub for your solution. Additionally, the main purpose of your solution will determine the setup parameters.
A consumer group is a personalized view (state, position or offset) for one application or customer. Each consumer or application can consume or read data based on it own requirements. For instance, one customer wants to read only one time events and does not store them; another may go back and read historical data again and again, and again.
Each consumer group can have max. five concurrent readers. As the Event Hubs operates in a partitioned consumer model, only one consumer can be active on a partition at a time within a consumer group.
The best practice is to use one partition to one consumer group because this is the scenario according to which the Event Hub has been designed. Adding up to four partitions per one consumer group should not have an impact on throughput and scalability.
Consumer group has two excellent client-side features.
The first one is Stream Offset. It is a cursor which determines a starting point of reading from an event stream. The offset is managed by a User.
The second feature is a Checkpoint. It stores events with an offset in Azure Blob Storage inside a stream. This feature gives us high reliability because when application disconnects from a customer group and reconnects again, it can read events from the last checkpoint instead of starting over.
Azure Event Hubs is very similar to Azure Service Bus in terms of functionality. However, its main objective is to consume and process a huge quantity of events, as well as keeping high scalability and reliability.
Creation of Event Hubs based on partitions has hit the bull’s eye, because it allowed to fast scalability and throughput. The CAP theory informed Azure architecture of this service and made it possible to define availability and implement a partitioned service. Microsoft also allowed for changing the CAP concept from AP (availability partitioned) to CP (consistency partitioned). Moreover, using checkpoints and offsets in your solution might give you more reliability.
Consequently, Azure Event Hubs is a great solution for processing big data and can help you keep your solution running smoothly, even with fluctuating data volumes. If you wish to know more about how to use big data to transform your business, just get in touch!