Azure Data Factory is a crucial element of the whole Azure Big Data ecosystem. Navigation of data flows, managing and triggering the execution of particular pieces of Azure Big Data application is essentially what it does. The new version of Data Factory is an evolution of its predecessor and now we call it Azure Data Factory V2 or, in short, ADF V2. Read on to learn how it will change the way we deal with Big Data in Azure.
I would like to invite you to a dedicated article series where I will take you on a journey showing how the Azure Data Factory has evolved. Today, I will highlight the most interesting features which have been introduced. The following posts will focus on the particular elements one by one. It is important to mention that the service is still in public preview.
Let’s get started!
Manually writing JSON files to create a pipeline? Not anymore!
Yes, you’ve read that correctly. Finally, this tedious, time-consuming and sometimes frustrating task is no longer an issue. Microsoft introduced Azure Data Factory Visual Tools, a completely new environment which will tremendously change the game and significantly improve pipeline development.
Azure Data Factory Visual Tools is a web-based application which you can access through Azure Portal. The whole pipeline development lifecycle takes place here. With just a simple drag & drop operation you can configure and connect particular bricks.
What if you miss the JSON definitions of your pipeline, data sets or linked services? No worries, you can view them directly as Visual Tools. Furthermore, Azure Data Factory Visual Tools gives you the possibility to establish a connection with your Visual Studio Team Services Git (or with GitHub in the near future). Thanks to that all your pipeline definition files can be stored in your repository.
Are you looking for more details about Azure Data Factory Visual Tools and their possibilities? Read my next post where I will focus only on this environment!
If this then do that or wait until…
Control Flow is something that was surely missed in ADF V1. Controlling pipeline activities’ execution is a brand new feature released in ADF V2. With the set of new actions you can define conditions or manipulate order based on which the activities should / should not be executed or what is the next step in case of activity success, failure or completion.
In this release, we have several conditional activities at our disposal. They include actions known from almost every programming language, like If, Until, For Each, Wait etc. and a few others like Web Activity, Execute Pipeline Activity or Get Metadata Activity.
More about developing conditions within the pipeline in another post!
It’s time to run the pipeline
Another improvement which came with Azure Data Factory V2 is a trigger. Additionally, the whole processing model is more straightforward, convenient, flexible and easier to understand.
Do you remember the data slices definition for each data set and their coordination with activity windows? Fortunately, they disappeared and are not required anymore! Finally – thanks, Microsoft and ADF team! 🙂
So, the data slices in datasets are now just history. Therefore, let’s talk a little bit more about triggers and how to run pipelines. In ADF V2 we have two options to run the pipeline:
- Run it on-demand
- Run it through the trigger.
In the first case, we run the trigger on demand. We can do it in many ways, including Visual Tools, REST API, PowerShell or C# etc.
The second option is something completely new. In order to run pipelines on a regular basis, we define a trigger. We can distinguish two types of this feature, i.e. a schedule trigger and a tumbling window trigger.
One of the differences between them is that the tumbling window trigger can process for periods defined in the past, a feature not supported by the schedule trigger. Similarly, we can use different approaches to run the pipeline, same as for on-demand run.
At this point, I will end my story about triggers and I will get back to them in the future.
Make your pipeline more flexible with @parameters
Parameters are another new feature which is worth mentioning here. As we can imagine, they allow to “inject” some value which may have an impact on the ADF pipeline execution behavior. Furthermore, using parameters in conjunction with control flow activities, ADF system variables, and expression languages gives us really fancy possibilities for orchestrating pipeline behavior.
For instance, you can use pipeline parameters in the following situations:
- Changing the threshold for If condition activity
- Setting a different input or output path for Copy activity sink or source
- Changing the value of a property within the pipeline, linked service, dataset, trigger etc.
I will cover this topic in more detail in another post where I will provide you with more real-life examples.
Welcome to the cloud, SSIS packages!
More and more components from the classic SQL Server Business Intelligence stack are brought to the Microsoft Azure ecosystem. Last time it was Analysis Services Tabular, now it’s time for Integration Services!
From now on you can host your Azure Data Factory pipelines in a dedicated environment called Integration Runtime (IR for short). IR is a compute infrastructure dedicated for Azure Data Factory. Its purpose is not only SSIS package execution but also Activity dispatch and Data movement. I will not come back to those functionalities, so at this point I would like to tell you a little bit about them:
- Data movement – the purpose is quite simple. Azure Data Factory uses integration runtime environment to carry the whole burden related to input and output of data copy between different services or data sources
- Activity dispatch – it is used for triggering and monitoring of transformation activities which are executed on other services like Azure SQL Server, Azure Machine Learning, HDInsight cluster and others.
Integration runtime does not impact how SQL Server Integration Service package is developed. This means that you may still use SQL Server Data Tools (SSDT) for development, and you can also use it for deployment to your cloud environment (i.e. IR).
If you are curious how to create and maintain SSIS Integration Runtime and use it in practice then just follow my posts where I will dive into more detail.
It is just the beginning
Today, we started a series which describes the most important and valuable features introduced to Azure Data Factory V2. For sure ADF V2 is a service on which you should start focusing your attention.
The recent enhancements greatly improve the whole development lifecycle. Please, be aware that this is still a public preview and some features might not work correctly or something might change before general availability (GA). Nevertheless, it is worth to be up-to-date and one step ahead!
I will cover all five topics in more detail in the future articles. I encourage you to read them. Stay tuned, or for more information, get in touch!