Azure Data Factory pipeline definitions are stored in a set of JSON files. Previously, the JSON file had to be created manually. Fortunately though, this is no longer the case! With Azure Data Factory V2 Visual Tools you don’t have to care about JSON files. Read on to see how you can easily create ADF pipelines!
Welcome back to my second post about Azure Data Factory V2. Today, I will cover Visual Tools in more detail. They introduce a completely new experience to ADF pipelines development.
The previous approach was very error-prone and often you probably noticed an issue during deployment. Maybe in case of simple pipelines it could be easy to detect and fix, but for more complex ones it was for sure a big challenge.
The new feature gets rid of these problems. Additionally, if you have experience with SQL Server Integration Services packages, you will notice that the user interface looks familiar.
Open the gate to the new portal!
First of all, you have to provision ADF V2 on your Azure subscription. In order to open Visual Tools, go to your Azure Data Factory V2 instance and click on Author & Monitor.
The starting point is the Overview page where you can watch some introduction videos and tutorials.
In order to open the Visual Tools designer click the pen icon in the top left corner. In case you want to switch between ADF instances in different subscriptions you can click the Azure Data Factory icon at the top. The gauge icon at the bottom is the Monitor feature.
Notice: Due to enhancements introduced in ADF V2, the JSON definitions are different than in ADF V1. Therefore, it is not possible to use Visual Tools with ADF V1. You have to create your pipelines from scratch. At this moment there is no tool that lets you migrate pipelines from V1 to V2 and I do not have any information that something like that will appear in the future.
Visual Tools has a graphical interface where you develop Azure Data Factory V2 pipelines. Here I would like to say a little bit about the most important elements which we can find there (you’ll find a diagram below).
- List of all pipelines and datasets defined within the current Azure Data Factory instance.
- “+” button – create and define a new pipeline or dataset.
- Connections – create and define a new linked service or integration runtime (including SSIS integration runtime).
- Publish All – deployment of all elements (i.e. pipelines, datasets, triggers, linked services etc.) to ADF. When your pipeline development is done, then just click Publish All.
- Discard All – cancels all changes which were not published.
- ARM Template – download ARM Templates which contain all definitions of pipelines, datasets etc. This might be useful if you want to deploy the solution in another environment. You can find more details on how to deploy resources from the ARM template here.
- List of all activities split across several groups. We have several new activities under General (Execute Pipeline, Get Metadata, Lookup, Web) and Iteration & Conditions (ForEach, If Condition, Until, Wait).
- Designer canvas – this is the place where you create your pipeline workflow. Just simply drag and drop the required activities and connect them. In today’s example, we have a simple pipeline with two activities. First, it performs a copy activity and in case of success, the stored procedure is invoked on the database.
- Properties panel – here you can find all properties related to the selected activity (in our case copy activity). This panel displays a different set of attributes depending on the selected activity. If none of them are selected then Visual Tools will display the properties of the current pipeline. As you can see, you do not have to fill in JSON files to define an activity or dataset. You can set all the required properties just by entering the appropriate value in particular controls.
- Validate – check if your pipeline is defined properly. In case of any errors, the appropriate panel will appear with a list of issues.
- Test Run – run a pipeline in order to test it. It does not require the Publish All action, which means that it is ran without deployment. The existing version will not be replaced.
- Trigger – run the pipeline on demand by using Trigger Now (this option requires that pipeline is published/deployed). Create or modify existing triggers by using the New/Edit option. In ADF V2 we have two types of trigger (Schedule or Tumbling Window) which I will cover in another post.
- Code – do you miss JSON definitions? Just click the Code button to see the pipeline definition in JSON format.
- ADF object navigation tab – you can switch between particular pipelines, datasets, linked services or triggers and set up each of them.
- Designer canvas buttons – zoom in, zoom out, lock canvas, reset zoom level, zoom to fit, turn on/off multi-select, auto align, show lineage.
- Additional buttons – help/information, show notification, send feedback, sign out.
- Resource group and the current ADF instance name.
- Show/hide – Pipeline Validation Output – here you can find a list of potential errors when you click the Validate button. If everything is fine, then the list will be empty.
Ok, let’s create a simple pipeline
Now you know what Visual Tools looks like and I explained most of the components. Next, I am going to show you how to create a simple pipeline step by step.
The pipeline task will be very simple, i.e. copy data from Blob Storage to Azure Data Lake Store. In order to perform this task, make sure that you have Azure Storage Account with Blob and Azure Data Lake Store already created on your Azure subscription.
Create a pipeline
- Click “+” button and select, and then Pipeline.
- Then, set the pipeline name.
Create datasets for Azure Blob Storage and Azure Data Lake
- In this next step, we need to provide details about input and output dataset. Click “+” button and then Dataset.
- The New Dataset panel will appear. As you can see, Visual Tools supports the creation of many different dataset types. Find Azure Blob Storage and click the Finish button. You can use search text box in order to find the desired dataset quicker.
- Set the name for the created dataset, i.e. AzureBlobOutput.
- Repeat the steps and create a dataset for Azure Data Lake Store. Name this dataset AzureDataLakeStoreInput.
Create Linked Services for Blob and Data Lake Store
- In the bottom left corner, click on Connections.
- From the Connections panel, you can create Linked Services and Integration Runtimes. In order to create a new Linked Service, click the New button.
- The New Linked Service creation panel looks very similar to the one we have when we define datasets. You can notice that here we have an additional split between Data Store and Compute. Find Azure Blob Storage and click Continue.
- You will then be navigated to the new panel where you can set up Linked Service properties for Blob Storage. Set the Name, then choose Azure subscription and Storage account name from which you want to copy data. You can test the connection just by clicking on the Test connection button. Then, confirm creation by clicking the Finish button.
- Repeat the same steps and set up a Linked Service for Azure Data Lake Store. You might notice that the set of properties will be different than for Azure Blob Storage. In order to set up a service principal key and ID refer to the following article.
- Upon completion, you should have one pipeline, two datasets, and two linked services defined.
Create a Copy Activity
- Switch to pipeline by clicking on the pipeline name on the left panel.
- From Activities, expand Data Flow panel and drag & drop the Copy activity.
- Click on the created copy activity and set properties. In the Properties panel go to General in order to set activity name, then Source tab and set AzureBlobOutput as a Source Data Set, then Sink tab and set AzureDataLakeStoreInput as Sink Dataset.
- You should then notice a green tick mark which means that Copy activity setup is correct.
Set up Datasets’ input and output locations
- In order to finish this part, make sure that you have some files or folders in your source Azure Blob Storage.
- Click on the AzureBlobOutput dataset and switch to Connection tab in the properties panel.
- Set Linked Service by selecting the appropriate position from the dropdown list.
- Next, set the File path where you store the files which you would like to copy. You can indicate a specific file or folder.
- You can select the Binary Copy option, then the copied files will be treated as binaries and schema will not be enforced.
- Repeat these same steps for the AzureDataLakeStoreInput dataset.
Validation and pipeline test run
Our pipeline is now ready, but before you perform a test run, let’s validate it.
- Select the created pipeline and click the Validate button.
- If everything is fine and validation did not detect any errors, then you should see the following notification:
- Great! Now our pipeline is ready to run. At this moment we have two options:
- We can deploy the pipeline using Publish All and run it on demand by using Trigger Now.
- We can use the Test Run option.
The difference is that for the Test Run, we do not deploy the pipeline. This option is especially useful when we work on a new version of an already deployed pipeline and we do not want to replace the existing one which works, as here we create one that’s completely new. Thanks to that we can test and check if the new pipeline works correctly. When you want to deploy the pipeline, click Publish All.
- Click Test Run to execute a test run for the pipeline. You can switch to Output tab to see execution details. If everything works well, then the status value is Succeeded. You can see more execution details by clicking the icons in the Actions column.
Notice: if you use Test Run, then pipeline execution logs are not visible in the Monitor module. In Monitor, you can only see logs of all published pipelines. I will write a little bit more about Monitor in the next article.
- The pipeline is ready! Click Publish All to deploy it to the Azure Data Factory instance. To run this pipeline on a regular basis according to schedule it requires creating a Trigger. I will describe this in the future.
In this post, I showed you how to create a simple pipeline step by step, including all the required elements, i.e. datasets and linked services. The great thing is that we were able to do that without any line of code in JSON files.
In the next post, I will show you how to apply a more complex logic which includes conditions, loops or parameters. Stay tuned! And if you’d like to chat – get in touch!