Azure Data Factory Overview For Beginners
To complete the Extract, Transform, and Load (ETL) process, engineers can depend on several tools and technologies. One of these tools is Azure Data Factory (ADF). At its core, ADF is a data pipeline orchestrator and ETL tool facilitating easy and streamlined data processing.
With ADF, we can transfer data at scale and with speed while creating bespoke data-driven workflows and schedule pipelines. Besides the flexibility of ADF for processing data, it also has a lower learning curve.
This makes Azure data factory a good solution for beginners in this field and when you need a reliable solution to complete the task quickly.
In this guide, we will go through the steps to begin working with the ADF. Look for the steps to;
- Creating datasets
- Creating a pipeline
- Debugging the pipeline
- Manual triggering of pipeline
- Scheduled triggering
- Monitoring the pipeline
Let’s get started.
Setting Up Azure Data Factory
Before moving to the setup process, make sure of a few things;
- Get a subscription with Azure. You can make an account for free and get started with the basics right away.
- Next, identify your role in the Azure account. To set up everything from scratch, take on the role of an administrator.
- However, to work on the child resources (datasets, pipeline, triggers, etc.), you can take on the role of a Data Factory Contributor.
Continuing with creating an Azure data factory, here are the steps.
1. Launch Data Factory
You can use Microsoft Edge or Google Chrome to access your Azure account. Once in, navigate to Azure Portal, click on Create a Resource, and select Integration. From the options given, find and click on Data Factory.
2. Add Resource
From the window, you are seeing right now, look for a tab named Basics. Then select your Azure Subscription. This is important because the data set you are about to create will be attached to this subscription.
When prompted to choose Resource, use the drop-down list to select one or Create a new resource.
Follow the “What is Azure Resource Manager” guide to know more about creating a resource.
3. Select Region
These are the geographical regions, and the supported ones are listed on the platform. Basically, these will help you know where the IT Infrastructure Managed Services Azure data factory metadata will be stored. The supported regions are;
- West US
- East US
- North Europe
4. Enter a Name and Version
A basic practice is to give a globally unique name to the data factory. For trial purposes, you can take ADTTutorialDataFactory or anything else you want. If the name is not unique, you will get an error message, which will be easy to resolve. With the name fixed, move to Version and select V2.
It is important to read and understand the Data Factory – naming rules to add the required names to the Data Factory Artifacts.
Check the image below for a better understanding.
5. Git Configuration and Review
In the last step of setting up the ADF, move to the next tab Git Configuration. Here click on Configure Git Later and click on Review & Create. Before hitting create, you will have to pass the validation test.
6. Azure Data Factory Studio:
Once you have created the ADF. Move to the main page, click on Go To Resource and select the name of your Data Factory Page. Towards the bottom, you will see Open Azure Data Factory Studio. This will open the data factory page on a new tab.
When you experience issues with getting authorized, try clearing the browser from third-party cookies and site data.
Next Step – Create a Linked Service
Linked service creation is the next crucial step in ETL processing with the Azure data factory. The purpose of creating this service is to link the data store, from where data will be extracted to Data Factory. The same service also works for when you are working with Synapse Workspace.
Creating this service is like identifying and defining the connection information required to connect the data factory to external sources.
Follow the steps to create a linked service in the Azure data factory.
1. Create New Service
Start by opening the Manage tab located on the left side panel. In this, you will find Linked Service, click on it and create a New linked service.
2.Azure Blob Storage
After the Create Linked Service page opens, find and select Azure Blob Storage, followed by clicking on Continue. On the next page, do the following;
- Fill in a name – it can be anything. For tutorial purposes, let’s keep it AzureStorageLinkedService.
- Next, select your Subscription and account name from the drop-down list.
- In Test Connection, select “To Linked Service” and click on Create.
Moving On → Creating Datasets
Datasets are the data structures stored by businesses in data stores. Simply put, datasets represent the data that you will be using in the Azure data factory and put it under processing.
In the ADF, you can create data with a simple procedure. The important thing to remember is that you will need to create two types of datasets, InputDataset and OutputDataset.
The source data in the input folder is represented by this dataset. Here you need to specify three things;
- Blob container
- Input folder
This is the data that is sent to the destination. Here too, you need to specify three things;
- Blob container
- Output folder
The name of the Output dataset depends on the ID, which in turn, is generated based on the pipeline.
To create datasets, you must specify the details about the source data accurately in source dataset settings;
- Blob container
This will tell the system where the data resides.
The same is required for in the Sink dataset settings;
- Blob container
This will tell the system where the data will be copied.
With this done, move forward and complete the following steps;
- In basic configuration, click on the Author (Pencil sign) tab and then click on the (+) sign located besides the search bar. From the drop-down menu, select Datasets.
- Datasheet Page: On the new datasheet page, select Azure blob storage followed by clicking on Continue. This will take you to the Select Format page, and here you need to select Binary. Click on Continue.
- Set Properties: In this step, we will configure the properties of the dataset page by working on the following;
- Fill InputDataset in the Name column.
- Under the Linked Service menu, select AzureStorageLinkedService.
- Select the file path by clicking on the Browse button on the left-hand side. Once selected, click on OK.
Repeat the same process for configuring OutputDataset. However, in case you are not able to find the output folder, it will be created during the runtime process.
Creating a Pipeline in Azure Data Factory Studio
A pipeline is the group of activities that are scheduled to run and perform during the process. All the activities in a pipeline culminate in completing a single task. Hence, creating a pipeline will improve how you complete the process and help you streamline the activities accordingly.
- Create a Pipeline: From the Author page, click on the (+) sign and select Pipeline. From the window, you will see five different tabs at the bottom of the page.
- Copy Data: Click on the General tab followed by selecting Properties. Name the pipeline; we can keep it CopyPipeline and then close the General panel.
- Activities: On the left hand, you will find the Activities panel. Under this, locate Move & transform followed by dragging Copy Data to the white surface area, which is the space given to create the pipeline. Below the tabs, you will find the name column and fill in the name of the CopyfromBlobtoBlob.
- Validation: In the next steps, switch to the Source tab and select InputDataset in the column Source Dataset. Next, in the Sink tab, choose OutputDataset for SinkDataset. With this done, you will find the Validate option on the top panel.
Next Up — Debugging the Pipeline
Debugging is simple. You need to ensure that the pipeline is free from errors and issues before deploying it into the Azure Data Factory processing. To run the debugging sequence, follow these steps;
- Above the white surface area where you have recently created the pipeline, find Debug. Click on it to start a test run on the created pipeline.
- To be sure, confirm the debugging status and results by checking the name of the pipeline you have just created.
Manually Triggering the Pipeline
Besides automatic processing of the pipeline, you can manually trigger the execution. Before manually triggering the pipeline, publish all the entities to the Data Factory.
For this, click on Publish, located on the top main menu. Once published, click on Add Trigger located on the Pipeline toolbar and select Trigger Now. When the Pipeline Run page opens up, click on OK.
Monitoring is important to ensure that the pipeline is running according to the requirements. For this;
- Find the Refresh button on the left-hand panel. It’s the button below Author. From the window that appears, find the name of your pipeline, which you want to monitor and check.
- Select the name of the pipeline and check its status. To view more information, click on Details (it’s the image of spectacles). To further explore how the properties are configured, check out Copy Activity Overview.
- Here you can also confirm whether the system has created a new output folder or not.
Scheduling a Pipeline Trigger
Scheduling a trigger for the pipeline is not always necessary, but it can be an option. You can run a periodic triggering of the pipeline by tweaking the settings. Here’s how it can be done.
1. Start by going into the Author page and clicking on Add Trigger. Then click on New/Edit and when on the Add Trigger page, click on Choose Trigger followed by clicking on New.
2. On this page, fill in the required fields. Enter the Start Date, Recurrence, and the End On date. Click on Activated and then OK.
3. You might get a warning message here, click on OK and move forward. Back to the main page, click on Publish All to reflect the changes in the Data Factory.
4. Once this is done, go to the Monitor and click on Refresh. Here you will notice that the values have changed to Triggered By instead of Trigger Now. This changing of values authenticates that the scheduled trigger has been set.
5. To further check the progress, click on Trigger Runs and check that an output file is created for every pipeline.
This entire exercise sums up how to operate Azure data factory studio and execute a pipeline function, ensuring data extraction, transformation, and loading into the designated output destination.
Companies that have to transfer loads of data, especially from legacy systems, will find working with the Azure data factory straightforward. As a solution, it supports a wider range of file and data types, making your work faster and easier.
Latest posts by Punit Parmar (see all)
- Azure Data Factory Overview For Beginners - August 30, 2022
- Introduction About Star Schema Vs Snowflake Schema Vs Galaxy Schema - March 14, 2022
- Different Ways to Implement Power BI - April 8, 2021