enrich data generated in the cloud
Fortunately, Microsoft Azure has answered these questions with a platform that gives users to curate a workflow that can take data from both cloud data stores and on-premise and transform or process data using current compute services. The results can be published to cloud data store or on-premises for business intelligence (BI) applications to consume, known as Azure Data Factory.
What is Azure Data Factory?
How does Data Factory work?
The Data Factory service allows organizations to create data pipelines that transform and move data and then run the data pipelines on a specific schedule. This means the data consumed and produced by workflows is time-sliced, and it can specify the pipeline mode as scheduled once in a while.
Hence, what's Azure Data Factory, and how it works? The pipelines in ADF typically perform the following three steps:
Collect and Connect: Connect to all the needed sources of processing and data like file shares, SaaS services, FTP, and web services. Then, moving the data as per the requirement to a centralized location for processing by utilizing the copy activity in a pipeline of data to transfer data from cloud and on-premise data stores to a centralization data store repository in the cloud for further analysis.
Enrich and Transform: Once data is displayed in a centralized data store in the cloud, it is transformed utilized compute services like HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
Publish: Delivering transformed data from the cloud to on-premise sources like keep it in your cloud storage sources for ingestion by BI and analytics tools or SQL Serverand other applications.
What is Azure Data Factory used for?
Every cloud project needs data migration activities across disparate data sources such as cloud or on-premises, services and networks. Therefore, ADF acts as a necessary enabler for organizations stepping into the era of cloud computing.
Consider an example where you organizations have to manage big data workflows in the cloud. For that, they will have to tackle with numerous data that might be either stored in the cloud-based stores such as Azure Data Lake Store /Azure Blob Storage or some on-premises storage system. Now they have some sort of services to transform the data. But, the challenge is to figure out how to automate the data movement to the cloud, store and process it in a few clicks for further analysis. That’s where Azure Data Factory comes in.
Some critical components in Data Factory
Data Factory has some key components that work to define processing events, input and output data, and the schedule and resources needed to execute the data flow in the desired environments:
Representing datasets and data structures within the data stores
An input dataset shows the input for an activity in the pipeline. An output dataset shows the output for the move. For instance, an Azure Blob dataset shows the folder in the Azure Blob Storage and blob container from which the pipeline should read the data. Or, an Azure SQL Table dataset sets the table to which the output data is written by the activity.
Pipeline is a group of activities
They are utilized to group activities into a unit that performs a task together. A data factory may have more than one pipelines. For instance, a pipeline could have a group of activities that takes data from an Azure blob storage and then runs a Hive query on an HDInsight cluster to part the data.
Defining activities and actions to perform on your data
Presently, Azure Data Factory supports two types of activities: data movement and data transformation.
Linked services - connecting Data Factory to external resources
For instance, an Azure Storage linked service specifies a connection string to connect to the Azure Storage account.