Building Analytical System on Azure Data Lake Gen2

1510

We live in the world of Big Data and Analytics. It’s a fast-changing world with new technologies emerging at a fast pace. This pace has increased considerably with the emergence of the cloud.

Amongst cloud vendors, Microsoft Azure, with a variety of PaaS offerings to realize the big data and analytics applications( Azure Data Lake is one of them), is a leading cloud service provider. Azure Data Lake Store is a very popular PaaS offering from Microsoft Azure for storing big data i.e. data at different volumes, variety and velocity. Earlier we had Azure Data Lake Gen1. However, for additional security and features, Microsoft has come up with ADLS Gen2. Let us go through a few aspects of building an analytical system with the latter.

Deployment and Governance

Firstly, every analytical system and software system entails deployment and governance and Azure ecosystem is no exception to this. To achieve this a shell scripting language like Powershell is prefered. Read this article on deployment and governance of  ADLS gen2 using Powershell:

Managing Azure Data Lake Gen2 with Powershell

Migrating to Azure Data Lake Gen2: Azure Databricks

Secondly, we build an analytical system to perform analytics. As we know, Azure Data Lake is a storage system. However, unlike Azure Data Lake Gen1, ADLS gen2 does not have Azure Data Lake Analytics. Hence, Azure Databricks an alternative to building an analytical system on top of ADLS Gen2. However, there are nuances to connect Azure Databricks to ADLS gen2. Read the below article to know more about it.

Azure Data Lake Gen2 and Azure Databricks

We know that Azure Data Lake is a file system. Hence, we need tools to manipulate file systems. The below article details the same using Databricks:

Azure Data Lake and Azure Databricks file systems.

Migrating to Azure Data Lake Gen2: Azure Data Factory

Finally, any analytical system consists of an ETL/orchestration engine. In Microsoft Azure ecosystem, Azure Data Factory is that service.

Nevertheless, the key practical aspect is authentication, as far as ADF is concerned. There are two multiple ways to authenticate. However, with ADLS Gen2, we can authenticate ADF with managed identity. Please note that this article features Managed Identity using RBAC.

Managed Identity between Azure Data Factory and Azure storage

However, RBAC is disallowed in many organizations. Hence, the next one details on Access Control Lists. Refer to this article for more details on ACL’s:

Azure Data Lake Gen2 Managed Identity using Access Control Lists

Conclusion

We hope that the article was useful. We don’t guarantee its completeness or accuracy. Hence, we advise reader discretion.



I am a Data Scientist with 6+ years of experience.


Leave a Reply