Mount an Azure Data Lake Gen 2 In Azure Databricks Using a Service Principal

Mounting an azure data lake in azure databricks can be done in several ways. But the most secure and recommended way of doing it is through a service principal that has its credentials stored in an azure key vault. In my attempt to set that up, I found information scattered in many different articles. So, I decided to do this blog to discuss everything you need to get that up an running.

More specifically, I will discuss the following topics:

1- Create a secret scope in azure databricks that is backed by an azure key vault instance.
2- Setup permissions for your service principal on the data lake account.
3- Store credentials necessary for your service principal in a key vault.
4- Build a function to mount your data lake.

Assumptions:
This article assumes the following:
1- That you have already created a service principal and know the id and the secret key for that principal.
2- That you already have the data lake gen 2, key vault and databricks resources already created.

1- Create a secret scope in azure databricks that is backed by an azure key vault instance

First thing first, we need to authenticate azure databricks to the key vault instance so that it is at least able to read/list the keys in that key vault. To do that through the UI, you need to go to the following URL:
https://<databricks-instance>#secrets/createScope

See example below of how this Url will look like (please make sure that you use your databricks instance that appears in the Url when you launch databricks from azure portal:
https://adb-XXXXXXXXXXXXXXXX.X.azuredatabricks.net/#secrets/createScope
Please note that this Url is case sensitive. The s in scope should be capitalized as shown above.

Once you launch this Url you will see a screen as shown below. You will need to enter the following information:
– The scope name (any name of your choice). Let’s call it key-vault-secret-scope for now. Remember this name as we will use it later in our mounting function.
– DNS Name and Resource ID for your azure key vault instance.

The DNS Name and Resource ID are available under the properties window in your azure key vault resource. From there, you need to use the Vault URI as your DNS name and Resource ID for Resource ID as shown below:

You can verify success of this by going to your key vault instance –> Access policies –> Verify that AzureDatabricks instance is added with Get/List permissions granted for secrets (in an earlier version, this strategy was granting databricks 16 extra permissions that it didn’t really need. So make sure that you only grant it get/list permissions on secrets, or more than that only if required).

2- Setup permissions for your service principal on the data lake account

Here there are also two ways of doing this. This can either be setup through a role based access control (RBAC) or through access control list (ACL). I will discuss the RBAC for the purpose of simplicity.

Ideally, you want to grant your service principal the ability to read/write/delete data on the data lake storage account. For that, you can grant it Storage Blob Data Contributor. This should grant you enough permissions to manipulate data and containers in the storage account. This can be done by going to your storage account –> go to Access Control –> select add –> and then add role assignment –> select the storage blob data contributor role and then select the service principal that you need to grant access to.

3- Store credentials necessary for your service principal in a key vault.

To avoid hardcoding your service principal information in databricks code, it is highly recommended that you store the information in your key vault. This is particularly important to do for the client secret key for your service principal. So for the purpose of this exercise let’s just create a key called sp-appid to store the application id and a key called sp-secret to store the service principal client key. Your key vault would look something like this

4- Build a function to mount your data lake.

Finally, we are here. Once you have all that setup, you should be ready for the databricks function. Your function should look like this:

def mount_data_lake(mount_path, 
                   container_name,
                   storage_account_name,
                   service_principal_client_id,
                   azure_ad_directory_id,
		   service_principal_client_secret_key):
   """
     mount_path: the path that you want to mount to. For example, /mnt/mydata
	 contianer_name: name of container on data lake. 
	 storage_account_name: name of storage account that you want to mount. 
	 service_principal_client_id: application (client) id for the service principal 
	 azure_ad_directory_id: tenant id
	 service_principal_client_secret_key: the secret key for your service principal 
   """
  configs = {"fs.azure.account.auth.type": "OAuth",
             "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
             "fs.azure.account.oauth2.client.id": service_principal_client_id,
             "fs.azure.account.oauth2.client.secret": service_principal_client_secret_key,
             "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/" + azure_ad_directory_id + "/oauth2/token"}
  try:
    storage_account_url = "abfss://" + container_name + "@" + storage_account_name + ".dfs.core.windows.net/"
    dbutils.fs.mount(
      source =  storage_account_url,
      mount_point = mount_path,
      extra_configs = configs
    )
  except Exception as ex:
    if 'already mounted' in str(ex.args):
      print(f"Mount: {mount_path} is already mounted")
    else:
      raise Exception(f"Unable to mount {mount_path} on {storage_account_name}. Error message: {ex}")

And now you can simply call your function on any storage account/mount point that you like. For example:

service_principal_client_id = dbutils.secrets.get(scope="key-vault-secret-scope", key="sp-appid")
service_principal_client_secret_key = dbutils.secrets.get(scope="key-vault-secret-scope", key="sp-secret")
azure_ad_directory_id = "your tenant id"
storage_account_name = "name of your storage account"
container_name = "name of your container"
mount_path = "your mount point"

mount_data_lake(mount_path, 
                   container_name,
                   storage_account_name,
                   service_principal_client_id,
                   azure_ad_directory_id,
		   service_principal_client_secret_key)

Well that’s it. Congrats! you are ready to start reading/writing to/from your data lake gen 2 by using this mount point. For example, if you named your mount point as /mnt/mydata if under that container that you mounted there is a folder called MyFolder that has a file called MyFile.csv, then you can read the csv file like this

file_path = '/mnt/mydata/MyFolder/MyFile.csv'
mydata = spark.read.format('csv').options(header = 'true', inferSchema = 'true').load(file_path)

Hope this helps. Please let me know in the comments below if you have any question.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s