Mounting an azure data lake in azure databricks can be done in several ways. But the most secure and recommended way of doing it is through a service principal that has its credentials stored in an azure key vault. In my attempt to set that up, I found information scattered in many different articles. So, I decided to do this blog to discuss everything you need to get that up an running.
More specifically, I will discuss the following topics:
1- Create a secret scope in azure databricks that is backed by an azure key vault instance. 2- Setup permissions for your service principal on the data lake account. 3- Store credentials necessary for your service principal in a key vault. 4- Build a function to mount your data lake.
Assumptions: This article assumes the following: 1- That you have already created a service principal and know the id and the secret key for that principal. 2- That you already have the data lake gen 2, key vault and databricks resources already created.
1- Create a secret scope in azure databricks that is backed by an azure key vault instance
First thing first, we need to authenticate azure databricks to the key vault instance so that it is at least able to read/list the keys in that key vault. To do that through the UI, you need to go to the following URL: https://<databricks-instance>#secrets/createScope
See example below of how this Url will look like (please make sure that you use your databricks instance that appears in the Url when you launch databricks from azure portal: https://adb-XXXXXXXXXXXXXXXX.X.azuredatabricks.net/#secrets/createScope Please note that this Url is case sensitive. The s in scope should be capitalized as shown above.
Once you launch this Url you will see a screen as shown below. You will need to enter the following information: – The scope name (any name of your choice). Let’s call it key-vault-secret-scope for now. Remember this name as we will use it later in our mounting function. – DNS Name and Resource ID for your azure key vault instance.
The DNS Name and Resource ID are available under the properties window in your azure key vault resource. From there, you need to use the Vault URI as your DNS name and Resource ID for Resource ID as shown below:
You can verify success of this by going to your key vault instance –> Access policies –> Verify that AzureDatabricks instance is added with Get/List permissions granted for secrets (in an earlier version, this strategy was granting databricks 16 extra permissions that it didn’t really need. So make sure that you only grant it get/list permissions on secrets, or more than that only if required).
2- Setup permissions for your service principal on the data lake account
Here there are also two ways of doing this. This can either be setup through a role based access control (RBAC) or through access control list (ACL). I will discuss the RBAC for the purpose of simplicity.
Ideally, you want to grant your service principal the ability to read/write/delete data on the data lake storage account. For that, you can grant it Storage Blob Data Contributor. This should grant you enough permissions to manipulate data and containers in the storage account. This can be done by going to your storage account –> go to Access Control –> select add –> and then add role assignment –> select the storage blob data contributor role and then select the service principal that you need to grant access to.
3- Store credentials necessary for your service principal in a key vault.
To avoid hardcoding your service principal information in databricks code, it is highly recommended that you store the information in your key vault. This is particularly important to do for the client secret key for your service principal. So for the purpose of this exercise let’s just create a key called sp-appid to store the application id and a key called sp-secret to store the service principal client key. Your key vault would look something like this
4- Build a function to mount your data lake.
Finally, we are here. Once you have all that setup, you should be ready for the databricks function. Your function should look like this:
def mount_data_lake(mount_path,
container_name,
storage_account_name,
service_principal_client_id,
azure_ad_directory_id,
service_principal_client_secret_key):
"""
mount_path: the path that you want to mount to. For example, /mnt/mydata
contianer_name: name of container on data lake.
storage_account_name: name of storage account that you want to mount.
service_principal_client_id: application (client) id for the service principal
azure_ad_directory_id: tenant id
service_principal_client_secret_key: the secret key for your service principal
"""
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": service_principal_client_id,
"fs.azure.account.oauth2.client.secret": service_principal_client_secret_key,
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/" + azure_ad_directory_id + "/oauth2/token"}
try:
storage_account_url = "abfss://" + container_name + "@" + storage_account_name + ".dfs.core.windows.net/"
dbutils.fs.mount(
source = storage_account_url,
mount_point = mount_path,
extra_configs = configs
)
except Exception as ex:
if 'already mounted' in str(ex.args):
print(f"Mount: {mount_path} is already mounted")
else:
raise Exception(f"Unable to mount {mount_path} on {storage_account_name}. Error message: {ex}")
And now you can simply call your function on any storage account/mount point that you like. For example:
service_principal_client_id = dbutils.secrets.get(scope="key-vault-secret-scope", key="sp-appid")
service_principal_client_secret_key = dbutils.secrets.get(scope="key-vault-secret-scope", key="sp-secret")
azure_ad_directory_id = "your tenant id"
storage_account_name = "name of your storage account"
container_name = "name of your container"
mount_path = "your mount point"
mount_data_lake(mount_path,
container_name,
storage_account_name,
service_principal_client_id,
azure_ad_directory_id,
service_principal_client_secret_key)
Well that’s it. Congrats! you are ready to start reading/writing to/from your data lake gen 2 by using this mount point. For example, if you named your mount point as /mnt/mydata if under that container that you mounted there is a folder called MyFolder that has a file called MyFile.csv, then you can read the csv file like this
As discussed in this article by Databricks that during your work in a notebook, you can mount a Blob Storage container or a folder inside a container to Databricks File System. The whole point of mounting to a blob storage container is simply to use an abbreviated link to your data using the databricks file system rather than having to refer to the whole URL to your blob container every time you need to read/write data from that blob container. More details on mounting and its usage, can be found in the articles referenced above.
The purpose of this article is to suggest a way to check if the mountpoint has been created already and only attempt to create it if it doesn’t exist using python.
This can simply be done if we knew how to list existing mountpoints using python. Luckily, databricks offers this to us using the dbutils.fs.mounts() command. To access the actual mountpoint we can do something like this:
for mount in dbutils.fs.mounts():
print (mount.mountPoint)
Knowing how to access mountpoints enables us to write some Python syntax to only mount if the mountpoint doesn’t exist. The code should look like the following:
storageAccountName = "your storage account name"
storageAccountAccessKey = "your storage account access key"
blobContainerName = "your blob container name"
if not any(mount.mountPoint == '/mnt/FileStore/MountFolder/' for mount in dbutils.fs.mounts()):
try:
dbutils.fs.mount(
source = "wasbs://{}@{}.blob.core.windows.net".format(blobContainerName, storageAccountName),
mount_point = "/mnt/FileStore/MountFolder/",
extra_configs = {'fs.azure.account.key.' + storageAccountName + '.blob.core.windows.net': storageAccountAccessKey}
)
except Exception as e:
print("already mounted. Try to unmount first")
Or, you can add an error handler to print an error message if the the blob is mounted already, as such:
When you have SaaS systems and custom systems all over the place in your organization, there is a need for unification. The more systems you have over a long period of time, the less standardization you have among them. Other factors include different technologies and different architectural styles. If those system expose any kind of API that is needed internally (by your developers) or externally (by your customers), then a need to give those API’s a consistent look and feel along with a set of unified policies becomes important.
Azure API Management (APIM) is one solution to this problem. APIM is an Azure resource that you can provision and have it sit between your API consumers and the APIs exposed by your systems.
My focus here is on one example: D365 Web API. First question that comes to mind is why do we need to expose D365 APIs through APIM even though the API is modern and well documented? Here are few reasons:
The Common Data Model have API limits. If you have systems that read data from your Dynamics 365 instance through APIM, you can cache the results and save on some API calls. APIM comes equipped with a built in cache but if you need a bigger cache then you can attach an external cache system like Redis.
Because of the imposed CDS per user limits, with APIM you can limit the number of API calls by a your consumer systems. As of this time, Microsoft provides a 25,000, 50,000 or 100,000 API calls per application or non-interactive user based on your licencing model (see details here).
With APIM, you can set the authentication to your D365 Web API using an Azure AD application instead of a licensed user. Anyone who calls the API doesn’t need to have a user setup in D365 or given a security role. This basically ease things for user management. This ties back to number 2 above, the API calls here don’t count as part of a user quota because we are not using a licensed user authentication. The type of user we will use is called Application User, and this type doesn’t need a licence.
If the systems that talk to D365 Web API expect an XML format result, you can transform the JSON output of D365 API into an XML output. Other transformations are also supported by APIM.
APIM provides so many other policies that you can put in place between your consumers and the D365 API, to mention a few, you can change headers, add more data to the request/response etc.
D365 API is extensive and it is time consuming for the its consumers to learn about it quickly. In APIM, you can have end points to what is needed only by your API Consumers.
Analytics on who is calling your API’s and how many requests for each end point.
You can package your APIs endpoints in products (group of API that serve some defined purpose), and you can provide your consumers with subscription keys to track who is calling what API endpoint.
To this point, we haven’t done any real work. In summary, here is what we want to do:
Provision an APIM instance.
Create a simple API in APIM that calls our D365 Web API.
Setup the Authentication between APIM and D365 Web API using Azure AD and without consuming a Dynamics licence.
Add a send-request policy for token generation to implicitly obtain a token and send it to D365 API.
Call the APIs from APIM.
Provisioning an APIM Instance in Azure
Azure APIM comes in different pricing tiers flavors. In this blog, I opted out for the developer plan, the steps below should work on all other tiers including the consumption plan. Head to Create a new Resource in Azure, search for API Management and create it as below. The name needs to be globally unique. With the developer tier, expect a wait time of 30 minutes at least for this resource to provision, if you want a much faster provisioning, select the consumption plan. Once you provision the APIM instance, it will be accessible https://{Name of your APIM resource}.developer.azure-api.net/, this URL will server as the base URL for all the API’s the are behind this APIM resource.
Create A simple APIM API that Calls D365 API
We can create API’s in APIM in different ways. If your API has an open api file, previously called Swagger, then you can import that file and APIM will create the operations for you. Unfortunately, with Dynamics API we don’t have that luxury and we need to create the API manually starting from an empty API.
In your provisioned instance, click APIs on the left navigation, and click on Blank API.
In the dialog that appears, fill in a friendly name for your API and add the base URL for Dynamics API. If you have many APIs that you want to hide behind an APIM resource, give each API a prefix, in my case, I call it d365api. So now, calling https://{your crm org}.crm3.dynamics.com/api/data/v9.1/ is equivalent to calling your APIM instance at the URL https://apimd365.azure-api.net/d365api
Once an API is created, we need to add the operations to our API. Assume that your ESRI maps team at your organization wants to draw all of your contacts and accounts on a map and they need their data. You can easily expose two operations like this:
The way to do this is by Adding an operation as shown below, you need to specify the path of the operations. In Dynamics 365 and because it is a REST API, you can just have all the accounts by appending “contacts” to the base URL, same for accounts. Of course, if you want to only return specific data about those contacts or accounts, you can pass the OData filters in the query part of the operation being created.
At this point, we have a fully non-functional API :). The reason is Dynamics 365 still doesn’t know about APIM and doesn’t trust any request that comes from it. Next, let’s see how we can implement a server to server authentication without any intervention from the API caller.
Set the Authentication between APIM and D365
Note: Application and Client are used interchangeably in Azure AD terminology
APIM is very flexible from the API authentication/authorization point of view. It allows you to add an OAuth2.0 or OpenID connect authentication servers as part of its configurations to get access tokens that you can later supply to your API calls . As you may know, D365 API uses OAuth2.0 to Authenticate, this happens by calling an authorization end point that provides you with an access token, then you pass this token in a header called “Authorization” to any call you make to D365 API. Here we have two options, the first is letting the APIM consumers be responsible for calling the authorization endpoint, getting the token and provide as a header or making their life easy and abstract the authorization from them completely. I prefer the latter option because it is less headache for end users of your APIM.
In the case we are tackling now, I have my Dynamics 365 and my APIM in the same tenant, that means they both operate under the same Azure Active Directory. To let the APIM authenticate with Dynamics API, we need to create an Application Registration in Azure AD, give it permission to Access Dynamics 365 and let Dynamics 365 organization know about it. This App Registration is like a user identity that APIM uses to authenticate with D365 API.
To create an Application Registration in Azure AD, click on App Registration, New Registration.
Give the Application Registration a friendly name and select Web as its type. In the redirect URL, fill in any value as this is not important, for example, use https://localhost
Take a note of the application ID and the tenant ID as we will need them later.
To complete the credentials, we need to generate a secret that will behave like the password for the app registration. In a production environment, try to have this secret stored in the key vault, but for now, just create a secret, set its expiry and copy its value as well (This is your only chance to copy the secret, after that it will be masked forever, If you lose the secret, you need to generate a new one).
Our App Registration is created, but two things are missing, the permission to access Dynamics APIs from this app registration and letting the Dynamics system know about this app registration. To give the app registration a permission, Click on API permissions, add Dynamics CRM
Select the User_Impersonation permission and click Add Permission.
To make Dynamics 365 aware of this app registration, you need to create what is called an Application User (you need to be a Dynamics administrator to do this). The Application user is the representative of Dynamics 365 that talks with the App Registration. Navigate to Settings->Security->Users. Switch to the Application User View and click New User. You only need the Application ID of the App Registration that you collected earlier. Fill in the other required fields and save, if everything is successful, the Application ID URL and the Object ID should auto populate. Give this user a security role to access only the needed data by your API consumers and nothing more. For our example to work, at least give a read access on the account and contact entities.
The last piece of information we need from the Azure AD is the token generation URL, this URL is what the APIM will call to get a token that will be used to authorize a request to D365 API. In the overview section of either Azure AD or your app Registration, Click on Endpoints in the command bar and select the token URL
By now, you should have the application ID, tenant ID, secret value and the token generation URL.
Add a send-request Policy For Token Generation
To make things easy for the APIM consumers, we need to implicitly authenticate with D365 API using the App Registration and Azure AD. One of the most amazing features of APIM is the ability to do almost anything in between receiving a call and before we forward it to the back-end service (D365 API). In our case, we need to issue a call to the Token generation URL with the proper App Registration credentials, get a token and insert it into a header called “Authorization”.
Click on APIs, select the API we created above and click on all operations. and in the Inbound Processing designer, click on the little edit icon. The reason we want to do this on all operations is because D365 requires a token on any operation, so here we will do it once and it will apply to each individual operation after that.
By default, your inbound processing looks like this, this means that do nothing on inbound calls, nothing in the back-end and nothing with the outbound results. In our case, we want to add a request to the inbound stage to get us a token.
Policies in APIM are so flexible and you can do so many things with them to manipulate the call pipeline. In our case, we want to to send a request to the token URL and thats done by a send-request policy. You will need most of the values you collected before in the App Registration step. Notice that the send-request policy stores the result in a variable called bearerToken and just after the send-request policy, we use a set-header policy that create a header called “Authorization” and populate it with the value of the access token. (Note:The send-request returns a bearerToken object that among other things, contains the access token that needs to be in the authorization header, that explains the casting logic you see in the set-header policy).
Now is when the hard work “should” pay off. Open a browser tab (since we are doing a GET request, a browser should be enough, otherwise, use Postman), paste this URL
If you see the list of contacts returned to you, then you have done everything right, if you don’t, check the returned error, most of the time, it is security related and a review of all the values in the app registration and the send-request policy is a good start for debugging.
Subscription Keys and Products
APIM comes equipped with two powerful features, Products and Subscriptions. Products is a way to group specific APIs together to manage them as a unit. Subscriptions are basically a method of tracking API consumers by asking them to provide a subscription key with each request. You can control the API limits and analyse them by subscription keys, rate limiting and throttling policies are very common policies in APIM (more on that here). This adds a simple layer of security and tracking capability in your APIM, for example our call to get contacts will be something like this
Each subscription has a primary and secondary key that you can provide to your APIM consumers and if they are compromised, you can regenerate them again. Also, subscription keys can be passed as a header if you don’t want to expose them in the URL.
Summary
We have created an APIM resource between your D365 API callers and the API itself. APIM provides a lot of control over what happens to the request over its lifetime. We showed the important of authenticating/authorization using an app registration and an application user and how this way saves us from consuming the licensed users CDS API limits. We also saw how we can incorporate policies to to politely hijack the request and issue a call to the token endpoint and set the authorization header with its response. This post is in no way an extensive post about APIM, to know more about its wide feature set, visit the official documentations here.