Le’ts say that you have a csv file, a blob container and access to a DataBricks workspace. The purpose of this mini blog is to show how easy is the process from having a file on your local computer to reading the data into databricks. I will go through the process of uploading the csv file manually to a an azure blob container and then read it in DataBricks using python code.
Step 1: Upload the file to your blob container
This can be done simply by navigating to your blob container. From there, you can click the upload button and select the file you are interested in. Once selected, you need to click the upload button that in the upload blade. See screenshot below.
Once uploaded, you will be able to see the file available in your blob container as shown below:
Step 2: Get credentials necessary for databricks to connect to your blob container
From your azure portal, you need to navigate to all resources then select your blob storage account and from under the settings select account keys. Once their, copy the key under Key1 to a local notepad.
Step 3: Configure DataBricks to read the file
Here, you need to navigate to your databricks work space (create one if you don’t have one already) and launch it. Once launched, go to workspace and create a new python notebook.
To start reading the data, first, you need to configure your spark session to use credentials for your blob container. This can simply be done through the spark.conf.set command. More precisely, we start with the following
storage_account_name = 'nameofyourstorageaccount'
storage_account_access_key = 'thekeyfortheblobcontainer'
spark.conf.set('fs.azure.account.key.' + storage_account_name + '.blob.core.windows.net', storage_account_access_key)
Once done, we need to build the file path in the blob container and read the file as a spark dataframe.
blob_container = 'yourblobcontainername'
filePath = "wasbs://" + blob_container + "@" + storage_account_name + ".blob.core.windows.net/Sales/SalesFile.csv"
salesDf = spark.read.format("csv").load(filePath, inferSchema = True, header = True)
And congrats, we are done. You can use the display command to have a sneak peak at our data as shown below.