Microsoft Fabric's underlying architecture is known as OneLake -- a unified data storage foundation that consolidates an organization's data estate in a single, unified data lake. OneLake can incorporate data that's physically not stored in OneLake through the use of OneLake Shortcuts.

Data stored in other public clouds such as AWS S3 buckets aren't physically stored  in Azure OneLake, but they can be logically accessed from a Data Lake by creating a shortcut within a Fabric Data Lake, wher the data is read from the external source on demand.

In this post we'll do the following:

  • Create an AWS S3 Bucket
  • Add data to the S3 bucket
  • Create a shortcut to access the S3 data from a Fabric Lakehouse

Our End Goal

When we're finished with the configuration, we'll be able to access data from the AWS S3 bucket as if it was a part of the Fabric Lakehouse.  Here's an image of our end goal:

Image of S3 shortcut added to Lakehouse

With that end in mind, let's get started!

💡
Note: this is a walk-through of the S3 shortcut process. While the data integration uses AWS and Azure security features, check with your Fabric and AWS adminstrators to ensure your own organizations best practices and policies are applied.

Create an S3 Bucket

In this post I'll create a new bucket from scratch in AWS.  Heading over to the AWS console S3 section, click the link to create a new bucket.

I'll give the bucket a unique name, in this case rhk-shared-bucket. I'll leave all other configuration options as defaults.

💡
It's not necessary to enable any public access to the bucket, since we'll be creating a service account in AWS IAM to access data from fabric using an access key.
Creating a bucket

Upload Data to the S3 Bucket

The bucket wouldn't be that useful without some data, so I'll create a folder in the bucket to hold data.  In this hypothetical scenario I'll pretend I have a web site running on AWS that drops user review files into the bucket, and I'll ingest them into the Fabric data lake using Spark DataFrames and Python.

After uploading files to a folder in the bucket, I can see them from the AWS S3 console:

Files uploaded to bucket folder

Create a Service Account for Fabric

So far data is in the S3 Bucket, but only AWS administrators can read the files.  We want Fabric to read the data, but not leave the bucket open to the Internet.  So, let's create a service account for Fabric.

Head over to the AWS IAM Service and create a new user. The user doesn't need any permissions or roles assigned at this time--we'll use a bucket policy to authorize this user to read the S3 bucket later.

When creating a shortcut to the S3 bucket from Fabric, we will need to provide a user access key and secret key.

Click on the Create access key link on the new IAM user's summary page to create an access key.

Select the application running outside AWS use case, then press Next.

Next provide a description for the key (for reference later)

After the Access and Secret access keys are created, copy them and save them somewhere secure. These are two strings we'll need to provide to Fabric when configuring the shortcut.

Add S3 Permissions for the IAM User

Fabric will use the access and secret keys to authenticate to AWS. We need to ensure that the IAM user has permissions to read content from the S3 bucket. In AWS, this is done by adding a bucket policy to the S3 bucket.

💡
Fabric requires GetObject, GetBucketLocation and ListBucket permissions to the target S3 bucket. 

From the S3 bucket permissions tab, create a bucket policy similar to the following:

{
    "Version": "2012-10-17",
    "Id": "<a unique ID for the policy>",
    "Statement": [
        {
            "Sid": "<a unique statement id>",
            "Effect": "Allow",
            "Principal": {
                "AWS": "<get the user's ARN from IAM>"
            },
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "<this bucket's ARN>",
                "<this bucket's ARN>/*"
            ]
        }
    ]
}

Lookup the S3 URL

When configuring Fabric, we'll need the Internet URL for our bucket.  There are multiple ways to discover the URL, but I find a convenient is to copy it from the properties page of a file stored in the bucket.  

Copy the bucket Internet-routable URL for use in Fabric later

Find the URL and save it along with the secrets created in the last step.

Now that we have data in the bucket, we know the bucket base URL, and have the access and secret keys for the IAM account used to access bucket contents, we can head over to Fabric to create the shortcut!

Add a Shortcut to the S3 Files

Adding the shortcut to Fabric is simple from this point:

  1. Select the menu option to add a shortcut to the Files folder in a lakehouse
  2. Provide the S3 bucket URL
  3. Provide the IAM user access key and secret key
  4. Give the folder within the bucket the shortcut should point to, and a name for the shortcut

Select the Add Shortcut Menu Command

Find the shortcut command on the option menu in the Lakehouse Files folder entry.

Click the New Shortcut menu item

Select Amazon S3 from the External Sources menu

External Source Selection

Provide Connection Details

Next we need to create a new connection profile for the S3 bucket. Paste in the URL and keys created in AWS, and provide a connection name to use for future reference.

💡
Providing authentication information is a one-time task. Fabric stores the connection information to the bucket, so additional shortcuts to other folders in the bucket can re-use this connection profile. If the security information changes, the shortcut can be modified in the Fabric admin menu.
Enter S3 and IAM connection information into Fabric

Provide the Shortcut Name and S3 Folder Source

The final step in creating the shortcut is to provide a logical name to display in the Lakehouse browser window, as well as the folder in S3 the shortcut should point to.

💡
A shortcut fundamentally maps a virtual folder in the Lakehouse to a physical folder in the S3 bucket.

After the shortcut is updated in the file explorer, we can click on it to see the list of files in the S3 bucket.

💡
The files in the S3 bucket aren't moved from AWS to Azure storage. Fabric will read the files from AWS on demand. Keep in mind this will incur egress charges on the AWS side as files are transmitted between clouds.

Test by Reading the Folder Contents into a DataFrame

To test that the shortcut works correctly, we can read the folder contents into a Spark DataFrame.

df=spark.read.option("multiline", "true") \
   .json("Files/s3_shared_web_reviews/*")

display(df)

Summary

And that's a wrap.  We've created a shortcut from Microsoft Fabric to an AWS S3 bucket, and now we can directly access the data being stored by other applications running on the AWS cloud!

Most importantly, data engineers, data analysts and data scientists don't need to setup pipelines to move data between clouds, since that's being handled for us.