RBA Consulting
RBA Consulting
RBA Consulting

Recently, I spent several months helping a client establish development best practices in a Fabric environment they were building. Fabric as a platform is getting a lot of attention. It can be easy to start ingesting data from a myriad of sources into your lakehouses and then refine it into the classic medallion architecture.

For the most part, you can accomplish what you need for ingestion in a relatively low-code/no-code manner, and your data analysts and engineers can start working with the data immediately in Python notebooks using the libraries of their choice, such as PySpark, pandas, duckdb, etc.

However, the problems the client was running into were issues from a software development lifecycle perspective we’ve worked hard to improve over the last 10-15 years. They had some initial ideas they wanted to try out and turned some people loose in the Fabric environment. They made good headway proving out their ideas, but things were quickly spiraling out of control in terms of process and organization.

In Fabric, a workspace is a shared environment everyone uses at the same time. Imagine if a bunch of people were all using the same computer at the same time for their development, and on top of it they weren’t even using different accounts. Everybody was just throwing everything they did in a shared C:\fabric-dev folder. It might work for one or two people, but once you start adding to the team and people’s different needs, it’s really easy to get wrapped around the axle and lose control of what’s in the workspace.

I’m just barely old enough that at the start of my career I saw one or two projects where there was no version control and everyone was just working in a shared network drive. It’s not a fun way to work, and I’d rather not revisit it if I don’t have to!

In this post, we’re going to look at some high-level recommendations for structuring workspaces, working with feature branches/workspaces, and some various gotchas that can catch you along the way, especially as you’re starting out experimenting with Fabric as a platform.
Fabric Workspace Structure and Git integration

First things first, how should you set up your workspaces? This is where a little planning ahead can save you a lot of headache down the road. Often people will start with a structure like this, where they create a dev, test, and prod workspace, connect the dev workspace to Git, and then use Fabric’s built-in deployment pipelines to move items between environments.

This starts off working initially but really becomes a problem if you want a more mature development process where things are developed in feature branches and merged in through pull requests after review (more on this later).

One thing you’ll end up learning in Fabric, is that you’re always going to need more workspaces! Especially when you want to implement a robust development lifecycle with feature branching, pull requests and so on in your Fabric environment.

There’s a lot of different approaches to organizing your workspaces you can look into, the following is a simple hub-style approach. You’re likely going to end up with a lot more workspaces so it’s good to plan ahead. Fabric has tools to help organize workspaces like the ability to assign them to different domains that you should leverage, especially if you are pursuing a data mesh architecture. But even in a single domain I’d recommend splitting everything into minimum 3-4 workspaces per environment depending on your needs. Each environment should have at minimum an engineering, store and ingestion workspace. If you’re going to be developing PowerBI reports off your data I’d recommend a presentation workspace as well.

Let’s look at this structure explain the benefits of setting your environment up this way. If you do the math for 3 environments, we’re already up to 12 workspaces, but the benefits outweigh the extra work of maintaining these workspaces.

Store

This is the workspace that will store all the data. Fabric stores item definitions as templates analogous to ARM or Bicep templates to deploy them to other environments/workspaces.

Splitting out your workspaces into at least two “store” and “everything else” is the biggest initial win. When things are deployed either through Deployment Pipelines or Fabric CI/CD only the definitions move, not the data, so by seperating out our data it becomes much easier to support a development workflow with feature branches. If my notebook and the lakehouse with it’s data both lived in the Dev workspace, when I branch out to a new workspace my notebook will point to the new copy of the lakehouse, which is empty. Now I also need to have a process to populate my new lakehouse which eats up storage and capacity.

There are newer options when branching to keep notebooks pointing at the original datasources, but that’s a bit of a band-aid that you don’t need at all if you structure things correctly up front.

Ingestion

This is where all the work happens to ingest your data into Fabric and push it to your store lakehouses, etc. Whether it’s notebooks, copy jobs, pipelines, etc. If you have a simple enough ingestion process you could realistically combine this with Engineering workspace, with the caveat that permissions in Fabric are easiest to manage at a workspace level, so for all of these scenarios if you need to lock access down it’s much easier to do by moving it into it’s own workspace.

Engineering

Here is the home for all your data engineering notebooks and pipelines that orchestrate them. We want the pipelines that execute the notebooks to be in the same workspace because it greatly simplifies the deployment process.

It’s the opposite of what we want with notebooks and lakehouses. When a notebook references a lakehouse in the same workspace it will point at that lakehouse in a new workspace, which is empty to start so the notebook can’t really do anything.

When a pipeline points to a notebook in the same workspace, it will reference the new copy of that notebook in a workspace so when we’re updating our notebooks and orchestration we don’t have to swap notebook references to test our notebook and pipeline changes together. Plus, when we deploy the pipelines will always be pointing to the notebooks in the same (correct) workspace and we don’t have to do any further configuration to fix those references in the deployment process.

Presentation

Here is where all reports and semantic models live. The real benefit here is especially permissions depending on your connection types in the semantic model. With workspace identities or service principals you can grant someone view access to a presentation workspace without having to give them access to any of the underlying data or processes.

All of these workspaces can be stored in the same Git branch by using the workspace sub-folder feature of the Fabric git integration. When you sync a worksapce to git, you can give it a sub-folder to pull from:

📁 workspaces
  📁 engineering
  📁 ingestion
  📁 store
  📁 presentation

In addition to the structure I’d recommend creating workspace identities for each of the workspaces. Workspaces Identities are a new(er) feature that basically creates a service principal under the hood that is tied to the workspace. It’s not fully supported across all scenarios in Fabric, but the support is growing and it greatly simplifies some things like semantic model connections or pipeline integration, depending on your needs.

Fabric CI/CD, Branching Strategy and Feature Workspaces

If you do any reading in the online Fabric forums one of the common pain points is the CI/CD process, especially around the built-in Deployment Pipelines in Fabric. The pipelines UI can be very laggy and hard to figure out what’s needed to deploy, especially as your number of workspace items grow.

They can also be brittle, especially if something gets accidentally deleted. For instance, in our example workspace setup, if a lakehouse gets accidentally deleted, even if you undo and restore from git, the guid in Fabric has changed because it’s a new item. Now all those deployment pipelines relationships are broken and you’ll likely have to delete the affected workspace in all environments and redeploy. Then any deployment rules will have to be manually updated for each notebook as well. It’s a very painful and cumbersome process.

I strongly recommend you implement a CI/CD deployment as part of your Devops or Github build pipelines using the Fabric CI/CD library, now officially supported by Microsoft.

This is a python library that offers a robust toolkit for deploying your fabric items from the git repository along with paramterization rules, etc that greatly simplify the deployment process. For instance if you have notebooks that use lakehouse named customer_silver you can setup a single parameter substitution rule in the parameter.yml files the library uses.

The approach I use has all the main workspaces we’ve seen before but then developers get their own feature workspaces and create feature branches in the git repo. They can then sync their workspaces to their feature branch and commit using the Fabric source control integration.

In this scenario none of the “main” shared workspaces are synced to git, they only get updated via an Azure Devops pipeline or Github workflow action using the Fabric CI/CD library. Think of it like traditional software development. Developers work locally in feature branches on their computers (feature workspace) once their changes are ready they can raise a PR for their feature branch back into Main. Once their changes are approved and merged in, an automated build deploys the latest up to the dev server (dev workspace).

In theory, you could have a single workspace per developer, but depending on their role, it may make sense to give them multiple workspaces that correspond to the areas they’ll work in most often.
Some people keep their main dev workspaces synced to Git as well, but I think this will cause problems in the long run and force you to manually update the workspaces when changes are merged. We don’t let people commit directly to Main without going through a PR in other software development processes, and we should strive for the same here in Fabric.
This also helps address problems in the built-in Fabric Deployment Pipelines. When deploying, rather than configuring substitution rules for each notebook, we can use the logical lakehouse GUID ID in Git and then use named parameter substitutions in our engineering workspace deployment, so every notebook that uses that lakehouse will automatically have its references updated as we deploy to different environments.

    - find_value: "7b6a7a83-afe6-417d-a2db-c50a0270b772" # customer_silver lakehouse id
      replace_value:
        dev: $workspace.store-dev.$items.Lakehouse.customer_silver.$id
        test: $workspace.store-test.$items.Lakehouse.customer_silver.$id
        prod: $workspace.store-prod.$items.Lakehouse.customer_silver.$id

As you can see, we find the original GUID and tell the deployment library to replace it with the following value per environment. This uses named parameterization to automatically pull the right value for the right workspace/lakehouse based on the environment. With deployments properly configured this way, our deleted lakehouse example becomes much easier to handle. We can simply redeploy to our target store workspace to recreate it, then deploy our engineering workspaces, and everything will have the correct references properly updated. This could be the difference between a 5-minute wait for a deploy to finish and hours of tedious click-ops to fix a problem.

You can find all these values by pulling your fabric repo locally and looking at the notebook-content.py for a given notebook. They’re all at the top in the header of the file.

# METADATA ********************

# META {
# META   "kernel_info": {
# META     "name": "synapse_pyspark"
# META   },
# META   "dependencies": {
# META     "lakehouse": {
# META       "default_lakehouse": "7b6a7a83-afe6-417d-a2db-c50a0270b772",
# META       "default_lakehouse_name": "customer_silver",
# META       "default_lakehouse_workspace_id": "732077b5-c555-4358-9ac9-80fafa5c5e48",
# META       "known_lakehouses": [
# META         {
# META           "id": "ddd6e1e1-55aa-4665-aa05-fbdbc8bdd72c"
# META         },
# META         {
# META           "id": "f3dc5d9a-e156-44f0-8b30-8395a2bc687f"
# META         }
# META       ]
# META     },
# META     "environment": {
# META       "environmentId": "acb5449d-fdb8-4c92-ae95-26ed42d949d0",
# META       "workspaceId": "2b327290-14c1-494e-9595-7d85e6e90099"
# META     },
# META     "mirrored_db": {
# META       "known_mirrored_dbs": [
# META         {
# META           "id": "ff208c6c-dafc-4d99-8053-c9a56baf117b"
# META         }
# META       ]
# META     }
# META   }
# META }

All these values can be parameterized during deployment, and the documentation for the Fabric CI/CD library is pretty good. I highly recommend reviewing the available features and their examples. You should be able to get up and running relatively quickly.

Special Considerations

Just a few other thoughts when it comes to moving to automated deployments and a mature development process in Fabric.

Semantic Models

When changing Semantic Model connections from environment to environment, only the owner of the model can change the connection properties. If you’ve set up this type of automated deployment from the beginning, your service principal used to deploy already owns your Semantic Model. If you’re moving existing workspaces to this model and don’t want to delete and recreate the workspace, you can write a simple PowerShell script and run it as part of your initial deploy for your service principal to take over ownership of existing Semantic Models using the Power BI REST API.

Workspace Naming

Fabric generally treats workspace names as case-insensitive, but the deployment library does not. I generally just keep everything lowercase to avoid headaches.

Variable Libraries

Variable libraries in Fabric greatly simplify moving things from environment to environment, and the Fabric CI/CD library has a neat feature where, if the environment you’re deploying to (defined in the parameter.yml files) matches the value set name of a Variable Library being deployed, that value set will be automatically set as the active one.

In normal Deployment Pipelines, you’d have to deploy first and then set the correct value set as active, so this saves yet another “ClickOps” step.

Pipeline Schedules

Let’s say you’ve got a Data Pipeline in Fabric that runs some notebooks and does some transformation from your Silver to Gold layer. In Dev, on small datasets, you may want that to run often, or even just on-demand manually as needed, but in Prod, run it on a schedule once a day. Using Deployment Pipelines, your schedules are currently overwritten every time you deploy and must be manually updated after the fact.

With the Fabric CI/CD library, you can set up multiple schedules and disable your Test and Prod schedules in Dev. Then, in the parameter.yml file for your workspace, you automatically enable/disable the appropriate schedules per environment. By using the Key/Value replace function, you can turn schedules on and off per environment.

key_value_replace:
    - find_key: $.schedules[?(@.jobType=="Execute")].enabled
      replace_value:
          PPE: false
          PROD: true
      file_path: "**/.schedules"

Triggering Pipelines From Devops

One of the shortcomings of Fabric is really granulaty of permissions. For instance, it would be really nice to have an Orchestration workspace that you could grant access to in higher environments to manually trigger Data Pipeline runs. For instance, maybe a notebook had a bug and some tables got dropped, etc from a Lakehouse you deploy your fix. But if the pipeline that triggers your notebook is run once a day in Test it’d be nice to manually trigger it just this once.

Unfortunately, to execute a pipeline you need to have Contributor rights in a workspace, and if you have Contributor rights or above you basically have the keys to the castle in that workspace. We obviously don’t want that in higher environments so something we’ve done is create new standalone Azure DevOps pipelines to trigger specific pipelines in higher environments. Then we can manage the permissions to those pipelines, so now certain people have the permission to trigger Data Pipelines without having to give them the ability to do anything else in higher environments.

Basically we’re using Devops as a UI to select pipelines and the fact that the service principal you use needs to be a Contributer in the workspace to deploy to it.

# =============================================================================
# Trigger Data Pipelines - Integration Workspace
# =============================================================================
# Manually trigger Fabric data pipelines in the integration workspace.
# Select the environment and which pipelines to run using the options below.
#
# To add a new pipeline:
#   1. Add a new boolean parameter below with displayName matching the pipeline name
#   2. Add a corresponding entry to the 'pipelineSelections' object in the PowerShell script
# =============================================================================

trigger: none
pr: none

parameters:
  # -------------------------------------------------------------------------
  # ENVIRONMENT SELECTION
  # -------------------------------------------------------------------------
  - name: environment
    displayName: 'Environment'
    type: string
    default: 'dev'
    values:
      - dev
      - test
      - prod

  # -------------------------------------------------------------------------
  # PIPELINE SELECTIONS
  # Add new pipelines here as boolean parameters
  # -------------------------------------------------------------------------
  - name: MainBronzeToGold
    displayName: 'Main Bronze to Gold Pipeline'
    type: boolean
    default: false

  - name: Bronze2Silver_Pipeline
    displayName: 'Bronze2Silver Pipeline'
    type: boolean
    default: false

variables:
  azureServiceConnection: '<ServiceConnection>'
  # Workspace name pattern: 3phcloud-integration-{environment}
  workspaceName: 'engineering-${{ parameters.environment }}'

pool:
  vmImage: 'windows-latest'

stages:
  - stage: TriggerPipelines
    displayName: 'Trigger Data Pipelines (${{ parameters.environment }})'
    jobs:
      - job: TriggerJob
        displayName: 'Trigger Fabric Data Pipelines'
        steps:
          - checkout: self

          - task: AzureCLI@2
            displayName: 'Trigger selected data pipelines'
            inputs:
              azureSubscription: '$(azureServiceConnection)'
              scriptType: 'pscore'
              scriptLocation: 'inlineScript'
              inlineScript: |
                $ErrorActionPreference = "Stop"
                
                # ---------------------------------------------------------------
                # PIPELINE SELECTIONS
                # To add a new pipeline: add a key-value pair below where:
                #   - Key = Fabric pipeline display name (exact match)
                #   - Value = $true/$false from the corresponding parameter
                # ---------------------------------------------------------------
                $pipelineSelections = @{
                    "Main Bronze to Gold Pipeline"  = $${{ parameters.MainBronzeToGold }}
                    "Bronze2Silver Pipeline"             = $${{ parameters.Bronze2Silver_Pipeline}}
                }
                
                # Build list of selected pipelines
                $pipelines = @($pipelineSelections.GetEnumerator() | Where-Object { $_.Value -eq $true } | Select-Object -ExpandProperty Key)
                
                if ($pipelines.Count -eq 0) {
                    Write-Host "##vso[task.logissue type=warning]No pipelines selected. Please select at least one pipeline to trigger."
                    exit 0
                }
                
                Write-Host "Environment: ${{ parameters.environment }}" -ForegroundColor Cyan
                Write-Host "Workspace: $(workspaceName)" -ForegroundColor Cyan
                Write-Host ""
                Write-Host "Pipelines to trigger:" -ForegroundColor Yellow
                foreach ($p in $pipelines) {
                    Write-Host "  - $p" -ForegroundColor Cyan
                }
                Write-Host ""
                
                # Execute the trigger script
                & "$(Build.SourcesDirectory)/.deploy/trigger-data-pipelines.ps1" `
                    -WorkspaceName "$(workspaceName)" `
                    -Pipelines $pipelines
Where the trigger-data-pipelines.ps1 uses the Fabric REST API to get the pipeline ID via name and then trigger it with this snippet
function Start-DataPipeline {
    <#
    .SYNOPSIS
        Triggers a data pipeline job.
    #>
    param(
        [Parameter(Mandatory = $true)]
        [string]$WorkspaceId,

        [Parameter(Mandatory = $true)]
        [string]$PipelineId,

        [Parameter(Mandatory = $true)]
        [string]$PipelineName
    )

    $headers = Get-AuthHeaders
    $url = "$script:FabricApiBase/workspaces/$WorkspaceId/items/$PipelineId/jobs/instances?jobType=Pipeline"

    try {
        Write-Host "  Starting pipeline: $PipelineName" -ForegroundColor Cyan
        $response = Invoke-WebRequest -Uri $url -Headers $headers -Method Post -UseBasicParsing
        
        Write-Host "    Started successfully" -ForegroundColor Green
        return $true
    }
    catch {
        Write-Host "    Failed to start: $_" -ForegroundColor Red
        return $false
    }
}

Final Thoughts

That wraps up some initial thoughts on how to streamline your development lifecycle. In future posts I plan to address how you can automate workspace creation and management and also walk through how we address topics like managing capacity usage especially in smaller capacities during initial POC work.

If you’re evaluating Fabric and start with a smaller F2 – F8 capacity, it’s very easy to get frustrated running into capacity limits. There are a few tricks and tips you can use to help stay under limits without having to jump 2-3 capacity tiers so stay tuned for that as well.

About the Author

Nick Olson
Nick Olson

Principal Software Consultant

Nick has dedicated the past 20+ years of his career to development on the Microsoft tech stack, cultivating a deep expertise in this domain. He possesses an extensive, proven track record of successfully completing diverse development projects.