Running a databricks notebook connected to git via ADF independent from git username

2.3k views Asked by At

In our company for orchestrating of running Databricks notebooks, experimentally we learned to connect our notebooks (affiliated to a git repository) to ADF pipelines, however, there is an issue.

As you can see in the photo attached to this question path to the notebook depends on the employee username, which is not a stable solution at production.

What is/are the solution(s) to solve it?.

  • update: The main issue is keeping employee username out of production to avoid any future failure. Either in path of ADF or secondary storage place which can be read by lookup but still sitting production side.

Path selection in ADF: enter image description here

enter image description here

enter image description here

enter image description here

2

There are 2 answers

1
Alex Ott On BEST ANSWER

If you want to avoid having the username in the path, then you can just create a folder inside Repos, and do checkout there (here is full instruction):

  • In the Repos, in the top-level part, click on the near the "Repos" header, select "Create" and select "Folder". Give it some name, like, "Staging":

enter image description here

  • Create a repository inside that folder

Click on the near the "Staging" folder, and click "Create" and select "Repo":

enter image description here

After that you can navigate to that repository in the ADF UI.

It's also recommended to set permissions on the folder, so only specific people can update projects inside it.

5
Utkarsh Pal On

You can use Azure DevOps source control to manage the developer and production Databrick Notebooks or other related codes/scripts/documents in Git. Learn more here.

Keep your Notebooks in logical distributed repositories in Github and use the same path in your Azure Data Factory in Notebook activity.

If you want to pass the dynamic path in Notebook activity, you should have placeholder of the notebook file paths lists something like a text/csv file or a SQL table where all the notebooks paths will be available.

Then use the Lookup activity in the ADF to get the list of those paths and pass the lookup output to a ForEach activity and have a Notebook activity inside ForEach and pass the path (for each iteration) to the parameters. This way you can avoid hardcoded field path in the pipeline.