Init script guide

What are Init scripts?

Init scripts are (usually small) shell scripts that run during the start-up of each cluster node before the Spark driver or worker JVM starts. Theses scripts are primarily used to pre-install Python and R libraries. They can also add certificates that we might need to access specific resources.
They are particular useful where we want to:
1. Ensure we have specific libraries/package configuration in place.
2. Share our own set up easily with colleagues.
More information can be found here: Cluster node initialization scripts

Custom init scripts

There are init scripts associated with each databrick runtime. You can also create your own.

Init script Github repo

The simplest way to access working init scrpits is to clone the init script repo

Instructions on how to clone a Github repo can be found in the Databricks folder. The DASH.sh file will now be in the repo section of the Workspace tab.

Creating a .sh file

You can also create your own file.

Log into databricks.
Start in your respective “Workspace” explorer (tab located at upper left corner in Databricks home screen).

In the upper right corner, click add and select “File”.
Name the file “my_init_script.sh” (or something you like) and click “create file”.

In the file created list what you want to be installed.

An example init script can be found here: example init script (*)

Copy the path of your .sh file

For whatever .sh file you are using; in the Workspace Explorer right click on the filename (or click on the three vertical dots to the right) and select “Copy > Path”.
Path is now copied to clipboard. It should look something like this: /Workspace/Users/firstname.lastname@defra.gov.uk/my_init_script.sh

Create a PC cluster with a custom init script

Go to the “Compute” explorer (tab located at upper left corner usually 4 lines below “Workspace”).
Click on “Create with Personal Compute”.

In the window that opens, make the relevant choices about name, runtime and node type.
It is suggested to use Runtime 12.2 LTS unless you know of particular requirements to use another.

Click on the “Advanced options” tab. Under the “Advanced options” tab select the “Init Scripts” tab.
In the form asking for “Init script path” paste the file path of the init script from above.
The destination box, to the left of it, should read “Workspace”.
Note that this means you will have to delete the prefix “/Workpace” from the path. i.e. the path will read something like: “/Users/firstname.lastname@defra.gov.uk/my_init_script.sh”
Click “Add”.

Click “Create Cluster”.

If you click back on the “compute” tab you should see your culster in the list. Under status it should have a spining icon as the cluster spins up.
Look out of the window, it takes a bit to make a cluster. (~6’ for the init script provided). After 6-7’ our cluster should be up and running. The spinning icon will turn to a tick.

Check it is working

If we used Runtime 12.2 LTS the pre-configured pandas version is 1.4.2 (see Databricks Runtime 12.2 LTS for Machine Learning - Azure Databricks | Microsoft Learn for a full list) but in our example init script we particularly asked for 1.5.3.

To test, create a notebook in your Databricks workspace buy clicking “New > Notebook”.

By default it is a Python notebook. It will be named “Untitled Notebook” plus the date. You can change this if you wish by double clicking on the name. Check it is running on your cluster by looking at the box just to the right of the “Run all” button, near the top right hand side. If there is no green circle you need to use the drop down to select your cluster.

In the first cell type:

import pandas as pd  
print(pd.__version__)

Then run it by clicking on the play icon (or press Shift + Return)

This should print: “1.5.3”.

Alternative way to check
Click on your cluster from the compute page.
Click the tab “Apps” and then “Web Terminal”.
On the terminal that opens, type “pip show pandas”.
This prints information about the pandas package and the version should read “1.5.3”

You could also use “pip list” to print the version of every pip package.

If you get the correct version of pandas you have created a PC cluster that is initialised with your custom init script! 😊

(*) Users are discouraged from using ad-hoc init scripts. If you want an init script to be used more extensively please make a formal github PR at this repo: init-scripts