Init scripts are (usually small) shell scripts that run during the
start-up of each cluster node before the Spark driver or worker JVM
starts. Theses scripts are primarily used to pre-install Python and R
libraries. They can also add certificates that we might need to access
specific resources.
They are particular useful where we want to:
1. Ensure we have specific libraries/package configuration in
place.
2. Share our own set up easily with colleagues.
More information can be found here: Cluster
node initialization scripts
There are init scripts associated with each databrick runtime. You can also create your own.
The simplest way to access working init scrpits is to clone the init script repo
Instructions on how to clone a Github repo can be found in the Databricks folder. The DASH.sh file will now be in the repo section of the Workspace tab.
You can also create your own file.
Log into databricks.
Start in your respective “Workspace” explorer (tab located at upper left
corner in Databricks home screen).
In the upper right corner, click add and select “File”.
Name the file “my_init_script.sh” (or something you like) and click
“create file”.
In the file created list what you want to be installed.
An example init script can be found here: example init script (*)
For whatever .sh file you are using; in the Workspace Explorer right
click on the filename (or click on the three vertical dots to the right)
and select “Copy > Path”.
Path is now copied to clipboard. It should look something like this:
/Workspace/Users/firstname.lastname@defra.gov.uk/my_init_script.sh
Go to the “Compute” explorer (tab located at upper left corner
usually 4 lines below “Workspace”).
Click on “Create with Personal Compute”.
In the window that opens, make the relevant choices about name,
runtime and node type.
It is suggested to use Runtime 12.2 LTS unless you know of particular
requirements to use another.
Click on the “Advanced options” tab. Under the “Advanced options” tab
select the “Init Scripts” tab.
In the form asking for “Init script path” paste the file path of the
init script from above.
The destination box, to the left of it, should read “Workspace”.
Note that this means you will have to delete the prefix “/Workpace” from
the path. i.e. the path will read something like:
“/Users/firstname.lastname@defra.gov.uk/my_init_script.sh”
Click “Add”.
Click “Create Cluster”.
If you click back on the “compute” tab you should see your culster in
the list. Under status it should have a spining icon as the cluster
spins up.
Look out of the window, it takes a bit to make a cluster. (~6’ for the
init script provided). After 6-7’ our cluster should be up and running.
The spinning icon will turn to a tick.
If we used Runtime 12.2 LTS the pre-configured pandas
version is 1.4.2 (see Databricks
Runtime 12.2 LTS for Machine Learning - Azure Databricks | Microsoft
Learn for a full list) but in our example init script we
particularly asked for 1.5.3.
To test, create a notebook in your Databricks workspace buy clicking “New > Notebook”.
By default it is a Python notebook. It will be named “Untitled Notebook” plus the date. You can change this if you wish by double clicking on the name. Check it is running on your cluster by looking at the box just to the right of the “Run all” button, near the top right hand side. If there is no green circle you need to use the drop down to select your cluster.
In the first cell type:
import pandas as pd
print(pd.__version__)
Then run it by clicking on the play icon (or press Shift + Return)
This should print: “1.5.3”.
Alternative way to check
Click on your cluster from the compute page.
Click the tab “Apps” and then “Web Terminal”.
On the terminal that opens, type “pip show pandas”.
This prints information about the pandas package and the version should
read “1.5.3”
You could also use “pip list” to print the version of every pip package.
If you get the correct version of pandas you have created a PC cluster that is initialised with your custom init script! 😊
(*) Users are discouraged from using ad-hoc init scripts. If you want an init script to be used more extensively please make a formal github PR at this repo: init-scripts