PySpark Vector Files¶
Read vector files into a Spark DataFrame with geometry encoded as Well Known Binary (WKB).
Full documentation is available here.
Requirements¶
This library was developed using Databricks Runtime 10.4 LTS and uses the versions of python
, pandas
and pyspark
that come pre-installed on that runtime. However, it also requires GDAL 3.4.3
as this is the most recent version of GDAL
available from ubuntugis-unstable as of 2022-08-11.
You can install GDAL
on your cluster using an init script. See here for an example.
Install pyspark-vector-files
¶
Within a Databricks notebook¶
%pip install pyspark-vector-files
From the command line¶
python -m pip install pyspark-vector-files
Quick start¶
Read the first layer from a file or files with given extension into a single Spark DataFrame:
from pyspark_vector_files import read_vector_files
sdf = read_vector_files(
path="/path/to/files/",
suffix=".ext",
)
More examples are available here.
Local development¶
To ensure compatibility with Databricks Runtime 10.4 LTS, this package was developed on a Linux machine running the Ubuntu 20.04 LTS
operating system using Python3.8.10
, GDAL 3.4.3
, and spark 3.2.1.
.
Install Python 3.8.10
using pyenv¶
See the pyenv-installer
’s Installation / Update / Uninstallation instructions.
Install Python 3.8.10 globally:
pyenv install 3.8.10
Then install it locally in the repository you’re using:
pyenv local 3.8.10
Install GDAL 3.4.3
¶
Add the UbuntuGIS unstable Private Package Archive (PPA) and update your package list:
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable \
&& sudo apt-get update
Install gdal 3.4.3
, I found I also had to install python3-gdal (even though
I’m going to use poetry to install it in a virtual environment later) to
avoid version conflicts:
sudo apt-get install -y gdal-bin=3.4.3+dfsg-1~focal0 \
libgdal-dev=3.4.3+dfsg-1~focal0 \
python3-gdal=3.4.3+dfsg-1~focal0
Verify the installation:
ogrinfo --version
# GDAL 3.4.3, released 2022/04/22
Install poetry 1.1.13
¶
See poetry’s osx / linux / bashonwindows install instructions
Clone this repository¶
git clone https://github.com/Defra-Data-Science-Centre-of-Excellence/pyspark_vector_files.git
Install dependencies using poetry
¶
poetry install