Usage

Read various spatial vector formats into a Spark DataFrame.

Basic usage

Read the first layer from a file or files into a single Spark DataFrame:

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
)

Filename pattern matching

Read files that begin with “abc” into a single Spark DataFrame:

sdf = read_vector_files(
    path="/path/to/files/",
    pattern="abc*",
    suffix=".ext",
)

Read files that end with four digits into a single Spark DataFrame:

sdf = read_vector_files(
    path="/path/to/files/",
    pattern="*[0-9][0-9][0-9][0-9]",
    suffix=".ext",
)

For more information on pattern matching using Unix shell-style wildcards, see Python’s fnmatch module.

Reading files from nested folders

By default, the library will only look within the specified folder. To enable recursive searching of subdirectories, use the recursive argument.

Given the following folder structure:

/path/to/files
|    file_0.ext
|    file_1.ext
|
|-- subfolder
|        file_2.ext
|        file_3.ext
sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
)

will read file_0.ext and file_1.ext, while

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
    recursive=True,
)

will read file_0.ext, file_1.ext, subfolder/file_2.ext, and subfolder/file_3.ext.

Reading layers

Read a specific layer for a file or files, using layer name:

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
    layer_identifier="layer_name"
)

or layer index:

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
    layer_identifier=1
)

GDAL Virtual File Systems

Read compressed files using GDAL Virtual File Systems:

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".gz",
    layer_identifier="layer_name",
    vsi_prefix="/vsigzip/",
)

For more information, see GDAL Virtual File Systems.

User-defined Schema

By default, a schema will be generated from the first file in the folder. For a single tabular dataset that has been partitioned across several files, this will work fine.

However, it won’t work for a list format like GML, as not every file will contain the same fields. In this case, you can define a schema yourself. You will also need to set the coerce_to_schema flag to True.

schema = StructType(
    [
        StructField("id", LongType()),
        StructField("category", StringType()),
        StructField("geometry", BinaryType()),
    ]
)

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
    layer_identifier="layer_name",
    schema=schema,
    coerce_to_schema=True,
)

Concurrency Strategy

By default, the function will parallelise across files.

This should work well for single dataset that has been partitioned across several files. Especially if it has been partition so that those individual files can be comfortably read into memory on a single machine.

However, the function also provides a way of parallelising across chunks of rows within a file or files.

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
    concurrency_strategy="rows",
)

By default, a chunk will consist of 1 million rows but you can change this using the ideal_chunk_size parameter.

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
    concurrency_strategy="rows",
    ideal_chunk_size=5_000_000,
)

Warning

Reading chunks adds a substantial overhead as files have to be opened to get a row count. The “rows” strategy should only be used for a single large file or a small number of large files.

Functions for reading a GeoPackage using Spark’s JDBC drivers.

Reading a GeoPackage using Spark’s JDBC drivers.

Register the GeoPackage Dialect

To read a GeoPackage using Spark’s JDBC drivers, you will need to register a custom mapping of GeoPackage to Spark Catalyst types.

To do this, run the following in a Databricks notebook cell:

%scala
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects}
import org.apache.spark.sql.types._

object GeoPackageDialect extends JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:sqlite")

override def getCatalystType(
    sqlType: Int,
    typeName: String,
    size: Int,
    md: MetadataBuilder
): Option[DataType] = typeName match {
    case "BOOLEAN" => Some(BooleanType)
    case "TINYINT" => Some(ByteType)
    case "SMALLINT" => Some(ShortType)
    case "MEDIUMINT" => Some(IntegerType)
    case "INT" | "INTEGER" => Some(LongType)
    case "FLOAT" => Some(FloatType)
    case "DOUBLE" | "REAL" => Some(DoubleType)
    case "TEXT" => Some(StringType)
    case "BLOB" => Some(BinaryType)
    case "GEOMETRY" | "POINT" | "LINESTRING" | "POLYGON" | "MULTIPOINT" |
    "MULTILINESTRING" | "MULTIPOLYGON" | "GEOMETRYCOLLECTION" | "CIRCULARSTRING" |
    "COMPOUNDCURVE" | "CURVEPOLYGON" | "MULTICURVE" | "MULTISURFACE" | "CURVE" |
    "SURFACE" => Some(BinaryType)
    case "DATE" => Some(DateType)
    case "DATETIME" => Some(StringType)
}
}

JdbcDialects.registerDialect(GeoPackageDialect)

Once you’ve registered the GeoPackage dialect, you can list the layers in a given GeoPackage, get the coordinate reference system of a given layer, or read a layer into a Spark DataFrame with the geometry column encoded as Well Known Binary.

List layers

To list the layers in a given GeoPackage, use list_layers():

list_layers(
    path="/path/to/file.gpkg"
)

Get the coordinate reference system of a layer

To get the coordinate reference system of a layers in a given GeoPackage, use get_crs():

get_crs(
    path="/path/to/file.gpkg"
    layer_name="layer_name"
)

Get the bounding box of a layer

To get the bounding box of a layers in a given GeoPackage, use get_bounds():

get_bounds(
    path="/path/to/file.gpkg"
    layer_name="layer_name"
)

Read a layer into a Spark DataFrame

To read a layer into a Spark DataFrame, use read_gpkg(), If you don’t supply a layer, the first layer will be read:

read_gpkg(
    path="/path/to/file.gpkg"
)

Read a specific layer

If you supply a layer, that will be read:

 read_gpkg(
     path="/path/to/file.gpkg"
     layer_name="layer_name"
)

Supply a custom geometry column name

The function will assume that the geometry column is called geom, if it’s called something else, supply a different original_geometry_column_name:

read_gpkg(
    path="/path/to/file.gpkg"
    layer_name="layer_name"
    original_geometry_column_name="geometry",
)

GeoPackage Geometry Encoding

GeoPackage Geometry Encoding consists of two parts:

  1. A GeoPackage Binary (GPB) header

  2. A Well Known Binary (WKB) geometry

By default, the function will split the original geometry column into these two parts, unpack the GPB header, and return both as, respectively, gpb_header and wkb_geometry.

If you don’t want the GPB header, set drop_gpb_header to True:

read_gpkg(
    path="/path/to/file.gpkg"
    layer_name="layer_name"
    drop_gpb_header=True,
)

The length of the GPB header depends on the format of envelope (bounding box) included in it. The function assumes that the envelope will be [minx, maxx, miny, maxy] and that, therefore, the GPB header will be 40 bytes long. However, it’s possible for the format of the envelope to be [minx, maxx, miny, maxy, minz, maxz] or [minx, maxx, miny, maxy, minm, maxm] and, therefore 56 bytes long, or [minx, maxx, miny, maxy, minz, maxz, minm, maxm] and, therefore, 72 bytes long.

In these cases you will need to set header_length to the appropriate value:

read_gpkg(
    path="/path/to/file.gpkg"
    layer_name="layer_name"
    header_length=56,
)