API Reference¶

pyspark_vector_files.read_vector_files(path: str, ogr_to_spark_type_map: mappingproxy = mappingproxy({(0, 0): IntegerType, (0, 1): BooleanType, (0, 2): ShortType, (1, 0): ArrayType(IntegerType, true), (1, 1): ArrayType(ShortType, true), (1, 2): ArrayType(BooleanType, true), (2, 0): DoubleType, (2, 3): FloatType, (3, 0): ArrayType(DoubleType, true), (3, 3): ArrayType(FloatType, true), (4, 0): StringType, (4, 4): StringType, (4, 5): StringType, (5, 0): ArrayType(StringType, true), (6, 0): StringType, (7, 0): ArrayType(StringType, true), (8, 0): BinaryType, (9, 0): StringType, (10, 0): StringType, (11, 0): StringType, (12, 0): LongType, (13, 0): ArrayType(LongType, true)}), pattern: str = '*', suffix: str = '*', recursive: bool = False, ideal_chunk_size: int = 1000000, geom_field_name: str = 'geometry', geom_field_type: ~typing.Tuple[int, int] = (8, 0), coerce_to_schema: bool = False, spark_to_pandas_type_map: mappingproxy = mappingproxy({IntegerType: <class 'numpy.int32'>, BooleanType: <class 'numpy.bool_'>, ShortType: <class 'numpy.int16'>, ArrayType(IntegerType, true): <class 'numpy.object_'>, ArrayType(BooleanType, true): <class 'numpy.object_'>, ArrayType(ShortType, true): <class 'numpy.object_'>, DoubleType: <class 'numpy.float64'>, FloatType: <class 'numpy.float32'>, ArrayType(DoubleType, true): <class 'numpy.object_'>, ArrayType(FloatType, true): <class 'numpy.object_'>, StringType: <class 'numpy.object_'>, ArrayType(StringType, true): <class 'numpy.object_'>, BinaryType: <class 'numpy.object_'>, LongType: <class 'numpy.int64'>, ArrayType(LongType, true): <class 'numpy.object_'>}), concurrency_strategy: str = 'files', vsi_prefix: ~typing.Optional[str] = None, schema: ~typing.Optional[~pyspark.sql.types.StructType] = None, layer_identifier: ~typing.Optional[~typing.Union[str, int]] = None) → DataFrame¶

Read vector file(s) into a Spark DataFrame.

Parameters:

path (str) – Path to a folder of vector files.
ogr_to_spark_type_map (MappingProxyType) – A mapping of OGR to Spark data types. Defaults to OGR_TO_SPARK.
pattern (str) – A filename pattern. This will be passed to to pathlib’s Path.glob method. For more information, see https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob. Defaults to “*”.
suffix (str) – A file extension pattern. This will be passed to to pathlib’s Path.glob method. For more information, see https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob. Defaults to “*”.
recursive (bool) – If True, recursive globbing in enabled. For more information, see https://docs.python.org/3/library/pathlib.html#pathlib.Path.rglob. Defaults to False.
ideal_chunk_size (int) – The max number of rows to be read into a chunk. Defaults to 1_000_000.
geom_field_name (str) – The name of the geometry column. Defaults to “geometry”.
geom_field_type (Tuple[int, int]) – The data type of the geometry column when it is passed to Spark. Defaults to (8, 0), which represents binary data.
coerce_to_schema (bool) – If True, all files or chunks will be forced to fit the supplied schema. Missing columns will be added and additional columns will be removed. Defaults to False.
spark_to_pandas_type_map (MappingProxyType) – A mapping of Spark to Pandas data types. Defaults to SPARK_TO_PANDAS.
concurrency_strategy (str) – The concurrency strategy to use, can be “files” or “rows”. Defaults to “files”.
vsi_prefix (str, optional) – The GDAL virtual file system prefix(es) to use. For more information, see https://gdal.org/user/virtual_file_systems.html. Defaults to None.
schema (StructType, optional) – A user-defined Spark schema. Defaults to None.
layer_identifier (Union[str, int], optional) – A layer name or index. If None is given, the first layer will be return. Defaults to None.

Returns:

A Spark DataFrame with geometry encoded as: Well Known Binary (WKB).

Return type:

SparkDataFrame

pyspark_vector_files.gpkg.list_layers(path: str, spark: Optional[SparkSession] = None) → List[str]¶

List layers from a GeoPackage.

Examples

>>> list_layers(
    "/dbfs/mnt/base/unrestricted/source_rpa_spatial_data_mart/dataset_rpa_reference_parcels/format_GPKG_rpa_reference_parcels/LATEST_rpa_reference_parcels/reference_parcels.gpkg"
)
['reference_parcels']

Parameters:

path (str) – Path to the GeoPackage. Must use the File API Format, i.e. start with /dbfs instead of dbfs:.
spark (Optional[SparkSession], optional) – The SparkSession to use. If none is provided, the active session will be used. Defaults to None.

Returns:

A list of layer names.

Return type:

List[str]

pyspark_vector_files.gpkg.get_crs(path: str, layer_name: str, spark: Optional[SparkSession] = None) → str¶

Get CRS of a layer in a GeoPackage.

Examples

>>> get_crs(
    "/dbfs/mnt/base/unrestricted/source_rpa_spatial_data_mart/dataset_rpa_reference_parcels/format_GPKG_rpa_reference_parcels/LATEST_rpa_reference_parcels/reference_parcels.gpkg",
    layer_name="reference_parcels",
)
'PROJCS["OSGB 1936 / British National Grid",GEOGCS["OSGB 1936",DATUM["OSGB_1936",SPHEROID["Airy 1830",6377563.396,299.3249646,AUTHORITY["EPSG","7001"]],AUTHORITY["EPSG","6277"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4277"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",49],PARAMETER["central_meridian",-2],PARAMETER["scale_factor",0.9996012717],PARAMETER["false_easting",400000],PARAMETER["false_northing",-100000],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","27700"]]'

Parameters:

path (str) – Path to the GeoPackage. Must use the File API Format, i.e. start with /dbfs instead of dbfs:.
layer_name (str) – The name of the layer to read.
spark (Optional[SparkSession], optional) – The SparkSession to use. If none is provided, the active session will be used. Defaults to None.

Returns:

Well-known Text Representation of the Spatial Reference System.

Return type:

str

pyspark_vector_files.gpkg.read_gpkg(path: str, original_geometry_column_name: str = 'geom', header_length: int = 40, drop_gpb_header: bool = False, layer_name: Optional[str] = None, spark: Optional[SparkSession] = None) → DataFrame¶

Read a GeoPackage into Spark DataFrame.

Parameters:

path (str) – Path to the GeoPackage. Must use the File API Format, i.e. start with /dbfs instead of dbfs:.
original_geometry_column_name (str) – The name of the geometry column of the layer in the GeoPackage. Defaults to “geom”.
header_length (int) – The length of the GPB header, in bytes. Defaults to 40.
drop_gpb_header (bool) – Whether to exclude the GPB header from the returned DataFrame. Defaults to False.
layer_name (str, optional) – The name of the layer to read. If none is provided, the first layer is read. Defaults to None.
spark (SparkSession, optional) – The SparkSession to use. If none is provided, the active session will be used. Defaults to None.

Returns:

A Spark DataFrame with geometry encoded as: Well Known Binary (WKB).

Return type:

SparkDataFrame