rasterio/DESIGN.rst
Sean Gillies 763469d15c
Draft of project design notes (#2345)
* Draft of project design notes

* More design notes

* Mention array vs point issue

* Add a bit about tools and the CLI

* More introduction
2021-12-13 10:20:48 -07:00

159 lines
6.6 KiB
ReStructuredText

============
Design Notes
============
Rasterio's design can be deduced from its code, but we can make it even more
comprehensible by writing about it in simple language. That's what this
document is about: describing the abstractions and design of the software to
project developers.
Rasterio has low level abstractions and higher level abstractions. Let's be
clear: none of them are as high as some users want. Rasterio has no zonal stats
feature. No NDVI feature. No interactive mapping features. But it does provide
low-level abstractions that can be used to build these features in other
applications.
Interfaces
==========
Rasterio has interfaces that are not yet described using abstract base classes
or other formal interface system. The following subsections describe them
briefly.
DataAccessor
------------
This interface is involved with opening a dataset for access and is implemented
by the DatasetReader and DatasetWriter classes. Their constructors take a str
or os.PathLike object and, internally, attempt to adapt it to a
rasterio.path.Path object.
A DataAccessor is in some ways analogous to a Python I/O stream. It has an
access mode: "r", "r+", "w", or "w+". It can be in open or closed state. It is
a context manager. It has methods that read or write unlabeled arrays of raster
pixels to or from a dataset or optional windows (think slices) of a dataset. A
DataAccessor has more attributes than a Python I/O steam. There's no "encoding"
but there is a "crs" describing the coordinate reference system for the pixels
and a "transform", "gcps", or "rcps" attribute describing how the array indices
map to coordinates in that system.
Raster bands are not one of rasterio's abstractions. We don't read data from
the band of a dataset. We read multi-dimensional data from a dataset via a
DataAccessor.
Array
-----
A DataAccessor trades in not-sparse (dense) unlabeled Numpy arrays with a
minimum dimension of 2: row and column, or line and pixel. In the case of
multichannel/multiband datasets, like RGB imagery, there can also be a third
dimension corresponding to the channel or band. For these, the dimensions would
be: band, row, and column, in that order.
Elements of these arrays generally represent values integrated over an area.
Gridded, possibly sparse, point data can be handled, but it is not the default
as it is with, for example, xarray.
rasterio.path.Path
------------------
GDAL's GDALOpenEx takes an array of UTF-8 encoded bytes as its primary
argument. These bytes may contain a filename, a URL, an RDBMS connection
string, XML, or JSON. Almost any kind of dataset address, really. GDAL puts no
constraint on the content at all. A future format driver might use an array of
emoji to address data and GDAL would be fine with that.
A rasterio.path.Path object contains a GDAL dataset address and has an as_vsi()
method, the result of which can be UTF-8 encoded and given to GDALOpenEx.
This interface isn't meant for public consumption. We might make it private, to
the extent that anything can be private in Python.
DataPath
--------
By analogy to Python's pathlib.Path, a rasterio DataPath has an open() method
that returns a DataAccessor.
rasterio.io.MemoryFile and rasterio.io.FilePath implement the DataPath
interface.
Tools
-----
The issue at https://github.com/rasterio/rasterio/issues/1300 describes
rasterio's higher level tool abstraction. A tool is more or less the guts of a
command line program, minus the argument and option parsing. It works on named
datasets, not on arrays or Python objects.
The tool abstraction is: given names of input and output files and driver and
environment configuration parameters, the tool transforms pixels quickly and
efficiently, absorbing the complexity of lazy data loading and concurrency.
Opening a dataset
=================
rasterio.open() accepts a variety of inputs and returns a DataAccessor.
If the input implements DataPath, open() delegates to the input object. If the
input can be adapted to DataPath, open() delegates to the adapter. If the
input is a str or os.PathLike, it is adapted to rasterio.path.Path and passed
to a DataAccessor constructor.
Data types
==========
Rasterio uses Numpy data types and translates these to GDAL types before
calling GDAL methods.
GDAL context
============
GDAL relies on global state in the form of format drivers, a connection pool,
an error stack, caches, and configuration for these and optional software
features. Rasterio presents this context as a Python object:
rasterio.env.local._env. The rasterio.env.Env context manager is rasterio's
abstraction for configuration of the context. Importing rasterio creates the
absolute minimum of GDAL global state. It is not until an instance of
rasterio.env.Env is created and its context is entered, whether explicitly or
implicitly (by calling rasterio.open), that format drivers are registered and
rasterio.env.local._env becomes not None.
Many methods of rasterio require GDAL's context to be fully initialized. To
make this easy to ensure, we can use decorators from the rasterio.env module.
See for example the exists function in rasterio/shutil.pyx.
Errors and exceptions
=====================
GDAL maintains an error stack and a registry of handlers that are called when
an error is pushed onto the stack. Rasterio registers a handler that routes
GDAL error messages to Python's logger. We don't enable registration of other
handlers. Instead, users and developers should work with Python's logger.
Additionally, we check the error stack after calling GDAL functions from Cython
extension code and raise a Python exception if the last error is of GDAL type
>= 3. Several functions in rasterio._err exist to help: exc_wrap_int,
exc_wrap_pointer, etc.
GDAL raster band cache
======================
GDAL has a per-process in-memory LRU (least recently used) raster block cache.
A DataAccessor's read method results in cached blocks. Subsequent reads from
the same accessor may reuse those cached blocks. Calling a DataAccessor's write
method will update cached blocks. Cached blocks are written to the dataset's
storage when evicted from the cache or when the DataAccessor is closed,
flushing all the dataset's cached blocks.
Rasterio has no abstraction for this cache.
Command line interface
======================
Rasterio includes a command line program named "rio". It shares a set of
options with the "fio" program from the Fiona project (the vector counterpart
to rasterio). The rio program has one level of subcommands. The subcommands do
different things, though there is a little bit of overlap so that users don't
always have to call multiple commands to get a slightly different result.
Raster operations don't compose as readily as line-oriented text operations do.