From 763469d15c37c34104b8edb6256c6fdb9fdb4ae4 Mon Sep 17 00:00:00 2001 From: Sean Gillies Date: Mon, 13 Dec 2021 10:20:48 -0700 Subject: [PATCH] Draft of project design notes (#2345) * Draft of project design notes * More design notes * Mention array vs point issue * Add a bit about tools and the CLI * More introduction --- DESIGN.rst | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 DESIGN.rst diff --git a/DESIGN.rst b/DESIGN.rst new file mode 100644 index 00000000..1d55c1e9 --- /dev/null +++ b/DESIGN.rst @@ -0,0 +1,158 @@ +============ +Design Notes +============ + +Rasterio's design can be deduced from its code, but we can make it even more +comprehensible by writing about it in simple language. That's what this +document is about: describing the abstractions and design of the software to +project developers. + +Rasterio has low level abstractions and higher level abstractions. Let's be +clear: none of them are as high as some users want. Rasterio has no zonal stats +feature. No NDVI feature. No interactive mapping features. But it does provide +low-level abstractions that can be used to build these features in other +applications. + +Interfaces +========== + +Rasterio has interfaces that are not yet described using abstract base classes +or other formal interface system. The following subsections describe them +briefly. + +DataAccessor +------------ + +This interface is involved with opening a dataset for access and is implemented +by the DatasetReader and DatasetWriter classes. Their constructors take a str +or os.PathLike object and, internally, attempt to adapt it to a +rasterio.path.Path object. + +A DataAccessor is in some ways analogous to a Python I/O stream. It has an +access mode: "r", "r+", "w", or "w+". It can be in open or closed state. It is +a context manager. It has methods that read or write unlabeled arrays of raster +pixels to or from a dataset or optional windows (think slices) of a dataset. A +DataAccessor has more attributes than a Python I/O steam. There's no "encoding" +but there is a "crs" describing the coordinate reference system for the pixels +and a "transform", "gcps", or "rcps" attribute describing how the array indices +map to coordinates in that system. + +Raster bands are not one of rasterio's abstractions. We don't read data from +the band of a dataset. We read multi-dimensional data from a dataset via a +DataAccessor. + +Array +----- + +A DataAccessor trades in not-sparse (dense) unlabeled Numpy arrays with a +minimum dimension of 2: row and column, or line and pixel. In the case of +multichannel/multiband datasets, like RGB imagery, there can also be a third +dimension corresponding to the channel or band. For these, the dimensions would +be: band, row, and column, in that order. + +Elements of these arrays generally represent values integrated over an area. +Gridded, possibly sparse, point data can be handled, but it is not the default +as it is with, for example, xarray. + +rasterio.path.Path +------------------ + +GDAL's GDALOpenEx takes an array of UTF-8 encoded bytes as its primary +argument. These bytes may contain a filename, a URL, an RDBMS connection +string, XML, or JSON. Almost any kind of dataset address, really. GDAL puts no +constraint on the content at all. A future format driver might use an array of +emoji to address data and GDAL would be fine with that. + +A rasterio.path.Path object contains a GDAL dataset address and has an as_vsi() +method, the result of which can be UTF-8 encoded and given to GDALOpenEx. + +This interface isn't meant for public consumption. We might make it private, to +the extent that anything can be private in Python. + +DataPath +-------- + +By analogy to Python's pathlib.Path, a rasterio DataPath has an open() method +that returns a DataAccessor. + +rasterio.io.MemoryFile and rasterio.io.FilePath implement the DataPath +interface. + +Tools +----- + +The issue at https://github.com/rasterio/rasterio/issues/1300 describes +rasterio's higher level tool abstraction. A tool is more or less the guts of a +command line program, minus the argument and option parsing. It works on named +datasets, not on arrays or Python objects. + +The tool abstraction is: given names of input and output files and driver and +environment configuration parameters, the tool transforms pixels quickly and +efficiently, absorbing the complexity of lazy data loading and concurrency. + +Opening a dataset +================= + +rasterio.open() accepts a variety of inputs and returns a DataAccessor. + +If the input implements DataPath, open() delegates to the input object. If the +input can be adapted to DataPath, open() delegates to the adapter. If the +input is a str or os.PathLike, it is adapted to rasterio.path.Path and passed +to a DataAccessor constructor. + +Data types +========== + +Rasterio uses Numpy data types and translates these to GDAL types before +calling GDAL methods. + +GDAL context +============ + +GDAL relies on global state in the form of format drivers, a connection pool, +an error stack, caches, and configuration for these and optional software +features. Rasterio presents this context as a Python object: +rasterio.env.local._env. The rasterio.env.Env context manager is rasterio's +abstraction for configuration of the context. Importing rasterio creates the +absolute minimum of GDAL global state. It is not until an instance of +rasterio.env.Env is created and its context is entered, whether explicitly or +implicitly (by calling rasterio.open), that format drivers are registered and +rasterio.env.local._env becomes not None. + +Many methods of rasterio require GDAL's context to be fully initialized. To +make this easy to ensure, we can use decorators from the rasterio.env module. +See for example the exists function in rasterio/shutil.pyx. + +Errors and exceptions +===================== + +GDAL maintains an error stack and a registry of handlers that are called when +an error is pushed onto the stack. Rasterio registers a handler that routes +GDAL error messages to Python's logger. We don't enable registration of other +handlers. Instead, users and developers should work with Python's logger. +Additionally, we check the error stack after calling GDAL functions from Cython +extension code and raise a Python exception if the last error is of GDAL type +>= 3. Several functions in rasterio._err exist to help: exc_wrap_int, +exc_wrap_pointer, etc. + +GDAL raster band cache +====================== + +GDAL has a per-process in-memory LRU (least recently used) raster block cache. +A DataAccessor's read method results in cached blocks. Subsequent reads from +the same accessor may reuse those cached blocks. Calling a DataAccessor's write +method will update cached blocks. Cached blocks are written to the dataset's +storage when evicted from the cache or when the DataAccessor is closed, +flushing all the dataset's cached blocks. + +Rasterio has no abstraction for this cache. + +Command line interface +====================== + +Rasterio includes a command line program named "rio". It shares a set of +options with the "fio" program from the Fiona project (the vector counterpart +to rasterio). The rio program has one level of subcommands. The subcommands do +different things, though there is a little bit of overlap so that users don't +always have to call multiple commands to get a slightly different result. +Raster operations don't compose as readily as line-oriented text operations do.