Skip to content

User guide🔗

This user guide can be used as a starting point for getting a deeper understanding of the inner workings of the boxs library. It is meant for users who want to learn about individual details or who plan to extend its features by developing own value types or transformers.

Data organization🔗

Boxs keeps track of data items and their dependencies that are created when executing python code. Data items are identified by some automatic derived data_id, but for easier usage user-defined unique names within a single run are supported, too. Each execution leads to a new set of data items without overwriting anything. This allows to compare them across different "runs". All data items are organized in "Boxes", that can be used for grouping together related items. Multiple Boxes can share a single Storage, which actually stores the data and their meta-data.

Diagram of a workflow

Each individual data item can be referenced by 3 different ids:

  • box_id: The id of the box, in which the data item is stored.
  • data_id: The id identifying the same data entity across multiple runs.
  • run_id: The id that identifies the run in which this data item was created.

Warning

Boxs treats all data as immutable. Once written, a data item can't be updated. Instead a new data item with a different run_id should be used. Even though deleting a run and reusing its run_id is possible, it is HIGHLY discouraged since doing so can lead to inconsistencies especially with dependency tracking across different runs.

User API🔗

Boxs user API can be imported from the boxs package. All classes and functions that are meant for users, are importable from the top level package.

store()🔗

boxs.store() is the function that stores an actual value. It takes a couple of arguments, that influence where the data is stored, how the value is serialized, how its data id is calculated and what additional meta-data should be stored along with it.

The full signature of the method looks like this:

def store(
    value,
    *parents,
    name=None,
    origin=ORIGIN_FROM_FUNCTION_NAME,
    tags=None,
    meta=None,
    value_type=None,
    run_id=None,
    box=None
) -> boxs.DataInfo:
The function returns an object of type boxs.DataInfo. This type contains a reference to the stored data, which allows to load it again at a later time, and some additional meta information about it, including how it was stored.

store() arguments are described as follows:

value🔗

value contains the data that should be stored. Out of the box, boxs supports natively a couple of different types that can be stored:

With some limitation it can store lists and dicts, too, as long as every contained value can be serialized to JSON without the need of a custom JSONEncoder.

All other types of values can be stored either by explicitly setting the value_type argument or adding one that supports this type to the box which will contain the value.

*parents🔗

Boxs supports tracking dependencies between different values. This can be helpful e.g. when trying to understand the impact of changes in complex situations or when data is reused from different runs. Adding already stored data items as parents to a new data item is done by providing them as additional positional arguments to the store() call, e.g.:

import boxs

...

def fetch_data():
    ...
    data = boxs.store(data_values, box='my-box-id', name='data')
    return data

def partition_data(data):
    ...
    train_data = boxs.store(train_values, data, box='my-box-id', name='train_data')
    eval_data = boxs.store(eval_value, data, box='my-box-id', name='eval_data')
    return train_data, eval_data

...

data = fetch_data()
train_data, eval_data = partition_data(data)

In this example the second function partition_data(data) stores two new data items train_data and eval_data using data as a parent data item.

origin=ORIGIN_FROM_FUNCTION_NAME,🔗

Stored data items are referenced by a data_id, that is automatically derived from the origin of the data, defined by this keyword argument, and their parents. The origin is meant to describe in textual fashion, where some data originated. Its argument value can be either a str, or a Callable that returns a string. As a default, a callable is used, that extracts the name of the function from the stack, where store() is called from.

The callable can optionally take a OriginContext object, from whose attributes the origin can be constructed. For more information take a look at the type of the OriginMappingFunction.

name=None🔗

Referencing all data just from its automatically generated data_id is not the most convenient way for users. With every new dependency or some changes to the origin the data id changes to something completely different, so it becomes hard to keep track of things. To alleviate this, data items can be named by providing a value for the name keyword argument when storing the data. These names have to be unique within a run and can be used to refer to a specific data item from the command line. As a default, no names are given.

tags=None🔗

Often it can be helpful to group data items by some criteria. This is why one can assign a set of tags to each data item when storing new data. These tags are mappings from string keys to string values. The tags can later be used for listing data items or determine how they should be handled. As a default, no tags are used.

meta=None🔗

The meta keyword argument is meant for storing arbitrary meta-data about the item that might be useful later. This can be things like information about the size or source of data, e.g. a date which tells the update date of its data source. In general, all keys in meta must be strings. The values can differ, though. It can be any of the types that are supported by the python JSON encoder, so even dicts and lists can be used, as long as all values they contain can be serialized to JSON.

Boxs uses the same meta-data internally, for keeping track of the type of the value as well as some useful information like checksums or size of the data. So when inspecting the meta attribute of a data item, not only user-defined meta-data is shown.

value_type=None🔗

ValueTypes are the mechanism that allows to store values in the first place. In order to know, how to serialize and deserialize a specific value, boxs needs a corresponding value type. For a set of common used types, boxs has a predefined list of value types that are used, e.g. for files, strings, bytes or stream, so the value_type argument doesn't need to be set and can stay on its default None. For custom types though, a value_type has to be provided.

run_id=None🔗

Boxs automatically generates a new run_id every time, it is run in a new process. This run_id allows to correlate the version of a specific data item to a single invocation of the script, that created the data. There might be situations, where the user wants to override this automatic mechanism and do a manual run_id management. In this case, providing a custom run_id will override the automatically generated one.

box=None🔗

Boxs organizes data in collections, called "boxes". This keyword argument can be used for specifying, in which box the data should be stored. Its value can be either a str with the box_id of the box, or the Box object itself. If no box is specified, the default_box from the configuration is used. Since a box is required, a ValueError is raised, if no box is specified neither as keyword argument nor in the configuration.

load()🔗

Once a value has been stored, the question is now, how to load it once we need to use it again. For this boxs provides a boxs.load() function, that takes a reference to the data item and returns the stored value, that can be of any type.

def load(data, value_type=None) -> Any:

data🔗

The data item whose value should be loaded. data can be of the two types, either boxs.data.DataInfo or boxs.data.DataRef. DataInfo is the type returned by the store() method. It contains information about the data item, how it was stored and its ancestors. In contrast, DataRef contains only the necessary ids for uniquely identifying a data item.

Both, DataInfo and DataRef provide a load() method just for convenience, that internally use theboxs.load() function.

value_type=None🔗

The ValueType to use for converting the stored data to a python value. Usually, this doesn't need to be provided and can be left as its default None, since when storing of a value, the used value_type is added to the meta-data of the item, so that the same value_type can be reused, when the value is loaded again. Sometimes though, a user wants to use a different value type when loading data. In this case the value_type provided explicitly when calling load() can override the value type that is stored with the value.

info()🔗

info() returns the DataInfo about a data item from its DataRef reference. This function is usually not called directly, because it is more convenient to use the corresponding property DataRef.info that uses the info() function internally.

ValueType🔗

Within a python script different types of data are used. In order to know, how these different values can be stored and loaded, boxs uses the concept of ValueTypes. A value type corresponds usually to one specific python type that it supports. ValueType defines two methods, that are used for actually writing the value to storage and reading it at a later time:

    @abc.abstractmethod
    def write_value_to_writer(self, value, writer):
        raise NotImplementedError

    @abc.abstractmethod
    def read_value_from_reader(self, reader):
        raise NotImplementedError

ValueType.write_value_to_writer(value, writer) takes two arguments, value and writer. value is the value, that should be stored. The writer argument points to a storage specific implementation of the Writer interface, that allows to write data and the additional infos to the storage.

ValueType.read_value_from_reader(reader) takes only a single argument. reader contains a storage specific implementation of the Reader interface, that allows to read data or infos back from storage.

Each Box contains a pre-defined list of value types for common types like strings, bytes or files and directories. These value types are automatically used depending on the type of value. This works by them implementing another method of the ValueType interface, supports(value):

    def supports(self, value):
        return False

This method is given a value that should be stored and returns, if it supports writing this value or not. When a box should store a value without an explicit defined value type, it loops through its list of default types and uses the first type that returns True when its supports(value) method is called.

ValueType defines two additional methods, that are used for recreating a value that has been stored before. When a value has been stored, boxs calls the get_specification() method of the used value type, which returns a string specification of the value type. This specification is then added as an additional field 'value_type' to the meta-data. Once this value should be loaded, the corresponding value type is recreated using the class method from_specification(cls, specification) which takes the specification string from the meta-data and returns a ValueType instance that is then used for reading the value.

Box🔗

A Box is the class that actually implements the logic of storing and loading of values. Its interface matches the free functions store(), load() and info() from the boxs package once created:

Before a box can be used, it needs to be defined. This is done by creating a new instance of its class:

import boxs

...
box = boxs.Box('my-box-id', storage)
Its constructor takes a string containing the box_id and the underlying storage object, that actually stores the data of the items stored in the box.

When a Box is created, it registers itself with its box_id. This allows to find the box by its id at a later time, using the get_box(box_id) function.

A box comes with a pre-defined list of value types to support storing a some common types. Additional ValueType can be added by its add_value_type(value_type) method. This method adds the new value type at the beginning of the list, so that it takes precedence before the standard types.

Storage🔗

Storage is the interface that defines what methods storage implementations have to implement and adhere to, to be used by boxs for storing and loading data.

Reading and writing items🔗

A Storage implementation provides the means to read and write data items by creating new storage specific readers and writers for each data item that should be stored or loaded. Responsible for this are the two methods create_reader(item) and create_writer(item, name, tags) of the Storage interface.

In both cases the argument item is of type boxs.storage.Item which contains the ids of the item to be read or written.

    @abc.abstractmethod
    def create_reader(self, item):
        """
        Returns:
            boxs.storage.Reader: The reader that will load the data from the
                storage.
        """

    @abc.abstractmethod
    def create_writer(self, item, name=None, tags=None):
        """
        Returns:
            boxs.storage.Writer: The writer that will write the data into the
                storage.
        """

Writer🔗

A Writer implementation has to inherit from the boxs.storage.Writer base class. The base class defines a set of properties and methods that are used within boxs to write data items. When implementing the interface, only 2 methods are needed:

    @abc.abstractmethod
    def as_stream(self):
        """
        Return a stream to which the data content should be written.

        Returns:
            io.RawIOBase: The binary io-stream.
        """

    @abc.abstractmethod
    def write_info(self, info):
        """
        Write the info for the data item to the storage.

        Args:
            info (Dict[str,Any]): The information about the new data item.
        """

as_stream() is used by the individual value types to transfer the actual data of the values that should be stored. The implementation has to return a binary stream, that is not already opened.

The second method write_info(info) is called by boxs, once the data has been written. It takes a single dictionary as argument, that contains information describing the data item. The Writer implementation shouldn't expect anything about the format of the dictionary. The only guaranteed property is that it can be serialized using the standard JSON library.

Both methods must raise a boxs.errors.DataCollision exception when an item with the same ids already exists. Alternatively, the error can be raised when the writer is created.

Reader🔗

Similar to the Writer class, the Reader class has the corresponding 2 methods:

    @abc.abstractmethod
    def as_stream(self):
        """
        Return a stream from which the data content can be read.

        Returns:
            io.RawIOBase: A stream instance from which the data can be read.
        """

    @property
    @abc.abstractmethod
    def info(self):
        """Dictionary containing information about the data."""

as_stream() is used by the individual value types to read the actual data of the value that was stored. The implementation has to return a binary stream, that is not already opened.

info is a property, that returns the info dictionary, that was previously written. Ideally, the implementation caches the info once it has been read.

Both methods must raise a boxs.errors.DataNotFound exception when the item that should be read, doesn't exist. Alternatively, the error can be raised when the reader is created.

Querying and manipulating a storage🔗

Besides creating the Writer or Reader, the Storage interface contains additional methods, that are used for querying or manipulating the stored data items. These methods are currently only used by the command-line interface.

Warning

The Storage interface is not meant to be used directly by the user, but should be regarded as an implementation detail of boxs whose interface might change between versions. This does NOT include the interfaces Writer and Reader which have to be used by ValueType implementations and therefore should be stable.

    @abc.abstractmethod
    def list_runs(self, box_id, limit=None, name_filter=None):
        """
        List the runs within a box stored in this storage.

        The runs should be returned in descending order of their start time.

        Args:
            box_id (str): `box_id` of the box in which to look for runs.
            limit (Optional[int]): Limits the returned runs to maximum `limit` number.
                Defaults to `None` in which case all runs are returned.
            name_filter (Optional[str]): If set, only include runs which have names
                that have the filter as prefix. Defaults to `None` in which case all
                runs are returned.

        Returns:
            List[box.storage.Run]: The runs.
        """

    @abc.abstractmethod
    def list_items(self, item_query):
        """
        List all items that match a given query.

        The item query can contain parts of box id, run id or run name and data id or
        data name. If a query value is not set (`== None`) it is not used as a filter
        criteria.

        Args:
            item_query (boxs.storage.ItemQuery): The query which defines which items
                should be listed.

        Returns:
            List[box.storage.Item]: The runs.
        """

    @abc.abstractmethod
    def set_run_name(self, box_id, run_id, name):
        """
        Set the name of a run.

        The name can be updated and removed by providing `None`.

        Args;
            box_id (str): `box_id` of the box in which the run is stored.
            run_id (str): Run id of the run which should be named.
            name (Optional[str]): New name of the run. If `None`, an existing name
                will be removed.

        Returns:
            box.storage.Run: The run with its new name.
        """

    @abc.abstractmethod
    def delete_run(self, box_id, run_id):
        """
        Delete all the data of the specified run.

        Args;
            box_id (str): `box_id` of the box in which the run is stored.
            run_id (str): Run id of the run which should be deleted.
        """

Transformer🔗

Transformers are a mechanism for extending how data is stored and what meta-data is stored alongside. This works by wrapping the Reader and Writer that are created by the Storage and returning a different reader/writer.

Boxs comes with some ready-to-use transformers. The StatisticsTransformer gathers additional statistics about each data item like the size in bytes or number of lines. Another built-in transformer is the ChecksumTransformer which calculates checksums when storing data and verifies the checksum when that data is loaded again. This allows to detect transfer errors and can be used later for de-duplicating the data stored in a storage.

Using transformers🔗

Enabling transformers is done on a per-box level. A transformer is enabled by adding it as positional argument to instantiation of the box that should use it:

import boxs

box = boxs.Box(
    'my-box-id',
    boxs.FileSystemStorage('/my/path/to/storage/dir'),
    boxs.StatisticsTransformer(),
    boxs.ChecksumTransformer(),
)

Implementing transformers🔗

The Transformer base class contains only two simple methods:

    def transform_writer(self, writer):
        return writer

    def transform_reader(self, reader):
        return reader

transform_writer(writer) takes a Writer instance as argument and returns a new writer. This allows to intercept the write operations and modify the data as it is written. The writer gives access to the item specific meta-data, too, so that the transformed writer can add new attributes. The implementation of the base class returns the same writer it gets, so doing nothing.

transform_reader(reader) works in the same way. It receives a Reader instance that can be wrapped with an own implementation that modifies the data as it is being read. Modifying meta-data can be done, too, but is not recommended, since it creates an inconsistency between the meta-data in the storage, and the one that is seen by the value type loading the value. The base implementation returns the reader without any modification.

To make the implementation easier, boxs.transformer contains already a reader and a writer that delegates all its methods to a wrapped reader/writer. These DelegatingReader and DelegatingWriter classes can be used for implementing a custom transformer. Similarly, If the data stream should be intercepted, the boxs.io.DelegatingStream class can be used for modifying the read() or write() operations on the stream.

Configuration🔗

Even though, boxs tries to minimize the amount of steps necessary to use it, some aspects of it can be configured. For this it uses internally a configuration, that is returned from its get_config() function. The configuration is automatically created on first use.

Configurable values🔗

default_box🔗

One configuration value that can be set is the default_box. This value is a string, that contains the box_id of the box that should be used, if no box is explicitly specified. The value can be either set directly

import boxs

config = boxs.get_config()
config.default_box = 'my-default-box'

or as part of the environment by specifying the environment variable BOXS_DEFAULT_BOX.

init_module🔗

The value 'init_module' contains the module name of python module, that should be automatically imported once boxs has been initialized. This allows to make sure that a specific box has been defined before the code using it is executed.

The value can be either set directly

import boxs

config = boxs.get_config()
config.init_module = 'my_box_init'

or as part of the environment by specifying the environment variable BOXS_INIT_MODULE.

Warning

Setting this value at run-time will lead to the module getting imported if it hasn't been loaded yet. Be careful about circular dependencies between this module and boxs.

How to use boxs🔗

Now with some knowledge about the different concepts within boxs at our hands, let's dive into the topic of how to put the library to good use.

Install the library🔗

Use stable release from PyPI🔗

All stable versions of bandsaw are available on PyPI and can be downloaded and installed from there. The easiest option to get it installed into your python environment is by using pip:

pip install boxs

Use from source🔗

Boxs's Git repository is available for everyone and can easily be cloned into a new repository on your local machine:

$ cd /your/local/directory
$ git clone https://gitlab.com/kantai/boxs.git
$ cd boxs

If you want to make changes to library, please follow the guidance in the README.md on how to setup the necessary tools for testing your changes.

If you just want to use the library, it is sufficient to add the path to your local boxs repository to your $PYTHONPATH variable, e.g.:

$ export PYTHONPATH="$PYTHONPATH:/your/local/directory/boxs"

Last update: 2022-02-01