API Reference

This is the API for the signac (core) application.

The Project

Attributes

Project.build_job_search_index(index[, _trust])

Build a job search index.

Project.build_job_statepoint_index([...])

Build a state point index to identify jobs with specific parameters.

Project.check()

Check the project's workspace for corruption.

Project.clone(job[, copytree])

Clone job into this project.

Project.config

Get project's configuration.

Project.create_access_module([filename, ...])

Create the access module for indexing.

Project.create_linked_view([prefix, ...])

Create or update a persistent linked view of the selected data space.

Project.detect_schema([exclude_const, ...])

Detect the project's state point schema.

Project.data

Get data associated with this project.

Project.doc

Get document associated with this project.

Project.document

Get document associated with this project.

Project.dump_statepoints(statepoints)

Dump the state points and associated job ids.

Project.export_to(target[, path, copytree])

Export all jobs to a target location, such as a directory or a (compressed) archive file.

Project.find_job_ids([filter, doc_filter, index])

Find the job_ids of all jobs matching the filters.

Project.find_jobs([filter, doc_filter])

Find all jobs in the project's workspace.

Project.fn(filename)

Prepend a filename with the project's root directory path.

Project.get_id()

Get the project identifier.

Project.get_statepoint(jobid[, fn])

Get the state point associated with a job id.

Project.groupby([key, default])

Group jobs according to one or more state point or document parameters.

Project.groupbydoc([key, default])

Group jobs according to one or more document values.

Project.import_from([origin, schema, sync, ...])

Import the data space located at origin into this project.

Project.id

Get the project identifier.

Project.index([formats, depth, skip_errors, ...])

Generate an index of the project's workspace.

Project.isfile(filename)

Check if a filename exists in the project's root directory.

Project.min_len_unique_id()

Determine the minimum length required for a job id to be unique.

Project.num_jobs()

Return the number of initialized jobs.

Project.open_job([statepoint, id])

Get a job handle associated with a state point.

Project.read_statepoints([fn])

Read all state points from a file.

Project.repair([fn_statepoints, index, job_ids])

Attempt to repair the workspace after it got corrupted.

Project.reset_statepoint(job, new_statepoint)

Overwrite the state point of this job while preserving job data.

Project.root_directory()

Alias for path.

Project.stores

Get HDF5-stores associated with this project.

Project.sync(other[, strategy, exclude, ...])

Synchronize this project with the other project.

Project.update_cache()

Update the persistent state point cache.

Project.update_statepoint(job, update[, ...])

Change the state point of this job while preserving job data.

Project.workspace

The project's workspace directory.

Project.write_statepoints([statepoints, fn, ...])

Dump state points to a file.

class signac.Project(config=None)

Bases: object

The handle on a signac project.

Application developers should usually not need to directly instantiate this class, but use get_project() instead.

Parameters

config – The project configuration to use. By default, it loads the first signac project configuration found while searching upward from the current working directory (Default value = None).

FN_CACHE = '.signac_sp_cache.json.gz'

The default filename for the state point cache file.

FN_DOCUMENT = 'signac_project_document.json'

The project’s document filename.

FN_STATEPOINTS = 'signac_statepoints.json'

The default filename to read from and write state points to.

KEY_DATA = 'signac_data'

The project’s datastore key.

build_job_search_index(index, _trust=False)

Build a job search index.

Parameters
  • index (list) – A document index.

  • _trust – (Default value = False).

Returns

A job search index based on the provided index.

Return type

JobSearchIndex

Deprecated since version 1.3: This will be removed in 2.0.

build_job_statepoint_index(exclude_const=False, index=None)

Build a state point index to identify jobs with specific parameters.

This method generates pairs of state point keys and mappings of values to a set of all corresponding job ids. The pairs are ordered by the number of different values. Since state point keys may be nested, they are represented as a tuple. For example:

>>> for i in range(4):
...     project.open_job({'a': i, 'b': {'c': i % 2}}).init()
...
>>> for key, value in project.build_job_statepoint_index():
...     print(key)
...     pprint.pprint(value)
...
('b', 'c')
defaultdict(<class 'set'>,
            {0: {'3a530c13bfaf57517b4e81ecab6aec7f',
                 '4e9a45a922eae6bb5d144b36d82526e4'},
             1: {'d49c6609da84251ab096654971115d0c',
                 '5c2658722218d48a5eb1e0ef7c26240b'}})
('a',)
defaultdict(<class 'set'>,
            {0: {'4e9a45a922eae6bb5d144b36d82526e4'},
             1: {'d49c6609da84251ab096654971115d0c'},
             2: {'3a530c13bfaf57517b4e81ecab6aec7f'},
             3: {'5c2658722218d48a5eb1e0ef7c26240b'}})

Values that are constant over the complete data space can be optionally ignored with the exclude_const argument set to True.

Parameters
  • exclude_const (bool) – Exclude entries that are shared by all jobs that are part of the index (Default value = False).

  • index – A document index.

Yields

tuple – Pairs of state point keys and mappings of values to a set of all corresponding job ids (Default value = None).

Deprecated since version 1.3: This will be removed in 2.0. Use the detect_schema() function instead.

check()

Check the project’s workspace for corruption.

Raises

signac.errors.JobsCorruptedError – When one or more jobs are identified as corrupted.

clone(job, copytree=<function copytree>)

Clone job into this project.

Create an identical copy of job within this project.

See signac clone for the command line equivalent.

Parameters
Returns

The job instance corresponding to the copied job.

Return type

Job

Raises

DestinationExistsError – In case that a job with the same id is already initialized within this project.

property config

Get project’s configuration.

Returns

Dictionary containing project’s configuration.

Return type

_ProjectConfig

create_access_module(filename=None, main=True, master=None)

Create the access module for indexing.

This method generates the access module required to make this project’s index part of a main index.

Parameters
  • filename (str) – The name of the access module file. Defaults to the standard name and should usually not be changed.

  • main (bool) – If True, add directives for the compilation of a master index when executing the module (Default value = True).

  • master (bool) – Deprecated parameter. Replaced by main.

Returns

Access module name.

Return type

str

Deprecated since version 1.5: This will be removed in 2.0. Access modules are deprecated.

create_linked_view(prefix=None, job_ids=None, index=None, path=None)

Create or update a persistent linked view of the selected data space.

Similar to export_to(), this function expands the data space for the selected jobs, but instead of copying data will create symbolic links to the individual job workspace directories. This is primarily useful for browsing through the data space using a file-browser with human-interpretable directory paths.

By default, the paths of the view will be based on variable state point keys as part of the implicit schema of the selected jobs that we create the view for. For example, creating a linked view for a data space with schema

>>> print(project.detect_schema())
{
 'foo': 'int([0, 1, 2, ..., 8, 9], 10)',
}

by calling project.create_linked_view('my_view') will look similar to:

my_view/foo/0/job -> workspace/b8fcc6b8f99c56509eb65568922e88b8
my_view/foo/1/job -> workspace/b6cd26b873ae3624653c9268deff4485
...

It is possible to control the paths using the path argument, which behaves in the exact same manner as the equivalent argument for export_to().

Note

The behavior of this function is almost equivalent to project.export_to('my_view', copytree=os.symlink) with the major difference that view hierarchies are actually updated, meaning that invalid links are automatically removed.

See signac view for the command line equivalent.

Parameters
  • prefix (str) – The path where the linked view will be created or updated (Default value = None).

  • job_ids (iterable) – If None (the default), create the view for the complete data space, otherwise only for this iterable of job ids.

  • index – A document index (Default value = None).

  • path – The path (function) used to structure the linked data space (Default value = None).

Returns

A dictionary that maps the source directory paths to the linked directory paths.

Return type

dict

property data

Get data associated with this project.

This property should be used for large array-like data, which can’t be stored efficiently in the project document. For examples and usage, see Centralized Project Data.

Equivalent to:

return project.stores['signac_data']

See also

H5Store

Usage examples.

Returns

An HDF5-backed datastore.

Return type

H5Store

detect_schema(exclude_const=False, subset=None, index=None)

Detect the project’s state point schema.

See signac schema for the command line equivalent.

Parameters
  • exclude_const (bool) – Exclude all state point keys that are shared by all jobs within this project (Default value = False).

  • subset – A sequence of jobs or job ids specifying a subset over which the state point schema should be detected (Default value = None).

  • index – A document index (Default value = None).

Returns

The detected project schema.

Return type

ProjectSchema

property doc

Get document associated with this project.

Alias for document().

Returns

The project document.

Return type

BufferedJSONAttrDict

property document

Get document associated with this project.

Returns

The project document.

Return type

BufferedJSONAttrDict

dump_statepoints(statepoints)

Dump the state points and associated job ids.

Equivalent to:

{project.open_job(sp).id: sp for sp in statepoints}
Parameters

statepoints (iterable) – A list of state points.

Returns

A mapping, where the key is the job id and the value is the state point.

Return type

dict

Deprecated since version 1.8: This will be removed in 2.0. State point backup files are being removed in favor of Project.update_cache().

export_to(target, path=None, copytree=None)

Export all jobs to a target location, such as a directory or a (compressed) archive file.

Use this function in combination with find_jobs() to export only a select number of jobs, for example:

project.find_jobs({'foo': 0}).export_to('foo_0.tar')

The path argument enables users to control how exactly the exported data space is to be expanded. By default, the path-function will be based on the implicit schema of the exported jobs. For example, exporting jobs that all differ by a state point key foo with project.export_to('data/'), the exported directory structure could look like this:

data/foo/0
data/foo/1
...

That would be equivalent to specifying path=lambda job: os.path.join('foo', job.sp.foo).

Instead of a function, we can also provide a string, where fields for state point keys are automatically formatted. For example, the following two path arguments are equivalent: “foo/{foo}” and “foo/{job.sp.foo}”.

Any attribute of job can be used as a field here, so job.doc.bar, job.id, and job.ws can also be used as path fields.

A special {{auto}} field allows us to expand the path automatically with state point keys that have not been specified explicitly. So, for example, one can provide path="foo/{foo}/{{auto}}" to specify that the path shall begin with foo/{foo}/, but is then automatically expanded with all other state point key-value pairs. How key-value pairs are concatenated can be controlled via the format-specifier, so for example, path="{{auto:_}}" will generate a structure such as

data/foo_0
data/foo_1
...

Finally, providing path=False is equivalent to path="{job.id}".

See also

import_from() :

Previously exported or non-signac data spaces can be imported.

signac export :

See signac export for the command line equivalent.

Parameters
  • target – A path to a directory to export to. The target can not already exist. Besides directories, possible targets are tar files (.tar), gzipped tar files (.tar.gz), zip files (.zip), bzip2-compressed files (.bz2), and xz-compressed files (.xz).

  • path – The path (function) used to structure the exported data space. This argument must either be a callable which returns a path (str) as a function of job, a string where fields are replaced using the job-state point dictionary, or False, which means that we just use the job-id as path. Defaults to the equivalent of {{auto}}.

  • copytree – The function used for the actual copying of directory tree structures. Defaults to shutil.copytree(). Can only be used when the target is a directory.

Returns

A dict that maps the source directory paths, to the target directory paths.

Return type

dict

find_job_ids(filter=None, doc_filter=None, index=None)

Find the job_ids of all jobs matching the filters.

The optional filter arguments must be a Mapping of key-value pairs and JSON serializable.

Note

Providing a pre-calculated index may vastly increase the performance of this function.

Parameters
  • filter (Mapping) – A mapping of key-value pairs that all indexed job state points are compared against (Default value = None).

  • doc_filter (Mapping) – A mapping of key-value pairs that all indexed job documents are compared against (Default value = None).

  • index – A document index. If not provided, an index will be computed (Default value = None).

Return type

The ids of all indexed jobs matching both filters.

Raises
  • TypeError – If the filters are not JSON serializable.

  • ValueError – If the filters are invalid.

  • RuntimeError – If the filters are not supported by the index.

Deprecated since version 1.3: This will be removed in 2.0. Use find_jobs() instead, then access ids with job.id.Replicate the original behavior with [job.id for job in project.find_jobs()]

find_jobs(filter=None, doc_filter=None)

Find all jobs in the project’s workspace.

The optional filter arguments must be a Mapping of key-value pairs and JSON serializable. The filter argument is used to search against job state points, whereas the doc_filter argument compares against job document keys.

See signac find for the command line equivalent.

Parameters
  • filter (Mapping) – A mapping of key-value pairs that all indexed job state points are compared against (Default value = None).

  • doc_filter (Mapping) – A mapping of key-value pairs that all indexed job documents are compared against (Default value = None).

Returns

JobsCursor of jobs matching the provided filter(s).

Return type

JobsCursor

Raises
  • TypeError – If the filters are not JSON serializable.

  • ValueError – If the filters are invalid.

  • RuntimeError – If the filters are not supported by the index.

fn(filename)

Prepend a filename with the project’s root directory path.

Parameters

filename (str) – The name of the file.

Returns

The joined path of project root directory and filename.

Return type

str

get_id()

Get the project identifier.

Returns

The project id.

Return type

str

Deprecated since version 1.3: This will be removed in 2.0. Use project.id instead.

classmethod get_job(root=None)

Find a Job in or above the current working directory (or provided path).

Parameters

root (str) – The job root directory. If no root directory is given, the current working directory is assumed to be the job directory (Default value = None).

Returns

The job instance.

Return type

Job

Raises

LookupError – When job cannot be found.

classmethod get_project(root=None, search=True, **kwargs)

Find a project configuration and return the associated project.

Parameters
  • root (str) – The starting point to search for a project, defaults to the current working directory.

  • search (bool) – If True, search for project configurations inside and above the specified root directory, otherwise only return projects with a root directory identical to the specified root argument (Default value = True).

  • **kwargs – Optional keyword arguments that are forwarded to the Project class constructor.

Returns

An instance of Project.

Return type

Project

Raises

LookupError – When project configuration cannot be found.

get_statepoint(jobid, fn=None)

Get the state point associated with a job id.

The state point is retrieved from the internal cache, from the workspace or from a state points file.

Parameters
  • jobid (str) – A job id to get the state point for.

  • fn (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS.

Returns

The state point corresponding to jobid.

Return type

dict

Raises

Deprecated since version 1.3: This will be removed in 2.0. Use open_job(id=jobid).statepoint() function instead.

groupby(key=None, default=None)

Group jobs according to one or more state point or document parameters.

Prepend the key with ‘sp.’ or ‘doc.’ to specify the query namespace. If no prefix is specified, group by state point key.

This method can be called on any JobsCursor such as the one returned by find_jobs() or by iterating over a project.

Examples

# Group jobs by state point parameter 'a'.
for key, group in project.groupby('a'):
    print(key, list(group))

# Group jobs by document value 'a'.
for key, group in project.groupby('doc.a'):
    print(key, list(group))

# Group jobs by jobs.sp['a'] and job.document['b']
for key, group in project.groupby(('a', 'doc.b')):
    print(key, list(group))

# Find jobs where job.sp['a'] is 1 and group them
# by job.sp['b'] and job.sp['c'].
for key, group in project.find_jobs({'a': 1}).groupby(('b', 'c')):
    print(key, list(group))

# Group by job.sp['d'] and job.document['count'] using a lambda.
for key, group in project.groupby(
    lambda job: (job.sp['d'], job.document['count'])
):
    print(key, list(group))

If key is None, jobs are grouped by id, placing one job into each group.

If default is None, only jobs with the key defined will be grouped. Jobs without the key will be filtered out and not included in any group.

Parameters
  • key (str, iterable, or callable) – The grouping key(s) passed as a string, iterable of strings, or a callable that will be passed one argument, the job (Default value = None).

  • default – A default value to be used when a given key is not present. The value must be sortable and is only used if not None (Default value = None).

Yields
  • key – Grouped key.

  • group (iterable of Jobs) – Iterable of Job instances matching this group key.

groupbydoc(key=None, default=None)

Group jobs according to one or more document values.

This method can be called on any JobsCursor such as the one returned by find_jobs() or by iterating over a project.

Examples

# Group jobs by document value 'a'.
for key, group in project.groupbydoc('a'):
    print(key, list(group))

# Find jobs where job.sp['a'] is 1 and group them
# by job.document['b'] and job.document['c'].
for key, group in project.find_jobs({'a': 1}).groupbydoc(('b', 'c')):
    print(key, list(group))

# Group by whether 'd' is a field in the job.document using a lambda.
for key, group in project.groupbydoc(lambda doc: 'd' in doc):
    print(key, list(group))

If key is None, jobs are grouped by id, placing one job into each group.

Parameters
  • key (str, iterable, or callable) – The document grouping parameter(s) passed as a string, iterable of strings, or a callable that will be passed one argument, document (Default value = None).

  • default – A default value to be used when a given document key is not present. The value must be sortable and is only used if not None (Default value = None).

Deprecated since version 1.7: This will be removed in 2.0. Use groupby with a ‘doc.’ filter instead, see https://docs.signac.io/en/latest/query.html#query-namespaces.

property id

Get the project identifier.

Returns

The project id.

Return type

str

import_from(origin=None, schema=None, sync=None, copytree=None)

Import the data space located at origin into this project.

This function will walk through the data space located at origin and will try to identify data space paths that can be imported as a job workspace into this project.

The schema argument expects a function that takes a path argument and returns a state point dictionary. A default function is used when no argument is provided. The default schema function will simply look for state point manifest files–usually named signac_statepoint.json–and then import all data located within that path into the job workspace corresponding to the state point specified in the manifest file.

Alternatively the schema argument may be a string, that is converted into a schema function, for example: Providing foo/{foo:int} as schema argument means that all directories under foo/ will be imported and their names will be interpreted as the value for foo within the state point.

Tip

Use copytree=os.replace or copytree=shutil.move to move dataspaces on import instead of copying them.

Warning: Imports can fail due to conflicts. Moving data instead of copying may therefore lead to inconsistent states and users are advised to apply caution.

See also

export_to() : Export the project data space.

signac import :

See signac import for the command line equivalent.

Parameters
  • origin – The path to the data space origin, which is to be imported. This may be a path to a directory, a zip file, or a tarball archive (Default value = None).

  • schema – An optional schema function, which is either a string or a function that accepts a path as its first and only argument and returns the corresponding state point as dict. (Default value = None).

  • sync – If True, the project will be synchronized with the imported data space. If a dict of keyword arguments is provided, the arguments will be used for sync() (Default value = None).

  • copytree – Specify which exact function to use for the actual copytree operation. Defaults to shutil.copytree().

Returns

A dict that maps the source directory paths to the target directory paths.

Return type

dict

index(formats=None, depth=0, skip_errors=False, include_job_document=True)

Generate an index of the project’s workspace.

This generator function indexes every file in the project’s workspace until the specified depth. The job document if it exists, is always indexed, other files need to be specified with the formats argument.

See signac project -i for the command line equivalent.

for doc in project.index({r'.*\.txt', 'TextFile'}):
    print(doc)
Parameters
  • formats (str, dict) – The format definitions as a pattern string (e.g. r'.*\.txt') or a mapping from pattern strings to formats (e.g. 'TextFile'). If None, only the job document is indexed (Default value = None).

  • depth (int) – Specifies the crawling depth. A value of 0 means no limit (Default value = 0).

  • skip_errors (bool) – Skip all errors which occur during indexing. This is useful when trying to repair a broken workspace (Default value = False).

  • include_job_document (bool) – Include the contents of job documents (Default value = True).

Yields

dict – Index document.

Deprecated since version 1.8: This will be removed in 2.0. Indexing is deprecated.

classmethod init_project(name=None, root=None, workspace=None, make_dir=True)

Initialize a project.

It is safe to call this function multiple times with the same arguments. However, a RuntimeError is raised if an existing project configuration would conflict with the provided initialization parameters.

See signac init for the command line equivalent.

Parameters
  • name (str, optional) – The name of the project to initialize (Default value = None).

  • root (str, optional) – The root directory for the project. Defaults to the current working directory.

  • workspace (str, optional) – The workspace directory for the project. Defaults to a subdirectory workspace in the project root.

  • make_dir (bool, optional) – Create the project root directory if it does not exist yet (Default value = True).

Returns

Initialized project, an instance of Project.

Return type

Project

Raises

RuntimeError – If the project root path already contains a conflicting project configuration.

isfile(filename)

Check if a filename exists in the project’s root directory.

Parameters

filename (str) – The name of the file.

Returns

True if filename exists in the project’s root directory.

Return type

bool

min_len_unique_id()

Determine the minimum length required for a job id to be unique.

This method’s runtime scales with the number of jobs in the workspace.

Returns

Minimum string length of a unique job identifier.

Return type

int

num_jobs()

Return the number of initialized jobs.

Returns

Count of initialized jobs.

Return type

int

Deprecated since version 1.8: This will be removed in 2.0. The num_jobs method is deprecated. Use len(project) instead.

open_job(statepoint=None, id=None)

Get a job handle associated with a state point.

This method returns the job instance associated with the given state point or job id. Opening a job by a valid state point never fails. Opening a job by id requires a lookup of the state point from the job id, which may fail if the job was not previously initialized.

Parameters
  • statepoint (dict) – The job’s unique set of state point parameters (Default value = None).

  • id (str) – The job id (Default value = None).

Returns

The job instance.

Return type

Job

Raises
  • KeyError – If the attempt to open the job by id fails.

  • LookupError – If the attempt to open the job by an abbreviated id returns more than one match.

property path

The path to the project directory.

Type

str

read_statepoints(fn=None)

Read all state points from a file.

See also

dump_statepoints

Dump the state points and associated job ids.

write_statepoints

Dump state points to a file.

Parameters

fn (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS.

Returns

State points.

Return type

dict

Deprecated since version 1.8: This will be removed in 2.0. State point backup files are being removed in favor of Project.update_cache().

repair(fn_statepoints=None, index=None, job_ids=None)

Attempt to repair the workspace after it got corrupted.

This method will attempt to repair lost or corrupted job state point manifest files using a state points file or a document index or both.

Parameters
  • fn_statepoints (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS.

  • index – A document index (Default value = None).

  • job_ids – An iterable of job ids that should get repaired. Defaults to all jobs.

Raises

signac.errors.JobsCorruptedError – When one or more corrupted job could not be repaired.

reset_statepoint(job, new_statepoint)

Overwrite the state point of this job while preserving job data.

This method will change the job id if the state point has been altered.

Danger

Use this function with caution! Resetting a job’s state point, may sometimes be necessary, but can possibly lead to incoherent data spaces.

Parameters
  • job (Job) – The job that should be reset to a new state point.

  • new_statepoint (mapping) – The job’s new state point.

Raises
  • DestinationExistsError – If a job associated with the new state point is already initialized.

  • OSError – If the move failed due to an unknown system related error.

Deprecated since version 1.3: This will be removed in 2.0. Use job.statepoint = new_statepoint instead.

root_directory()

Alias for path.

Deprecated since version 1.8: This will be removed in 2.0. Use Project.path instead.

property stores

Get HDF5-stores associated with this project.

Use this property to access an HDF5 file within the project’s root directory using the H5Store dict-like interface.

This is an example for accessing an HDF5 file called 'my_data.h5' within the project’s root directory:

project.stores['my_data']['array'] = np.random((32, 4))

This is equivalent to:

H5Store(project.fn('my_data.h5'))['array'] = np.random((32, 4))

Both the project.stores and the H5Store itself support attribute access. The above example could therefore also be expressed as:

project.stores.my_data.array = np.random((32, 4))
Returns

The HDF5-Store manager for this project.

Return type

H5StoreManager

sync(other, strategy=None, exclude=None, doc_sync=None, selection=None, **kwargs)

Synchronize this project with the other project.

Try to clone all jobs from the other project to this project. If a job is already part of this project, try to synchronize the job using the optionally specified strategies.

See signac sync for the command line equivalent.

Parameters
  • other (Project) – The other project to synchronize this project with.

  • strategy – A file synchronization strategy (Default value = None).

  • exclude – Files with names matching the given pattern will be excluded from the synchronization (Default value = None).

  • doc_sync – The function applied for synchronizing documents (Default value = None).

  • selection – Only sync the given jobs (Default value = None).

  • **kwargs – This method also accepts the same keyword arguments as the sync_projects() function.

Raises
  • DocumentSyncConflict – If there are conflicting keys within the project or job documents that cannot be resolved with the given strategy or if there is no strategy provided.

  • FileSyncConflict – If there are differing files that cannot be resolved with the given strategy or if no strategy is provided.

  • SchemaSyncConflict – In case that the check_schema argument is True and the detected state point schema of this and the other project differ.

temporary_project(name=None, dir=None)

Context manager for the initialization of a temporary project.

The temporary project is by default created within the root project’s workspace to ensure that they share the same file system. This is an example for how this method can be used for the import and synchronization of external data spaces.

with project.temporary_project() as tmp_project:
    tmp_project.import_from('/data')
    project.sync(tmp_project)
Parameters
  • name (str) – An optional name for the temporary project. Defaults to a unique random string.

  • dir (str) – Optionally specify where the temporary project root directory is to be created. Defaults to the project’s workspace directory.

Returns

An instance of Project.

Return type

Project

to_dataframe(*args, **kwargs)

Export the project metadata to a pandas DataFrame.

The arguments to this function are forwarded to to_dataframe().

Parameters
Return type

DataFrame

update_cache()

Update the persistent state point cache.

This function updates a persistent state point cache, which is stored in the project root directory. Most data space operations, including iteration and filtering or selection are expected to be significantly faster after calling this function, especially for large data spaces.

update_statepoint(job, update, overwrite=False)

Change the state point of this job while preserving job data.

By default, this method will not change existing parameters of the state point of the job.

This method will change the job id if the state point has been altered.

Warning

While appending to a job’s state point is generally safe, modifying existing parameters may lead to data inconsistency. Use the overwrite argument with caution!

Parameters
  • job (Job) – The job whose state point shall be updated.

  • update (mapping) – A mapping used for the state point update.

  • overwrite (bool, optional) – If False, an error will be raised if the update modifies the values of existing keys in the state point. If True, any existing keys will be overwritten in the same way as dict.update(). Use with caution! (Default value = False).

Raises
  • KeyError – If the update contains keys which are already part of the job’s state point and overwrite is False.

  • DestinationExistsError – If a job associated with the new state point is already initialized.

  • OSError – If the move failed due to an unknown system related error.

Deprecated since version 1.3: This will be removed in 2.0. Use job.update_statepoint() instead.

property workspace

The project’s workspace directory.

See signac project -w for the command line equivalent.

Type

str

write_statepoints(statepoints=None, fn=None, indent=2)

Dump state points to a file.

If the file already contains state points, all new state points will be appended, while the old ones are preserved.

See also

dump_statepoints

Dump the state points and associated job ids.

Parameters
  • statepoints (iterable) – A list of state points, defaults to all state points which are defined in the workspace.

  • fn (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS.

  • indent (int) – Specify the indentation of the JSON file (Default value = 2).

Deprecated since version 1.8: This will be removed in 2.0. State point backup files are being removed in favor of Project.update_cache().

The JobsCursor class

Attributes

JobsCursor.export_to(target[, path, copytree])

Export all jobs to a target location, such as a directory or a (zipped) archive file.

JobsCursor.groupby([key, default])

Group jobs according to one or more state point or document parameters.

JobsCursor.groupbydoc([key, default])

Group jobs according to one or more document values.

JobsCursor.to_dataframe([sp_prefix, ...])

Convert the selection of jobs to a pandas DataFrame.

class signac.contrib.project.JobsCursor(project, filter=None, doc_filter=None)

Bases: object

An iterator over a search query result.

Application developers should not directly instantiate this class, but use find_jobs() instead.

Enables simple iteration and grouping operations.

Parameters
  • project (Project) – Project handle.

  • filter (Mapping) – A mapping of key-value pairs that all indexed job state points are compared against (Default value = None).

Notes

Iteration is performed by acquiring job ids from the project using Project._find_job_ids(). When no filter (filter = None) is provided, that method can take a much faster execution path, so not passing a filter (or passing None explicitly) to this constructor is strongly recommended over passing an empty filter (filter = {}) when iterating over the entire data space.

export_to(target, path=None, copytree=None)

Export all jobs to a target location, such as a directory or a (zipped) archive file.

See also

export_to()

For full details on how to use this function.

Parameters
  • target (str) – A path to a directory or archive file to export to.

  • path (str or callable) – The path (function) used to structure the exported data space (Default value = None).

  • copytree (callable) – The function used for copying of directory tree structures. Defaults to shutil.copytree(). Can only be used when the target is a directory (Default value = None).

Returns

A dictionary that maps the source directory paths to the target directory paths.

Return type

dict

groupby(key=None, default=None)

Group jobs according to one or more state point or document parameters.

Prepend the key with ‘sp.’ or ‘doc.’ to specify the query namespace. If no prefix is specified, group by state point key.

This method can be called on any JobsCursor such as the one returned by find_jobs() or by iterating over a project.

Examples

# Group jobs by state point parameter 'a'.
for key, group in project.groupby('a'):
    print(key, list(group))

# Group jobs by document value 'a'.
for key, group in project.groupby('doc.a'):
    print(key, list(group))

# Group jobs by jobs.sp['a'] and job.document['b']
for key, group in project.groupby(('a', 'doc.b')):
    print(key, list(group))

# Find jobs where job.sp['a'] is 1 and group them
# by job.sp['b'] and job.sp['c'].
for key, group in project.find_jobs({'a': 1}).groupby(('b', 'c')):
    print(key, list(group))

# Group by job.sp['d'] and job.document['count'] using a lambda.
for key, group in project.groupby(
    lambda job: (job.sp['d'], job.document['count'])
):
    print(key, list(group))

If key is None, jobs are grouped by id, placing one job into each group.

If default is None, only jobs with the key defined will be grouped. Jobs without the key will be filtered out and not included in any group.

Parameters
  • key (str, iterable, or callable) – The grouping key(s) passed as a string, iterable of strings, or a callable that will be passed one argument, the job (Default value = None).

  • default – A default value to be used when a given key is not present. The value must be sortable and is only used if not None (Default value = None).

Yields
  • key – Grouped key.

  • group (iterable of Jobs) – Iterable of Job instances matching this group key.

groupbydoc(key=None, default=None)

Group jobs according to one or more document values.

This method can be called on any JobsCursor such as the one returned by find_jobs() or by iterating over a project.

Examples

# Group jobs by document value 'a'.
for key, group in project.groupbydoc('a'):
    print(key, list(group))

# Find jobs where job.sp['a'] is 1 and group them
# by job.document['b'] and job.document['c'].
for key, group in project.find_jobs({'a': 1}).groupbydoc(('b', 'c')):
    print(key, list(group))

# Group by whether 'd' is a field in the job.document using a lambda.
for key, group in project.groupbydoc(lambda doc: 'd' in doc):
    print(key, list(group))

If key is None, jobs are grouped by id, placing one job into each group.

Parameters
  • key (str, iterable, or callable) – The document grouping parameter(s) passed as a string, iterable of strings, or a callable that will be passed one argument, document (Default value = None).

  • default – A default value to be used when a given document key is not present. The value must be sortable and is only used if not None (Default value = None).

Deprecated since version 1.7: This will be removed in 2.0. Use groupby with a ‘doc.’ filter instead, see https://docs.signac.io/en/latest/query.html#query-namespaces.

next()

Return the next element.

This function is deprecated. Users should use next(iter(...)) instead. .. deprecated:: 0.9.6

Deprecated since version 0.9.6: This will be removed in 2.0. Use next(iter(…)) instead.

to_dataframe(sp_prefix='sp.', doc_prefix='doc.', usecols=None, flatten=False)

Convert the selection of jobs to a pandas DataFrame.

This function exports the job metadata to a pandas.DataFrame. All state point and document keys are prefixed by default to be able to distinguish them.

Parameters
  • sp_prefix (str, optional) – Prefix state point keys with the given string. Defaults to “sp.”.

  • doc_prefix (str, optional) – Prefix document keys with the given string. Defaults to “doc.”.

  • usecols (list-like or callable, optional) – Used to select a subset of columns. If list-like, must contain strings corresponding to the column names that should be included. For example, ['sp.a', 'doc.notes']. If callable, the column will be included if the function called on the column name returns True. For example, lambda x: 'sp.' in x. Defaults to None, which uses all columns from the state point and document. Note that this filter is applied after the doc and sp prefixes are added to the column names.

  • flatten (bool, optional) – Whether nested state points or document keys should be flattened. If True, {'a': {'b': 'c'}} becomes a column named a.b with value c. If False, it becomes a column named a with value {'b': 'c'}. Defaults to False.

Returns

A pandas DataFrame with all job metadata.

Return type

DataFrame

The Job class

Attributes

Job.clear()

Remove all job data, but not the job itself.

Job.close()

Close the job and switch to the previous working directory.

Job.data

Get data associated with this job.

Job.doc

Alias for document.

Job.document

Get document associated with this job.

Job.fn(filename)

Prepend a filename with the job's workspace directory path.

Job.get_id()

Job's state point unique identifier.

Job.id

Get the unique identifier for the job's state point.

Job.init([force])

Initialize the job's workspace directory.

Job.isfile(filename)

Return True if file exists in the job's workspace.

Job.move(project)

Move this job to project.

Job.open()

Enter the job's workspace directory.

Job.remove()

Remove the job's workspace including the job document.

Job.reset()

Remove all job data, but not the job itself.

Job.reset_statepoint(new_statepoint)

Overwrite the state point of this job while preserving job data.

Job.sp

Alias for statepoint.

Job.statepoint

Get or set the job's state point.

Job.stores

Get HDF5 stores associated with this job.

Job.sync(other[, strategy, exclude, doc_sync])

Perform a one-way synchronization of this job with the other job.

Job.update_statepoint(update[, overwrite])

Change the state point of this job while preserving job data.

Job.workspace()

Alias for path.

Job.ws

Alias for path.

class signac.contrib.job.Job(project, statepoint=None, _id=None)

Bases: object

The job instance is a handle to the data of a unique state point.

Application developers should not directly instantiate this class, but use open_job() instead.

Jobs can be opened by statepoint or _id. If both values are provided, it is the user’s responsibility to ensure that the values correspond.

Parameters
  • project (Project) – Project handle.

  • statepoint (dict) – State point for the job. (Default value = None)

  • _id (str) – The job identifier. (Default value = None)

FN_DOCUMENT = 'signac_job_document.json'

The job’s document filename.

FN_MANIFEST = 'signac_statepoint.json'

The job’s state point filename.

The job state point is a human-readable file containing the job’s state point that is stored in each job’s workspace directory.

KEY_DATA = 'signac_data'

The job’s datastore key.

clear()

Remove all job data, but not the job itself.

This function will do nothing if the job was not previously initialized.

See signac rm -c for the command line equivalent.

close()

Close the job and switch to the previous working directory.

property data

Get data associated with this job.

This property should be used for large array-like data, which can’t be stored efficiently in the job document. For examples and usage, see Job Data Storage.

Equivalent to:

return job.stores['signac_data']
Returns

An HDF5-backed datastore.

Return type

H5Store

property doc

Alias for document.

Warning

Even deep copies of doc will modify the same file, so changes will still effectively be persisted between deep copies. If you need a deep copy that will not modify the underlying persistent JSON file, use the call operator to get an equivalent plain dictionary: job.doc().

See signac document for the command line equivalent.

Returns

The job document handle.

Return type

JSONDict

property document

Get document associated with this job.

Warning

Even deep copies of document will modify the same file, so changes will still effectively be persisted between deep copies. If you need a deep copy that will not modify the underlying persistent JSON file, use the call operator to get an equivalent plain dictionary: job.document(). For more information, see JSONDict.

See signac document for the command line equivalent.

Returns

The job document handle.

Return type

JSONDict

fn(filename)

Prepend a filename with the job’s workspace directory path.

Parameters

filename (str) – The name of the file.

Returns

The full workspace path of the file.

Return type

str

get_id()

Job’s state point unique identifier.

Returns

The job id.

Return type

str

Deprecated since version 1.3: This will be removed in 2.0. Use job.id instead.

property id

Get the unique identifier for the job’s state point.

Returns

The job id.

Return type

str

init(force=False)

Initialize the job’s workspace directory.

This function will do nothing if the directory and the job state point already exist and the state point is valid.

Returns the calling job.

See signac job -c for the command line equivalent.

Parameters

force (bool) – Overwrite any existing state point files, e.g., to repair them if they got corrupted (Default value = False).

Returns

The job handle.

Return type

Job

Raises
  • OSError – If the workspace directory cannot be created or any other I/O error occurs when attempting to save the state point file.

  • JobsCorruptedError – If the job state point on disk is corrupted.

isfile(filename)

Return True if file exists in the job’s workspace.

Parameters

filename (str) – The name of the file.

Returns

True if file with filename exists in workspace.

Return type

bool

move(project)

Move this job to project.

This function will attempt to move this instance of job from its original project to a different project.

See signac move for the command line equivalent.

Parameters

project (Project) – The project to move this job to.

open()

Enter the job’s workspace directory.

You can use the Job class as context manager:

with project.open_job(my_statepoint) as job:
    # manipulate your job data

Opening the context will switch into the job’s workspace, leaving it will switch back to the previous working directory.

property path

The path to the job directory.

See signac job -w for the command line equivalent.

Type

str

remove()

Remove the job’s workspace including the job document.

This function will do nothing if the workspace directory does not exist.

See signac rm for the command line equivalent.

reset()

Remove all job data, but not the job itself.

This function will initialize the job if it was not previously initialized.

reset_statepoint(new_statepoint)

Overwrite the state point of this job while preserving job data.

This method will change the job id if the state point has been altered.

For more information, see Modifying the State Point.

Danger

Use this function with caution! Resetting a job’s state point may sometimes be necessary, but can possibly lead to incoherent data spaces.

Parameters

new_statepoint (dict) – The job’s new state point.

Deprecated since version 1.8: This will be removed in 2.0. Use job.statepoint = new_statepoint instead.

property sp

Alias for statepoint.

property statepoint

Get or set the job’s state point.

Setting the state point to a different value will change the job id.

For more information, see Modifying the State Point.

Warning

The state point object behaves like a dictionary in most cases, but because it persists changes to the filesystem, making a copy requires explicitly converting it to a dict. If you need a modifiable copy that will not modify the underlying JSON file, you can access a dict copy of the state point by calling it, e.g. sp_dict = job.statepoint() instead of sp = job.statepoint. For more information, see JSONAttrDict.

See signac statepoint for the command line equivalent.

Danger

Use this function with caution! Resetting a job’s state point may sometimes be necessary, but can possibly lead to incoherent data spaces.

Returns

Returns the job’s state point.

Return type

dict

property stores

Get HDF5 stores associated with this job.

Use this property to access an HDF5 file within the job’s workspace directory using the H5Store dict-like interface.

This is an example for accessing an HDF5 file called ‘my_data.h5’ within the job’s workspace:

job.stores['my_data']['array'] = np.random((32, 4))

This is equivalent to:

H5Store(job.fn('my_data.h5'))['array'] = np.random((32, 4))

Both the stores and the H5Store itself support attribute access. The above example could therefore also be expressed as:

job.stores.my_data.array = np.random((32, 4))
Returns

The HDF5-Store manager for this job.

Return type

H5StoreManager

sync(other, strategy=None, exclude=None, doc_sync=None, **kwargs)

Perform a one-way synchronization of this job with the other job.

By default, this method will synchronize all files and document data with the other job to this job until a synchronization conflict occurs. There are two different kinds of synchronization conflicts:

  1. The two jobs have files with the same, but different content.

  2. The two jobs have documents that share keys, but those keys are associated with different values.

A file conflict can be resolved by providing a ‘FileSync’ strategy or by excluding files from the synchronization. An unresolvable conflict is indicated with the raise of a FileSyncConflict exception.

A document synchronization conflict can be resolved by providing a doc_sync function that takes the source and the destination document as first and second argument.

Parameters
  • other (Job) – The other job to synchronize from.

  • strategy – A synchronization strategy for file conflicts. If no strategy is provided, a SyncConflict exception will be raised upon conflict (Default value = None).

  • exclude (str) – An filename exclude pattern. All files matching this pattern will be excluded from synchronization (Default value = None).

  • doc_sync – A synchronization strategy for document keys. If this argument is None, by default no keys will be synchronized upon conflict.

  • dry_run – If True, do not actually perform the synchronization.

  • **kwargs – Extra keyword arguments will be forward to the sync_jobs() function which actually excutes the synchronization operation.

Raises

FileSyncConflict – In case that a file synchronization results in a conflict.

update_statepoint(update, overwrite=False)

Change the state point of this job while preserving job data.

By default, this method will not change existing parameters of the state point of the job.

This method will change the job id if the state point has been altered.

For more information, see Modifying the State Point.

Warning

While appending to a job’s state point is generally safe, modifying existing parameters may lead to data inconsistency. Use the overwrite argument with caution!

Parameters
  • update (dict) – A mapping used for the state point update.

  • overwrite (bool, optional) – If False, an error will be raised if the update modifies the values of existing keys in the state point. If True, any existing keys will be overwritten in the same way as dict.update(). Use with caution! (Default value = False).

Raises
  • KeyError – If the update contains keys which are already part of the job’s state point and overwrite is False.

  • DestinationExistsError – If a job associated with the new state point is already initialized.

  • OSError – If the move failed due to an unknown system related error.

workspace()

Alias for path.

Deprecated since version 1.8: This will be removed in 2.0. Use Job.path instead.

property ws

Alias for path.

Deprecated since version 1.8: This will be removed in 2.0. Use Job.path instead.

The Collection

class signac.Collection(docs=None, primary_key='_id', compresslevel=0, _trust=False)

A collection of documents.

The Collection class manages a collection of documents in memory or in a file on disk. A document is defined as a dictionary mapping of key-value pairs.

An instance of collection may be used to manage and search documents. For example, given a collection with member data, where each document contains a name entry and an age entry, we can find the name of all members that are at age 32 like this:

members = [
    {'name': 'John',  'age': 32},
    {'name': 'Alice', 'age': 28},
    {'name': 'Kevin', 'age': 32},
    # ...
    ]

member_collection = Collection(members)
for doc in member_collection.find({'age': 32}):
    print(doc['name'])

To iterate over all documents in the collection, use:

for doc in collection:
    print(doc)

By default a collection object will reside in memory. However, it is possible to manage a collection associated to a file on disk. To open a collection which is associated with a file on disk, use the Collection.open() class method:

with Collection.open('collection.txt') as collection:
    for doc in collection.find({'age': 32}):
        print(doc)

The collection file is by default opened in a+ mode, which means it can be read from and written to and will be created if it does not exist yet.

Parameters
  • docs (iterable) – Initialize the collection with these documents.

  • primary_key (str) – The name of the key which serves as the primary index of the collection. Selecting documents by primary key has time complexity of O(N) in the worst case and O(1) on average. All documents must have a primary key value. The default primary key is _id.

  • compresslevel (int) – The level of compression to use. Any positive value implies compression and is used by the underlying gzip implementation. Default value is 0 (no compression).

Raises

ValueError – When first argument is a string.

clear()

Remove all documents from the collection.

close()

Close this collection instance.

In case that the collection is associated with a file-object, all changes are flushed to the file and the file is closed.

It is not possible to re-open the same collection instance after closing it.

delete_many(filter)

Delete all documents that match the filter.

Parameters

filter (dict) – A document that should be deleted must match this filter.

delete_one(filter)

Delete one document that matches the filter.

Parameters

filter (dict) – The document that should be deleted must match this filter.

dump(file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)

Dump the collection in JSON-encoding to file.

The file argument defaults to sys.stdout, which means the encoded blob will be printed to screen in case that no file argument is provided.

For example, to dump to a file on disk, one could write:

with open('my_collection.txt', 'w') as file:
    collection.dump(file)
Parameters

file – The file to write the encoded blob to (Default value = sys.stdout).

find(filter=None, limit=0)

Find all documents matching filter, but not more than limit.

This function searches the collection for all documents that match the given filter and returns a result vector. For example:

for doc in collection.find(my_filter):
    print(doc)

Nested values should be searched using the . operator, for example:

docs = collection.find({'nested.value': 42})

will return documents with a nested structure: {'nested': {'value': 42}}.

The result of find() can be stored and iterated over multiple times. In addition, the result vector can be queried for its size:

docs = collection.find(my_filter)

print(len(docs))    # the number of documents matching

for doc in docs:    # iterate over the result vector
    pass

Arithmetic Operators

  • $eq: equal

  • $ne: not equal

  • $gt: greater than

  • $gte: greater or equal than

  • $lt: less than

  • $lte: less or equal than

project.find({"a": {"$lt": 5})

Matches all docs with a less than 5.

Logical Operators

That includes $and and $or; both expect a list of expressions.

project.find({"$or": [{"a": 4}, {"b": {"$gt": 3}}]})

Matches all docs, where a is 4 or b is greater than 3.

Exists operator

Determines whether a specific key exists, or not, e.g.:

project.find({"a": {"$exists": True}})

Array operator

To determine whether specific elements are in ($in), or not in ($nin) an array, e.g.:

project.find({"a": {"$in": [0, 1, 2]}})

Matches all docs, where a is either 0, 1, or 2. Usage of $nin is equivalent.

Regular expression operator

Allows the “on-the-fly” evaluation of regular expressoions, e.g.:

project.find({"protocol": {"$regex": "foo"}})

Will match all docs with a protocol that contains the term ‘foo’.

$type operator

Matches when a value is of specific type, e.g.:

project.find({"protocol": {"$type": "str"}})

Finds all docs, where the value of protocol is of type str. Other types that can be checked are: int, float, bool, list, and null.

$where operator

Matches an arbitrary python expression, e.g.:

project.find({"foo": {"$where": "lambda x: x.startswith('bar')"}})

Matches all docs, where the value for foo starts with the word ‘bar’.

Parameters
  • filter (dict) – All documents must match the given filter (Default value = None).

  • limit (int) – Do not return more than limit number of documents. A limit value of 0 (the default) means no limit.

Returns

A result object that iterates over all matching documents.

Return type

_CollectionSearchResults

Raises

ValueError – In case that the filter argument is invalid.

find_one(filter=None)

Return one document that matches the filter or None.

doc = collection.find_one(my_filter)
if doc is None:
    print("No result found for filter", my_filter)
else:
    print("Doc matching filter:", my_filter, doc)
Parameters

filter (dict) – The returned document must match the given filter (Default value = None).

Returns

A matching document or None.

Return type

dict

Raises

ValueError – In case that the filter argument is invalid.

flush()

Write all changes to the associated file.

If the collection instance is associated with a file-object, calling the flush() method will write all changes to this file.

This method is also called when the collection is explicitly or implicitly closed.

property ids

Get an iterator over the primary key in the collection.

Returns

iterator over the primary key in the collection.

Return type

iterable

index(key, build=False)

Get (and optionally build) the index for a given key.

An index allows to access documents by a specific key with minimal time complexity, e.g.:

age_index = member_collection.index('age')
for _id in age_index[32]:
    print(member_collection[_id]['name'])

This means we can access documents by the ‘age’ key in O(1) time on average in addition to the primary key. Using the find() method will automatically build all required indexes for the particular search.

Once an index has been built, it will be internally managed by the class and updated with subsequent changes. An index returned by this method is always current with the latest state of the collection.

Parameters
  • key (str) – The primary key of the requested index.

  • build (bool) – If True, build a non-existing index if necessary, otherwise raise KeyError (Default value = False).

Returns

Index for the given key.

Return type

dict

Raises

KeyError – In case the build is False and the index has not been built yet or no index is present for the key.

insert_one(doc)

Insert one document into the collection.

If the document does not have a value for the collection’s primary key yet, it will be assigned one.

_id = collection.insert_one(doc)
assert _id in collection

Note

The document will be directly updated in case that it has no primary key and must therefore be mutable!

Parameters

doc (dict) – The document to be inserted.

Returns

The _id of the inserted documented.

Return type

str

main()

Start a command line interface for this Collection.

Use this function to interact with this instance of Collection on the command line. For example, executing the following script:

# find.py
with Collection.open('my_collection.txt') as c:
    c.main()

will enable us to search for documents on the command line like this:

$ python find.py '{"age": 32}'
{"name": "John", "age": 32}
{"name": "Kevin", "age": 32}
Raises

ValueError – When both –id or –indent are selected.

classmethod open(filename, mode=None, compresslevel=None)

Open a collection associated with a file on disk.

Using this factory method will return a collection that is associated with a collection file on disk. For example:

with Collection.open('collection.txt') as collection:
    for doc in collection:
        print(doc)

will read all documents from the collection.txt file or create the file if it does not exist yet.

Modifications to the file will be written to the file when the flush() method is called or the collection is explicitly closed by calling the Collection.close() method or implicitly by leaving the with-clause:

with Collection.open('collection.txt') as collection:
    collection.update(my_docs)
# All changes to the collection have been written to collection.txt.

The open-modes work as expected, so for example to open a collection file in read-only mode, use Collection.open('collection.txt', 'r').

Opening a gzip (*.gz) file also works as expected. Because gzip does not support a combined read and write mode, mode=*+ is not available. Be sure to open the file in read, write, or append mode as required. Due to the manner in which gzip works, opening a file in mode=wt will effectively erase the current file, so take care using mode=wt.

Parameters
  • filename (str) – Name of file to read the documents from or create the file if it does not exist.

  • mode (str) – Open the file with mode (Default value = None).

  • compresslevel (int) – The level of compression to use. Any positive value implies compression and is used by the underlying gzip implementation. (Default value = None)

Returns

An instance of Collection.

Return type

Collection

Raises

RuntimeError – File open-mode is not None for in-memory collection or compressed collections are not opened in binary mode.

property primary_key

Get the name of the collection’s primary key (default=’_id’).

classmethod read_json(file=None)

Construct an instance of Collection from a JSON file.

Parameters

file – The json file to read, provided as either a filename or a file-like object (Default value = None).

Returns

A Collection containing the JSON file

Return type

Collection

replace_one(filter, replacement, upsert=False)

Replace one document that matches the given filter.

The first document matching the filter will be replaced by the given replacement document. If the upsert argument is True, the replacement will be inserted in case that no document matches the filter.

Parameters
  • filter (dict) – A document that should be replaced must match this filter.

  • replacement (dict) – The replacement document.

  • upsert (bool) – If True, insert the replacement document in the case that no document matches the filter (Default value = False).

Returns

The id of the replaced (or upserted) documented.

Return type

str

Raises

ValueError – In case that the filter argument is invalid.

to_json(file=None)

Dump the collection as a JSON file.

This function returns the JSON-string directly if the file argument is None.

Parameters

file – The filename or a file-like object to write the JSON string to (Default value = None).

Returns

JSON-string when file argument is not provided.

Return type

JSON

update(docs)

Update the collection with these documents.

Any existing documents with the same primary key will be replaced.

Parameters

docs (iterable) – A sequence of documents to be upserted into the collection.

The JSONDict

This class implements the interface for the job’s statepoint and document attributes, but can also be used stand-alone:

signac.JSONDict

alias of BufferedJSONAttrDict

The H5Store

This class implements the interface to the job’s data attribute, but can also be used stand-alone:

class signac.H5Store(filename, **kwargs)

An HDF5-backed container for storing array-like and dictionary-like data.

The H5Store is a MutableMapping and therefore behaves similar to a dict, but all data is stored persistently in the associated HDF5 file on disk.

Supported types include:

  • built-in types (int, float, str, bool, NoneType, array)

  • numpy arrays

  • pandas data frames (requires pandas and pytables)

  • mappings with values that are supported types

Values can be accessed as attributes (h5s.foo) or via key index (h5s['foo']).

Examples

>>> from signac import H5Store
>>> with H5Store('file.h5') as h5s:
...     h5s['foo'] = 'bar'
...     assert 'foo' in h5s
...     assert h5s.foo == 'bar'
...     assert h5s['foo'] == 'bar'
>>>

The H5Store can be used as a context manager to ensure that the underlying file is opened, however most built-in types (excluding arrays) can be read and stored without the need to _explicitly_ open the file. To access arrays (reading or writing), the file must always be opened!

To open a file in read-only mode, use the open() method with mode='r':

>>> with H5Store('file.h5').open(mode='r') as h5s:
...     pass
>>>
Parameters
  • filename (str) – The filename of the underlying HDF5 file.

  • **kwargs – Additional keyword arguments to be forwarded to the h5py.File constructor. See the documentation for the h5py.File constructor for more information.

clear()

Remove all data from this store.

Danger

All data will be removed, this action cannot be reversed!

close()

Close the underlying HDF5 file.

property file

Access the underlying instance of h5py.File.

This property exposes the underlying h5py.File object enabling use of functions such as create_dataset() or requires_dataset().

Note

The store must be open to access this property!

Returns

The h5py file-object that this store is operating on.

Return type

h5py.File

Raises

H5StoreClosedError – When the store is closed at the time of accessing this property.

property filename

Return the H5Store filename.

flush()

Flush the underlying HDF5 file.

get(k[, d]) D[k] if k in D, else d.  d defaults to None.
items() a set-like object providing a view on D's items
keys() a set-like object providing a view on D's keys
property mode

Return the default opening mode of this H5Store.

open(mode=None)

Open the underlying HDF5 file.

Parameters

mode – The file open mode to use. Defaults to ‘a’ (append).

Returns

This H5Store instance.

pop(k[, d]) v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised.

popitem() (k, v), remove and return some (key, value) pair

as a 2-tuple; but raise KeyError if D is empty.

setdefault(key, value)

Set a value for a key if that key is not already set.

update([E, ]**F) None.  Update D from mapping/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v

values() an object providing a view on D's values

The H5StoreManager

This class implements the interface to the job’s stores attribute, but can also be used stand-alone:

class signac.H5StoreManager(prefix)

Bases: DictManager

Helper class to manage multiple instances of H5Store within a directory.

Example (assuming that the ‘stores/’ directory exists):

>>> stores = H5StoreManager('stores/')
>>> stores.data
<H5Store(filename=stores/data.h5)>
>>> stores.data.foo = True
>>> dict(stores.data)
{'foo': True}
Parameters

prefix – The directory prefix shared by all stores managed by this class.

cls

alias of H5Store

keys()

Return an iterable of keys.

property prefix

Return the prefix.

Top-level functions

The signac framework aids in the management of large and heterogeneous data spaces.

It provides a simple and robust data model to create a well-defined, indexable storage layout for data and metadata. This makes it easier to operate on large data spaces, streamlines post-processing and analysis, and makes data collectively accessible.

signac.TemporaryProject(name=None, cls=None, **kwargs)

Context manager for the generation of a temporary project.

This is a factory function that creates a Project within a temporary directory and must be used as context manager, for example like this:

with TemporaryProject() as tmp_project:
    tmp_project.import_from('/data')
Parameters
  • name (str) – An optional name for the temporary project. Defaults to a unique random string.

  • cls – The class of the temporary project. Defaults to Project.

  • **kwargs – Optional keyword arguments that are forwarded to the TemporaryDirectory class constructor, which is used to create a temporary root directory.

Yields

Project – An instance of Project.

signac.buffered(buffer_size=None, force_write=None, *args, **kwargs)

Enter context to buffer all operations for this backend.

Parameters
  • buffer_size (int) – The capacity of the buffer to use within this context (resets after the context is exited).

  • force_write (bool) – This argument does nothing and is only present for compatibility with signac 1.x.

signac.diff_jobs(*jobs)

Find differences among a list of jobs’ state points.

The resulting diff is a dictionary where the keys are job ids and the values are each job’s state point minus the intersection of all provided jobs’ state points. The comparison is performed over the combined set of keys and values.

See signac diff for the command line equivalent.

Parameters

*jobs (sequence[Job]) – Sequence of jobs to diff.

Returns

A dictionary where the keys are job ids and values are the unique parts of that job’s state point.

Return type

dict

Examples

>>> import signac
>>> project = signac.init_project()
>>> job1 = project.open_job({'constant': 42, 'diff1': 0, 'diff2': 1}).init()
>>> job2 = project.open_job({'constant': 42, 'diff1': 1, 'diff2': 1}).init()
>>> job3 = project.open_job({'constant': 42, 'diff1': 2, 'diff2': 2}).init()
>>> print(job1)
c4af2b26f1fd256d70799ad3ce3bdad0
>>> print(job2)
b96b21fada698f8934d58359c72755c0
>>> print(job3)
e4289419d2b0e57e4852d44a09f167c0
>>> signac.diff_jobs(job1, job2, job3)
{'c4af2b26f1fd256d70799ad3ce3bdad0': {'diff2': 1, 'diff1': 0},
'b96b21fada698f8934d58359c72755c0': {'diff2': 1, 'diff1': 1},
'e4289419d2b0e57e4852d44a09f167c0': {'diff2': 2, 'diff1': 2}}
>>> signac.diff_jobs(*project)
{'c4af2b26f1fd256d70799ad3ce3bdad0': {'diff2': 1, 'diff1': 0},
'b96b21fada698f8934d58359c72755c0': {'diff2': 1, 'diff1': 1},
'e4289419d2b0e57e4852d44a09f167c0': {'diff2': 2, 'diff1': 2}}
signac.export(docs, index, mirrors=None, update=False, num_tries=3, timeout=60, **kwargs)

Export docs to index and optionally associated files to mirrors.

The behavior of this function is equivalent to:

for doc in docs:
    export_one(doc, index, mirrors, num_tries)

If the update argument is set to True, the export algorithm will automatically identify stale index documents, that means documents that refer to files or state points that have been removed and are no longer part of the data space. Any document which shares the root, but not the _id field with any of the updated documents is considered stale and removed. Using update in combination with an empty docs sequence will raise ExportError, since it is not possible to identify stale documents in that case.

Note

This function will automatically delegate to specialized implementations for special index types. For example, if the index argument is a MongoDB document collection, the index documents will be exported via export_pymongo().

Parameters
  • docs – The index documents to export.

  • index – The collection to export the index to.

  • mirrors – An optional set of mirrors to export files to.

  • update (bool) – If True, remove stale index documents, that means documents that refer to files or state points that no longer exist.

  • num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.

  • timeout (int) – The time in seconds to wait before an automatic retry attempt.

  • kwargs – Optional keyword arguments to pass to delegate implementations.

Raises

ExportError – When using the update argument in combination with an empty docs sequence.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.export_one(doc, index, mirrors=None, num_tries=3, timeout=60)

Export one document to index and an optionally associated file to mirrors.

Parameters
  • doc – A document with a file_id entry.

  • docs – The index collection to export to.

  • mirrors – An optional set of mirrors to export files to.

  • num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.

  • timeout (int) – The time in seconds to wait before an automatic retry attempt.

Returns

The id and file id after successful export.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.export_pymongo(docs, index, mirrors=None, update=False, num_tries=3, timeout=60, chunksize=100)

Optimized export() function for pymongo index collections.

The behavior of this function is rougly equivalent to:

for doc in docs:
    export_one(doc, index, mirrors, num_tries)

Note

All index documents must be JSON-serializable to be able to be exported to a MongoDB collection.

Parameters
  • docs – The index documents to export.

  • index (pymongo.collection.Collection) – The database collection to export the index to.

  • num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.

  • timeout (int) – The time in seconds to wait before an automatic retry attempt.

  • chunksize (int) – The buffer size for export operations.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.export_to_mirror(doc, mirror, num_tries=3, timeout=60)

Export a file associated with doc to mirror.

Parameters
  • doc – A document with a file_id entry.

  • mirror – A file-system object to export the file to.

  • num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.

  • timeout (int) – The time in seconds to wait before an automatic retry attempt.

Returns

The file id after successful export.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.fetch(doc_or_id, mode='r', mirrors=None, num_tries=3, timeout=60, ignore_local=False)

Fetch the file associated with this document or file id.

This function retrieves a file associated with the provided index document or file id and behaves like the built-in open() function, e.g.:

for doc in index:
    with signac.fetch(doc) as file:
        do_something_with(file)
Parameters
  • doc_or_id – A file_id or a document with a file_id value.

  • mode – Mode to use for opening files.

  • mirrors – An optional set of mirrors to fetch the file from.

  • num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.

  • timeout (int) – The time in seconds to wait before an automatic retry attempt.

Returns

The file associated with the document or file id.

Return type

A file-like object

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.flush()

Execute all deferred JSONDict write operations.

Deprecated since version 1.7: This will be removed in 2.0.

signac.get_buffer_load()

Get the total amount of data currently stored in the buffer.

Returns

The size of all data contained in the buffer in the appropriate units for a particular buffering implementation.

Return type

int

signac.get_buffer_size()

Get the current buffer capacity.

Returns

The amount of data that can be stored before a flush is triggered in the appropriate units for a particular buffering implementation.

Return type

int

signac.get_database(name, hostname=None, config=None)

Get a database handle.

The database handle is an instance of Database, which provides access to the document collections within one database.

db = signac.db.get_database('MyDatabase')
docs = db.my_collection.find()

Please note, that a collection which did not exist at the point of access, will automatically be created.

Parameters
  • name (str) – The name of the database to get.

  • hostname (str) – The name of the configured host. Defaults to the first configured host, or the host specified by default_host.

  • config (common.config.Config) – The config object to retrieve the host configuration from. Defaults to the global configuration.

Returns

The database handle.

Return type

pymongo.database.Database

Deprecated since version 1.3: This will be removed in 2.0. The database package is deprecated.

signac.get_job(root=None)

Find a Job in or above the current working directory (or provided path).

Parameters

root (str) – The job root directory. If no root directory is given, the current working directory is assumed to be within the current job workspace directory (Default value = None).

Returns

Job handle.

Return type

Job

Raises

LookupError – If this job cannot be found.

Examples

When the current directory is a job workspace directory:

>>> signac.get_job()
signac.contrib.job.Job(project=..., statepoint={...})
signac.get_project(root=None, search=True, **kwargs)

Find a project configuration and return the associated project.

Parameters
  • root (str) – The starting point to search for a project, defaults to the current working directory.

  • search (bool) – If True, search for project configurations inside and above the specified root directory, otherwise only return projects with a root directory identical to the specified root argument (Default value = True).

  • **kwargs – Optional keyword arguments that are forwarded to get_project().

Returns

An instance of Project.

Return type

Project

Raises

LookupError – If no project configuration can be found.

signac.index(root='.', tags=None, depth=0, **kwargs)

Generate a main index.

A main index is compiled from other indexes by searching for modules named signac_access.py and compiling all indexes which are yielded from a function get_indexes(root) defined within that module as well as the indexes generated by crawlers yielded from a function get_crawlers(root) defined within that module.

This is a minimal example for a signac_access.py file:

import signac

def get_indexes(root):
    yield signac.index_files(root, r'.*\.txt')

Internally, this function constructs an instance of MainCrawler and all extra key-word arguments will be forwarded to the constructor of said main crawler.

Parameters
  • root (str) – Look for access modules under this directory path.

  • tags – If tags are provided, do not execute subcrawlers that don’t match the same tags.

  • depth (int) – Limit the search to the specified directory depth.

  • kwargs – These keyword-arguments are forwarded to the internal MainCrawler instance.

Yields

The main index documents as instances of dict.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.index_files(root='.', formats=None, depth=0)

Generate a file index.

This generator function yields file index documents, where each index document corresponds to one file.

To index all files in the current working directory, simply execute:

for doc in signac.index_files():
    print(doc)

A file associated with a file index document can be fetched via the fetch() function:

for doc in signac.index_files():
    with signac.fetch(doc) as file:
        print(file.read())

This is especially useful if the file index is part of a collection (Collection) which can be searched for specific entries.

To limit the file index to files with a specific filename formats, provide a regular expression as the formats argument. To index all files that have file ending .txt, execute:

for doc in signac.index_files(formats='.*\.txt'):
    print(doc)

We can specify specific formats by providing a dictionary as formats argument, where the key is the filename pattern and the value is an arbitrary formats string, e.g.:

for doc in signac.index_files(formats=
    {r'.*\.txt': 'TextFile', r'.*\.zip': 'ZipFile'}):
    print(doc)
Parameters
  • root (str) – The directory to index, defaults to the current working directory.

  • formats – Limit the index to files that match the given regular expression and optionally associate formats with given patterns.

  • depth (int) – Limit the search to the specified directory depth.

Yields

The file index documents as dicts.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.init_project(name=None, root=None, workspace=None, make_dir=True)

Initialize a project.

It is safe to call this function multiple times with the same arguments. However, a RuntimeError is raised if an existing project configuration would conflict with the provided initialization parameters.

Parameters
  • name (str, optional) – The name of the project to initialize.

  • root (str, optional) – The root directory for the project. Defaults to the current working directory.

  • workspace (str, optional) – The workspace directory for the project. Defaults to a subdirectory workspace in the project root.

  • make_dir (bool, optional) – Create the project root directory, if it does not exist yet (Default value = True).

Returns

The initialized project instance.

Return type

Project

Raises

RuntimeError – If the project root path already contains a conflicting project configuration.

signac.is_buffered()

Check if this backend is currently buffered.

signac.set_buffer_size(new_capacity)

Update the buffer capacity.

Parameters

new_capacity (int) – The new capacity of the buffer in the appropriate units for a particular buffering implementation.

Submodules

signac.cite module

Functions to support citing this software.

signac.cite.bibtex(file=None)

Generate bibtex entries for signac.

The bibtex entries will be printed to screen unless a filename or a file-like object are provided, in which case they will be written to the corresponding file.

Note

A full reference should also include the version of this software. Please refer to the documentation on how to cite a specific version.

Parameters

file – A str or file-like object. Defaults to sys.stdout.

Deprecated since version 1.8: This will be removed in 2.0. The cite module is deprecated.

signac.cite.reference(file=None)

Generate formatted reference entries for signac.

The references will be printed to screen unless a filename or a file-like object are provided, in which case they will be written to the corresponding file.

Note

A full reference should also include the version of this software. Please refer to the documentation on how to cite a specific version.

Parameters

file – A str or file-like object. Defaults to sys.stdout.

Deprecated since version 1.8: This will be removed in 2.0. The cite module is deprecated.

signac.sync module

Synchronization of jobs and projects.

Jobs may be synchronized by copying all data from the source job to the destination job. This means all files are copied and the documents are synchronized. Conflicts, that means both jobs contain conflicting data, may be resolved with a user defined strategy.

The synchronization of projects is in essence the synchronization of all jobs which are in the destination project with the ones in the source project and the sync synchronization of the project document. If a specific job does not exist yet at the destination it is simply cloned, otherwise it is synchronized.

A sync strategy is a function (or functor) that takes the source job, the destination job, and the name of the file generating the conflict as arguments and returns the decision whether to overwrite the file as Boolean. There are some default strategies defined within this module as part of the FileSync class. These are the default strategies:

  1. always – Always overwrite on conflict.

  2. never – Never overwrite on conflict.

  3. update – Overwrite when the modification time of the source file is newer.

  4. Ask – Ask the user interactively about each conflicting filename.

For example, to synchronize two projects resolving conflicts by modification time, use:

dest_project.sync(source_project, strategy=sync.FileSync.update)

Unlike files, which are always either overwritten as a whole or not, documents can be synchronized more fine-grained with a sync function. Such a function (or functor) takes the source and the destination document as arguments and performs the synchronization. The user is encouraged to implement their own sync functions, but there are a few default functions implemented as part of the DocSync class:

  1. NO_SYNC – Do not perform any synchronization.

  2. COPY – Apply the same strategy used to resolve file conflicts.

  3. update – Equivalent to dst.update(src).

  4. ByKey – Synchronize the source document key by key, more information below.

This is how we could synchronize two jobs, where the documents are synchronized with a simple update function:

dst_job.sync(src_job, doc_sync=sync.DocSync.update)

The DocSync.ByKey functor attempts to synchronize the destination document with the source document without overwriting any data. That means this function behaves similar to update() for a non-intersecting set of keys, but in addition will preserve nested mappings without overwriting values. In addition, any key conflict, that means keys that are present in both documents, but have differing data, will lead to the raise of a DocumentSyncConflict exception. The user may expclitly decide to overwrite certain keys by providing a “key-strategy”, which is a function that takes the conflicting key as argument, and returns the decision whether to overwrite that specific key as Boolean. For example, to sync two jobs, where conflicting keys should only be overwritten if they contain the term ‘foo’, we could execute:

dst_job.sync(src_job, doc_sync=sync.DocSync.ByKey(lambda key: 'foo' in key))

This means that all documents are synchronized ‘key-by-key’ and only conflicting keys that contain the word “foo” will be overwritten, any other conflicts would lead to the raise of a DocumentSyncConflict exception. A key-strategy may also be a regular expression, so the synchronization above could also be achieved with:

dst_job.sync(src_job, doc_sync=sync.DocSync.ByKey('foo'))
class signac.sync.DocSync

Bases: object

Collection of document synchronization functions.

class ByKey(key_strategy=None)

Bases: object

Synchronize documents key by key.

COPY = 'copy'

Copy (and potentially overwrite) documents like any other file.

NO_SYNC = False

Do not synchronize documents.

static update(src, dst)

Perform a simple update.

class signac.sync.FileSync

Bases: object

Collection of file synchronization strategies.

class Ask

Bases: object

Resolve sync conflicts by asking whether a file should be overwritten interactively.

static always(src, dst, fn)

Resolve sync conflicts by always overwriting.

classmethod keys()

Return keys.

static never(src, dst, fn)

Resolve sync conflicts by never overwriting.

static update(src, dst, fn)

Resolve sync conflicts based on newest modified timestamp.

signac.sync.sync_jobs(src, dst, strategy=None, exclude=None, doc_sync=None, recursive=False, follow_symlinks=True, preserve_permissions=False, preserve_times=False, preserve_owner=False, preserve_group=False, deep=False, dry_run=False)

Synchronize the dst job with the src job.

By default, this method will synchronize all files and document data of dst job with the src job until a synchronization conflict occurs. There are two different kinds of synchronization conflicts:

  1. The two jobs have files with the same name, but different content.

  2. The two jobs have documents that share keys, but those keys are mapped to different values.

A file conflict can be resolved by providing a ‘FileSync’ strategy or by excluding files from the synchronization. An unresolvable conflict is indicated with the raise of a FileSyncConflict exception.

A document synchronization conflict can be resolved by providing a doc_sync function that takes the source and the destination document as first and second argument.

Parameters
  • src (Job) – The src job, data will be copied from this job’s workspace.

  • dst (Job) – The dst job, data will be copied to this job’s workspace.

  • strategy (callable) – A synchronization strategy for file conflicts. The strategy should be a callable with signature strategy(src, dst, filepath) where src and dst are the source and destination instances of Project and filepath is the filepath relative to the project root. If no strategy is provided, a errors.SyncConflict exception will be raised upon conflict. (Default value = None)

  • exclude (str) – A filename exclusion pattern. All files matching this pattern will be excluded from the synchronization process. (Default value = None)

  • doc_sync (attribute or callable from DocSync) – A synchronization strategy for document keys. The default is to use a safe key-by-key strategy that will not overwrite any values on conflict, but instead raises a DocumentSyncConflict exception.

  • recursive (bool) – Recursively synchronize sub-directories encountered within the job workspace directories. (Default value = False)

  • follow_symlinks (bool) – Follow and copy the target of symbolic links. (Default value = True)

  • preserve_permissions (bool) – Preserve file permissions (Default value = False)

  • preserve_times (bool) – Preserve file modification times (Default value = False)

  • preserve_owner (bool) – Preserve file owner (Default value = False)

  • preserve_group (bool) – Preserve file group ownership (Default value = False)

  • dry_run (bool) – If True, do not actually perform any synchronization operations. (Default value = False)

  • deep (bool) – (Default value = False)

signac.sync.sync_projects(source, destination, strategy=None, exclude=None, doc_sync=None, selection=None, check_schema=True, recursive=False, follow_symlinks=True, preserve_permissions=False, preserve_times=False, preserve_owner=False, preserve_group=False, deep=False, dry_run=False, parallel=False, collect_stats=False)

Synchronize the destination project with the source project.

Try to clone all jobs from the source to the destination. If the destination job already exist, try to synchronize the job using the optionally specified strategy.

Parameters
  • source (class:~.Project) – The project presenting the source for synchronization.

  • destination (class:~.Project) – The project that is modified for synchronization.

  • strategy (callable) – A synchronization strategy for file conflicts. The strategy should be a callable with signature strategy(src, dst, filepath) where src and dst are the source and destination instances of Project and filepath is the filepath relative to the project root. If no strategy is provided, a errors.SyncConflict exception will be raised upon conflict. (Default value = None)

  • exclude (str) – A filename exclusion pattern. All files matching this pattern will be excluded from the synchronization process. (Default value = None)

  • doc_sync (attribute or callable from DocSync) – A synchronization strategy for document keys. The default is to use a safe key-by-key strategy that will not overwrite any values on conflict, but instead raises a DocumentSyncConflict exception.

  • selection (sequence of Job or job ids (str)) – Only synchronize the given selection of jobs. (Default value = None)

  • check_schema (bool) – If True, only synchronize if this and the other project have a matching state point schema. See also: detect_schema(). (Default value = True)

  • recursive (bool) – Recursively synchronize sub-directories encountered within the job workspace directories. (Default value = False)

  • follow_symlinks (bool) – Follow and copy the target of symbolic links. (Default value = True)

  • preserve_permissions (bool) – Preserve file permissions (Default value = False)

  • preserve_times (bool) – Preserve file modification times (Default value = False)

  • preserve_owner (bool) – Preserve file owner (Default value = False)

  • preserve_group (bool) – Preserve file group ownership (Default value = False)

  • dry_run (bool) – If True, do not actually perform the synchronization operation, just log what would happen theoretically. Useful to test synchronization strategies without the risk of data loss. (Default value = False)

  • deep (bool) – (Default value = False)

  • parallel (bool) – (Default value = False)

  • collect_stats (bool) – (Default value = False)

Returns

Returns stats if collect_stats is True, else None.

Return type

NoneType or FileTransferStats

Raises
  • DocumentSyncConflict – If there are conflicting keys within the project or job documents that cannot be resolved with the given strategy or if there is no strategy provided.

  • FileSyncConflict – If there are differing files that cannot be resolved with the given strategy or if no strategy is provided.

  • SchemaSyncConflict – In case that the check_schema argument is True and the detected state point schema of this and the other project differ.

signac.warnings module

Module for signac deprecation warnings.

exception signac.warnings.SignacDeprecationWarning

Bases: UserWarning

Indicates the deprecation of a signac feature, API or behavior.

This class indicates a user-relevant deprecation and is therefore a UserWarning, not a DeprecationWarning which is hidden by default.

signac.errors module

Errors raised by signac.

exception signac.errors.AuthenticationError

Bases: Error, RuntimeError

Authentication error.

exception signac.errors.BufferException

Bases: Error

An exception occurred in buffered mode.

exception signac.errors.BufferedFileError(files)

Bases: BufferException

Raised when an error occurred while flushing one or more buffered files.

files

A dictionary of files that caused issues during the flush operation, mapped to a possible reason for the issue or None in case that it cannot be determined.

exception signac.errors.ConfigError

Bases: Error, RuntimeError

Error with parsing or reading a configuration file.

exception signac.errors.DestinationExistsError(destination)

Bases: Error, RuntimeError

The destination for a move or copy operation already exists.

Parameters

destination (str) – The destination causing the error.

exception signac.errors.DocumentSyncConflict(keys)

Bases: SyncConflict

Raised when a synchronization operation fails due to a document conflict.

keys

The keys that caused the conflict.

exception signac.errors.Error

Bases: Exception

Base class used for signac Errors.

exception signac.errors.ExportError

Bases: Error, RuntimeError

Error exporting documents to a mirror.

exception signac.errors.FetchError

Bases: FileNotFoundError

Error in fetching data.

exception signac.errors.FileSyncConflict(filename)

Bases: SyncConflict

Raised when a synchronization operation fails due to a file conflict.

filename

The filename of the file that caused the conflict.

exception signac.errors.IncompatibleSchemaVersion

Bases: Error

The project’s schema version is incompatible with this version of signac.

exception signac.errors.InvalidKeyError

Bases: ValueError

Raised when a user uses a non-conforming key.

exception signac.errors.JobsCorruptedError(job_ids)

Bases: Error, RuntimeError

The state point manifest file of one or more jobs cannot be opened or is corrupted.

Parameters

job_ids – The job id(s) of the corrupted job(s).

exception signac.errors.KeyTypeError

Bases: TypeError

Raised when a user uses a key of invalid type.

exception signac.errors.SchemaSyncConflict(schema_src, schema_dst)

Bases: SyncConflict

Raised when a synchronization operation fails due to schema differences.

exception signac.errors.StatepointParsingError

Bases: Error, RuntimeError

Indicates an error that occurred while trying to identify a state point.

exception signac.errors.SyncConflict

Bases: Error, RuntimeError

Raised when a synchronization operation fails.

exception signac.errors.WorkspaceError(error)

Bases: Error, OSError

Raised when there is an issue creating or accessing the workspace.

Parameters

error – The underlying error causing this issue.

synced_collections package

Data Types

synced_collections.synced_collection module

Implement the SyncedCollection class.

class signac.synced_collections.data_types.synced_collection.SyncedCollection(parent=None, *args, **kwargs)

Bases: Collection

An abstract Collection type that is synced with a backend.

This class extends collections.abc.Collection and adds a number of abstract internal methods that must be implemented by its subclasses. These methods can be split into two groups of functions that are designed to be implemented by separate subtrees in the inheritance hierarchy that can then be composed:

Concrete Collection Types

These subclasses should implement the APIs for specific types of collections. For instance, a list-like SyncedCollection should implement the standard methods for sequences. In addition, they must implement the following abstract methods defined by the SyncedCollection:

Backend

These subclasses encode the process by which in-memory data is converted into a representation suitable for a particular backend. For instance, a JSON backend should know how to save a Python object into a JSON-encoded file and then read that object back.

Since these functionalities are effectively completely orthogonal, members of a given group should be interchangeable. For instance, a dict-like SyncedCollection can be combined equally easily with JSON, MongoDB, or SQL backends.

Validation

Due to the restrictions of a particular backend or the needs of a particular application, synced collections may need to restrict the data that they can store. Validators provide a standardized mechanism for this. A validator is a callable that parses any data added to a SyncedCollection and raises an Exception if any invalid data is provided. Validators cannot modify the data and should have no side effects. They are purely provided as a mechanism to reject invalid data. For example, a JSON validator would raise Exceptions if it detected non-string keys in a dict.

Since SyncedCollection is designed for extensive usage of inheritance, validators may be inherited by subclasses. There are two attributes that subclasses of SyncedCollection can define to control the validators used:

  • _validators: A list of callables that will be inherited by all subclasses.

  • _all_validators: A list of callables that will be used to validate this class, and this class alone.

When a SyncedCollection subclass is initialized (note that this is at class definition time, not when instances are created), its _register_validators() method will be called. If this class defines an _all_validators attribute, this set of validators will be used by all instances of this class. Otherwise, _register_validators() will traverse the MRO and collect the _validators attributes from all parents of a class, and store these in the _all_validators attribute for the class.

Note

Typically, a synced collection will be initialized with resource information, and data will be pulled from that resource. However, initializing with both data and resource information is a valid use case. In this case, the initial data will be validated by the standard validators, however, it will not be checked against the contents stored in the synced resource and is assumed to be consistent. This constructor pattern can be useful to avoid unnecessary resource accesses.

Thread safety

Whether or not SyncedCollection objects are thread-safe depends on the implementation of the backend. Thread-safety of SyncedCollection objects is predicated on backends providing an atomic write operation. All concrete collection types use mutexes to guard against concurrent write operations, while allowing read operations to happen freely. The validity of this mode of access depends on the write operations of a SyncedCollection being atomic, specifically the :meth:`~._save_to_resource method. Whether or not a particular subclass of SyncedCollection is thread-safe should be indicated by that subclass setting the _supports_threading class variable to True. This variable is set to False by SyncedCollection, so subclasses must explicitly opt-in to support threading by setting this variable to True.

Backends that support multithreaded execution will have multithreaded support turned on by default. This support can be enabled or disabled using the enable_multithreading() and disable_multithreading() methods. enable_multithreading() will raise a ValueError if called on a class that does not support multithreading.

Parameters

parent (SyncedCollection, optional) – If provided, the collection within which this collection is nested (Default value = None). A parent instance of SyncedCollection or None. If None, the collection owns its own data, otherwise it is nested within its parent. Every SyncedCollection either owns its own data, or has a parent (Default value = None).

_LoadSaveType

alias of _LoadAndSave

abstract property _backend

The backend associated with a given collection.

This property is abstract to enforce that subclasses implement it. Since it’s only internal, subclasses can safely override it with just a raw attribute; this property just serves as a way to enforce the abstract API for subclasses.

Type

str

classmethod _from_base(data, **kwargs)

Dynamically resolve the type of object to the corresponding synced collection.

This method assumes that data has already been validated. This assumption can always be met, since this method should only be called internally by other methods that modify the internal collection data. While this requirement does require that all calling methods be responsible for validation, it confers significant performance benefits because it can instruct any invoked class constructors not to validate, which is especially important for nested collections.

Parameters
  • data (Collection) – Data to be converted from base type.

  • **kwargs – Any keyword arguments to pass to the collection constructor.

Returns

Synced object of corresponding base type.

Return type

Collection

Notes

This method relies on the internal registry of subclasses populated by __init_subclass__() and the is_base_type() method to determine the subclass with the appropriate backend and data type. Once an appropriate type is determined, that class’s constructor is called. Since this method relies on the constructor and other methods, it can be concretely implemented here rather than requiring subclass implementations.

_load()

Load the data from the backend.

This method encodes the recursive logic required to handle the loading of nested collections. For a collection contained within another collection, only the root is ever responsible for loading the data. This method handles the appropriate recursive calls, then farms out the actual reading to the abstract method _load_from_resource().

abstract _load_from_resource()

Load data from underlying backend.

This method must be implemented for each backend. Backends may choose to return None, signaling that no modification should be performed on the data in memory. This mode is useful for backends where the underlying resource (e.g. a file) may not initially exist, but can be transparently created on save.

Returns

An equivalent unsynced collection satisfying is_base_type() that contains the data in the underlying resource (e.g. a file).

Return type

Collection or None

classmethod _register_validators()

Register all inherited validators to this class.

This method is called by __init_subclass__ when subclasses are created to control what validators will be applied to data added to instances of that class. By default, the _all_validators class variable defined on the class itself determines the validation rules for that class. If that variable is not defined, then all parents of the class are searched, and a list of validators is constructed by concatenating the _validators class variable for each parent class that defines it.

_save()

Save the data to the backend.

This method encodes the recursive logic required to handle the saving of nested collections. For a collection contained within another collection, only the parent is ever responsible for storing the data. This method handles the appropriate recursive calls, then farms out the actual writing to the abstract method _save_to_resource().

abstract _save_to_resource()

Save data to the backend.

This method must be implemented for each backend.

abstract _to_base()

Dynamically resolve the synced collection to the corresponding base type.

This method should not load the data from the underlying resource, it should simply converts the current in-memory representation of a synced collection to its naturally corresponding unsynced collection type.

Returns

An equivalent unsynced collection satisfying is_base_type().

Return type

Collection

abstract _update(data)

Update the in-memory representation to match the provided data.

The purpose of this method is to update the SyncedCollection to match the data in the underlying resource. The result of calling this method should be that self == data. The reason that this method is necessary is that SyncedCollections can be nested, and nested collections must also be instances of SyncedCollection so that synchronization occurs even when nested structures are modified. Recreating the full nested structure every time data is reloaded from file is highly inefficient, so this method performs an in-place update that only changes entries that need to be changed.

Parameters

data (Collection) – A collection satisfying is_base_type().

_validate(data)

Validate the input data.

Parameters

data (Collection) – An collection satisfying is_base_type().

classmethod disable_multithreading()

Disable all safety checks and thread locks required for thread safety.

The mutex locks required to enable multithreading introduce nontrivial performance costs, so they can be disabled for classes that support it.

classmethod enable_multithreading()

Enable safety checks and thread locks required for thread safety.

Support for multithreaded execution can be disabled by calling disable_multithreading(); calling this method reverses that.

abstract classmethod is_base_type(data)

Check whether data is of the same base type (such as list or dict) as this class.

Parameters

data (Any) – The input data to test.

Returns

Whether or not the object can be converted into this synced collection type.

Return type

bool

class signac.synced_collections.data_types.synced_collection._LoadAndSave(collection)

Bases: object

A context manager for SyncedCollection to wrap saving and loading.

Any write operation on a synced collection must be preceded by a load and followed by a save. Moreover, additional logic may be required to handle other aspects of the synchronization, particularly the acquisition of thread locks. This class abstracts this concept, making it easy for subclasses to customize the behavior if needed (for instance, to introduce additional locks).

synced_collections.synced_dict module

Implements the SyncedDict.

This implements a dict-like data structure that also conforms to the SyncedCollection API and can be combined with any backend type to give a dict-like API to a synchronized data structure.

class signac.synced_collections.data_types.synced_dict.SyncedDict(data=None, _validate=True, *args, **kwargs)

Bases: SyncedCollection, MutableMapping

Implement the dict data structure along with values access through attributes named as keys.

The SyncedDict inherits from SyncedCollection and MutableMapping. Therefore, it behaves like a dict.

Parameters
  • data (Mapping, optional) – The initial data to populate the dict. If None, defaults to {} (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the SyncedDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying backend, copying (even deep copying) a SyncedDict instance may exhibit unexpected behavior. If a true copy is required, you should use the _to_base() method to get a dict representation, and if necessary construct a new SyncedDict.

clear() None.  Remove all items from D.
get(k[, d]) D[k] if k in D, else d.  d defaults to None.
classmethod is_base_type(data)

Check whether the data is an instance of mapping.

Parameters

data (any) – Data to be checked.

Return type

bool

items() a set-like object providing a view on D's items
keys() a set-like object providing a view on D's keys
pop(k[, d]) v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised.

popitem() (k, v), remove and return some (key, value) pair

as a 2-tuple; but raise KeyError if D is empty.

reset(data)

Update the instance with new data.

Parameters

data (mapping) – Data to update the instance.

Raises

ValueError – If the data is not a mapping.

setdefault(k[, d]) D.get(k,d), also set D[k]=d if k not in D
update([E, ]**F) None.  Update D from mapping/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v

values() an object providing a view on D's values

synced_collections.synced_list module

Implements the SyncedList.

This implements a list-like data structure that also conforms to the SyncedCollection API and can be combined with any backend type to give a list-like API to a synchronized data structure.

class signac.synced_collections.data_types.synced_list.SyncedList(data=None, _validate=True, *args, **kwargs)

Bases: SyncedCollection, MutableSequence

Implementation of list data structure.

The SyncedList inherits from SyncedCollection and MutableSequence. Therefore, it behaves similar to a list.

Parameters
  • data (Sequence, optional) – The initial data to populate the list. If None, defaults to [] (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the SyncedList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying backend, copying (even deep copying) a SyncedList instance may exhibit unexpected behavior. If a true copy is required, you should use the _to_base() method to get a list representation, and if necessary construct a new SyncedList.

append(item)

S.append(value) – append value to the end of the sequence

clear() None -- remove all items from S
extend(iterable)

S.extend(iterable) – extend sequence by appending elements from the iterable

insert(index, item)

S.insert(index, value) – insert value before index

classmethod is_base_type(data)

Check whether the data is an non-string Sequence.

Parameters

data (Any) – Data to be checked

Return type

bool

remove(value)

S.remove(value) – remove first occurrence of value. Raise ValueError if the value is not present.

reset(data)

Update the instance with new data.

Parameters

data (non-string Sequence) – Data to update the instance.

Raises

ValueError – If the data is not a non-string sequence.

Backends

synced_collections.backends.collection_json module

Implements a JSON SyncedCollection backend.

class signac.synced_collections.backends.collection_json.BufferedJSONAttrDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: BufferedJSONDict, AttrDict

A buffered JSONAttrDict.

class signac.synced_collections.backends.collection_json.BufferedJSONAttrList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: BufferedJSONList

A BufferedJSONList whose dict-like children will be of type BufferedJSONAttrDict.

class signac.synced_collections.backends.collection_json.BufferedJSONCollection(parent=None, filename=None, *args, **kwargs)

Bases: SerializedFileBufferedCollection, JSONCollection

A JSONCollection that supports I/O buffering.

This class implements the buffer protocol defined by BufferedCollection. The concrete implementation of buffering behavior is defined by the SerializedFileBufferedCollection.

class signac.synced_collections.backends.collection_json.BufferedJSONDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: BufferedJSONCollection, SyncedDict

A buffered JSONDict.

class signac.synced_collections.backends.collection_json.BufferedJSONList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: BufferedJSONCollection, SyncedList

A buffered JSONList.

class signac.synced_collections.backends.collection_json.JSONAttrDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: JSONDict, AttrDict

A dict-like data structure that synchronizes with a persistent JSON file.

Unlike JSONAttrDict, this class also supports attribute-based access to dictionary contents, e.g. doc.foo == doc['foo'].

Examples

>>> doc = JSONAttrDict('data.json', write_concern=True)
>>> doc['foo'] = "bar"
>>> assert doc.foo == doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}
>>> doc.foo.bar = False
>>> doc
{'foo': {'bar': False}}
Parameters
  • filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).

  • write_concern (bool, optional) – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file (Default value = False).

  • data (collections.abc.Mapping, optional) – The initial data passed to JSONAttrDict. If None, defaults to {} (Default value = None).

  • parent (JSONCollection, optional) – A parent instance of JSONCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the JSONAttrDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying file, copying (even deep copying) a JSONAttrDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new JSONAttrDict instance.

class signac.synced_collections.backends.collection_json.JSONAttrList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: JSONList

A JSONList whose dict-like children will be of type JSONAttrDict.

class signac.synced_collections.backends.collection_json.JSONCollection(filename=None, write_concern=False, *args, **kwargs)

Bases: SyncedCollection

A SyncedCollection that synchronizes with a JSON file.

This collection implements synchronization by reading and writing the associated JSON file in its entirety for every read/write operation. This backend is a good choice for maximum accessibility and transparency since all data is immediately accessible in the form of a text file with no additional tooling, but is likely a poor choice for high performance applications.

Thread safety

The JSONCollection is thread-safe. To make these collections safe, the write_concern flag is ignored in multithreaded execution, and the write is always performed via a write to temporary file followed by a replacement of the original file. The file replacement operation uses os.replace(), which is guaranteed to be atomic by the Python standard.

Parameters
  • filename (str) – The filename of the associated JSON file on disk.

  • write_concern (bool, optional) – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file (Default value = False).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

property filename

The name of the associated JSON file on disk.

Type

str

class signac.synced_collections.backends.collection_json.JSONDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: JSONCollection, SyncedDict

A dict-like data structure that synchronizes with a persistent JSON file.

Examples

>>> doc = JSONDict('data.json', write_concern=True)
>>> doc['foo'] = "bar"
>>> assert doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}
Parameters
  • filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).

  • write_concern (bool, optional) – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file (Default value = False).

  • data (collections.abc.Mapping, optional) – The initial data passed to JSONDict. If None, defaults to {} (Default value = None).

  • parent (JSONCollection, optional) – A parent instance of JSONCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the JSONDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying file, copying (even deep copying) a JSONDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new JSONDict instance.

class signac.synced_collections.backends.collection_json.JSONList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: JSONCollection, SyncedList

A list-like data structure that synchronizes with a persistent JSON file.

Only non-string sequences are supported by this class.

Examples

>>> synced_list = JSONList('data.json', write_concern=True)
>>> synced_list.append("bar")
>>> assert synced_list[0] == "bar"
>>> assert len(synced_list) == 1
>>> del synced_list[0]
Parameters
  • filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).

  • write_concern (bool, optional) – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file (Default value = None).

  • data (non-str collections.abc.Sequence, optional) – The initial data passed to JSONList `. If ``None`, defaults to [] (Default value = None).

  • parent (JSONCollection, optional) – A parent instance of JSONCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the JSONList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying file, copying (even deep copying) a JSONList instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new JSONList instance.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONAttrDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: MemoryBufferedJSONDict, AttrDict

A buffered JSONAttrDict.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONAttrList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: MemoryBufferedJSONList

A MemoryBufferedJSONList whose dict-like children will be of type MemoryBufferedJSONAttrDict.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONCollection(parent=None, filename=None, *args, **kwargs)

Bases: SharedMemoryFileBufferedCollection, JSONCollection

A JSONCollection that supports I/O buffering.

This class implements the buffer protocol defined by BufferedCollection. The concrete implementation of buffering behavior is defined by the SharedMemoryFileBufferedCollection.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: MemoryBufferedJSONCollection, SyncedDict

A buffered JSONDict.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)

Bases: MemoryBufferedJSONCollection, SyncedList

A buffered JSONList.

signac.synced_collections.backends.collection_json.json_attr_dict_validator(data)

Validate data for JSONAttrDict.

This validator combines the logic from the following validators into one to make validation more efficient:

This validator combines the following logic:
  • JSON format validation

  • Ensuring no dots are present in string keys

  • Converting non-str keys to strings. This is a backwards compatibility layer that will be removed in signac 2.0.

Parameters

data – Data to validate.

Raises
  • KeyTypeError – If key data type is not supported.

  • TypeError – If the data type of data is not supported.

synced_collections.backends.collection_mongodb module

Implements a MongoDB SyncedCollection backend.

class signac.synced_collections.backends.collection_mongodb.MongoDBCollection(collection=None, uid=None, parent=None, *args, **kwargs)

Bases: SyncedCollection

A SyncedCollection that synchronizes with a MongoDB document.

In MongoDB, a database is composed of multiple MongoDB collections, which are analogous to tables in SQL databases but do not enforce a schema like in relational databases. In turn, collections are composed of documents, which are analogous to rows in a table but are much more flexible, storing any valid JSON object in a JSON-like encoded format known as BSON (“binary JSON”).

Each MongoDBCollection can be represented as a MongoDB document, so this backend stores the MongoDBCollection as a single document within the collection provided by the user. The document is identified by a unique key provided by the user.

Thread safety

The MongoDBCollection is not thread-safe.

Parameters
  • collection (pymongo.collection.Collection) – The MongoDB client in which to store data.

  • uid (dict) – The unique key-value mapping added to the data and stored in the document so that it is uniquely identifiable in the MongoDB collection. The key “data” is reserved and may not be part of this uid.

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

The user is responsible for providing a unique id such that there are no possible collisions between different MongoDBCollection instances stored in the same MongoDB collection. Failure to do so may result in data corruption if multiple documents are found to be apparently associated with a given uid.

property collection

Get the collection being synced to.

Type

pymongo.collection.Collection

property uid

Get the unique mapping used to identify this collection.

Type

dict

class signac.synced_collections.backends.collection_mongodb.MongoDBDict(collection=None, uid=None, data=None, parent=None, *args, **kwargs)

Bases: MongoDBCollection, SyncedDict

A dict-like data structure that synchronizes with a document in a MongoDB collection.

Examples

>>> doc = MongoDBDict('data')
>>> doc['foo'] = "bar"
>>> assert doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}
Parameters
  • collection (pymongo.collection.Collection, optional) – A pymongo.collection.Collection instance (Default value = None).

  • uid (dict, optional) – The unique key-value mapping identifying the collection (Default value = None).

  • data (non-str collections.abc.Mapping, optional) – The initial data passed to MongoDBDict. If None, defaults to {} (Default value = None).

  • parent (MongoDBCollection, optional) – A parent instance of MongoDBCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the MongoDBDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying a MongoDBDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new MongoDBDict instance.

class signac.synced_collections.backends.collection_mongodb.MongoDBList(collection=None, uid=None, data=None, parent=None, *args, **kwargs)

Bases: MongoDBCollection, SyncedList

A list-like data structure that synchronizes with a document in a MongoDB collection.

Only non-string sequences are supported by this class.

Examples

>>> synced_list = MongoDBList('data')
>>> synced_list.append("bar")
>>> assert synced_list[0] == "bar"
>>> assert len(synced_list) == 1
>>> del synced_list[0]
Parameters
  • collection (pymongo.collection.Collection, optional) – A pymongo.collection.Collection instance (Default value = None).

  • uid (dict, optional) – The unique key-value mapping identifying the collection (Default value = None).

  • data (non-str collections.abc.Sequence, optional) – The initial data passed to MongoDBList. If None, defaults to [] (Default value = None).

  • parent (MongoDBCollection, optional) – A parent instance of MongoDBCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the MongoDBList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying a MongoDBList instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new MongoDBList instance.

synced_collections.backends.collection_redis module

Implements a Redis SyncedCollection backend.

class signac.synced_collections.backends.collection_redis.RedisCollection(client=None, key=None, *args, **kwargs)

Bases: SyncedCollection

A SyncedCollection that synchronizes with a Redis database.

This backend stores data in Redis by associating it with the provided key.

Thread safety

The RedisCollection is not thread-safe.

Parameters
  • client (redis.Redis) – The Redis client used to persist data.

  • key (str) – The key associated with this collection in the Redis database.

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

property client

The Redis client used to store the data.

Type

redis.Redis

property key

The key associated with this collection stored in Redis.

Type

str

class signac.synced_collections.backends.collection_redis.RedisDict(client=None, key=None, data=None, parent=None, *args, **kwargs)

Bases: RedisCollection, SyncedDict

A dict-like data structure that synchronizes with a persistent Redis database.

Examples

>>> doc = RedisDict('data')
>>> doc['foo'] = "bar"
>>> assert doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}
Parameters
  • client (redis.Redis, optional) – A redis client (Default value = None).

  • key (str, optional) – The key of the collection (Default value = None).

  • data (collections.abc.Mapping, optional) – The initial data passed to RedisDict. If None, defaults to {} (Default value = None).

  • parent (RedisCollection, optional) – A parent instance of RedisCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the RedisDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying a RedisDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new RedisDict instance.

class signac.synced_collections.backends.collection_redis.RedisList(client=None, key=None, data=None, parent=None, *args, **kwargs)

Bases: RedisCollection, SyncedList

A list-like data structure that synchronizes with a persistent Redis database.

Only non-string sequences are supported by this class.

Examples

>>> synced_list = RedisList('data')
>>> synced_list.append("bar")
>>> assert synced_list[0] == "bar"
>>> assert len(synced_list) == 1
>>> del synced_list[0]
Parameters
  • client (redis.Redis, optional) – A Redis client (Default value = None).

  • key (str, optional) – The key of the collection (Default value = None).

  • data (non-str collections.abc.Sequence, optional) – The initial data passed to RedisList. If None, defaults to [] (Default value = None).

  • parent (RedisCollection, optional) – A parent instance of RedisCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the RedisList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying a RedisList instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new RedisList instance.

synced_collections.backends.collection_zarr module

Implements a Zarr SyncedCollection backend.

class signac.synced_collections.backends.collection_zarr.ZarrCollection(group=None, name=None, codec=None, *args, **kwargs)

Bases: SyncedCollection

A SyncedCollection that synchronizes with a Zarr group.

Since Zarr is designed for storage of array-like data, this backend implements synchronization by storing the collection in a 1-element object array. The user provides the group within which to store the data and the name of the data in the group.

Thread safety

The ZarrCollection is not thread-safe.

Parameters
  • group (zarr.hierarchy.Group) – The Zarr group in which to store data.

  • name (str) – The name under which this collection is stored in the Zarr group.

  • codec (numcodecs.abc.Codec) – The encoding mechanism for the data. If not provided, defaults to JSON encoding (Default value: None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

property codec

The encoding method used for the data.

Type

numcodecs.abc.Codec

property group

The Zarr group storing the data.

Type

zarr.hierarchy.Group

property name

The name of this data in the Zarr group.

Type

str

class signac.synced_collections.backends.collection_zarr.ZarrDict(group=None, name=None, data=None, parent=None, *args, **kwargs)

Bases: ZarrCollection, SyncedDict

A dict-like data structure that synchronizes with a Zarr group.

Examples

>>> doc = ZarrDict('data')
>>> doc['foo'] = "bar"
>>> assert doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}
Parameters
  • group (zarr.hierarchy.Group, optional) – The group in which to store data (Default value = None).

  • name (str, optional) – The name of the collection (Default value = None).

  • data (collections.abc.Mapping, optional) – The initial data passed to ZarrDict. If None, defaults to {} (Default value = None).

  • parent (ZarrCollection, optional) – A parent instance of ZarrCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the ZarrDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying (even deep copying) a ZarrDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new ZarrDict instance.

class signac.synced_collections.backends.collection_zarr.ZarrList(group=None, name=None, data=None, parent=None, *args, **kwargs)

Bases: ZarrCollection, SyncedList

A list-like data structure that synchronizes with a Zarr group.

Only non-string sequences are supported by this class.

Examples

>>> synced_list = ZarrList('data')
>>> synced_list.append("bar")
>>> assert synced_list[0] == "bar"
>>> assert len(synced_list) == 1
>>> del synced_list[0]
Parameters
  • group (zarr.hierarchy.Group, optional) – The group in which to store data (Default value = None).

  • name (str, optional) – The name of the collection (Default value = None).

  • data (non-str collections.abc.Sequence, optional) – The initial data passed to ZarrList. If None, defaults to [] (Default value = None).

  • parent (ZarrCollection, optional) – A parent instance of ZarrCollection or None. If None, the collection owns its own data (Default value = None).

  • *args – Positional arguments forwarded to parent constructors.

  • **kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the ZarrList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying (even deep copying) a ZarrList instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new ZarrList instance.

Buffers

synced_collections.buffers.buffered_collection module

Defines a buffering protocol for SyncedCollection objects.

Depending on the choice of backend, synchronization may be an expensive process. In that case, it can be helpful to allow many in-memory modifications to occur before any synchronization is attempted. Since many collections could be pointing to the same underlying resource, maintaining proper data coherency across different instances requires careful consideration of how the data is stored. The appropriate buffering methods can differ for different backends; as a result, the basic interface simply lays out the API for buffering and leaves implementation details for specific backends to handle. Judicious use of buffering can dramatically speed up code paths that might otherwise involve, for instance, heavy I/O. The specific buffering mechanism must be implemented by each backend since it depends on the nature of the underlying data format.

All buffered collections expose a local context manager for buffering. In addition, each backend exposes a context manager BufferedCollection.buffer_backend() that indicates to all buffered collections of that backend that they should enter buffered mode. These context managers may be nested freely, and buffer flushes will occur when all such managers have been exited.

with collection1.buffered:
    with type(collection1).buffer_backend:
        collection2['foo'] = 1
        collection1['bar'] = 1
        # collection2 will flush when this context exits.

    # This operation will write straight to the backend.
    collection2['bar'] = 2

    # collection1 will flush when this context exits.
class signac.synced_collections.buffers.buffered_collection.BufferedCollection(*args, **kwargs)

Bases: SyncedCollection

A SyncedCollection defining an interface for buffering.

The default behavior of this class is not to buffer. This class simply defines an appropriate interface for buffering behavior so that client code can rely on these methods existing, e.g. to be able to do things like with collection.buffered.... This feature allows client code to indicate to the collection when it is safe to buffer reads and writes, which usually means guaranteeing that the synchronization destination (e.g. an underlying file or database entry) will not be modified by other processes concurrently with the set of operations within the buffered block. However, in the default case the result of this will be a no-op and all data will be immediately synchronized with the backend.

The BufferedCollection overrides the _load() and _save() methods to check whether buffering is enabled or not. If not, the behavior is identical to the parent class. When in buffered mode, however, the BufferedCollection introduces two additional hooks that can be overridden by subclasses to control how the collection behaves while buffered:

Thread safety

Whether or not buffering is thread safe depends on the buffering method used. In general, both the buffering logic and the data type operations must be thread safe for the resulting collection type to be thread safe.

_flush()

Flush data associated with this instance from the buffer.

classmethod _flush_buffer()

Flush all data in this class’s buffer.

property _is_buffered

Check if we should write to the buffer or not.

_load()

Load data from the backend but buffer if needed.

This method is identical to the SyncedCollection implementation except that it determines whether data is actually synchronized or instead read from a temporary buffer based on the buffering mode.

_load_from_buffer()

Store data in buffer.

By default, this method simply calls _load_from_resource(). Subclasses must implement specific buffering strategies.

Returns

An equivalent unsynced collection satisfying is_base_type() that contains the buffered data. By default, the buffered data is just the data in the resource.

Return type

Collection

_save()

Synchronize data with the backend but buffer if needed.

This method is identical to the SyncedCollection implementation for sync except that it determines whether data is actually synchronized or instead written to a temporary buffer based on the buffering mode.

_save_to_buffer()

Store data in buffer.

By default, this method simply calls _save_to_resource(). Subclasses must implement specific buffering strategies.

classmethod backend_is_buffered()

Check if this backend is currently buffered.

classmethod buffer_backend(*args, **kwargs)

Enter context to buffer all operations for this backend.

synced_collections.buffers.file_buffered_collection module

A standardized buffering implementation for file-based backends.

All file-based backends can use a similar buffering protocol. In particular, integrity checks can be performed by checking for whether the file has been modified since it was originally loaded into the buffer. However, various specific components are abstract and must be implemented by child classes.

class signac.synced_collections.buffers.file_buffered_collection.FileBufferedCollection(parent=None, filename=None, *args, **kwargs)

Bases: BufferedCollection

A SyncedCollection that can buffer file I/O.

This class provides a standardized buffering protocol for all file-based backends. All file-based backends can use the same set of integrity checks prior to a buffer flush to ensure that no conflicting modifications are made. Specifically, they can check whether the file has been modified on disk since it was originally loaded to the buffer. This class provides the basic infrastructure for that and defines standard methods that can be used by all classes. Subclasses must define the appropriate storage mechanism.

Note

Important notes for developers:
  • This class should be inherited before any other collections. This requirement is due to the extensive use of multiple inheritance. Since this class is designed to be combined with other SyncedCollection types without making those types aware of buffering behavior, it transparently hooks into the initialization process, but this is dependent on its constructor being called before those of other classes.

  • All subclasses must define a class level _BUFFER_CAPACITY variable that is used to determine the maximum allowable buffer size.

Parameters

filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).

Warning

Although it can be done safely, in general modifying two different collections pointing to the same underlying resource while both are in different buffering modes is unsupported and can lead to undefined behavior. This class makes a best effort at performing safe modifications, but it is possible to construct nested buffered contexts for different objects that can lead to an invalid buffer state, or even situations where there is no obvious indicator of what is the canonical source of truth. In general, if you need multiple objects pointing to the same resource, it is strongly recommeneded to work with both of them in identical buffering states at all times.

classmethod buffer_backend(buffer_size=None, force_write=None, *args, **kwargs)

Enter context to buffer all operations for this backend.

Parameters
  • buffer_size (int) – The capacity of the buffer to use within this context (resets after the context is exited).

  • force_write (bool) – This argument does nothing and is only present for compatibility with signac 1.x.

classmethod disable_multithreading()

Disable all safety checks and thread locks required for thread safety.

This method adds managed buffer-related thread safety in addition to what the parent method does.

classmethod enable_multithreading()

Enable safety checks and thread locks required for thread safety.

This method adds managed buffer-related thread safety in addition to what the parent method does.

classmethod get_buffer_capacity()

Get the current buffer capacity.

Returns

The amount of data that can be stored before a flush is triggered in the appropriate units for a particular buffering implementation.

Return type

int

classmethod get_current_buffer_size()

Get the total amount of data currently stored in the buffer.

Returns

The size of all data contained in the buffer in the appropriate units for a particular buffering implementation.

Return type

int

classmethod set_buffer_capacity(new_capacity)

Update the buffer capacity.

Parameters

new_capacity (int) – The new capacity of the buffer in the appropriate units for a particular buffering implementation.

synced_collections.buffers.serialized_file_buffered_collection module

Buffering for file-based backends using a serialized buffer.

The buffering method implemented here involves a single buffer of serialized data. All collections in buffered mode encode their data into this buffer on save and decode from it on load.

class signac.synced_collections.buffers.serialized_file_buffered_collection.SerializedFileBufferedCollection(parent=None, filename=None, *args, **kwargs)

Bases: FileBufferedCollection

A FileBufferedCollection based on a serialized data store.

This class extends the FileBufferedCollection and implements a concrete storage mechanism in which data is encoded (by default, into JSON) and stored into a buffer. This buffer functions as a central data store for all collections and is a synchronization point for various collections pointing to the same underlying file. This serialization method may be a bottleneck in some applications; see the Warnings section for more information.

The buffer size and capacity for this class is measured in the total number of bytes stored in the buffer that correspond to file data. This is not the total size of the buffer, which also contains additional information like the hash of the data and the file metadata (which are used for integrity checks), but it is the relevant metric for users.

Note

Important note for subclasses: This class should be inherited before any other collections. This requirement is due to the extensive use of multiple inheritance: since this class is designed to be combined with other SyncedCollection types without making those types aware of buffering behavior, it transparently hooks into the initialization process, but this is dependent on its constructor being called before those of other classes.

Thread safety

This buffering method is thread safe. This thread safety is independent of the safety of an individual collection backend; the backend must support thread safe writes to the underlying resource in order for a buffered version using this class to be thread safe for general use. The thread safety guaranteed by this class only concerns buffer reads, writes, and flushes. All these operations are serialized because there is no way to prevent one collection from triggering a flush while another still thinks its data is in the cache.

Parameters

filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).

Warning

  • Although it can be done safely, in general modifying two different collections pointing to the same underlying resource while both are in different buffering modes is unsupported and can lead to undefined behavior. This class makes a best effort at performing safe modifications, but it is possible to construct nested buffered contexts for different objects that can lead to an invalid buffer state, or even situations where there is no obvious indicator of what is the canonical source of truth. In general, if you need multiple objects pointing to the same resource, it is strongly recommeneded to work with both of them in identical buffering states at all times.

  • The overhead of this buffering method is quite high due to the constant encoding and decoding of data. For performance-critical applications where memory is not highly constrained and virtual memory limits are absent, the SharedMemoryFileBufferedCollection may be more appropriate.

  • Due to the possibility of read operations triggering a flush, the contents of the buffer may be invalidated on loads as well. To prevent this even nominally read-only operations are serialized. As a result, although this class is thread safe, it will effectively serialize all operations and will therefore not be performant.

synced_collections.buffers.memory_buffered_collection module

A standardized buffering implementation for file-based backends.

The buffering method implemented here involves a single buffer of references to in-memory objects containing data. These objects are the base types of a given SyncedCollection type, e.g. a dict for all dict-like collections, and are the underlying data stores for those types. This buffering method exploits the fact that all mutable collection types in Python are references, so modifying one such collection results in modifying all of them, thereby removing any need for more complicated synchronization protocols.

class signac.synced_collections.buffers.memory_buffered_collection.SharedMemoryFileBufferedCollection(parent=None, filename=None, *args, **kwargs)

Bases: FileBufferedCollection

A SyncedCollection that defers all I/O when buffered.

This class extends the FileBufferedCollection and implements a concrete storage mechanism in which collections store a reference to their data in a buffer. This method takes advantage of the reference-based semantics of built-in Python mutable data types like dicts and lists. All collections referencing the same file are pointed to the same underlying data store in buffered mode, allowing all changes in one to be transparently reflected in the others. To further improve performance, the buffer size is determined only based on the number of modified collections stored, not the total number. As a result, the maximum capacity is only reached when a large number of modified collections are stored, and unmodified collections are only removed from the buffer when a buffered context is exited (rather than when buffer capacity is exhausted). See the Warnings section for more information.

The buffer size and capacity for this class is measured in the total number of collections stored in the buffer that have undergone any modifications since their initial load from disk. A sequence of read-only operations will load data into the buffer, but the apparent buffer size will be zero.

Note

Important note for subclasses: This class should be inherited before any other collections. This requirement is due to the extensive use of multiple inheritance: since this class is designed to be combined with other SyncedCollection types without making those types aware of buffering behavior, it transparently hooks into the initialization process, but this is dependent on its constructor being called before those of other classes.

Thread safety

This buffering method is thread safe. This thread safety is independent of the safety of an individual collection backend; the backend must support thread safe writes to the underlying resource in order for a buffered version using this class to be thread safe for general use. The thread safety guaranteed by this class only concerns buffer reads, writes, and flushes. All these operations are serialized because there is no way to prevent one collection from triggering a flush while another still thinks its data is in the cache; however, this shouldn’t be terribly performance-limiting since in buffered mode we’re avoiding I/O anyway and that’s the only thing that can be effectively parallelized here.

Parameters

filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).

Warning

  • Although it can be done safely, in general modifying two different collections pointing to the same underlying resource while both are in different buffering modes is unsupported and can lead to undefined behavior. This class makes a best effort at performing safe modifications, but it is possible to construct nested buffered contexts for different objects that can lead to an invalid buffer state, or even situations where there is no obvious indicator of what is the canonical source of truth. In general, if you need multiple objects pointing to the same resource, it is strongly recommeneded to work with both of them in identical buffering states at all times.

  • This buffering method has no upper bound on the buffer size if all operations on buffered objects are read-only operations. If a strict upper bound is required, for instance due to strict virtual memory limits on a given system, use of the SerializedFileBufferedCollection will allow limiting the total memory usage of the process.

Miscellaneous Modules

synced_collections.utils module

Define common utilities.

class signac.synced_collections.utils.AbstractTypeResolver(abstract_type_identifiers, cache_blocklist=None)

Bases: object

Mapping between recognized types and their abstract parents.

Synced collections are heavily reliant on checking the types of objects to determine the appropriate type of behavior in various scenarios. For maximum generality, most of these checks use the ABCs defined in collections.abc. The price of this flexibility is that isinstance checks with these classes are very slow because the __instancecheck__ hooks are implemented in pure Python and require checking many different cases.

Rather than attempting to directly optimize this behavior, this class provides a workaround by which we can amortize the cost of type checks. Given a set of types that must be resolved and a way to identify each of these (which may be expensive), it maintains a local cache of all instances of a given type that have previously been observed. This reduces the cost of type checking to a simple dict lookup, except for the first time a new type is observed.

Parameters
  • abstract_type_identifiers (Mapping) – A mapping from a string identifier for a group of types (e.g. "MAPPING") to a callable that can be used to identify that type. Due to insertion order guarantees of dictionaries in Python>=3.6 (officially 3.7), it may be beneficial to order this dictionary with the most frequently occuring types first. However, unless users have many different concrete types implementing the same abstract interface (e.g. many Mapping types identified via isinstance(obj, Mapping)), any performance gain should be negligible since the callables will only be executed once per type.

  • cache_blocklist (Sequence, optional) – A sequence of string identifiers from abstract_type_identifiers that should not be cached. If there are cases where objects of the same type would be classified into separate groups based on the callables in abstract_type_identifiers, this argument allows users to specify that this type should not be cached. This argument should be used sparingly because performance will quickly degrade if many calls to get_type() are with types that cannot be cached. The identifiers (keys in abstract_type_identifiers) corresponding to elements of the blocklist should be placed first in the abstract_type_identifiers dictionary since they will never be cached and are therefore the most likely callables to be used repeatedly (Default value = None).

abstract_type_identifiers

A mapping from string identifiers for an abstract type to callables that accepts an object and returns True if the object is of the key type and False if not.

Type

Dict[str, Callable[Any, bool]]

type_map

A mapping from concrete types to the corresponding named abstract type from abstract_type_identifiers.

Type

Dict[Type, str]

get_type(obj)

Get the type string corresponding to this data type.

Parameters

obj (Any) – Any object whose type to check

Returns

The name of the type, where valid types are the keys of the dict argument to the constructor. If the object’s type cannot be identified, will return None.

Return type

str

class signac.synced_collections.utils.SyncedCollectionJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: JSONEncoder

A json.JSONEncoder that handles objects encodeable using default().

Warning

  • JSON encoding of numpy arrays is not invertible; once encoded, reloading the data will result in converting arrays to lists and numpy numbers into ints or floats.

  • This class assumes that the in-memory data for a SyncedCollection is up-to-date. If the data has been changed on disk without updating the collection, or if this class is used to serialize the data before any method of the collection is invoked that would load the data from disk, the resulting serialized data may be incorrect.

default(o: Any) Dict[str, Any]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
signac.synced_collections.utils.default(o: Any) Dict[str, Any]

Get a JSON-serializable version of compatible types.

This function is suitable for use with JSON-serialization tools as a way to serialize SyncedCollection objects and NumPy arrays. It will attempt to obtain a JSON-serializable representation of an object that is otherwise not serializable by attempting to access its _data attribute.

Warning

  • JSON encoding of numpy arrays is not invertible; once encoded, reloading the data will result in converting arrays to lists and numpy numbers into ints or floats.

  • This function assumes that the in-memory data for a SyncedCollection is up-to-date. If the data has been changed on disk without updating the collection, or if this function is used to serialize the data before any method is invoked that would load the data from disk, the resulting serialized data may be incorrect.

synced_collections.validators module

Validators for SyncedCollection API.

A validator is any callable that raises Exceptions when called with invalid data. Validators should act recursively for nested data structures and should not return any values, only raise errors. This module implements built-in validators, but client code is free to implement and add additioal validators to collection types as needed.

signac.synced_collections.validators.json_format_validator(data)

Validate input data can be serialized to JSON.

Parameters

data – Data to validate.

Raises
  • KeyTypeError – If key data type is not supported.

  • TypeError – If the data type of data is not supported.

signac.synced_collections.validators.no_dot_in_key(data)

Raise an exception if there is a dot (.) in a mapping’s key.

Parameters

data – Data to validate.

Raises
  • KeyTypeError – If key data type is not supported.

  • InvalidKeyError – If the key contains invalid characters or is otherwise malformed.

signac.synced_collections.validators.require_string_key(data)

Raise an exception if key in a mapping is not a string.

Almost all supported backends require string keys.

Parameters

data – Data to validate.

Raises

KeyTypeError – If key type is not a string.