API Reference ¶

property data¶

Get data associated with this project.

This property should be used for large array-like data, which can’t be stored efficiently in the project document. For examples and usage, see Centralized Project Data.

Equivalent to:

return project.stores['signac_data']

See also

H5Store: Usage examples.

Returns: An HDF5-backed datastore.
Return type: H5Store

detect_schema(exclude_const=False, subset=None, index=None)¶

Detect the project’s state point schema.

See signac schema for the command line equivalent.

Parameters

exclude_const (bool) – Exclude all state point keys that are shared by all jobs within this project (Default value = False).
subset – A sequence of jobs or job ids specifying a subset over which the state point schema should be detected (Default value = None).
index – A document index (Default value = None).

Returns

The detected project schema.

Return type

ProjectSchema

property doc¶

Get document associated with this project.

Alias for document().

Returns: The project document.
Return type: BufferedJSONAttrDict

property document¶

Get document associated with this project.

Returns: The project document.
Return type: BufferedJSONAttrDict

dump_statepoints(statepoints)¶

Dump the state points and associated job ids.

Equivalent to:

{project.open_job(sp).id: sp for sp in statepoints}

Parameters: statepoints (iterable) – A list of state points.
Returns: A mapping, where the key is the job id and the value is the state point.
Return type: dict

Deprecated since version 1.8: This will be removed in 2.0. State point backup files are being removed in favor of Project.update_cache().

export_to(target, path=None, copytree=None)¶

Export all jobs to a target location, such as a directory or a (compressed) archive file.

Use this function in combination with find_jobs() to export only a select number of jobs, for example:

project.find_jobs({'foo': 0}).export_to('foo_0.tar')

The path argument enables users to control how exactly the exported data space is to be expanded. By default, the path-function will be based on the implicit schema of the exported jobs. For example, exporting jobs that all differ by a state point key foo with project.export_to('data/'), the exported directory structure could look like this:

data/foo/0
data/foo/1
...

That would be equivalent to specifying path=lambda job: os.path.join('foo', job.sp.foo).

Instead of a function, we can also provide a string, where fields for state point keys are automatically formatted. For example, the following two path arguments are equivalent: “foo/{foo}” and “foo/{job.sp.foo}”.

Any attribute of job can be used as a field here, so job.doc.bar, job.id, and job.ws can also be used as path fields.

A special {{auto}} field allows us to expand the path automatically with state point keys that have not been specified explicitly. So, for example, one can provide path="foo/{foo}/{{auto}}" to specify that the path shall begin with foo/{foo}/, but is then automatically expanded with all other state point key-value pairs. How key-value pairs are concatenated can be controlled via the format-specifier, so for example, path="{{auto:_}}" will generate a structure such as

data/foo_0
data/foo_1
...

Finally, providing path=False is equivalent to path="{job.id}".

See also

import_from() :: Previously exported or non-signac data spaces can be imported.
signac export :: See signac export for the command line equivalent.

Parameters

target – A path to a directory to export to. The target can not already exist. Besides directories, possible targets are tar files (.tar), gzipped tar files (.tar.gz), zip files (.zip), bzip2-compressed files (.bz2), and xz-compressed files (.xz).
path – The path (function) used to structure the exported data space. This argument must either be a callable which returns a path (str) as a function of job, a string where fields are replaced using the job-state point dictionary, or False, which means that we just use the job-id as path. Defaults to the equivalent of {{auto}}.
copytree – The function used for the actual copying of directory tree structures. Defaults to shutil.copytree(). Can only be used when the target is a directory.

Returns

A dict that maps the source directory paths, to the target directory paths.

Return type

find_job_ids(filter=None, doc_filter=None, index=None)¶

Find the job_ids of all jobs matching the filters.

The optional filter arguments must be a Mapping of key-value pairs and JSON serializable.

Note

Providing a pre-calculated index may vastly increase the performance of this function.

Parameters

filter (Mapping) – A mapping of key-value pairs that all indexed job state points are compared against (Default value = None).
doc_filter (Mapping) – A mapping of key-value pairs that all indexed job documents are compared against (Default value = None).
index – A document index. If not provided, an index will be computed (Default value = None).

Return type

The ids of all indexed jobs matching both filters.

Raises

TypeError – If the filters are not JSON serializable.
ValueError – If the filters are invalid.
RuntimeError – If the filters are not supported by the index.

Deprecated since version 1.3: This will be removed in 2.0. Use find_jobs() instead, then access ids with job.id.Replicate the original behavior with [job.id for job in project.find_jobs()]

find_jobs(filter=None, doc_filter=None)¶

Find all jobs in the project’s workspace.

The optional filter arguments must be a Mapping of key-value pairs and JSON serializable. The filter argument is used to search against job state points, whereas the doc_filter argument compares against job document keys.

See signac find for the command line equivalent.

Parameters

filter (Mapping) – A mapping of key-value pairs that all indexed job state points are compared against (Default value = None).
doc_filter (Mapping) – A mapping of key-value pairs that all indexed job documents are compared against (Default value = None).

Returns

JobsCursor of jobs matching the provided filter(s).

Return type

JobsCursor

Raises

TypeError – If the filters are not JSON serializable.
ValueError – If the filters are invalid.
RuntimeError – If the filters are not supported by the index.

fn(filename)¶

Prepend a filename with the project’s root directory path.

Parameters: filename (str) – The name of the file.
Returns: The joined path of project root directory and filename.
Return type: str

get_id()¶

Get the project identifier.

Returns: The project id.
Return type: str

Deprecated since version 1.3: This will be removed in 2.0. Use project.id instead.

classmethod get_job(root=None)¶

Find a Job in or above the current working directory (or provided path).

Parameters: root (str) – The job root directory. If no root directory is given, the current working directory is assumed to be the job directory (Default value = None).
Returns: The job instance.
Return type: Job
Raises: LookupError – When job cannot be found.

classmethod get_project(root=None, search=True, **kwargs)¶

Find a project configuration and return the associated project.

Parameters

root (str) – The starting point to search for a project, defaults to the current working directory.
search (bool) – If True, search for project configurations inside and above the specified root directory, otherwise only return projects with a root directory identical to the specified root argument (Default value = True).
**kwargs – Optional keyword arguments that are forwarded to the Project class constructor.

Returns

An instance of Project.

Return type

Raises

LookupError – When project configuration cannot be found.

get_statepoint(jobid, fn=None)¶

Get the state point associated with a job id.

The state point is retrieved from the internal cache, from the workspace or from a state points file.

Parameters

jobid (str) – A job id to get the state point for.
fn (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS.

Returns

The state point corresponding to jobid.

Return type

Raises

KeyError – If the state point associated with jobid could not be found.
signac.errors.JobsCorruptedError – If the state point manifest file corresponding to jobid is inaccessible or corrupted.

Deprecated since version 1.3: This will be removed in 2.0. Use open_job(id=jobid).statepoint() function instead.

groupby(key=None, default=None)¶

Group jobs according to one or more state point or document parameters.

Prepend the key with ‘sp.’ or ‘doc.’ to specify the query namespace. If no prefix is specified, group by state point key.

This method can be called on any JobsCursor such as the one returned by find_jobs() or by iterating over a project.

Examples

# Group jobs by state point parameter 'a'.
for key, group in project.groupby('a'):
    print(key, list(group))

# Group jobs by document value 'a'.
for key, group in project.groupby('doc.a'):
    print(key, list(group))

# Group jobs by jobs.sp['a'] and job.document['b']
for key, group in project.groupby(('a', 'doc.b')):
    print(key, list(group))

# Find jobs where job.sp['a'] is 1 and group them
# by job.sp['b'] and job.sp['c'].
for key, group in project.find_jobs({'a': 1}).groupby(('b', 'c')):
    print(key, list(group))

# Group by job.sp['d'] and job.document['count'] using a lambda.
for key, group in project.groupby(
    lambda job: (job.sp['d'], job.document['count'])
):
    print(key, list(group))

If key is None, jobs are grouped by id, placing one job into each group.

If default is None, only jobs with the key defined will be grouped. Jobs without the key will be filtered out and not included in any group.

Parameters

key (str, iterable, or callable) – The grouping key(s) passed as a string, iterable of strings, or a callable that will be passed one argument, the job (Default value = None).
default – A default value to be used when a given key is not present. The value must be sortable and is only used if not None (Default value = None).

Yields

key – Grouped key.
group (iterable of Jobs) – Iterable of Job instances matching this group key.

groupbydoc(key=None, default=None)¶

Group jobs according to one or more document values.

This method can be called on any JobsCursor such as the one returned by find_jobs() or by iterating over a project.

Examples

# Group jobs by document value 'a'.
for key, group in project.groupbydoc('a'):
    print(key, list(group))

# Find jobs where job.sp['a'] is 1 and group them
# by job.document['b'] and job.document['c'].
for key, group in project.find_jobs({'a': 1}).groupbydoc(('b', 'c')):
    print(key, list(group))

# Group by whether 'd' is a field in the job.document using a lambda.
for key, group in project.groupbydoc(lambda doc: 'd' in doc):
    print(key, list(group))

If key is None, jobs are grouped by id, placing one job into each group.

Parameters

key (str, iterable, or callable) – The document grouping parameter(s) passed as a string, iterable of strings, or a callable that will be passed one argument, document (Default value = None).
default – A default value to be used when a given document key is not present. The value must be sortable and is only used if not None (Default value = None).

Deprecated since version 1.7: This will be removed in 2.0. Use groupby with a ‘doc.’ filter instead, see https://docs.signac.io/en/latest/query.html#query-namespaces.

property id¶

Get the project identifier.

Returns: The project id.
Return type: str

import_from(origin=None, schema=None, sync=None, copytree=None)¶

Import the data space located at origin into this project.

This function will walk through the data space located at origin and will try to identify data space paths that can be imported as a job workspace into this project.

The schema argument expects a function that takes a path argument and returns a state point dictionary. A default function is used when no argument is provided. The default schema function will simply look for state point manifest files–usually named signac_statepoint.json–and then import all data located within that path into the job workspace corresponding to the state point specified in the manifest file.

Alternatively the schema argument may be a string, that is converted into a schema function, for example: Providing foo/{foo:int} as schema argument means that all directories under foo/ will be imported and their names will be interpreted as the value for foo within the state point.

Tip

Use copytree=os.replace or copytree=shutil.move to move dataspaces on import instead of copying them.

Warning: Imports can fail due to conflicts. Moving data instead of copying may therefore lead to inconsistent states and users are advised to apply caution.

See also

export_to() : Export the project data space.

signac import :: See signac import for the command line equivalent.

Parameters

origin – The path to the data space origin, which is to be imported. This may be a path to a directory, a zip file, or a tarball archive (Default value = None).
schema – An optional schema function, which is either a string or a function that accepts a path as its first and only argument and returns the corresponding state point as dict. (Default value = None).
sync – If True, the project will be synchronized with the imported data space. If a dict of keyword arguments is provided, the arguments will be used for sync() (Default value = None).
copytree – Specify which exact function to use for the actual copytree operation. Defaults to shutil.copytree().

Returns

A dict that maps the source directory paths to the target directory paths.

Return type

index(formats=None, depth=0, skip_errors=False, include_job_document=True)¶

Generate an index of the project’s workspace.

This generator function indexes every file in the project’s workspace until the specified depth. The job document if it exists, is always indexed, other files need to be specified with the formats argument.

See signac project -i for the command line equivalent.

for doc in project.index({r'.*\.txt', 'TextFile'}):
    print(doc)

Parameters

formats (str, dict) – The format definitions as a pattern string (e.g. r'.*\.txt') or a mapping from pattern strings to formats (e.g. 'TextFile'). If None, only the job document is indexed (Default value = None).
depth (int) – Specifies the crawling depth. A value of 0 means no limit (Default value = 0).
skip_errors (bool) – Skip all errors which occur during indexing. This is useful when trying to repair a broken workspace (Default value = False).
include_job_document (bool) – Include the contents of job documents (Default value = True).

Yields

dict – Index document.

Deprecated since version 1.8: This will be removed in 2.0. Indexing is deprecated.

classmethod init_project(name=None, root=None, workspace=None, make_dir=True)¶

Initialize a project.

It is safe to call this function multiple times with the same arguments. However, a RuntimeError is raised if an existing project configuration would conflict with the provided initialization parameters.

See signac init for the command line equivalent.

Parameters

name (str, optional) – The name of the project to initialize (Default value = None).
root (str, optional) – The root directory for the project. Defaults to the current working directory.
workspace (str, optional) – The workspace directory for the project. Defaults to a subdirectory workspace in the project root.
make_dir (bool, optional) – Create the project root directory if it does not exist yet (Default value = True).

Returns

Initialized project, an instance of Project.

Return type

Raises

RuntimeError – If the project root path already contains a conflicting project configuration.

isfile(filename)¶

Check if a filename exists in the project’s root directory.

Parameters: filename (str) – The name of the file.
Returns: True if filename exists in the project’s root directory.
Return type: bool

min_len_unique_id()¶

Determine the minimum length required for a job id to be unique.

This method’s runtime scales with the number of jobs in the workspace.

Returns: Minimum string length of a unique job identifier.
Return type: int

num_jobs()¶

Return the number of initialized jobs.

Returns: Count of initialized jobs.
Return type: int

Deprecated since version 1.8: This will be removed in 2.0. The num_jobs method is deprecated. Use len(project) instead.

open_job(statepoint=None, id=None)¶

Get a job handle associated with a state point.

This method returns the job instance associated with the given state point or job id. Opening a job by a valid state point never fails. Opening a job by id requires a lookup of the state point from the job id, which may fail if the job was not previously initialized.

Parameters

statepoint (dict) – The job’s unique set of state point parameters (Default value = None).
id (str) – The job id (Default value = None).

Returns

The job instance.

Return type

Job

Raises

KeyError – If the attempt to open the job by id fails.
LookupError – If the attempt to open the job by an abbreviated id returns more than one match.

property path¶

The path to the project directory.

Type: str

read_statepoints(fn=None)¶

Read all state points from a file.

See also

dump_statepoints: Dump the state points and associated job ids.
write_statepoints: Dump state points to a file.

Parameters: fn (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS.
Returns: State points.
Return type: dict

Deprecated since version 1.8: This will be removed in 2.0. State point backup files are being removed in favor of Project.update_cache().

repair(fn_statepoints=None, index=None, job_ids=None)¶

Attempt to repair the workspace after it got corrupted.

This method will attempt to repair lost or corrupted job state point manifest files using a state points file or a document index or both.

Parameters

fn_statepoints (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS.
index – A document index (Default value = None).
job_ids – An iterable of job ids that should get repaired. Defaults to all jobs.

Raises

signac.errors.JobsCorruptedError – When one or more corrupted job could not be repaired.

reset_statepoint(job, new_statepoint)¶

Overwrite the state point of this job while preserving job data.

This method will change the job id if the state point has been altered.

Danger

Use this function with caution! Resetting a job’s state point, may sometimes be necessary, but can possibly lead to incoherent data spaces.

Parameters

job (Job) – The job that should be reset to a new state point.
new_statepoint (mapping) – The job’s new state point.

Raises

DestinationExistsError – If a job associated with the new state point is already initialized.
OSError – If the move failed due to an unknown system related error.

Deprecated since version 1.3: This will be removed in 2.0. Use job.statepoint = new_statepoint instead.

root_directory()¶: Alias for path.

Deprecated since version 1.8: This will be removed in 2.0. Use Project.path instead.

property stores¶

Get HDF5-stores associated with this project.

Use this property to access an HDF5 file within the project’s root directory using the H5Store dict-like interface.

This is an example for accessing an HDF5 file called 'my_data.h5' within the project’s root directory:

project.stores['my_data']['array'] = np.random((32, 4))

This is equivalent to:

H5Store(project.fn('my_data.h5'))['array'] = np.random((32, 4))

Both the project.stores and the H5Store itself support attribute access. The above example could therefore also be expressed as:

project.stores.my_data.array = np.random((32, 4))

Returns: The HDF5-Store manager for this project.
Return type: H5StoreManager

sync(other, strategy=None, exclude=None, doc_sync=None, selection=None, **kwargs)¶

Synchronize this project with the other project.

Try to clone all jobs from the other project to this project. If a job is already part of this project, try to synchronize the job using the optionally specified strategies.

See signac sync for the command line equivalent.

Parameters

other (Project) – The other project to synchronize this project with.
strategy – A file synchronization strategy (Default value = None).
exclude – Files with names matching the given pattern will be excluded from the synchronization (Default value = None).
doc_sync – The function applied for synchronizing documents (Default value = None).
selection – Only sync the given jobs (Default value = None).
**kwargs – This method also accepts the same keyword arguments as the sync_projects() function.

Raises

DocumentSyncConflict – If there are conflicting keys within the project or job documents that cannot be resolved with the given strategy or if there is no strategy provided.
FileSyncConflict – If there are differing files that cannot be resolved with the given strategy or if no strategy is provided.
SchemaSyncConflict – In case that the check_schema argument is True and the detected state point schema of this and the other project differ.

temporary_project(name=None, dir=None)¶

Context manager for the initialization of a temporary project.

The temporary project is by default created within the root project’s workspace to ensure that they share the same file system. This is an example for how this method can be used for the import and synchronization of external data spaces.

with project.temporary_project() as tmp_project:
    tmp_project.import_from('/data')
    project.sync(tmp_project)

Parameters

name (str) – An optional name for the temporary project. Defaults to a unique random string.
dir (str) – Optionally specify where the temporary project root directory is to be created. Defaults to the project’s workspace directory.

Returns

An instance of Project.

Return type

to_dataframe(*args, **kwargs)¶

Export the project metadata to a pandas DataFrame.

The arguments to this function are forwarded to to_dataframe().

Parameters

*args – Forwarded to to_dataframe().
**kwargs – Forwarded to to_dataframe().

Return type

DataFrame

update_cache()¶

Update the persistent state point cache.

This function updates a persistent state point cache, which is stored in the project root directory. Most data space operations, including iteration and filtering or selection are expected to be significantly faster after calling this function, especially for large data spaces.

update_statepoint(job, update, overwrite=False)¶

Change the state point of this job while preserving job data.

By default, this method will not change existing parameters of the state point of the job.

This method will change the job id if the state point has been altered.

Warning

While appending to a job’s state point is generally safe, modifying existing parameters may lead to data inconsistency. Use the overwrite argument with caution!

Parameters

job (Job) – The job whose state point shall be updated.
update (mapping) – A mapping used for the state point update.
overwrite (bool, optional) – If False, an error will be raised if the update modifies the values of existing keys in the state point. If True, any existing keys will be overwritten in the same way as dict.update(). Use with caution! (Default value = False).

Raises

KeyError – If the update contains keys which are already part of the job’s state point and overwrite is False.
DestinationExistsError – If a job associated with the new state point is already initialized.
OSError – If the move failed due to an unknown system related error.

Deprecated since version 1.3: This will be removed in 2.0. Use job.update_statepoint() instead.

property workspace¶

The project’s workspace directory.

See signac project -w for the command line equivalent.

Type: str

write_statepoints(statepoints=None, fn=None, indent=2)¶

Dump state points to a file.

If the file already contains state points, all new state points will be appended, while the old ones are preserved.

See also

dump_statepoints: Dump the state points and associated job ids.

Parameters

statepoints (iterable) – A list of state points, defaults to all state points which are defined in the workspace.
fn (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS.
indent (int) – Specify the indentation of the JSON file (Default value = 2).

Deprecated since version 1.8: This will be removed in 2.0. State point backup files are being removed in favor of Project.update_cache().

The JobsCursor class¶

Attributes

`JobsCursor.export_to`(target[, path, copytree])	Export all jobs to a target location, such as a directory or a (zipped) archive file.
`JobsCursor.groupby`([key, default])	Group jobs according to one or more state point or document parameters.
`JobsCursor.groupbydoc`([key, default])	Group jobs according to one or more document values.
`JobsCursor.to_dataframe`([sp_prefix, ...])	Convert the selection of jobs to a pandas `DataFrame`.

class signac.contrib.project.JobsCursor(project, filter=None, doc_filter=None)¶

Bases: object

An iterator over a search query result.

Application developers should not directly instantiate this class, but use find_jobs() instead.

Enables simple iteration and grouping operations.

Parameters

project (Project) – Project handle.
filter (Mapping) – A mapping of key-value pairs that all indexed job state points are compared against (Default value = None).

Notes

Iteration is performed by acquiring job ids from the project using Project._find_job_ids(). When no filter (filter = None) is provided, that method can take a much faster execution path, so not passing a filter (or passing None explicitly) to this constructor is strongly recommended over passing an empty filter (filter = {}) when iterating over the entire data space.

export_to(target, path=None, copytree=None)¶

Export all jobs to a target location, such as a directory or a (zipped) archive file.

See also

export_to(): For full details on how to use this function.

Parameters

target (str) – A path to a directory or archive file to export to.
path (str or callable) – The path (function) used to structure the exported data space (Default value = None).
copytree (callable) – The function used for copying of directory tree structures. Defaults to shutil.copytree(). Can only be used when the target is a directory (Default value = None).

Returns

A dictionary that maps the source directory paths to the target directory paths.

Return type

groupby(key=None, default=None)¶

Group jobs according to one or more state point or document parameters.

Prepend the key with ‘sp.’ or ‘doc.’ to specify the query namespace. If no prefix is specified, group by state point key.

This method can be called on any JobsCursor such as the one returned by find_jobs() or by iterating over a project.

Examples

# Group jobs by state point parameter 'a'.
for key, group in project.groupby('a'):
    print(key, list(group))

# Group jobs by document value 'a'.
for key, group in project.groupby('doc.a'):
    print(key, list(group))

# Group jobs by jobs.sp['a'] and job.document['b']
for key, group in project.groupby(('a', 'doc.b')):
    print(key, list(group))

# Find jobs where job.sp['a'] is 1 and group them
# by job.sp['b'] and job.sp['c'].
for key, group in project.find_jobs({'a': 1}).groupby(('b', 'c')):
    print(key, list(group))

# Group by job.sp['d'] and job.document['count'] using a lambda.
for key, group in project.groupby(
    lambda job: (job.sp['d'], job.document['count'])
):
    print(key, list(group))

If key is None, jobs are grouped by id, placing one job into each group.

If default is None, only jobs with the key defined will be grouped. Jobs without the key will be filtered out and not included in any group.

Parameters

key (str, iterable, or callable) – The grouping key(s) passed as a string, iterable of strings, or a callable that will be passed one argument, the job (Default value = None).
default – A default value to be used when a given key is not present. The value must be sortable and is only used if not None (Default value = None).

Yields

key – Grouped key.
group (iterable of Jobs) – Iterable of Job instances matching this group key.

groupbydoc(key=None, default=None)¶

Group jobs according to one or more document values.

This method can be called on any JobsCursor such as the one returned by find_jobs() or by iterating over a project.

Examples

# Group jobs by document value 'a'.
for key, group in project.groupbydoc('a'):
    print(key, list(group))

# Find jobs where job.sp['a'] is 1 and group them
# by job.document['b'] and job.document['c'].
for key, group in project.find_jobs({'a': 1}).groupbydoc(('b', 'c')):
    print(key, list(group))

# Group by whether 'd' is a field in the job.document using a lambda.
for key, group in project.groupbydoc(lambda doc: 'd' in doc):
    print(key, list(group))

If key is None, jobs are grouped by id, placing one job into each group.

Parameters

key (str, iterable, or callable) – The document grouping parameter(s) passed as a string, iterable of strings, or a callable that will be passed one argument, document (Default value = None).
default – A default value to be used when a given document key is not present. The value must be sortable and is only used if not None (Default value = None).

Deprecated since version 1.7: This will be removed in 2.0. Use groupby with a ‘doc.’ filter instead, see https://docs.signac.io/en/latest/query.html#query-namespaces.

next()¶

Return the next element.

This function is deprecated. Users should use next(iter(...)) instead. .. deprecated:: 0.9.6

Deprecated since version 0.9.6: This will be removed in 2.0. Use next(iter(…)) instead.

to_dataframe(sp_prefix='sp.', doc_prefix='doc.', usecols=None, flatten=False)¶

Convert the selection of jobs to a pandas DataFrame.

This function exports the job metadata to a pandas.DataFrame. All state point and document keys are prefixed by default to be able to distinguish them.

Parameters

sp_prefix (str, optional) – Prefix state point keys with the given string. Defaults to “sp.”.
doc_prefix (str, optional) – Prefix document keys with the given string. Defaults to “doc.”.
usecols (list-like or callable, optional) – Used to select a subset of columns. If list-like, must contain strings corresponding to the column names that should be included. For example, ['sp.a', 'doc.notes']. If callable, the column will be included if the function called on the column name returns True. For example, lambda x: 'sp.' in x. Defaults to None, which uses all columns from the state point and document. Note that this filter is applied after the doc and sp prefixes are added to the column names.
flatten (bool, optional) – Whether nested state points or document keys should be flattened. If True, {'a': {'b': 'c'}} becomes a column named a.b with value c. If False, it becomes a column named a with value {'b': 'c'}. Defaults to False.

Returns

A pandas DataFrame with all job metadata.

Return type

DataFrame

The Job class¶

Attributes

`Job.clear`()	Remove all job data, but not the job itself.
`Job.close`()	Close the job and switch to the previous working directory.
`Job.data`	Get data associated with this job.
`Job.doc`	Alias for `document`.
`Job.document`	Get document associated with this job.
`Job.fn`(filename)	Prepend a filename with the job's workspace directory path.
`Job.get_id`()	Job's state point unique identifier.
`Job.id`	Get the unique identifier for the job's state point.
`Job.init`([force])	Initialize the job's workspace directory.
`Job.isfile`(filename)	Return True if file exists in the job's workspace.
`Job.move`(project)	Move this job to project.
`Job.open`()	Enter the job's workspace directory.
`Job.remove`()	Remove the job's workspace including the job document.
`Job.reset`()	Remove all job data, but not the job itself.
`Job.reset_statepoint`(new_statepoint)	Overwrite the state point of this job while preserving job data.
`Job.sp`	Alias for `statepoint`.
`Job.statepoint`	Get or set the job's state point.
`Job.stores`	Get HDF5 stores associated with this job.
`Job.sync`(other[, strategy, exclude, doc_sync])	Perform a one-way synchronization of this job with the other job.
`Job.update_statepoint`(update[, overwrite])	Change the state point of this job while preserving job data.
`Job.workspace`()	Alias for `path`.
`Job.ws`	Alias for `path`.

class signac.contrib.job.Job(project, statepoint=None, _id=None)¶

Bases: object

The job instance is a handle to the data of a unique state point.

Application developers should not directly instantiate this class, but use open_job() instead.

Jobs can be opened by statepoint or _id. If both values are provided, it is the user’s responsibility to ensure that the values correspond.

Parameters

project (Project) – Project handle.
statepoint (dict) – State point for the job. (Default value = None)
_id (str) – The job identifier. (Default value = None)

FN_DOCUMENT = 'signac_job_document.json'¶: The job’s document filename.

FN_MANIFEST = 'signac_statepoint.json'¶

The job’s state point filename.

The job state point is a human-readable file containing the job’s state point that is stored in each job’s workspace directory.

KEY_DATA = 'signac_data'¶: The job’s datastore key.

clear()¶

Remove all job data, but not the job itself.

This function will do nothing if the job was not previously initialized.

See signac rm -c for the command line equivalent.

close()¶: Close the job and switch to the previous working directory.

property data¶

Get data associated with this job.

This property should be used for large array-like data, which can’t be stored efficiently in the job document. For examples and usage, see Job Data Storage.

Equivalent to:

return job.stores['signac_data']

Returns: An HDF5-backed datastore.
Return type: H5Store

property doc¶

Alias for document.

Warning

Even deep copies of doc will modify the same file, so changes will still effectively be persisted between deep copies. If you need a deep copy that will not modify the underlying persistent JSON file, use the call operator to get an equivalent plain dictionary: job.doc().

See signac document for the command line equivalent.

Returns: The job document handle.
Return type: JSONDict

property document¶

Get document associated with this job.

Warning

Even deep copies of document will modify the same file, so changes will still effectively be persisted between deep copies. If you need a deep copy that will not modify the underlying persistent JSON file, use the call operator to get an equivalent plain dictionary: job.document(). For more information, see JSONDict.

See signac document for the command line equivalent.

Returns: The job document handle.
Return type: JSONDict

fn(filename)¶

Prepend a filename with the job’s workspace directory path.

Parameters: filename (str) – The name of the file.
Returns: The full workspace path of the file.
Return type: str

get_id()¶

Job’s state point unique identifier.

Returns: The job id.
Return type: str

Deprecated since version 1.3: This will be removed in 2.0. Use job.id instead.

property id¶

Get the unique identifier for the job’s state point.

Returns: The job id.
Return type: str

init(force=False)¶

Initialize the job’s workspace directory.

This function will do nothing if the directory and the job state point already exist and the state point is valid.

Returns the calling job.

See signac job -c for the command line equivalent.

Parameters

force (bool) – Overwrite any existing state point files, e.g., to repair them if they got corrupted (Default value = False).

Returns

The job handle.

Return type

Job

Raises

OSError – If the workspace directory cannot be created or any other I/O error occurs when attempting to save the state point file.
JobsCorruptedError – If the job state point on disk is corrupted.

isfile(filename)¶

Return True if file exists in the job’s workspace.

Parameters: filename (str) – The name of the file.
Returns: True if file with filename exists in workspace.
Return type: bool

move(project)¶

Move this job to project.

This function will attempt to move this instance of job from its original project to a different project.

See signac move for the command line equivalent.

Parameters: project (Project) – The project to move this job to.

open()¶

Enter the job’s workspace directory.

You can use the Job class as context manager:

with project.open_job(my_statepoint) as job:
    # manipulate your job data

Opening the context will switch into the job’s workspace, leaving it will switch back to the previous working directory.

property path¶

The path to the job directory.

See signac job -w for the command line equivalent.

Type: str

remove()¶

Remove the job’s workspace including the job document.

This function will do nothing if the workspace directory does not exist.

See signac rm for the command line equivalent.

reset()¶

Remove all job data, but not the job itself.

This function will initialize the job if it was not previously initialized.

reset_statepoint(new_statepoint)¶

Overwrite the state point of this job while preserving job data.

This method will change the job id if the state point has been altered.

For more information, see Modifying the State Point.

Danger

Use this function with caution! Resetting a job’s state point may sometimes be necessary, but can possibly lead to incoherent data spaces.

Parameters: new_statepoint (dict) – The job’s new state point.

Deprecated since version 1.8: This will be removed in 2.0. Use job.statepoint = new_statepoint instead.

property sp¶: Alias for statepoint.

property statepoint¶

Get or set the job’s state point.

Setting the state point to a different value will change the job id.

For more information, see Modifying the State Point.

Warning

The state point object behaves like a dictionary in most cases, but because it persists changes to the filesystem, making a copy requires explicitly converting it to a dict. If you need a modifiable copy that will not modify the underlying JSON file, you can access a dict copy of the state point by calling it, e.g. sp_dict = job.statepoint() instead of sp = job.statepoint. For more information, see JSONAttrDict.

See signac statepoint for the command line equivalent.

Danger

Use this function with caution! Resetting a job’s state point may sometimes be necessary, but can possibly lead to incoherent data spaces.

Returns: Returns the job’s state point.
Return type: dict

property stores¶

Get HDF5 stores associated with this job.

Use this property to access an HDF5 file within the job’s workspace directory using the H5Store dict-like interface.

This is an example for accessing an HDF5 file called ‘my_data.h5’ within the job’s workspace:

job.stores['my_data']['array'] = np.random((32, 4))

This is equivalent to:

H5Store(job.fn('my_data.h5'))['array'] = np.random((32, 4))

Both the stores and the H5Store itself support attribute access. The above example could therefore also be expressed as:

job.stores.my_data.array = np.random((32, 4))

Returns: The HDF5-Store manager for this job.
Return type: H5StoreManager

sync(other, strategy=None, exclude=None, doc_sync=None, **kwargs)¶

Perform a one-way synchronization of this job with the other job.

By default, this method will synchronize all files and document data with the other job to this job until a synchronization conflict occurs. There are two different kinds of synchronization conflicts:

The two jobs have files with the same, but different content.

The two jobs have documents that share keys, but those keys are associated with different values.

A file conflict can be resolved by providing a ‘FileSync’ strategy or by excluding files from the synchronization. An unresolvable conflict is indicated with the raise of a FileSyncConflict exception.

A document synchronization conflict can be resolved by providing a doc_sync function that takes the source and the destination document as first and second argument.

Parameters

other (Job) – The other job to synchronize from.
strategy – A synchronization strategy for file conflicts. If no strategy is provided, a SyncConflict exception will be raised upon conflict (Default value = None).
exclude (str) – An filename exclude pattern. All files matching this pattern will be excluded from synchronization (Default value = None).
doc_sync – A synchronization strategy for document keys. If this argument is None, by default no keys will be synchronized upon conflict.
dry_run – If True, do not actually perform the synchronization.
**kwargs – Extra keyword arguments will be forward to the sync_jobs() function which actually excutes the synchronization operation.

Raises

FileSyncConflict – In case that a file synchronization results in a conflict.

update_statepoint(update, overwrite=False)¶

Change the state point of this job while preserving job data.

By default, this method will not change existing parameters of the state point of the job.

This method will change the job id if the state point has been altered.

For more information, see Modifying the State Point.

Warning

While appending to a job’s state point is generally safe, modifying existing parameters may lead to data inconsistency. Use the overwrite argument with caution!

Parameters

update (dict) – A mapping used for the state point update.
overwrite (bool, optional) – If False, an error will be raised if the update modifies the values of existing keys in the state point. If True, any existing keys will be overwritten in the same way as dict.update(). Use with caution! (Default value = False).

Raises

KeyError – If the update contains keys which are already part of the job’s state point and overwrite is False.
DestinationExistsError – If a job associated with the new state point is already initialized.
OSError – If the move failed due to an unknown system related error.

workspace()¶: Alias for path.

Deprecated since version 1.8: This will be removed in 2.0. Use Job.path instead.

property ws¶: Alias for path.

Deprecated since version 1.8: This will be removed in 2.0. Use Job.path instead.

The Collection¶

class signac.Collection(docs=None, primary_key='_id', compresslevel=0, _trust=False)¶

A collection of documents.

The Collection class manages a collection of documents in memory or in a file on disk. A document is defined as a dictionary mapping of key-value pairs.

An instance of collection may be used to manage and search documents. For example, given a collection with member data, where each document contains a name entry and an age entry, we can find the name of all members that are at age 32 like this:

members = [
    {'name': 'John',  'age': 32},
    {'name': 'Alice', 'age': 28},
    {'name': 'Kevin', 'age': 32},
    # ...
    ]

member_collection = Collection(members)
for doc in member_collection.find({'age': 32}):
    print(doc['name'])

To iterate over all documents in the collection, use:

for doc in collection:
    print(doc)

By default a collection object will reside in memory. However, it is possible to manage a collection associated to a file on disk. To open a collection which is associated with a file on disk, use the Collection.open() class method:

with Collection.open('collection.txt') as collection:
    for doc in collection.find({'age': 32}):
        print(doc)

The collection file is by default opened in a+ mode, which means it can be read from and written to and will be created if it does not exist yet.

Parameters

docs (iterable) – Initialize the collection with these documents.
primary_key (str) – The name of the key which serves as the primary index of the collection. Selecting documents by primary key has time complexity of O(N) in the worst case and O(1) on average. All documents must have a primary key value. The default primary key is _id.
compresslevel (int) – The level of compression to use. Any positive value implies compression and is used by the underlying gzip implementation. Default value is 0 (no compression).

Raises

ValueError – When first argument is a string.

clear()¶: Remove all documents from the collection.

close()¶

Close this collection instance.

In case that the collection is associated with a file-object, all changes are flushed to the file and the file is closed.

It is not possible to re-open the same collection instance after closing it.

delete_many(filter)¶

Delete all documents that match the filter.

Parameters: filter (dict) – A document that should be deleted must match this filter.

delete_one(filter)¶

Delete one document that matches the filter.

Parameters: filter (dict) – The document that should be deleted must match this filter.

dump(file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)¶

Dump the collection in JSON-encoding to file.

The file argument defaults to sys.stdout, which means the encoded blob will be printed to screen in case that no file argument is provided.

For example, to dump to a file on disk, one could write:

with open('my_collection.txt', 'w') as file:
    collection.dump(file)

Parameters: file – The file to write the encoded blob to (Default value = sys.stdout).

find(filter=None, limit=0)¶

Find all documents matching filter, but not more than limit.

This function searches the collection for all documents that match the given filter and returns a result vector. For example:

for doc in collection.find(my_filter):
    print(doc)

Nested values should be searched using the . operator, for example:

docs = collection.find({'nested.value': 42})

will return documents with a nested structure: {'nested': {'value': 42}}.

The result of find() can be stored and iterated over multiple times. In addition, the result vector can be queried for its size:

docs = collection.find(my_filter)

print(len(docs))    # the number of documents matching

for doc in docs:    # iterate over the result vector
    pass

Arithmetic Operators

$eq: equal

$ne: not equal

$gt: greater than

$gte: greater or equal than

$lt: less than

$lte: less or equal than
project.find({"a": {"$lt": 5})
Matches all docs with a less than 5.

Logical Operators

That includes $and and $or; both expect a list of expressions.
project.find({"$or": [{"a": 4}, {"b": {"$gt": 3}}]})
Matches all docs, where a is 4 or b is greater than 3.

Exists operator

Determines whether a specific key exists, or not, e.g.:
project.find({"a": {"$exists": True}})

Array operator

To determine whether specific elements are in ($in), or not in ($nin) an array, e.g.:
project.find({"a": {"$in": [0, 1, 2]}})
Matches all docs, where a is either 0, 1, or 2. Usage of $nin is equivalent.

Regular expression operator

Allows the “on-the-fly” evaluation of regular expressoions, e.g.:
project.find({"protocol": {"$regex": "foo"}})
Will match all docs with a protocol that contains the term ‘foo’.

$type operator

Matches when a value is of specific type, e.g.:
project.find({"protocol": {"$type": "str"}})
Finds all docs, where the value of protocol is of type str. Other types that can be checked are: int, float, bool, list, and null.

$where operator

Matches an arbitrary python expression, e.g.:
project.find({"foo": {"$where": "lambda x: x.startswith('bar')"}})
Matches all docs, where the value for foo starts with the word ‘bar’.

Parameters

filter (dict) – All documents must match the given filter (Default value = None).
limit (int) – Do not return more than limit number of documents. A limit value of 0 (the default) means no limit.

Returns

A result object that iterates over all matching documents.

Return type

_CollectionSearchResults

Raises

ValueError – In case that the filter argument is invalid.

find_one(filter=None)¶

Return one document that matches the filter or None.

doc = collection.find_one(my_filter)
if doc is None:
    print("No result found for filter", my_filter)
else:
    print("Doc matching filter:", my_filter, doc)

Parameters: filter (dict) – The returned document must match the given filter (Default value = None).
Returns: A matching document or None.
Return type: dict
Raises: ValueError – In case that the filter argument is invalid.

flush()¶

Write all changes to the associated file.

If the collection instance is associated with a file-object, calling the flush() method will write all changes to this file.

This method is also called when the collection is explicitly or implicitly closed.

property ids¶

Get an iterator over the primary key in the collection.

Returns: iterator over the primary key in the collection.
Return type: iterable

index(key, build=False)¶

Get (and optionally build) the index for a given key.

An index allows to access documents by a specific key with minimal time complexity, e.g.:

age_index = member_collection.index('age')
for _id in age_index[32]:
    print(member_collection[_id]['name'])

This means we can access documents by the ‘age’ key in O(1) time on average in addition to the primary key. Using the find() method will automatically build all required indexes for the particular search.

Once an index has been built, it will be internally managed by the class and updated with subsequent changes. An index returned by this method is always current with the latest state of the collection.

Parameters

key (str) – The primary key of the requested index.
build (bool) – If True, build a non-existing index if necessary, otherwise raise KeyError (Default value = False).

Returns

Index for the given key.

Return type

pymongo.database.Database

Raises

KeyError – In case the build is False and the index has not been built yet or no index is present for the key.

insert_one(doc)¶

Insert one document into the collection.

If the document does not have a value for the collection’s primary key yet, it will be assigned one.

_id = collection.insert_one(doc)
assert _id in collection

Note

The document will be directly updated in case that it has no primary key and must therefore be mutable!

Parameters: doc (dict) – The document to be inserted.
Returns: The _id of the inserted documented.
Return type: str

main()¶

Start a command line interface for this Collection.

Use this function to interact with this instance of Collection on the command line. For example, executing the following script:

# find.py
with Collection.open('my_collection.txt') as c:
    c.main()

will enable us to search for documents on the command line like this:

$ python find.py '{"age": 32}'
{"name": "John", "age": 32}
{"name": "Kevin", "age": 32}

Raises: ValueError – When both –id or –indent are selected.

classmethod open(filename, mode=None, compresslevel=None)¶

Open a collection associated with a file on disk.

Using this factory method will return a collection that is associated with a collection file on disk. For example:

with Collection.open('collection.txt') as collection:
    for doc in collection:
        print(doc)

will read all documents from the collection.txt file or create the file if it does not exist yet.

Modifications to the file will be written to the file when the flush() method is called or the collection is explicitly closed by calling the Collection.close() method or implicitly by leaving the with-clause:

with Collection.open('collection.txt') as collection:
    collection.update(my_docs)
# All changes to the collection have been written to collection.txt.

The open-modes work as expected, so for example to open a collection file in read-only mode, use Collection.open('collection.txt', 'r').

Opening a gzip (*.gz) file also works as expected. Because gzip does not support a combined read and write mode, mode=*+ is not available. Be sure to open the file in read, write, or append mode as required. Due to the manner in which gzip works, opening a file in mode=wt will effectively erase the current file, so take care using mode=wt.

Parameters

filename (str) – Name of file to read the documents from or create the file if it does not exist.
mode (str) – Open the file with mode (Default value = None).
compresslevel (int) – The level of compression to use. Any positive value implies compression and is used by the underlying gzip implementation. (Default value = None)

Returns

An instance of Collection.

Return type

Collection

Raises

RuntimeError – File open-mode is not None for in-memory collection or compressed collections are not opened in binary mode.

property primary_key¶: Get the name of the collection’s primary key (default=’_id’).

classmethod read_json(file=None)¶

Construct an instance of Collection from a JSON file.

Parameters: file – The json file to read, provided as either a filename or a file-like object (Default value = None).
Returns: A Collection containing the JSON file
Return type: Collection

replace_one(filter, replacement, upsert=False)¶

Replace one document that matches the given filter.

The first document matching the filter will be replaced by the given replacement document. If the upsert argument is True, the replacement will be inserted in case that no document matches the filter.

Parameters

filter (dict) – A document that should be replaced must match this filter.
replacement (dict) – The replacement document.
upsert (bool) – If True, insert the replacement document in the case that no document matches the filter (Default value = False).

Returns

The id of the replaced (or upserted) documented.

Return type

str

Raises

ValueError – In case that the filter argument is invalid.

to_json(file=None)¶

Dump the collection as a JSON file.

This function returns the JSON-string directly if the file argument is None.

Parameters: file – The filename or a file-like object to write the JSON string to (Default value = None).
Returns: JSON-string when file argument is not provided.
Return type: JSON

update(docs)¶

Update the collection with these documents.

Any existing documents with the same primary key will be replaced.

Parameters: docs (iterable) – A sequence of documents to be upserted into the collection.

The JSONDict¶

This class implements the interface for the job’s statepoint and document attributes, but can also be used stand-alone:

signac.JSONDict¶: alias of BufferedJSONAttrDict

The H5Store¶

This class implements the interface to the job’s data attribute, but can also be used stand-alone:

class signac.H5Store(filename, **kwargs)¶

An HDF5-backed container for storing array-like and dictionary-like data.

The H5Store is a MutableMapping and therefore behaves similar to a dict, but all data is stored persistently in the associated HDF5 file on disk.

Supported types include:

built-in types (int, float, str, bool, NoneType, array)

numpy arrays

pandas data frames (requires pandas and pytables)

mappings with values that are supported types

Values can be accessed as attributes (h5s.foo) or via key index (h5s['foo']).

Examples

>>> from signac import H5Store
>>> with H5Store('file.h5') as h5s:
...     h5s['foo'] = 'bar'
...     assert 'foo' in h5s
...     assert h5s.foo == 'bar'
...     assert h5s['foo'] == 'bar'
>>>

The H5Store can be used as a context manager to ensure that the underlying file is opened, however most built-in types (excluding arrays) can be read and stored without the need to _explicitly_ open the file. To access arrays (reading or writing), the file must always be opened!

To open a file in read-only mode, use the open() method with mode='r':

>>> with H5Store('file.h5').open(mode='r') as h5s:
...     pass
>>>

Parameters

filename (str) – The filename of the underlying HDF5 file.
**kwargs – Additional keyword arguments to be forwarded to the h5py.File constructor. See the documentation for the h5py.File constructor for more information.

clear()¶: Remove all data from this store.

Danger

All data will be removed, this action cannot be reversed!

close()¶: Close the underlying HDF5 file.

property file¶

Access the underlying instance of h5py.File.

This property exposes the underlying h5py.File object enabling use of functions such as create_dataset() or requires_dataset().

Note

The store must be open to access this property!

Returns: The h5py file-object that this store is operating on.
Return type: h5py.File
Raises: H5StoreClosedError – When the store is closed at the time of accessing this property.

property filename¶: Return the H5Store filename.

flush()¶: Flush the underlying HDF5 file.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.¶

items() → a set-like object providing a view on D's items¶

keys() → a set-like object providing a view on D's keys¶

property mode¶: Return the default opening mode of this H5Store.

open(mode=None)¶

Open the underlying HDF5 file.

Parameters: mode – The file open mode to use. Defaults to ‘a’ (append).
Returns: This H5Store instance.

pop(k[, d]) → v, remove specified key and return the corresponding value.¶: If key is not found, d is returned if given, otherwise KeyError is raised.

popitem() → (k, v), remove and return some (key, value) pair¶: as a 2-tuple; but raise KeyError if D is empty.

setdefault(key, value)¶: Set a value for a key if that key is not already set.

update([E, ]**F) → None. Update D from mapping/iterable E and F.¶: If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v

values() → an object providing a view on D's values¶

The H5StoreManager¶

This class implements the interface to the job’s stores attribute, but can also be used stand-alone:

class signac.H5StoreManager(prefix)¶

Bases: DictManager

Helper class to manage multiple instances of H5Store within a directory.

Example (assuming that the ‘stores/’ directory exists):

>>> stores = H5StoreManager('stores/')
>>> stores.data
<H5Store(filename=stores/data.h5)>
>>> stores.data.foo = True
>>> dict(stores.data)
{'foo': True}

Parameters: prefix – The directory prefix shared by all stores managed by this class.

cls¶: alias of H5Store

keys()¶: Return an iterable of keys.

property prefix¶: Return the prefix.

Top-level functions¶

The signac framework aids in the management of large and heterogeneous data spaces.

It provides a simple and robust data model to create a well-defined, indexable storage layout for data and metadata. This makes it easier to operate on large data spaces, streamlines post-processing and analysis, and makes data collectively accessible.

signac.TemporaryProject(name=None, cls=None, **kwargs)¶

Context manager for the generation of a temporary project.

This is a factory function that creates a Project within a temporary directory and must be used as context manager, for example like this:

with TemporaryProject() as tmp_project:
    tmp_project.import_from('/data')

Parameters

name (str) – An optional name for the temporary project. Defaults to a unique random string.
cls – The class of the temporary project. Defaults to Project.
**kwargs – Optional keyword arguments that are forwarded to the TemporaryDirectory class constructor, which is used to create a temporary root directory.

Yields

Project – An instance of Project.

signac.buffered(buffer_size=None, force_write=None, *args, **kwargs)¶

Enter context to buffer all operations for this backend.

Parameters

buffer_size (int) – The capacity of the buffer to use within this context (resets after the context is exited).
force_write (bool) – This argument does nothing and is only present for compatibility with signac 1.x.

signac.diff_jobs(*jobs)¶

Find differences among a list of jobs’ state points.

The resulting diff is a dictionary where the keys are job ids and the values are each job’s state point minus the intersection of all provided jobs’ state points. The comparison is performed over the combined set of keys and values.

See signac diff for the command line equivalent.

Parameters: *jobs (sequence[Job]) – Sequence of jobs to diff.
Returns: A dictionary where the keys are job ids and values are the unique parts of that job’s state point.
Return type: dict

Examples

>>> import signac
>>> project = signac.init_project()
>>> job1 = project.open_job({'constant': 42, 'diff1': 0, 'diff2': 1}).init()
>>> job2 = project.open_job({'constant': 42, 'diff1': 1, 'diff2': 1}).init()
>>> job3 = project.open_job({'constant': 42, 'diff1': 2, 'diff2': 2}).init()
>>> print(job1)
c4af2b26f1fd256d70799ad3ce3bdad0
>>> print(job2)
b96b21fada698f8934d58359c72755c0
>>> print(job3)
e4289419d2b0e57e4852d44a09f167c0
>>> signac.diff_jobs(job1, job2, job3)
{'c4af2b26f1fd256d70799ad3ce3bdad0': {'diff2': 1, 'diff1': 0},
'b96b21fada698f8934d58359c72755c0': {'diff2': 1, 'diff1': 1},
'e4289419d2b0e57e4852d44a09f167c0': {'diff2': 2, 'diff1': 2}}
>>> signac.diff_jobs(*project)
{'c4af2b26f1fd256d70799ad3ce3bdad0': {'diff2': 1, 'diff1': 0},
'b96b21fada698f8934d58359c72755c0': {'diff2': 1, 'diff1': 1},
'e4289419d2b0e57e4852d44a09f167c0': {'diff2': 2, 'diff1': 2}}

signac.export(docs, index, mirrors=None, update=False, num_tries=3, timeout=60, **kwargs)¶

Export docs to index and optionally associated files to mirrors.

The behavior of this function is equivalent to:

for doc in docs:
    export_one(doc, index, mirrors, num_tries)

If the update argument is set to True, the export algorithm will automatically identify stale index documents, that means documents that refer to files or state points that have been removed and are no longer part of the data space. Any document which shares the root, but not the _id field with any of the updated documents is considered stale and removed. Using update in combination with an empty docs sequence will raise ExportError, since it is not possible to identify stale documents in that case.

Note

This function will automatically delegate to specialized implementations for special index types. For example, if the index argument is a MongoDB document collection, the index documents will be exported via export_pymongo().

Parameters

docs – The index documents to export.
index – The collection to export the index to.
mirrors – An optional set of mirrors to export files to.
update (bool) – If True, remove stale index documents, that means documents that refer to files or state points that no longer exist.
num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
timeout (int) – The time in seconds to wait before an automatic retry attempt.
kwargs – Optional keyword arguments to pass to delegate implementations.

Raises

ExportError – When using the update argument in combination with an empty docs sequence.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.export_one(doc, index, mirrors=None, num_tries=3, timeout=60)¶

Export one document to index and an optionally associated file to mirrors.

Parameters

doc – A document with a file_id entry.
docs – The index collection to export to.
mirrors – An optional set of mirrors to export files to.
num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
timeout (int) – The time in seconds to wait before an automatic retry attempt.

Returns

The id and file id after successful export.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.export_pymongo(docs, index, mirrors=None, update=False, num_tries=3, timeout=60, chunksize=100)¶

Optimized export() function for pymongo index collections.

The behavior of this function is rougly equivalent to:

for doc in docs:
    export_one(doc, index, mirrors, num_tries)

Note

All index documents must be JSON-serializable to be able to be exported to a MongoDB collection.

Parameters

docs – The index documents to export.
index (pymongo.collection.Collection) – The database collection to export the index to.
num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
timeout (int) – The time in seconds to wait before an automatic retry attempt.
chunksize (int) – The buffer size for export operations.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.export_to_mirror(doc, mirror, num_tries=3, timeout=60)¶

Export a file associated with doc to mirror.

Parameters

doc – A document with a file_id entry.
mirror – A file-system object to export the file to.
num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
timeout (int) – The time in seconds to wait before an automatic retry attempt.

Returns

The file id after successful export.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.fetch(doc_or_id, mode='r', mirrors=None, num_tries=3, timeout=60, ignore_local=False)¶

Fetch the file associated with this document or file id.

This function retrieves a file associated with the provided index document or file id and behaves like the built-in open() function, e.g.:

for doc in index:
    with signac.fetch(doc) as file:
        do_something_with(file)

Parameters

doc_or_id – A file_id or a document with a file_id value.
mode – Mode to use for opening files.
mirrors – An optional set of mirrors to fetch the file from.
num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
timeout (int) – The time in seconds to wait before an automatic retry attempt.

Returns

The file associated with the document or file id.

Return type

A file-like object

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.flush()¶: Execute all deferred JSONDict write operations.

Deprecated since version 1.7: This will be removed in 2.0.

signac.get_buffer_load()¶

Get the total amount of data currently stored in the buffer.

Returns: The size of all data contained in the buffer in the appropriate units for a particular buffering implementation.
Return type: int

signac.get_buffer_size()¶

Get the current buffer capacity.

Returns: The amount of data that can be stored before a flush is triggered in the appropriate units for a particular buffering implementation.
Return type: int

signac.get_database(name, hostname=None, config=None)¶

Get a database handle.

The database handle is an instance of Database, which provides access to the document collections within one database.

db = signac.db.get_database('MyDatabase')
docs = db.my_collection.find()

Please note, that a collection which did not exist at the point of access, will automatically be created.

Parameters

name (str) – The name of the database to get.
hostname (str) – The name of the configured host. Defaults to the first configured host, or the host specified by default_host.
config (common.config.Config) – The config object to retrieve the host configuration from. Defaults to the global configuration.

Returns

The database handle.

Return type

Deprecated since version 1.3: This will be removed in 2.0. The database package is deprecated.

signac.get_job(root=None)¶

Find a Job in or above the current working directory (or provided path).

Parameters: root (str) – The job root directory. If no root directory is given, the current working directory is assumed to be within the current job workspace directory (Default value = None).
Returns: Job handle.
Return type: Job
Raises: LookupError – If this job cannot be found.

Examples

When the current directory is a job workspace directory:

>>> signac.get_job()
signac.contrib.job.Job(project=..., statepoint={...})

signac.get_project(root=None, search=True, **kwargs)¶

Find a project configuration and return the associated project.

Parameters

root (str) – The starting point to search for a project, defaults to the current working directory.
search (bool) – If True, search for project configurations inside and above the specified root directory, otherwise only return projects with a root directory identical to the specified root argument (Default value = True).
**kwargs – Optional keyword arguments that are forwarded to get_project().

Returns

An instance of Project.

Return type

Raises

LookupError – If no project configuration can be found.

signac.index(root='.', tags=None, depth=0, **kwargs)¶

Generate a main index.

A main index is compiled from other indexes by searching for modules named signac_access.py and compiling all indexes which are yielded from a function get_indexes(root) defined within that module as well as the indexes generated by crawlers yielded from a function get_crawlers(root) defined within that module.

This is a minimal example for a signac_access.py file:

import signac

def get_indexes(root):
    yield signac.index_files(root, r'.*\.txt')

Internally, this function constructs an instance of MainCrawler and all extra key-word arguments will be forwarded to the constructor of said main crawler.

Parameters

root (str) – Look for access modules under this directory path.
tags – If tags are provided, do not execute subcrawlers that don’t match the same tags.
depth (int) – Limit the search to the specified directory depth.
kwargs – These keyword-arguments are forwarded to the internal MainCrawler instance.

Yields

The main index documents as instances of dict.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.index_files(root='.', formats=None, depth=0)¶

Generate a file index.

This generator function yields file index documents, where each index document corresponds to one file.

To index all files in the current working directory, simply execute:

for doc in signac.index_files():
    print(doc)

A file associated with a file index document can be fetched via the fetch() function:

for doc in signac.index_files():
    with signac.fetch(doc) as file:
        print(file.read())

This is especially useful if the file index is part of a collection (Collection) which can be searched for specific entries.

To limit the file index to files with a specific filename formats, provide a regular expression as the formats argument. To index all files that have file ending .txt, execute:

for doc in signac.index_files(formats='.*\.txt'):
    print(doc)

We can specify specific formats by providing a dictionary as formats argument, where the key is the filename pattern and the value is an arbitrary formats string, e.g.:

for doc in signac.index_files(formats=
    {r'.*\.txt': 'TextFile', r'.*\.zip': 'ZipFile'}):
    print(doc)

Parameters

root (str) – The directory to index, defaults to the current working directory.
formats – Limit the index to files that match the given regular expression and optionally associate formats with given patterns.
depth (int) – Limit the search to the specified directory depth.

Yields

The file index documents as dicts.

Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.

signac.init_project(name=None, root=None, workspace=None, make_dir=True)¶

Initialize a project.

It is safe to call this function multiple times with the same arguments. However, a RuntimeError is raised if an existing project configuration would conflict with the provided initialization parameters.

Parameters

name (str, optional) – The name of the project to initialize.
root (str, optional) – The root directory for the project. Defaults to the current working directory.
workspace (str, optional) – The workspace directory for the project. Defaults to a subdirectory workspace in the project root.
make_dir (bool, optional) – Create the project root directory, if it does not exist yet (Default value = True).

Returns

The initialized project instance.

Return type

Raises

RuntimeError – If the project root path already contains a conflicting project configuration.

signac.is_buffered()¶: Check if this backend is currently buffered.

signac.set_buffer_size(new_capacity)¶

Update the buffer capacity.

Parameters: new_capacity (int) – The new capacity of the buffer in the appropriate units for a particular buffering implementation.

Submodules¶

signac.cite module¶

Functions to support citing this software.

signac.cite.bibtex(file=None)¶

Generate bibtex entries for signac.

The bibtex entries will be printed to screen unless a filename or a file-like object are provided, in which case they will be written to the corresponding file.

Note

A full reference should also include the version of this software. Please refer to the documentation on how to cite a specific version.

Parameters: file – A str or file-like object. Defaults to sys.stdout.

Deprecated since version 1.8: This will be removed in 2.0. The cite module is deprecated.

signac.cite.reference(file=None)¶

Generate formatted reference entries for signac.

The references will be printed to screen unless a filename or a file-like object are provided, in which case they will be written to the corresponding file.

Note

A full reference should also include the version of this software. Please refer to the documentation on how to cite a specific version.

Parameters: file – A str or file-like object. Defaults to sys.stdout.

Deprecated since version 1.8: This will be removed in 2.0. The cite module is deprecated.

signac.sync module¶

Synchronization of jobs and projects.

Jobs may be synchronized by copying all data from the source job to the destination job. This means all files are copied and the documents are synchronized. Conflicts, that means both jobs contain conflicting data, may be resolved with a user defined strategy.

The synchronization of projects is in essence the synchronization of all jobs which are in the destination project with the ones in the source project and the sync synchronization of the project document. If a specific job does not exist yet at the destination it is simply cloned, otherwise it is synchronized.

A sync strategy is a function (or functor) that takes the source job, the destination job, and the name of the file generating the conflict as arguments and returns the decision whether to overwrite the file as Boolean. There are some default strategies defined within this module as part of the FileSync class. These are the default strategies:

always – Always overwrite on conflict.

never – Never overwrite on conflict.

update – Overwrite when the modification time of the source file is newer.

Ask – Ask the user interactively about each conflicting filename.

For example, to synchronize two projects resolving conflicts by modification time, use:

dest_project.sync(source_project, strategy=sync.FileSync.update)

Unlike files, which are always either overwritten as a whole or not, documents can be synchronized more fine-grained with a sync function. Such a function (or functor) takes the source and the destination document as arguments and performs the synchronization. The user is encouraged to implement their own sync functions, but there are a few default functions implemented as part of the DocSync class:

NO_SYNC – Do not perform any synchronization.

COPY – Apply the same strategy used to resolve file conflicts.

update – Equivalent to dst.update(src).

ByKey – Synchronize the source document key by key, more information below.

This is how we could synchronize two jobs, where the documents are synchronized with a simple update function:

dst_job.sync(src_job, doc_sync=sync.DocSync.update)

The DocSync.ByKey functor attempts to synchronize the destination document with the source document without overwriting any data. That means this function behaves similar to update() for a non-intersecting set of keys, but in addition will preserve nested mappings without overwriting values. In addition, any key conflict, that means keys that are present in both documents, but have differing data, will lead to the raise of a DocumentSyncConflict exception. The user may expclitly decide to overwrite certain keys by providing a “key-strategy”, which is a function that takes the conflicting key as argument, and returns the decision whether to overwrite that specific key as Boolean. For example, to sync two jobs, where conflicting keys should only be overwritten if they contain the term ‘foo’, we could execute:

dst_job.sync(src_job, doc_sync=sync.DocSync.ByKey(lambda key: 'foo' in key))

This means that all documents are synchronized ‘key-by-key’ and only conflicting keys that contain the word “foo” will be overwritten, any other conflicts would lead to the raise of a DocumentSyncConflict exception. A key-strategy may also be a regular expression, so the synchronization above could also be achieved with:

dst_job.sync(src_job, doc_sync=sync.DocSync.ByKey('foo'))

class signac.sync.DocSync¶

Bases: object

Collection of document synchronization functions.

class ByKey(key_strategy=None)¶

Bases: object

Synchronize documents key by key.

COPY = 'copy'¶: Copy (and potentially overwrite) documents like any other file.

NO_SYNC = False¶: Do not synchronize documents.

static update(src, dst)¶: Perform a simple update.

class signac.sync.FileSync¶

Bases: object

Collection of file synchronization strategies.

class Ask¶

Bases: object

Resolve sync conflicts by asking whether a file should be overwritten interactively.

static always(src, dst, fn)¶: Resolve sync conflicts by always overwriting.

classmethod keys()¶: Return keys.

static never(src, dst, fn)¶: Resolve sync conflicts by never overwriting.

static update(src, dst, fn)¶: Resolve sync conflicts based on newest modified timestamp.

signac.sync.sync_jobs(src, dst, strategy=None, exclude=None, doc_sync=None, recursive=False, follow_symlinks=True, preserve_permissions=False, preserve_times=False, preserve_owner=False, preserve_group=False, deep=False, dry_run=False)¶

Synchronize the dst job with the src job.

By default, this method will synchronize all files and document data of dst job with the src job until a synchronization conflict occurs. There are two different kinds of synchronization conflicts:

The two jobs have files with the same name, but different content.

The two jobs have documents that share keys, but those keys are mapped to different values.

A file conflict can be resolved by providing a ‘FileSync’ strategy or by excluding files from the synchronization. An unresolvable conflict is indicated with the raise of a FileSyncConflict exception.

A document synchronization conflict can be resolved by providing a doc_sync function that takes the source and the destination document as first and second argument.

Parameters

src (Job) – The src job, data will be copied from this job’s workspace.
dst (Job) – The dst job, data will be copied to this job’s workspace.
strategy (callable) – A synchronization strategy for file conflicts. The strategy should be a callable with signature strategy(src, dst, filepath) where src and dst are the source and destination instances of Project and filepath is the filepath relative to the project root. If no strategy is provided, a errors.SyncConflict exception will be raised upon conflict. (Default value = None)
exclude (str) – A filename exclusion pattern. All files matching this pattern will be excluded from the synchronization process. (Default value = None)
doc_sync (attribute or callable from DocSync) – A synchronization strategy for document keys. The default is to use a safe key-by-key strategy that will not overwrite any values on conflict, but instead raises a DocumentSyncConflict exception.
recursive (bool) – Recursively synchronize sub-directories encountered within the job workspace directories. (Default value = False)
follow_symlinks (bool) – Follow and copy the target of symbolic links. (Default value = True)
preserve_permissions (bool) – Preserve file permissions (Default value = False)
preserve_times (bool) – Preserve file modification times (Default value = False)
preserve_owner (bool) – Preserve file owner (Default value = False)
preserve_group (bool) – Preserve file group ownership (Default value = False)
dry_run (bool) – If True, do not actually perform any synchronization operations. (Default value = False)
deep (bool) – (Default value = False)

signac.sync.sync_projects(source, destination, strategy=None, exclude=None, doc_sync=None, selection=None, check_schema=True, recursive=False, follow_symlinks=True, preserve_permissions=False, preserve_times=False, preserve_owner=False, preserve_group=False, deep=False, dry_run=False, parallel=False, collect_stats=False)¶

Synchronize the destination project with the source project.

Try to clone all jobs from the source to the destination. If the destination job already exist, try to synchronize the job using the optionally specified strategy.

Parameters

source (class:~.Project) – The project presenting the source for synchronization.
destination (class:~.Project) – The project that is modified for synchronization.
strategy (callable) – A synchronization strategy for file conflicts. The strategy should be a callable with signature strategy(src, dst, filepath) where src and dst are the source and destination instances of Project and filepath is the filepath relative to the project root. If no strategy is provided, a errors.SyncConflict exception will be raised upon conflict. (Default value = None)
exclude (str) – A filename exclusion pattern. All files matching this pattern will be excluded from the synchronization process. (Default value = None)
doc_sync (attribute or callable from DocSync) – A synchronization strategy for document keys. The default is to use a safe key-by-key strategy that will not overwrite any values on conflict, but instead raises a DocumentSyncConflict exception.
selection (sequence of Job or job ids (str)) – Only synchronize the given selection of jobs. (Default value = None)
check_schema (bool) – If True, only synchronize if this and the other project have a matching state point schema. See also: detect_schema(). (Default value = True)
recursive (bool) – Recursively synchronize sub-directories encountered within the job workspace directories. (Default value = False)
follow_symlinks (bool) – Follow and copy the target of symbolic links. (Default value = True)
preserve_permissions (bool) – Preserve file permissions (Default value = False)
preserve_times (bool) – Preserve file modification times (Default value = False)
preserve_owner (bool) – Preserve file owner (Default value = False)
preserve_group (bool) – Preserve file group ownership (Default value = False)
dry_run (bool) – If True, do not actually perform the synchronization operation, just log what would happen theoretically. Useful to test synchronization strategies without the risk of data loss. (Default value = False)
deep (bool) – (Default value = False)
parallel (bool) – (Default value = False)
collect_stats (bool) – (Default value = False)

Returns

Returns stats if collect_stats is True, else None.

Return type

NoneType or FileTransferStats

Raises

DocumentSyncConflict – If there are conflicting keys within the project or job documents that cannot be resolved with the given strategy or if there is no strategy provided.
FileSyncConflict – If there are differing files that cannot be resolved with the given strategy or if no strategy is provided.
SchemaSyncConflict – In case that the check_schema argument is True and the detected state point schema of this and the other project differ.

signac.warnings module¶

Module for signac deprecation warnings.

exception signac.warnings.SignacDeprecationWarning¶

Bases: UserWarning

Indicates the deprecation of a signac feature, API or behavior.

This class indicates a user-relevant deprecation and is therefore a UserWarning, not a DeprecationWarning which is hidden by default.

signac.errors module¶

Errors raised by signac.

exception signac.errors.AuthenticationError¶

Authentication error.

exception signac.errors.BufferException¶

Bases: Error

An exception occurred in buffered mode.

exception signac.errors.BufferedFileError(files)¶

Bases: BufferException

Raised when an error occurred while flushing one or more buffered files.

files¶: A dictionary of files that caused issues during the flush operation, mapped to a possible reason for the issue or None in case that it cannot be determined.

exception signac.errors.ConfigError¶

Error with parsing or reading a configuration file.

exception signac.errors.DestinationExistsError(destination)¶

The destination for a move or copy operation already exists.

Parameters: destination (str) – The destination causing the error.

exception signac.errors.DocumentSyncConflict(keys)¶

Bases: SyncConflict

Raised when a synchronization operation fails due to a document conflict.

keys¶: The keys that caused the conflict.

exception signac.errors.Error¶

Bases: Exception

Base class used for signac Errors.

exception signac.errors.ExportError¶

Error exporting documents to a mirror.

exception signac.errors.FetchError¶

Bases: FileNotFoundError

Error in fetching data.

exception signac.errors.FileSyncConflict(filename)¶

Bases: SyncConflict

Raised when a synchronization operation fails due to a file conflict.

filename¶: The filename of the file that caused the conflict.

exception signac.errors.IncompatibleSchemaVersion¶

Bases: Error

The project’s schema version is incompatible with this version of signac.

exception signac.errors.InvalidKeyError¶

Bases: ValueError

Raised when a user uses a non-conforming key.

exception signac.errors.JobsCorruptedError(job_ids)¶

The state point manifest file of one or more jobs cannot be opened or is corrupted.

Parameters: job_ids – The job id(s) of the corrupted job(s).

exception signac.errors.KeyTypeError¶

Bases: TypeError

Raised when a user uses a key of invalid type.

exception signac.errors.SchemaSyncConflict(schema_src, schema_dst)¶

Bases: SyncConflict

Raised when a synchronization operation fails due to schema differences.

exception signac.errors.StatepointParsingError¶

Indicates an error that occurred while trying to identify a state point.

exception signac.errors.SyncConflict¶

Raised when a synchronization operation fails.

exception signac.errors.WorkspaceError(error)¶

Bases: Error, OSError

Raised when there is an issue creating or accessing the workspace.

Parameters: error – The underlying error causing this issue.

synced_collections package¶

Data Types¶

synced_collections.synced_collection module¶

Implement the SyncedCollection class.

class signac.synced_collections.data_types.synced_collection.SyncedCollection(parent=None, *args, **kwargs)¶

Bases: Collection

An abstract Collection type that is synced with a backend.

This class extends collections.abc.Collection and adds a number of abstract internal methods that must be implemented by its subclasses. These methods can be split into two groups of functions that are designed to be implemented by separate subtrees in the inheritance hierarchy that can then be composed:

Concrete Collection Types

These subclasses should implement the APIs for specific types of collections. For instance, a list-like SyncedCollection should implement the standard methods for sequences. In addition, they must implement the following abstract methods defined by the SyncedCollection:

is_base_type(): Determines whether an object satisfies the semantics of the collection object a given SyncedCollection is designed to mimic.

_to_base(): Converts a SyncedCollection to its natural base type (e.g. a list).

_update(): Updates the SyncedCollection to match the contents of the provided collections.abc.Collection. After calling sc._update(c), we must have that sc == c; however, since such updates are frequent when loading and saving data to a resource, _update() should be implemented to minimize new object creation wherever possible.

Backend

These subclasses encode the process by which in-memory data is converted into a representation suitable for a particular backend. For instance, a JSON backend should know how to save a Python object into a JSON-encoded file and then read that object back.

_load_from_resource(): Loads data from the underlying resource and returns it in an object satisfying is_base_type().

_save_to_resource(): Stores data to the underlying resource.

_backend: A unique string identifier for the resource backend.

Since these functionalities are effectively completely orthogonal, members of a given group should be interchangeable. For instance, a dict-like SyncedCollection can be combined equally easily with JSON, MongoDB, or SQL backends.

Validation

Due to the restrictions of a particular backend or the needs of a particular application, synced collections may need to restrict the data that they can store. Validators provide a standardized mechanism for this. A validator is a callable that parses any data added to a SyncedCollection and raises an Exception if any invalid data is provided. Validators cannot modify the data and should have no side effects. They are purely provided as a mechanism to reject invalid data. For example, a JSON validator would raise Exceptions if it detected non-string keys in a dict.

Since SyncedCollection is designed for extensive usage of inheritance, validators may be inherited by subclasses. There are two attributes that subclasses of SyncedCollection can define to control the validators used:

_validators: A list of callables that will be inherited by all subclasses.

_all_validators: A list of callables that will be used to validate this class, and this class alone.

When a SyncedCollection subclass is initialized (note that this is at class definition time, not when instances are created), its _register_validators() method will be called. If this class defines an _all_validators attribute, this set of validators will be used by all instances of this class. Otherwise, _register_validators() will traverse the MRO and collect the _validators attributes from all parents of a class, and store these in the _all_validators attribute for the class.

Note

Typically, a synced collection will be initialized with resource information, and data will be pulled from that resource. However, initializing with both data and resource information is a valid use case. In this case, the initial data will be validated by the standard validators, however, it will not be checked against the contents stored in the synced resource and is assumed to be consistent. This constructor pattern can be useful to avoid unnecessary resource accesses.

Thread safety

Whether or not SyncedCollection objects are thread-safe depends on the implementation of the backend. Thread-safety of SyncedCollection objects is predicated on backends providing an atomic write operation. All concrete collection types use mutexes to guard against concurrent write operations, while allowing read operations to happen freely. The validity of this mode of access depends on the write operations of a SyncedCollection being atomic, specifically the :meth:`~._save_to_resource method. Whether or not a particular subclass of SyncedCollection is thread-safe should be indicated by that subclass setting the _supports_threading class variable to True. This variable is set to False by SyncedCollection, so subclasses must explicitly opt-in to support threading by setting this variable to True.

Backends that support multithreaded execution will have multithreaded support turned on by default. This support can be enabled or disabled using the enable_multithreading() and disable_multithreading() methods. enable_multithreading() will raise a ValueError if called on a class that does not support multithreading.

Parameters: parent (SyncedCollection, optional) – If provided, the collection within which this collection is nested (Default value = None). A parent instance of SyncedCollection or None. If None, the collection owns its own data, otherwise it is nested within its parent. Every SyncedCollection either owns its own data, or has a parent (Default value = None).

_LoadSaveType¶: alias of _LoadAndSave

abstract property _backend¶

The backend associated with a given collection.

This property is abstract to enforce that subclasses implement it. Since it’s only internal, subclasses can safely override it with just a raw attribute; this property just serves as a way to enforce the abstract API for subclasses.

Type: str

classmethod _from_base(data, **kwargs)¶

Dynamically resolve the type of object to the corresponding synced collection.

This method assumes that data has already been validated. This assumption can always be met, since this method should only be called internally by other methods that modify the internal collection data. While this requirement does require that all calling methods be responsible for validation, it confers significant performance benefits because it can instruct any invoked class constructors not to validate, which is especially important for nested collections.

Parameters

data (Collection) – Data to be converted from base type.
**kwargs – Any keyword arguments to pass to the collection constructor.

Returns

Synced object of corresponding base type.

Return type

Collection

Notes

This method relies on the internal registry of subclasses populated by __init_subclass__() and the is_base_type() method to determine the subclass with the appropriate backend and data type. Once an appropriate type is determined, that class’s constructor is called. Since this method relies on the constructor and other methods, it can be concretely implemented here rather than requiring subclass implementations.

_load()¶

Load the data from the backend.

This method encodes the recursive logic required to handle the loading of nested collections. For a collection contained within another collection, only the root is ever responsible for loading the data. This method handles the appropriate recursive calls, then farms out the actual reading to the abstract method _load_from_resource().

abstract _load_from_resource()¶

Load data from underlying backend.

This method must be implemented for each backend. Backends may choose to return None, signaling that no modification should be performed on the data in memory. This mode is useful for backends where the underlying resource (e.g. a file) may not initially exist, but can be transparently created on save.

Returns: An equivalent unsynced collection satisfying is_base_type() that contains the data in the underlying resource (e.g. a file).
Return type: Collection or None

classmethod _register_validators()¶

Register all inherited validators to this class.

This method is called by __init_subclass__ when subclasses are created to control what validators will be applied to data added to instances of that class. By default, the _all_validators class variable defined on the class itself determines the validation rules for that class. If that variable is not defined, then all parents of the class are searched, and a list of validators is constructed by concatenating the _validators class variable for each parent class that defines it.

_save()¶

Save the data to the backend.

This method encodes the recursive logic required to handle the saving of nested collections. For a collection contained within another collection, only the parent is ever responsible for storing the data. This method handles the appropriate recursive calls, then farms out the actual writing to the abstract method _save_to_resource().

abstract _save_to_resource()¶

Save data to the backend.

This method must be implemented for each backend.

abstract _to_base()¶

Dynamically resolve the synced collection to the corresponding base type.

This method should not load the data from the underlying resource, it should simply converts the current in-memory representation of a synced collection to its naturally corresponding unsynced collection type.

Returns: An equivalent unsynced collection satisfying is_base_type().
Return type: Collection

abstract _update(data)¶

Update the in-memory representation to match the provided data.

The purpose of this method is to update the SyncedCollection to match the data in the underlying resource. The result of calling this method should be that self == data. The reason that this method is necessary is that SyncedCollections can be nested, and nested collections must also be instances of SyncedCollection so that synchronization occurs even when nested structures are modified. Recreating the full nested structure every time data is reloaded from file is highly inefficient, so this method performs an in-place update that only changes entries that need to be changed.

Parameters: data (Collection) – A collection satisfying is_base_type().

_validate(data)¶

Validate the input data.

Parameters: data (Collection) – An collection satisfying is_base_type().

classmethod disable_multithreading()¶

Disable all safety checks and thread locks required for thread safety.

The mutex locks required to enable multithreading introduce nontrivial performance costs, so they can be disabled for classes that support it.

classmethod enable_multithreading()¶

Enable safety checks and thread locks required for thread safety.

Support for multithreaded execution can be disabled by calling disable_multithreading(); calling this method reverses that.

abstract classmethod is_base_type(data)¶

Check whether data is of the same base type (such as list or dict) as this class.

Parameters: data (Any) – The input data to test.
Returns: Whether or not the object can be converted into this synced collection type.
Return type: bool

class signac.synced_collections.data_types.synced_collection._LoadAndSave(collection)¶

Bases: object

A context manager for SyncedCollection to wrap saving and loading.

Any write operation on a synced collection must be preceded by a load and followed by a save. Moreover, additional logic may be required to handle other aspects of the synchronization, particularly the acquisition of thread locks. This class abstracts this concept, making it easy for subclasses to customize the behavior if needed (for instance, to introduce additional locks).

synced_collections.synced_dict module¶

Implements the SyncedDict.

This implements a dict-like data structure that also conforms to the SyncedCollection API and can be combined with any backend type to give a dict-like API to a synchronized data structure.

class signac.synced_collections.data_types.synced_dict.SyncedDict(data=None, _validate=True, *args, **kwargs)¶

Bases: SyncedCollection, MutableMapping

Implement the dict data structure along with values access through attributes named as keys.

The SyncedDict inherits from SyncedCollection and MutableMapping. Therefore, it behaves like a dict.

Parameters

data (Mapping, optional) – The initial data to populate the dict. If None, defaults to {} (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the SyncedDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying backend, copying (even deep copying) a SyncedDict instance may exhibit unexpected behavior. If a true copy is required, you should use the _to_base() method to get a dict representation, and if necessary construct a new SyncedDict.

clear() → None. Remove all items from D.¶

get(k[, d]) → D[k] if k in D, else d. d defaults to None.¶

classmethod is_base_type(data)¶

Check whether the data is an instance of mapping.

Parameters: data (any) – Data to be checked.
Return type: bool

items() → a set-like object providing a view on D's items¶

keys() → a set-like object providing a view on D's keys¶

pop(k[, d]) → v, remove specified key and return the corresponding value.¶: If key is not found, d is returned if given, otherwise KeyError is raised.

popitem() → (k, v), remove and return some (key, value) pair¶: as a 2-tuple; but raise KeyError if D is empty.

reset(data)¶

Update the instance with new data.

Parameters: data (mapping) – Data to update the instance.
Raises: ValueError – If the data is not a mapping.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶

update([E, ]**F) → None. Update D from mapping/iterable E and F.¶: If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v

values() → an object providing a view on D's values¶

synced_collections.synced_list module¶

Implements the SyncedList.

This implements a list-like data structure that also conforms to the SyncedCollection API and can be combined with any backend type to give a list-like API to a synchronized data structure.

class signac.synced_collections.data_types.synced_list.SyncedList(data=None, _validate=True, *args, **kwargs)¶

Bases: SyncedCollection, MutableSequence

Implementation of list data structure.

The SyncedList inherits from SyncedCollection and MutableSequence. Therefore, it behaves similar to a list.

Parameters

data (Sequence, optional) – The initial data to populate the list. If None, defaults to [] (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the SyncedList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying backend, copying (even deep copying) a SyncedList instance may exhibit unexpected behavior. If a true copy is required, you should use the _to_base() method to get a list representation, and if necessary construct a new SyncedList.

append(item)¶: S.append(value) – append value to the end of the sequence

clear() → None -- remove all items from S¶

extend(iterable)¶: S.extend(iterable) – extend sequence by appending elements from the iterable

insert(index, item)¶: S.insert(index, value) – insert value before index

classmethod is_base_type(data)¶

Check whether the data is an non-string Sequence.

Parameters: data (Any) – Data to be checked
Return type: bool

remove(value)¶: S.remove(value) – remove first occurrence of value. Raise ValueError if the value is not present.

reset(data)¶

Update the instance with new data.

Parameters: data (non-string Sequence) – Data to update the instance.
Raises: ValueError – If the data is not a non-string sequence.

Backends¶

synced_collections.backends.collection_json module¶

Implements a JSON SyncedCollection backend.

class signac.synced_collections.backends.collection_json.BufferedJSONAttrDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: BufferedJSONDict, AttrDict

A buffered JSONAttrDict.

class signac.synced_collections.backends.collection_json.BufferedJSONAttrList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: BufferedJSONList

A BufferedJSONList whose dict-like children will be of type BufferedJSONAttrDict.

class signac.synced_collections.backends.collection_json.BufferedJSONCollection(parent=None, filename=None, *args, **kwargs)¶

Bases: SerializedFileBufferedCollection, JSONCollection

A JSONCollection that supports I/O buffering.

This class implements the buffer protocol defined by BufferedCollection. The concrete implementation of buffering behavior is defined by the SerializedFileBufferedCollection.

class signac.synced_collections.backends.collection_json.BufferedJSONDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: BufferedJSONCollection, SyncedDict

A buffered JSONDict.

class signac.synced_collections.backends.collection_json.BufferedJSONList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: BufferedJSONCollection, SyncedList

A buffered JSONList.

class signac.synced_collections.backends.collection_json.JSONAttrDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: JSONDict, AttrDict

A dict-like data structure that synchronizes with a persistent JSON file.

Unlike JSONAttrDict, this class also supports attribute-based access to dictionary contents, e.g. doc.foo == doc['foo'].

Examples

>>> doc = JSONAttrDict('data.json', write_concern=True)
>>> doc['foo'] = "bar"
>>> assert doc.foo == doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}
>>> doc.foo.bar = False
>>> doc
{'foo': {'bar': False}}

Parameters

filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).
write_concern (bool, optional) – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file (Default value = False).
data (collections.abc.Mapping, optional) – The initial data passed to JSONAttrDict. If None, defaults to {} (Default value = None).
parent (JSONCollection, optional) – A parent instance of JSONCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the JSONAttrDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying file, copying (even deep copying) a JSONAttrDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new JSONAttrDict instance.

class signac.synced_collections.backends.collection_json.JSONAttrList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: JSONList

A JSONList whose dict-like children will be of type JSONAttrDict.

class signac.synced_collections.backends.collection_json.JSONCollection(filename=None, write_concern=False, *args, **kwargs)¶

A SyncedCollection that synchronizes with a JSON file.

This collection implements synchronization by reading and writing the associated JSON file in its entirety for every read/write operation. This backend is a good choice for maximum accessibility and transparency since all data is immediately accessible in the form of a text file with no additional tooling, but is likely a poor choice for high performance applications.

Thread safety

The JSONCollection is thread-safe. To make these collections safe, the write_concern flag is ignored in multithreaded execution, and the write is always performed via a write to temporary file followed by a replacement of the original file. The file replacement operation uses os.replace(), which is guaranteed to be atomic by the Python standard.

Parameters

filename (str) – The filename of the associated JSON file on disk.
write_concern (bool, optional) – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file (Default value = False).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

property filename¶

The name of the associated JSON file on disk.

Type: str

class signac.synced_collections.backends.collection_json.JSONDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: JSONCollection, SyncedDict

A dict-like data structure that synchronizes with a persistent JSON file.

Examples

>>> doc = JSONDict('data.json', write_concern=True)
>>> doc['foo'] = "bar"
>>> assert doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}

Parameters

filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).
write_concern (bool, optional) – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file (Default value = False).
data (collections.abc.Mapping, optional) – The initial data passed to JSONDict. If None, defaults to {} (Default value = None).
parent (JSONCollection, optional) – A parent instance of JSONCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the JSONDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying file, copying (even deep copying) a JSONDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new JSONDict instance.

class signac.synced_collections.backends.collection_json.JSONList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: JSONCollection, SyncedList

A list-like data structure that synchronizes with a persistent JSON file.

Only non-string sequences are supported by this class.

Examples

>>> synced_list = JSONList('data.json', write_concern=True)
>>> synced_list.append("bar")
>>> assert synced_list[0] == "bar"
>>> assert len(synced_list) == 1
>>> del synced_list[0]

Parameters

filename (str, optional) – The filename of the associated JSON file on disk (Default value = None).
write_concern (bool, optional) – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file (Default value = None).
data (non-str collections.abc.Sequence, optional) – The initial data passed to JSONList `. If ``None`, defaults to [] (Default value = None).
parent (JSONCollection, optional) – A parent instance of JSONCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the JSONList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying file, copying (even deep copying) a JSONList instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new JSONList instance.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONAttrDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: MemoryBufferedJSONDict, AttrDict

A buffered JSONAttrDict.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONAttrList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: MemoryBufferedJSONList

A MemoryBufferedJSONList whose dict-like children will be of type MemoryBufferedJSONAttrDict.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONCollection(parent=None, filename=None, *args, **kwargs)¶

Bases: SharedMemoryFileBufferedCollection, JSONCollection

A JSONCollection that supports I/O buffering.

This class implements the buffer protocol defined by BufferedCollection. The concrete implementation of buffering behavior is defined by the SharedMemoryFileBufferedCollection.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONDict(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: MemoryBufferedJSONCollection, SyncedDict

A buffered JSONDict.

class signac.synced_collections.backends.collection_json.MemoryBufferedJSONList(filename=None, write_concern=False, data=None, parent=None, *args, **kwargs)¶

Bases: MemoryBufferedJSONCollection, SyncedList

A buffered JSONList.

signac.synced_collections.backends.collection_json.json_attr_dict_validator(data)¶

Validate data for JSONAttrDict.

This validator combines the logic from the following validators into one to make validation more efficient:

This validator combines the following logic:

JSON format validation
Ensuring no dots are present in string keys
Converting non-str keys to strings. This is a backwards compatibility layer that will be removed in signac 2.0.

Parameters

data – Data to validate.

Raises

KeyTypeError – If key data type is not supported.
TypeError – If the data type of data is not supported.

synced_collections.backends.collection_mongodb module¶

Implements a MongoDB SyncedCollection backend.

class signac.synced_collections.backends.collection_mongodb.MongoDBCollection(collection=None, uid=None, parent=None, *args, **kwargs)¶

A SyncedCollection that synchronizes with a MongoDB document.

In MongoDB, a database is composed of multiple MongoDB collections, which are analogous to tables in SQL databases but do not enforce a schema like in relational databases. In turn, collections are composed of documents, which are analogous to rows in a table but are much more flexible, storing any valid JSON object in a JSON-like encoded format known as BSON (“binary JSON”).

Each MongoDBCollection can be represented as a MongoDB document, so this backend stores the MongoDBCollection as a single document within the collection provided by the user. The document is identified by a unique key provided by the user.

Thread safety

The MongoDBCollection is not thread-safe.

Parameters

collection (pymongo.collection.Collection) – The MongoDB client in which to store data.
uid (dict) – The unique key-value mapping added to the data and stored in the document so that it is uniquely identifiable in the MongoDB collection. The key “data” is reserved and may not be part of this uid.
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

The user is responsible for providing a unique id such that there are no possible collisions between different MongoDBCollection instances stored in the same MongoDB collection. Failure to do so may result in data corruption if multiple documents are found to be apparently associated with a given uid.

property collection¶

Get the collection being synced to.

Type: pymongo.collection.Collection

property uid¶

Get the unique mapping used to identify this collection.

Type: dict

class signac.synced_collections.backends.collection_mongodb.MongoDBDict(collection=None, uid=None, data=None, parent=None, *args, **kwargs)¶

Bases: MongoDBCollection, SyncedDict

A dict-like data structure that synchronizes with a document in a MongoDB collection.

Examples

>>> doc = MongoDBDict('data')
>>> doc['foo'] = "bar"
>>> assert doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}

Parameters

collection (pymongo.collection.Collection, optional) – A pymongo.collection.Collection instance (Default value = None).
uid (dict, optional) – The unique key-value mapping identifying the collection (Default value = None).
data (non-str collections.abc.Mapping, optional) – The initial data passed to MongoDBDict. If None, defaults to {} (Default value = None).
parent (MongoDBCollection, optional) – A parent instance of MongoDBCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the MongoDBDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying a MongoDBDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new MongoDBDict instance.

class signac.synced_collections.backends.collection_mongodb.MongoDBList(collection=None, uid=None, data=None, parent=None, *args, **kwargs)¶

Bases: MongoDBCollection, SyncedList

A list-like data structure that synchronizes with a document in a MongoDB collection.

Only non-string sequences are supported by this class.

Examples

>>> synced_list = MongoDBList('data')
>>> synced_list.append("bar")
>>> assert synced_list[0] == "bar"
>>> assert len(synced_list) == 1
>>> del synced_list[0]

Parameters

collection (pymongo.collection.Collection, optional) – A pymongo.collection.Collection instance (Default value = None).
uid (dict, optional) – The unique key-value mapping identifying the collection (Default value = None).
data (non-str collections.abc.Sequence, optional) – The initial data passed to MongoDBList. If None, defaults to [] (Default value = None).
parent (MongoDBCollection, optional) – A parent instance of MongoDBCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the MongoDBList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying a MongoDBList instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new MongoDBList instance.

synced_collections.backends.collection_redis module¶

Implements a Redis SyncedCollection backend.

class signac.synced_collections.backends.collection_redis.RedisCollection(client=None, key=None, *args, **kwargs)¶

A SyncedCollection that synchronizes with a Redis database.

This backend stores data in Redis by associating it with the provided key.

Thread safety

The RedisCollection is not thread-safe.

Parameters

client (redis.Redis) – The Redis client used to persist data.
key (str) – The key associated with this collection in the Redis database.
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

property client¶

The Redis client used to store the data.

Type: redis.Redis

property key¶

The key associated with this collection stored in Redis.

Type: str

class signac.synced_collections.backends.collection_redis.RedisDict(client=None, key=None, data=None, parent=None, *args, **kwargs)¶

Bases: RedisCollection, SyncedDict

A dict-like data structure that synchronizes with a persistent Redis database.

Examples

>>> doc = RedisDict('data')
>>> doc['foo'] = "bar"
>>> assert doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}

Parameters

client (redis.Redis, optional) – A redis client (Default value = None).
key (str, optional) – The key of the collection (Default value = None).
data (collections.abc.Mapping, optional) – The initial data passed to RedisDict. If None, defaults to {} (Default value = None).
parent (RedisCollection, optional) – A parent instance of RedisCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the RedisDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying a RedisDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new RedisDict instance.

class signac.synced_collections.backends.collection_redis.RedisList(client=None, key=None, data=None, parent=None, *args, **kwargs)¶

Bases: RedisCollection, SyncedList

A list-like data structure that synchronizes with a persistent Redis database.

Only non-string sequences are supported by this class.

Examples

>>> synced_list = RedisList('data')
>>> synced_list.append("bar")
>>> assert synced_list[0] == "bar"
>>> assert len(synced_list) == 1
>>> del synced_list[0]

Parameters

client (redis.Redis, optional) – A Redis client (Default value = None).
key (str, optional) – The key of the collection (Default value = None).
data (non-str collections.abc.Sequence, optional) – The initial data passed to RedisList. If None, defaults to [] (Default value = None).
parent (RedisCollection, optional) – A parent instance of RedisCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the RedisList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying a RedisList instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new RedisList instance.

synced_collections.backends.collection_zarr module¶

Implements a Zarr SyncedCollection backend.

class signac.synced_collections.backends.collection_zarr.ZarrCollection(group=None, name=None, codec=None, *args, **kwargs)¶

A SyncedCollection that synchronizes with a Zarr group.

Since Zarr is designed for storage of array-like data, this backend implements synchronization by storing the collection in a 1-element object array. The user provides the group within which to store the data and the name of the data in the group.

Thread safety

The ZarrCollection is not thread-safe.

Parameters

group (zarr.hierarchy.Group) – The Zarr group in which to store data.
name (str) – The name under which this collection is stored in the Zarr group.
codec (numcodecs.abc.Codec) – The encoding mechanism for the data. If not provided, defaults to JSON encoding (Default value: None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

property codec¶

The encoding method used for the data.

Type: numcodecs.abc.Codec

property group¶

The Zarr group storing the data.

Type: zarr.hierarchy.Group

property name¶

The name of this data in the Zarr group.

Type: str

class signac.synced_collections.backends.collection_zarr.ZarrDict(group=None, name=None, data=None, parent=None, *args, **kwargs)¶

Bases: ZarrCollection, SyncedDict

A dict-like data structure that synchronizes with a Zarr group.

Examples

>>> doc = ZarrDict('data')
>>> doc['foo'] = "bar"
>>> assert doc['foo'] == "bar"
>>> assert 'foo' in doc
>>> del doc['foo']
>>> doc['foo'] = dict(bar=True)
>>> doc
{'foo': {'bar': True}}

Parameters

group (zarr.hierarchy.Group, optional) – The group in which to store data (Default value = None).
name (str, optional) – The name of the collection (Default value = None).
data (collections.abc.Mapping, optional) – The initial data passed to ZarrDict. If None, defaults to {} (Default value = None).
parent (ZarrCollection, optional) – A parent instance of ZarrCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the ZarrDict object behaves like a dict, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying (even deep copying) a ZarrDict instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new ZarrDict instance.

class signac.synced_collections.backends.collection_zarr.ZarrList(group=None, name=None, data=None, parent=None, *args, **kwargs)¶

Bases: ZarrCollection, SyncedList

A list-like data structure that synchronizes with a Zarr group.

Only non-string sequences are supported by this class.

Examples

>>> synced_list = ZarrList('data')
>>> synced_list.append("bar")
>>> assert synced_list[0] == "bar"
>>> assert len(synced_list) == 1
>>> del synced_list[0]

Parameters

group (zarr.hierarchy.Group, optional) – The group in which to store data (Default value = None).
name (str, optional) – The name of the collection (Default value = None).
data (non-str collections.abc.Sequence, optional) – The initial data passed to ZarrList. If None, defaults to [] (Default value = None).
parent (ZarrCollection, optional) – A parent instance of ZarrCollection or None. If None, the collection owns its own data (Default value = None).
*args – Positional arguments forwarded to parent constructors.
**kwargs – Keyword arguments forwarded to parent constructors.

Warning

While the ZarrList object behaves like a list, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying database, copying (even deep copying) a ZarrList instance may exhibit unexpected behavior. If a true copy is required, you should use the call operator to get a dictionary representation, and if necessary construct a new ZarrList instance.

Buffers¶

synced_collections.buffers.buffered_collection module¶

Defines a buffering protocol for SyncedCollection objects.

Depending on the choice of backend, synchronization may be an expensive process. In that case, it can be helpful to allow many in-memory modifications to occur before any synchronization is attempted. Since many collections could be pointing to the same underlying resource, maintaining proper data coherency across different instances requires careful consideration of how the data is stored. The appropriate buffering methods can differ for different backends; as a result, the basic interface simply lays out the API for buffering and leaves implementation details for specific backends to handle. Judicious use of buffering can dramatically speed up code paths that might otherwise involve, for instance, heavy I/O. The specific buffering mechanism must be implemented by each backend since it depends on the nature of the underlying data format.

All buffered collections expose a local context manager for buffering. In addition, each backend exposes a context manager BufferedCollection.buffer_backend() that indicates to all buffered collections of that backend that they should enter buffered mode. These context managers may be nested freely, and buffer flushes will occur when all such managers have been exited.

with collection1.buffered:
    with type(collection1).buffer_backend:
        collection2['foo'] = 1
        collection1['bar'] = 1
        # collection2 will flush when this context exits.

    # This operation will write straight to the backend.
    collection2['bar'] = 2

    # collection1 will flush when this context exits.

class signac.synced_collections.buffers.buffered_collection.BufferedCollection(*args, **kwargs)¶