Details
DATA-M
DATA-M is a service provided by Prometheus to support the The Configurable Data Curation System (CDCS) ecosystem. the CDCS is a modular web framework developed at the National Institute of Standards and Technology (NIST) for the past several years to initially manage and operate scientific data.
The CDCS has been developed based on the FAIR data principles which gives it by design the following very powerful capabilities:
The CDCS framework provides the tools to build a platform that follows FAIR data principles which gives it by design the following very powerful capabilities:
Findable: the system makes it easy to find data by allowing users to provide rich metadata descriptions about each item of a dataset and by assigning each data a globally unique identifier,
Accessible: data can be accessed in a standard and controlled way (authentication),
Interoperable: the metadata are written in a format (XML) that can be read by other application and can also be exported to other formats (e.g. JSON),
Reusable: datasets can be downloaded and used in a different setting.
The application is being actively maintained and has been forked many times by other organizations. The CDCS is open source and several instanciations are accessible on GitHub such as the Materials Data Curation System (MDCS): https://github.com/usnistgov/MDCS.
Plugin System
CDCS has been engineered with flexibility and enhancement in mind. For that reason, a plugin system has been used to enable customization and enhancements to be added with ease and maintainability.
The CDCS has more than 50 plugins already accessible through the Pypi package repository. Here are currently some of the main features of the framework:
Dynamic data model:
Defined by XML Schemas (following community standards for interoperability)
Indexes and UIs dynamically generated from schema,
Curate, explore, export and share data between systems.
Modularity, build a new system by:
Selecting a set of desired features (Django Apps), or creating new ones,
Uploading a data model (XML Schema),
Customizing settings and theme,
Persistent ids, SAML2 authentication.
Available Plugins
Here is list of the top 5 monst used plugins:
core_main_app: Core APIs
core_curate_app: Dynamic UI generation from template for data entry.
core_explore_by_keyword_app: Search UI with full text search capabilities
core_exporters_app: Export search results to different formats
core_linked_records_app: Assign a globally unique PID to records and link records together
Data Repository
Users have data in different formats, in multiple places, using different vocabularies. The goal of the CDCS is to provide an effective research data lifecycle that supports FAIR Data principles (Findable, Accessible, Interoperable, Reusable).
Data repositories let you create or reuse a data model for your domain and start collaborating on the curation of datasets (with rich and structured metadata). Whether you are curating data daily or want to share finalized datasets with the community (with persistent identifiers for findability and referencing), data repositories let you build and customize a system that meets your needs. Authorized users can explore the curated data using full text search or custom search forms generated from the custom data model. All the features are available via a web UI but can also be used in scripts to automate curation and data retrieval thanks to its REST API.
Data Registry
Data registries are specialized data repositories focusing on the discovery of resources. Registries come with a predefined data model that let users provide extensive metadata about resources managed by their organization. These systems are able to share data with each other by accessing to a network of trusted registries and providing and harvesting data from this network.
What is a Resource?
It's a metadata document about a physical or virtual entity like an organization, a dataset, a website, a software. Some of the information a resource may capture include fields such as a title, a list of authors, a descritpion, contact information, a publication date, and any other useful information.
Why was it built?
Users have resources that need to be discovered by the community. The registry is a specialized CDCS instance that focuses on Findability of these resources.
What are the main features of a registry:
A common data model developed by the materials community and customizable for different domains (e.g. healthcare, energy),
Search UI tailored to a data model, with full text search and advanced resource filtering,
Register resources that might be of interest for the community,
Connect registries together by setting up harvesters and providers using the OAI-PMH protocol.
Use a standardized metadata model to explore information about a dataset in a searchable way
Data Flow
The CDCS can be used in several domains, thanks to their dynamic data models. Those data models (XML Schemas) can be either user defined or defined by a community of users willing to engage in a standardised way. Once a model is set, (meta)data (XML) uploaded on the system by authorized users will be validated against these data template, and will be indexed in the database of your choice. In addition to these (meta)data documents, the CDCS also allows the storage of files of any types (PDF, images, text, ...). Users coming to the system can then retrieve information from the system thanks to querying endpoints (full text search, search by field) and download the datasets they need.
Data Harvesting
The CDCS framework supports multiple types of search accros multiple instances. The OAI-PMH protocol implemented in registries enables organizations to connect to other registry instances and harvest their information for fast and efficient searches.
Federated Search
A CDCS instance can be configured to grant access to some or all hosted data on the system to other CDCS instances. This allows users of an instance to broadcast queries to a set of systems and federate results in one place.