Updating DAR resource files¶

Every Data Access Request (DAR) project consists of below key files:

1) config

2) governance/project

3) access/access

4) metadata/dataset

They are all toml files. The cr8tor initiate command uses a cookie-cutter template which provides you with the basic version of them. The content needs to be updated with relevant information before you can successfully run the orchestration workflow.

config¶

The config file contains the following fields, which must be populated with relevant information:

Field	Description
`Source-Organization`	eg. `LSC SDE`
`Organization-Address`	eg. `Lancashire Teaching Hospitals NHS Trust, PR2 9HT`
`Contact-Name`	eg. `LSC SDE Program Team`
`Contact-Email`	eg. `lsc.sde@test.com`

governance/project¶

The governance/project contains the following fields, which must be populated with relevant information:

[project]

Field	Description
`description`	eg. `example-project created using CR8TOR.`
`reference`	auto-generated by cookie-cutter based on project_name. Pattern cr8-<project_name>
`name`	auto-generated by cookie-cutter based on project_name
`project_name`	follow organisation policy for project naming convention

[project.destination]

Field	Description
`type`	Destination type. Supported values: [`filestore`, `postgresql`]
`name`	Destination name (for filestore destinations, this drives storage location), e.g., [`LSC`, `NW`]
`format`	Data format for filestore destinations. Supported values: [`csv`, `duckdb`]

Supported Destination Types¶

The CR8TOR Publisher supports two main destination types, each with different behaviors and requirements:

Filestore Destination (`filestore`)¶

When type = "filestore", the Publish Service stores data as files in a mounted filestore rather than a database.

Key Characteristics:

File-based Storage: Data is stored as files (CSV or DuckDB format) in mounted filesystem paths
Destination-Specific Storage: The name field drives storage location via environment variables (TARGET_STORAGE_ACCOUNT_{NAME}_SDE_MNT_PATH)
Two-stage Process: Data flows through staging container first, then moves to production container
BagIt Packaging: Files are organized following BagIt standards with SHA512 checksums for integrity verification
Format Options: Supports both CSV and DuckDB formats for flexible data consumption

Required Fields:

type = "filestore"
name - Destination name (e.g., "LSC", "NW") that determines storage mount point
format - Data format: "csv" or "duckdb"

Directory Structure Created:

{storage_mount_path}/
├── staging/
│   └── {project_name}/
│       └── {project_start_time}/
│           └── data/outputs/
└── production/
    └── {project_name}/
        └── {project_start_time}/
            └── data/outputs/

PostgreSQL Destination (`postgresql`)¶

When type = "postgresql", the Publish Service loads data directly into a PostgreSQL database and creates OPAL resources for secure access.

Key Characteristics:

Database Storage: Data is loaded directly into PostgreSQL tables
OPAL Integration: Automatically creates Obiba OPAL components for secure data access:
Creates OPAL project named after the CR8TOR project
Establishes OPAL resources pointing to PostgreSQL tables
Creates DataSHIELD permission groups ({project_name}_group)
Sets up DataSHIELD and resource-level permissions
Access Control: Leverages OPAL's DataSHIELD framework for secure, privacy-preserving data analysis
Table Filtering: Filters tables based on project name and start time patterns, excluding DLT internal tables

Required Fields:

type = "postgresql"
name - Destination name (used for OPAL project identification)
format - Not applicable for PostgreSQL destinations

OPAL Resources Created:

Resource Naming: tre_postgresql_{schema}_{table}
Project Structure: One OPAL project per CR8TOR project
Group Management: DataSHIELD groups with "use" and "view" permissions
User Management: Ensures default DataSHIELD user (dsuser_default) exists

Choose filestore for file-based data delivery and postgresql for database-hosted data with integrated access control through OPAL's DataSHIELD framework.

[repository]

Field	Description
`codeRepository`	Auto-generated by cookie-cutter. e.g., `https://github.com/lsc-sde-crates/cr8-example-project`
`description`	Auto-generated by cookie-cutter.
`name`	Auto-generated by cookie-cutter.

[requesting_agent]

Field	Description
`name`	e.g. Prof. Jane Doe

[requesting_agent.affiliation]

Field	Description
`url`	e.g. https://someuni.com
`name`	e.g. Jane Doe's University

access/access¶

The access/access contains the following fields, which must be populated with relevant information:

[source]

The required source fields depend on the type of the source. The Solution supports extracting data from multiple source types including Databricks SQL warehouse endpoints and various SQL databases. The required fields for each type are validated by cr8tor and cr8tor-publisher app (using pydantic models).

Supported Source Types¶

The following source types are currently supported:

databrickssql - Databricks Unity Catalog
postgresql - PostgreSQL databases
mysql - MySQL databases
mssql - Microsoft SQL Server databases
sqlserver - Microsoft SQL Server databases (alternative identifier)

Common Source Fields¶

Field	Description
`name`	Name of the data source, e.g., `My Database Connection`.
`type`	Must match one of the supported source types listed above.
`host_url`	Host URL of the database server.

Databricks Unity Catalog (databrickssql)¶

For Databricks Unity Catalog sources, the following additional fields are required:

Field	Description
`http_path`	HTTP path for the Databricks SQL Warehouse endpoint, e.g., `/sql/1.0/warehouses/0aec44b2e70e201d`.
`port`	Optional. Default is 443 for Databricks SQL endpoints.
`catalog`	Databricks Unity Catalog name from which data will be extracted.

SQL Databases (postgresql, mysql, mssql)¶

For SQL database sources, the following additional fields are required:

Field	Description
`database`	Database name to connect to.
`port`	Database connection port (varies by database type).

[credentials]

Credentials fields are tied to the source type. Different source types require different credential configurations:

Databricks Unity Catalog Credentials¶

For Databricks SQL endpoints, we use Databricks Service Principal (SPN). See here how to create a new SPN and assign required roles and permissions.

Field	Description
`provider`	Credential provider, e.g., `AzureKeyVault`.
`spn_clientid`	Key name in secrets provider containing Service Principal client ID, e.g., `databricksspnclientid`.
`spn_secret`	Key name in secrets provider containing Service Principal secret, e.g., `databricksspnsecret`.

SQL Database Credentials¶

For SQL database sources (PostgreSQL, MySQL, MSSQL), we use username/password authentication:

Field	Description
`provider`	Credential provider, e.g., `AzureKeyVault`.
`username_key`	Key name in secrets provider containing database username, e.g., `db-username`.
`password_key`	Key name in secrets provider containing database password, e.g., `db-password`.

metadata/dataset¶

We expect a single file per dataset (in terms of Databricks SQL source, it is per Unity Catalog schema). The metadata/dataset file is contains the following fields, which must be populated with relevant information:

Field	Description
`name`	Name of the dataset, e.g., `dataset_1`.
`description`	Description of our dataset.
`schema_name`	Schema name, e.g., `z__cr8tor_poc`.

Warning

Now, we can define the tables and columns we want to retrieve, but this is optional. If we do not provide tables and/or columns, the solution will retrieve all available tables and/or columns.

[[tables]]

Field	Description
`name`	Name of the table, e.g., `domain`.

[[tables.columns]]

Field	Description
`name`	Name of the column, e.g., `domain_concept_id`.