CSV-to-JSON-LD User Guide
This is a place to organize information to teach and guide the casual user.
License
This project is open source and licensed under the MIT License.
What is the tool?
This is a metadata publishing tool developed for Work Package 1 (WP1) of the MARCO-BOLO project (MARine COastal BiOdiversity Long-term Observations). WP1 focuses on data literacy and metadata flow across the project. This tool helps researchers and data managers transform metadata from CSV files into JSON-LD conforming to Schema.org and ready for harvesting by the ODIS Catalog.
Why was this tool created? [Answer]
π How the Tool Works
This tool helps you turn metadata stored in CSV files into a format that can be read by machines and shared widely across the web β specifically, in a format called JSON-LD, which follows the Schema.org standard.
Documentation
To make all this possible, the tool brings together four key technologies:
Step By Step Explanation
π 1. LinkML: The Blueprint
LinkML (Linked Modeling Language) is used to define a schema β basically, a blueprint that tells us: - what kinds of metadata we expect (e.g., dataset title, creator, location) - what format each field should have (e.g., a date, a URL, a number) - which fields are required
We use LinkML to write these rules in a way that can be both human- and machine-readable. This schema ensures that everyone entering metadata is using the same structure.
π 2. CSV-W: Structured Spreadsheets
Most people are comfortable using spreadsheets, so we use CSV files to collect metadata. But plain CSV files donβt include descriptions of what each column means. Thatβs where CSV-W (CSV on the Web) comes in.
CSV-W adds a metadata file alongside each CSV, which explains: - what each column represents - how to interpret the data (e.g., what kind of value it is, which field in the schema it maps to)
This lets us treat CSV files like structured, interoperable datasets β not just a bunch of text.
π 3. W3IDs: Persistent Links
Each field or class in the schema β like creator
, identifier
, or dataset
β is assigned a W3ID (Web Identifier).
W3IDs are permanent URLs that act as stable identifiers for these concepts. For example:
https://w3id.org/marco-bolo/Dataset
Even if we update the website or move things around, this W3ID will always point to the current definition of βDatasetβ in our schema. Itβs like giving every concept a permanent name tag.
π§© 4. Schema.org: Speaking a Common Language
When we convert metadata into JSON-LD, we map it to Schema.org β a vocabulary used by Google, Bing, and many others to understand web content.
This means the metadata you publish: - can be discovered by search engines - fits into global data-sharing platforms like the ODIS Catalog - supports automated reuse and integration across domains
Schema.org helps your dataset "speak the same language" as other data on the internet.
𧬠Bringing It All Together
- You fill in CSV templates based on the LinkML schema.
- CSV-W files describe what each column in those spreadsheets means.
- GitHub Actions validate your data against the schema to catch any problems.
- The tool converts your metadata into Schema.org-compliant JSON-LD.
- W3IDs make sure all terms have stable, referenceable definitions.
- You publish the JSON-LD on the web, and it can be harvested into global catalogs.
Suggested approach
There are multiple ways you could use this tool on GitHub and on your local machine, but we are going to focus on the workflow that we think works best across MARCO-BOLO WPs. Here are the general steps, we'll walk through each of them in detail, below.
- Fork the GitHub Repository.
- Add your task information (e.g. dataset) to the CSV files.
- Submit a Pull Request to the original MARCO-BOLO repository.
- Your work will be reviewed and merged with this central repository, which will be registered with ODIS.
1. Fork the GitHub Repository.
- If you don't have one, Create a GitHub Account: https://github.com/signup
- Create a copy of the repository, or 'fork', in which to do your work.
- Click the 'fork' button, or go to https://github.com/marco-bolo/csv-to-json-ld/fork
- Select yourself as the owner.
- Click 'Create fork'.
- You should now have a copy of the repository at
https://github.com/your-github-username/csv-to-json-ld/
2. Add your task information (e.g. dataset) to the CSV files.
This section are merely suggestions that should be under constant revision based on what seems to be easiest for MBO users.
- There are a variety of ways you can do this. Here are our recommendations:
- If you are savvy with GitHub, clone your copy locally, use your editor of choice to update the CSV files and push the updates to your fork.
- If you are unfamiliar with GitHub, download a local zip file of the repo, by clicking the green
Code
button and selectingDownload ZIP
. Alternatively you can go to:https://github.com/your-github-username/csv-to-json-ld/archive/refs/heads/main.zip
- Unzip the downloaded file, and use your editor of choice to make updates to the CSV files.
- Upload the CSV files that have changed to your GitHub fork by clicking the
Add file
button and selectingUpload files
. Include a meaningful 'commit' message describing the changes you have made. As long as the file name is the same, it will overwrite the copy that is hosted on GitHub.
3 Submit a Pull Request to the original MARCO-BOLO repository.
- To merge your changes with the original MARCO-BOLO repository and the WP1 team, you need to make a Pull Request (PR)
- On your GitHub fork, click the
Contribute
button and select 'Open pull request'. - Add a meaningful title and description of the changes you have made.
- Click
Reviewers
to select a WP1 Team member to review and approve your changes. - Click
Create pull request
4. Your work will be reviewed and merged with this central repository, which will be registered with ODIS.
Very much in development
- A WP1 Team member will review your changes and communicate with you via GitHub about any changes that need to be made.
- From here, we need to figure out if the JSON-LD files should have been generated on their fork, or if we do them after merging.
- If the latter, this would be part of the review process since it will have to clear validation to create the JSON-LD.
GitHub.dev approach
- GitHub offers an environment for editing the CSV files. The advantage of working in this space is you avoid local copies of the files. The disadvantage is you have a limited set of CSV editing functions.
- To access this environment, go to
https://github.dev/your-github-username/csv-to-json-ld
- Sign in to GitHub when prompted and authorize GitHub.dev to access your account.
- If this is your first time, click on the
Extensions
icon (a group of stacked squares) and install the βExcel Viewerβ extension from MESCIUS. This enables spreadsheet-style editing of CSV files. - Tick the box: βUse this profile as the default for new windowsβ. This configures your browser to open CSVs with a table-based view.
- You may need to open the CSV by right-clicking on the file and clicke
open with
>CSV Editor Excel Viewer
. You and can make this the default open option for CSVs via the same menu.
Validating through GitHub Actions
This is most relevant to the WP1 team, who will likely be supervising validation
The workflow for validating the CSVs and generating the JSON-LD can be found here: https://github.com/marco-bolo/csv-to-json-ld/blob/main/.github/workflows/build-jsonld.yaml Currently it runs in response to any push or pull request. We may eventually switch to a manual trigger to give us more control over validation and iteration.
When the workflows are triggered, it will be logged in the GitHub Actions
tab. If you click on Actions
you will see the various workflows on the left hand side. Click on Build JSON-LD
to view any runs of this workflow. Your run will be titled by the commit title. If you are unsure if it is your build, you can filter by Actor
on the right hand side of the table.
A green check mark (β ) to the left of your build means your changes passed validation. A red cross (β) means there were errors.
I passed validation (β ), what next?
Download Your JSON-LD Output. If you click on the build title, it will bring you to the page for that build. At the bottom are Artifacts
, or files that were produced by the GitHub Action. In the build results, click schema-org-jsonld-outputs
to download the output as a zip file.
Note: These artifacts are temporary and will expire after 90 days. Be sure to store the files elsewhere for long-term access.
we need to decide what happens next
I failed validation (β), what next?
Review the build logs. If you click on the build title, it will bring you to the page for that build. At the bottom of a failed GitHub Action are Annotations
. If you click on the item(s) under annotations, it will bring you to the log of the build. You will be brought to the last error in the log and can scroll to review. You can also expand other sections of the log by clicking on the title of the section (e.g. 'Post Checkout').
The log should include a summary of the erros that looks like this:
Errors detected:
When validating remote/Person.csv-metadata.json
ERROR Type: Required in CSV 'file:/work/Person.csv', Row: 3, Column: '4'
ERROR Type: Required in CSV 'file:/work/Person.csv', Row: 3, Column: '5'
For example, the above message indicates that the 4th and 5th column of the 3rd row (header == row 1) are invalid because they are required fields, but are empty.
Hosting and Registering JSON-LD with ODIS
This is is development, as we may have a single endpoint, the MBO GitHub repo, for all MBO JSON-LD to be crawled by ODIS.
To make your metadata discoverable by ODIS:
- Host the generated JSON-LD at a stable public URL (e.g., through GitHub Pages).
- Register the URL with ODIS so it can be harvested and indexed.
π€¬ What could go wrong?
Required Fields and Validation Rules
Each CSV template has fields marked as required, and some fields must also follow validation rules (e.g., format restrictions or uniqueness constraints). These ensure your metadata is structured correctly and interoperable with global catalogs like ODIS.
Required Fields by CSV Template
Before filling out any table, note that most templates inherit common required fields. These include:
Universal Required Fields
Field | Meaning |
---|---|
id |
A unique permanent identifier (e.g. mbo_abc123 ) |
metadataPublisherId |
The ID of a Person or Organization who is publishing this metadata |
metadataDescribedForActionId |
The ID of an Action that this record is describing (except for Action.csv itself) |
These fields are required in nearly every table. If they are missing or point to invalid IDs, validation will fail.
π Additional Required Fields by Table
Action.csv
- `actionStatus` - `resultStatus` *(Note: `metadataDescribedForActionId` is not required here because this is the root action being described)*Audience.csv
- `audienceType`ContactPoint.csv
- `contactType`DataDownload.csv
- `contentUrl` - `encodingFormat`Dataset.csv
- `name` - `description` - `keywords`DatasetComment.csv
- `text`DefinedTerm.csv
- `name`EmbargoStatement.csv
- `embargoDate`GeoShape.csv
- `containedInPlace`HowTo.csv
- `name` - `description`HowToStep.csv
- `position` - `text`HowToTip.csv
- `text`License.csv
- `name` - `url`MonetaryGrant.csv
- `name` - `amount`Organization.csv
- `name`Person.csv
- `name`Place.csv
- `name` - `address`PropertyValue.csv
- `propertyID` - `value`PublishingStatusDefinedTerm.csv
- `name`Service.csv
- `serviceType`SoftwareApplication.csv
- `name` - `applicationCategory`SoftwareSourceCode.csv
- `codeRepository`Taxon.csv
- `scientificName`β Tip: If any required field is missing, the GitHub Action will fail validation during the
validate-csvws-build-jsonld
step.
π Validation Rules (SHACL Constraints)
The system also applies additional validation rules using SHACL. These rules ensure the integrity of the metadata graph:
Rule | Type | Description |
---|---|---|
MBO Identifier Must Be Unique | β Violation | Each id (e.g. mbo_tool_001 ) must appear in only one CSV file. It cannot represent multiple entities across files. |
Entity Should Be Referenced | β οΈ Warning | Any entity you define (e.g. a Person , Place , or SoftwareApplication ) should be referenced somewhere else in the metadata (e.g. as a creator , location , or usedSoftware ). |
β οΈ Warnings wonβt stop your JSON-LD from being generated, but violations will.
Required Table Relationships
Before filling out any MARCO-BOLO CSV tables, it's important to understand how they depend on each other.
π§± Minimum Required Files for a Dataset
To create a valid Dataset.csv
row, you must also provide records in:
File | Why it's needed |
---|---|
Dataset.csv |
The dataset record itself |
Action.csv |
To define the metadataDescribedForActionId value |
Person.csv or Organization.csv |
To define the metadataPublisherId value |
These relationships apply to every other table as well. No table stands alone β they all describe a resource that must be attributed (publisher) and scoped (action).
Required Cross-Table Dependencies
Table | Depends on Table | Field | Multivalued |
---|---|---|---|
Action | Action | metadataDescribedForActionId | No |
Action | PersonOrOrganization | agentId | No |
Action | PersonOrOrganization | metadataPublisherId | No |
Audience | Action | metadataDescribedForActionId | No |
Audience | PersonOrOrganization | metadataPublisherId | No |
ContactPoint | Action | metadataDescribedForActionId | No |
ContactPoint | PersonOrOrganization | metadataPublisherId | No |
DataDownload | Action | metadataDescribedForActionId | No |
DataDownload | Dataset | datasetMboId | No |
DataDownload | PersonOrOrganization | metadataPublisherId | No |
Dataset | Action | metadataDescribedForActionId | No |
Dataset | PersonOrOrganization | metadataPublisherId | No |
Dataset | PropertyValue | containsVariablesMboIds | Yes |
DatasetComment | Action | metadataDescribedForActionId | No |
DatasetComment | Dataset | commentAboutDatasetMboId | No |
DatasetComment | PersonOrOrganization | metadataPublisherId | No |
DefinedTerm | Action | metadataDescribedForActionId | No |
DefinedTerm | PersonOrOrganization | metadataPublisherId | No |
EmbargoStatement | Action | metadataDescribedForActionId | No |
EmbargoStatement | Dataset | embargoedDatasetMboId | No |
EmbargoStatement | PersonOrOrganization | metadataPublisherId | No |
GeoShape | Action | metadataDescribedForActionId | No |
GeoShape | PersonOrOrganization | metadataPublisherId | No |
HowTo | Action | metadataDescribedForActionId | No |
HowTo | HowToStep | howToStepMboId | No |
HowTo | PersonOrOrganization | metadataPublisherId | No |
HowToStep | Action | metadataDescribedForActionId | No |
HowToStep | PersonOrOrganization | metadataPublisherId | No |
HowToTip | Action | metadataDescribedForActionId | No |
HowToTip | PersonOrOrganization | metadataPublisherId | No |
License | Action | metadataDescribedForActionId | No |
License | PersonOrOrganization | metadataPublisherId | No |
MonetaryGrant | Action | metadataDescribedForActionId | No |
MonetaryGrant | PersonOrOrganization | metadataPublisherId | No |
Organization | Action | metadataDescribedForActionId | No |
Organization | PersonOrOrganization | metadataPublisherId | No |
Person | Action | metadataDescribedForActionId | No |
Person | PersonOrOrganization | metadataPublisherId | No |
Place | Action | metadataDescribedForActionId | No |
Place | PersonOrOrganization | metadataPublisherId | No |
PropertyValue | Action | metadataDescribedForActionId | No |
PropertyValue | PersonOrOrganization | metadataPublisherId | No |
Service | Organization | serviceProviderOrganizationMboId | No |
Service | PersonOrOrganization | metadataPublisherId | No |
SoftwareApplication | Action | metadataDescribedForActionId | No |
SoftwareApplication | PersonOrOrganization | metadataPublisherId | No |
SoftwareSourceCode | Action | metadataDescribedForActionId | No |
SoftwareSourceCode | PersonOrOrganization | metadataPublisherId | No |
Taxon | Action | metadataDescribedForActionId | No |
Taxon | PersonOrOrganization | metadataPublisherId | No |
Optional Cross-Table Dependencies
Table | Depends on Table | Field | Multivalued |
---|---|---|---|
Action | Action | childActionMboIds | Yes |
Action | Dataset | resultingDatasetMboIds | Yes |
Action | HowTo | howToPerformActionMboId | No |
Action | PersonOrOrganization | participantIds | Yes |
DataDownload | Audience | audienceMboIds | Yes |
DataDownload | License | licenseMboId | No |
DataDownload | PersonOrOrganization | authorId | No |
DataDownload | PersonOrOrganization | contributorIds | Yes |
DataDownload | PersonOrOrganization | maintainerId | No |
DataDownload | PersonOrOrganization | ownerId | No |
DataDownload | PersonOrOrganization | publisherId | No |
DataDownload | PublishingStatusDefinedTerm | publishingStatusMboId | No |
Dataset | Audience | audienceMboIds | Yes |
Dataset | DataDownload | dataDownloadMboIds | Yes |
Dataset | EmbargoStatement | embargoStatementMboId | No |
Dataset | License | licenseMboId | No |
Dataset | PersonOrOrganization | authorId | No |
Dataset | PersonOrOrganization | contributorIds | Yes |
Dataset | PersonOrOrganization | maintainerId | No |
Dataset | PersonOrOrganization | ownerId | No |
Dataset | PersonOrOrganization | publisherId | No |
Dataset | Place | spatialCoveragePlaceMboId | No |
Dataset | PublishingStatusDefinedTerm | publishingStatusMboId | No |
Dataset | Taxon | aboutTaxonMboIds | Yes |
DatasetComment | PersonOrOrganization | authorId | No |
HowToStep | Audience | audienceMboIds | Yes |
HowToStep | HowToStep | childStepMboIds | Yes |
HowToStep | HowToTip | howToImplementTipMboIds | Yes |
HowToStep | PersonOrOrganization | contributorIds | Yes |
HowToStep | PersonOrOrganization | providerId | No |
HowToStep | Service | citeServiceMboIds | Yes |
HowToStep | SoftwareApplication | citeSoftwareApplicationMboIds | Yes |
HowToStep | SoftwareSourceCode | citeSourceCodeMboIds | Yes |
HowToTip | Audience | audienceMboIds | Yes |
MonetaryGrant | Organization | funderOrganizationMboIds | Yes |
MonetaryGrant | Organization | sponsorOrganizationMboIds | Yes |
Organization | ContactPoint | contactPointMboIds | Yes |
Organization | MonetaryGrant | fundingGrantMboIds | Yes |
Organization | Organization | departmentMboIds | Yes |
Organization | Organization | memberOfOrganizationMboIds | Yes |
Organization | Organization | parentOrganizationMboId | No |
Person | ContactPoint | contactPointMboIds | Yes |
Person | Organization | affiliatedOrganizationMboIds | Yes |
Person | Organization | worksForOrganizationMboIds | Yes |
Place | GeoShape | geoShapeMboId | No |
PropertyValue | PropertyValue | isTypeOfPropertyValueMboId | Yes |
Service | Audience | audienceMboIds | Yes |
Service | Place | placesServedMboIds | Yes |
SoftwareApplication | PersonOrOrganization | authorId | No |
SoftwareApplication | PersonOrOrganization | contributorIds | Yes |
SoftwareApplication | PersonOrOrganization | maintainerId | No |
SoftwareApplication | PersonOrOrganization | ownerId | No |
SoftwareApplication | PersonOrOrganization | providerId | No |
SoftwareApplication | PersonOrOrganization | publisherId | No |
SoftwareApplication | PublishingStatusDefinedTerm | publishingStatusMboId | No |
SoftwareSourceCode | PersonOrOrganization | authorId | No |
SoftwareSourceCode | PersonOrOrganization | contributorIds | Yes |
SoftwareSourceCode | PersonOrOrganization | maintainerId | No |
SoftwareSourceCode | PersonOrOrganization | ownerId | No |
SoftwareSourceCode | PersonOrOrganization | publisherId | No |
SoftwareSourceCode | PublishingStatusDefinedTerm | publishingStatusMboId | No |