Skip to content

Required controlledVocabulary metadata marked as valid while empty#12345

Open
stevenwinship wants to merge 5 commits into
developfrom
11900-improved-cvoc-value-validation2
Open

Required controlledVocabulary metadata marked as valid while empty#12345
stevenwinship wants to merge 5 commits into
developfrom
11900-improved-cvoc-value-validation2

Conversation

@stevenwinship
Copy link
Copy Markdown
Contributor

What this PR does / why we need it: Previous PR caused post merge test failures in Jenkins. This PR includes the code from the previous PR plus the fixes needed to pass the tests

Which issue(s) this PR closes:#11900

Special notes for your reviewer: Original PR #11950

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@stevenwinship stevenwinship self-assigned this Apr 21, 2026
@github-actions github-actions Bot added the Type: Bug a defect label Apr 21, 2026
@stevenwinship stevenwinship moved this to In Progress 💻 in IQSS Dataverse Project Apr 21, 2026
@stevenwinship stevenwinship added FY26 Sprint 21 FY26 Sprint 21 (2026-04-08 - 2026-04-22) Size: 10 A percentage of a sprint. 7 hours. labels Apr 21, 2026
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 21, 2026

Coverage Status

coverage: 24.95% (-0.001%) from 24.951% — 11900-improved-cvoc-value-validation2 into develop

@github-actions

This comment has been minimized.

@stevenwinship stevenwinship force-pushed the 11900-improved-cvoc-value-validation2 branch from 0d3fd19 to 4880feb Compare April 22, 2026 18:55
@cmbz cmbz assigned ErykKul and unassigned stevenwinship Apr 22, 2026
@cmbz cmbz moved this from In Progress 💻 to SPRINT READY in IQSS Dataverse Project Apr 22, 2026
@cmbz
Copy link
Copy Markdown

cmbz commented Apr 22, 2026

2026-04-22: @ErykKul we're returning this PR back to you for further fixes.

@github-actions

This comment has been minimized.

@stevenwinship stevenwinship moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Apr 22, 2026
@stevenwinship stevenwinship force-pushed the 11900-improved-cvoc-value-validation2 branch from 4880feb to 1528236 Compare April 22, 2026 21:29
@github-actions

This comment has been minimized.

3 similar comments
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@ErykKul
Copy link
Copy Markdown
Collaborator

ErykKul commented Apr 28, 2026

Root cause: allowHarvestingMissingCVV was gating the sanitizer, but it should only affect the parser.

The parser uses the flag to decide whether to throw on an unmappable controlled vocabulary string. The sanitizer handles a different case: required controlled vocabulary fields that come up empty after initDatasetFields() because the source format (e.g. oai_dc) does not carry them. Those fields must always be filled with a proper ControlledVocabularyValue wrapping "N/A", not a plain "N/A" string, otherwise the second validation pass still fails.

The fix removes the gate and routes by field type instead. Controlled vocabulary fields get a ControlledVocabularyValue("N/A"), other fields get setSingleValue("N/A"). With that, allowHarvestingMissingCVV=false correctly imports 7 of 8 records: the parser rejects the 1 record with a bad controlled vocabulary string, and the sanitizer rescues the 7 records with empty controlled vocabulary fields.

@ErykKul ErykKul moved this from This Sprint 🏃‍♀️ 🏃 to Ready for Review ⏩ in IQSS Dataverse Project Apr 28, 2026
@cmbz cmbz added FY26 Sprint 22 FY26 Sprint 22 (2026-04-22 - 2026-05-06) FY26 Sprint 23 FY26 Sprint 23 (2026-05-06 - 2026-05-20) labels May 6, 2026
@stevenwinship stevenwinship self-assigned this May 15, 2026
@stevenwinship stevenwinship moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project May 15, 2026
@stevenwinship stevenwinship force-pushed the 11900-improved-cvoc-value-validation2 branch from 0af6740 to ccbf5f0 Compare May 15, 2026 15:22
@github-actions

This comment has been minimized.

@rtreacy rtreacy self-assigned this May 15, 2026
Copy link
Copy Markdown
Contributor

@rtreacy rtreacy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally. Passes tests. On coding style, is it really necessary/desirable to use import edu.harvard.iq.dataverse.*; ?

@github-project-automation github-project-automation Bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project May 15, 2026
@stevenwinship stevenwinship moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project May 15, 2026
@stevenwinship stevenwinship self-assigned this May 15, 2026
@stevenwinship
Copy link
Copy Markdown
Contributor Author

stevenwinship commented May 15, 2026

QAing this pr I found the following issue:

Setup: JVM setting -Ddataverse.api.allow-incomplete-metadata=true (same result if not set)
Run HarvestingClientsIT

Create a new dataset in UI and observe the "Subject" drop down. All the 'N/A' values that were assigned in the harvesting (ImportServiceBean sanitize) were also added to the drop down list.

image

@stevenwinship stevenwinship added Waiting Status: Needs Input Applied to issues in need of input from someone currently unavailable and removed Waiting labels May 15, 2026
@jp-tosca
Copy link
Copy Markdown
Contributor

Let us know when you can look at this @ErykKul :) thank you!

@cmbz cmbz added the FY26 Sprint 24 FY26 Sprint 24 (2026-05-20 - 2026-06-03) label May 21, 2026
@ErykKul
Copy link
Copy Markdown
Collaborator

ErykKul commented May 22, 2026

The previous fix created new ControlledVocabularyValue(null, "N/A", f.getDatasetFieldType()), a transient JPA entity pointing to the actual field type (e.g. Subject). Because DatasetField.controlledVocabularyValues has @ManyToMany(cascade = {CascadeType.MERGE}), merging the DatasetField on save cascaded to this new entity and inserted it as a permanent row in controlledvocabularyvalue with the Subject's datasetfieldtype_id. After running HarvestingClientsIT, every CV field that was sanitized with N/A added a new "N/A" entry to its dropdown. Running the test repeatedly multiplied the rows.

The fix uses datasetfieldService.findNAControlledVocabularyValue() instead, which returns the existing managed singleton (the same entity SWORD uses) that has datasetFieldType = null. Merging a managed entity only updates the join table — no new row is ever inserted into controlledvocabularyvalue, and the Subject dropdown stays clean. The if (naValue != null) guard handles the rare case where the singleton was never loaded via loadNAControlledVocabularyValue; in that case the field is left empty and the second-pass validation will throw ImportException, which is correct. If the N/A placeholder doesn't exist, the record cannot be sanitized and should be skipped.

@github-actions

This comment has been minimized.

1 similar comment
@github-actions
Copy link
Copy Markdown

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:11900-improved-cvoc-value-validation2
ghcr.io/gdcc/configbaker:11900-improved-cvoc-value-validation2

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY26 Sprint 21 FY26 Sprint 21 (2026-04-08 - 2026-04-22) FY26 Sprint 22 FY26 Sprint 22 (2026-04-22 - 2026-05-06) FY26 Sprint 23 FY26 Sprint 23 (2026-05-06 - 2026-05-20) FY26 Sprint 24 FY26 Sprint 24 (2026-05-20 - 2026-06-03) Size: 10 A percentage of a sprint. 7 hours. Status: Needs Input Applied to issues in need of input from someone currently unavailable Type: Bug a defect

Projects

Status: QA ✅

Development

Successfully merging this pull request may close these issues.

Required controlledVocabulary metadata marked as valid while empty

6 participants