feat(datastore): Decorate and log errors thrown during DataStore entrypoint initialization#27327
feat(datastore): Decorate and log errors thrown during DataStore entrypoint initialization#27327markfields wants to merge 6 commits into
Conversation
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (252 lines, 7 files), I've queued these reviewers:
How this works
|
There was a problem hiding this comment.
Pull request overview
This PR adds telemetry for DataStore entry point initialization failures so component load issues include Fluid-specific context like DataStore identity and package path.
Changes:
- Adds a shared helper for code-artifact-tagged DataStore telemetry properties.
- Logs and annotates errors thrown while resolving
FluidDataStoreRuntime.entryPoint. - Reuses the shared telemetry helper for DataStore realization errors and adds tests for entry point failure logging.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
packages/utils/telemetry-utils/src/error.ts |
Aligns DataProcessingError wrapping factory usage. |
packages/runtime/runtime-utils/src/index.ts |
Exports the new DataStore telemetry helper. |
packages/runtime/runtime-utils/src/dataStoreHelpers.ts |
Adds shared DataStore identity telemetry property helper. |
packages/runtime/datastore/src/test/dataStoreRuntime.spec.ts |
Adds coverage for lazy entry point initialization and failure telemetry. |
packages/runtime/datastore/src/dataStoreRuntime.ts |
Wraps, annotates, logs, and rethrows entry point initialization failures. |
packages/runtime/container-runtime/src/dataStoreContext.ts |
Reuses the shared DataStore telemetry helper for realization errors. |
Fleet Review — CleanNo issues found across the reviewer fleet for this run. |
| error, | ||
| "entryPointInitialization", | ||
| ); | ||
| errorWrapped.addTelemetryProperties( |
There was a problem hiding this comment.
Note: In testing I found that these telemetry props get stripped off on the way up by roundtripping through exceptionToResponse and responseToException. A bit curious to look into that flow and if we could cut out those tranforms.
There was a problem hiding this comment.
Does this have to do with privacy concerns? I know there can sometimes be policies to strip all error information away in case it contains user data or identifiable information.
There was a problem hiding this comment.
Nope, it has to do with legacy architectural direction (URI-based routing within and across containers)
There was a problem hiding this comment.
These properties are correctly tagged as "code artifacts", so hosts can decide if they're safe to log under different circumstances
ChumpChief
left a comment
There was a problem hiding this comment.
Love wrapping at the exit to custom code! Few organizational/API suggestions.
| fluidDataStoreId: this.id, | ||
| }), | ||
| ); | ||
| errorWrapped.addTelemetryProperties(dataStoreLoadTelemetryProps(this)); |
There was a problem hiding this comment.
Passing this tends to be a code smell, probably prefer either passing the specific members needed or else make it a member method if it truly needs access to this.
There was a problem hiding this comment.
Didn't want to muck up the legacy-beta IFluidDataStoreContext interface. I can consider passing in just the props we need off the context.
There was a problem hiding this comment.
That would be my preference too, passing big/powerful objects for small tasks is also usually a code smell :-D
| new LazyPromise(async () => provideEntryPoint(this)), | ||
| new LazyPromise(async () => | ||
| provideEntryPoint(this).catch((error) => { | ||
| const errorWrapped = DataProcessingError.wrapIfUnrecognized( |
| new LazyPromise(async () => provideEntryPoint(this)), | ||
| new LazyPromise(async () => | ||
| provideEntryPoint(this).catch((error) => { | ||
| const errorWrapped = DataProcessingError.wrapIfUnrecognized( |
There was a problem hiding this comment.
I am wondering if DataProcessingError is appropriate here. This runs application entry point logic and has nothing to do with Fluid right? So, maybe this is a UsageError or just a generic error?
There was a problem hiding this comment.
Thinking more about this - we should separate out the path that loads the data store / DDS data from the path that loads the application logic. And we should be able to tell these apart from telemetry. I would suggest to not use data process error for the latter because that should not cause any problems with the data (it does fail summaries today but that is an unexpected scenario and will be fixed soon)
| summarizeNow(createSummarizerResult.summarizer, "test"), | ||
| "Summarizing should not throw", | ||
| ); | ||
| // Create a summarizer for the container and do a summary shouldn't throw. |
There was a problem hiding this comment.
Interesting test. I know you are not changing this but this comment isn't clear. I would suggest updating this to say that the data store is realized / loaded during summarization but is not initialized and hence summarization succeeds.
| "Non interactive/summarizer client's data object should not be initialized", | ||
| "Loading data store in summarizer did not throw as it should, or threw an unexpected error.", | ||
| ); | ||
| // In summarizer, load the data store should fail. |
There was a problem hiding this comment.
Update this comment to say that this throws an error and also logs the error in itExpects
| "Initial creation of container and data store should succeed.", | ||
| ); | ||
| }); | ||
| // Load second container, load the data store will also call initializingFromExisting and succeed. |
There was a problem hiding this comment.
An observation, don't need to change this I guess - This step seems unnecessary.
Description
Component load failures are one of the top scenarios our partners need to be able to monitor and diagnose in production. This PR closes a long-standing gap where errors thrown during entryPoint initialization (e.g. for
PureDataObject, that includes the initalization hooks) are not logged to FF tables.While host applications can and do log these errors, there is context that's only readily available within FF that is really important for investigation - All the common ContainerRuntime props (notably clientType - is it the Summarizer?), as well as the DataStore's packagePath (indicates the component type) and id.
This PR logs an error event with all these props, and additionally annotates the error object* with the dataStore package/id info.
* Note: In testing I found that these telemetry props get stripped off on the way up by roundtripping through
exceptionToResponseandresponseToException. A bit curious to look into that flow and if we could cut out those tranforms.