Skip to content

Conversation

@rombert
Copy link
Contributor

@rombert rombert commented Dec 22, 2025

No description provided.

@rombert rombert marked this pull request as draft December 22, 2025 13:19
@rombert
Copy link
Contributor Author

rombert commented Dec 22, 2025

Marking draft as this seems to break full text indexing of PDFs.

@rombert
Copy link
Contributor Author

rombert commented Dec 22, 2025

There is a class loading issue betweek tika-core and tika-parsers; up til now we did not explicitly configure any class from tika-parsers

java.lang.ClassNotFoundException: org.apache.tika.parser.pdf.PDFParser not found by org.apache.tika.core [81]
	at org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringImpl.java:1591)
	at org.apache.felix.framework.BundleWiringImpl.access$300(BundleWiringImpl.java:79)
	at org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass(BundleWiringImpl.java:1976)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:490)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:547)
	at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.java:235)
	at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:628)
	at org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:589)
	at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:198)
	at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:177)
	at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:170)
	at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:166)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.binary.FulltextBinaryTextExtractor.initializeTikaConfig(FulltextBinaryTextExtractor.java:304)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.binary.FulltextBinaryTextExtractor.createDefaultParser(FulltextBinaryTextExtractor.java:332)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.binary.FulltextBinaryTextExtractor.<clinit>(FulltextBinaryTextExtractor.java:69)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.createBinaryTextExtractor(FulltextIndexEditorContext.java:126)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.getTextExtractor(FulltextIndexEditorContext.java:238)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.newDocumentMaker(LuceneIndexEditorContext.java:62)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.newDocumentMaker(LuceneIndexEditorContext.java:36)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditor.makeDocument(FulltextIndexEditor.java:277)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditor.addOrUpdate(FulltextIndexEditor.java:254)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditor.leave(FulltextIndexEditor.java:132)
	at org.apache.jackrabbit.oak.spi.commit.CompositeEditor.leave(CompositeEditor.java:67)
	at org.apache.jackrabbit.oak.plugins.index.progress.ProgressTrackingEditor.leave(ProgressTrackingEditor.java:72)
	at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.leave(VisibleEditor.java:59)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:129)
	at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:160)
	at org.apache.jackrabbit.oak.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:502)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:124)
	at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:160)
	at org.apache.jackrabbit.oak.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:502)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:124)
	at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:160)
	at org.apache.jackrabbit.oak.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:502)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:124)
	at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:160)
	at org.apache.jackrabbit.oak.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:502)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:124)
	at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:160)
	at org.apache.jackrabbit.oak.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:502)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:124)
	at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:160)
	at org.apache.jackrabbit.oak.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:502)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:51)
	at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.enter(IndexUpdate.java:180)
	at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.enter(VisibleEditor.java:53)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:48)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.updateIndex(AsyncIndexUpdate.java:814)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.runWhenPermitted(AsyncIndexUpdate.java:592)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:444)
	at org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:349)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.apache.sling.commons.scheduler.impl.QuartzThreadPool.lambda$runInThread$0(QuartzThreadPool.java:83)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
	at java.base/java.lang.Thread.run(Thread.java:1474)

Adjust the class loader used for loading Tika configurations to allow configuring the PDFParser.
By default Tika does not use the context class loader so we plug it in the existing abstraction.

This effectively substitutes the tika-core classloader with the oak-lucene classloader, given that
the FulltextBinaryTextExtractor ends up being embedded in oak-lucene.
@rombert rombert marked this pull request as ready for review December 22, 2025 15:23
Copy link
Member

@thomasmueller thomasmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. I'm afraid I'm not an export on Tika / class loading. But I don't see any obvious error.

@thomasmueller thomasmueller self-requested a review December 22, 2025 16:06
@rombert rombert marked this pull request as draft December 22, 2025 18:57
@rombert
Copy link
Contributor Author

rombert commented Dec 22, 2025

Still some classloading issues to clarify

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants