Added tika text extraction support for lucene indexing #1277
Added tika text extraction support for lucene indexing #1277mitjale wants to merge 12 commits intogitblit-org:masterfrom
Conversation
Default query search operator AND Added properties settings and mantained compatibility with default settings from previous version
flaix
left a comment
There was a problem hiding this comment.
I am pretty sure that not every line of the class LuceneService changed. It looks like you may have changed indentation. This makes it impossible to review changes.
Never change indentation, line endings, whitespace and such when making functional changes. Always keep functional changes separate and reviewable.
TomaszSzt
left a comment
There was a problem hiding this comment.
I can see You noticed that passing byte[] data can crash server if large file is to be processed. Good.
Yet still You do return String from extractText which still, as You don't limit the size of input file to be processed, opens a way for users to abuse it and use it to crash server by supplying large files enough to cause OutOfMemory exception. There is no problem in supplying 100MB or more PDF.
Either use streaming tika or restrict the number of bytes passsed to tika by using an input stream wrapper. Make the limit configurable if You decide on it.
I'm too missing this functionality, because it is a critical step which prevents GitBlit from being Knowledge Management System for non-coders, but this implementation will crash server with OutOfMemory exception in a very unpredictable way.
I've added text extraction support for Lucene indexing from Apache Tika parsers, thus enabling search on pdf and office documents