In #17229 is applied an implementation change to the BufferedTokenizer. It switched from RubyArray and JRuby classes to pure Java, so that the method extract now return an Iterable instead of the RubyArray.
The implementation in split in two parts:
- one iterator that is responsible to accumulate the fragments coming from the
extract method invocation
- an outer iterator that decorates the first and simply apply the
sizeLimit verification throwing an exception when it receives a token longer than sizeLimit.
Despite the implementation seems more linear, it reintroduces a problem that was surfaced in logstash-plugins/logstash-codec-json_lines#43.
The inner iterator uses a StringBuilder as accumulator, and on every call of the extract, it dices the tokens by separator.
Now if a very big payload is pushed down to a codec that uses this BufferedTokenizer, where the payload simply require multiple invocations of extract before resulting in a tokenization (that could also never happen, suppose an offending client that never send the separator), the memory gets filled with data that is useless and would be rejected by the next iterator, but simply never reach it because goes OOM before.
In the previous implementation the tokenization part, when firstly detected an overrun of the sizeLimit, dropped every fragment till the next separator was presented, protecting from OOM conditions like this.
Now are my questions:
In #17229 is applied an implementation change to the BufferedTokenizer. It switched from RubyArray and JRuby classes to pure Java, so that the method
extractnow return anIterableinstead of theRubyArray.The implementation in split in two parts:
extractmethod invocationsizeLimitverification throwing an exception when it receives a token longer thansizeLimit.Despite the implementation seems more linear, it reintroduces a problem that was surfaced in logstash-plugins/logstash-codec-json_lines#43.
The inner iterator uses a
StringBuilderas accumulator, and on every call of theextract, it dices the tokens by separator.Now if a very big payload is pushed down to a codec that uses this BufferedTokenizer, where the payload simply require multiple invocations of
extractbefore resulting in a tokenization (that could also never happen, suppose an offending client that never send the separator), the memory gets filled with data that is useless and would be rejected by the next iterator, but simply never reach it because goes OOM before.In the previous implementation the tokenization part, when firstly detected an overrun of the sizeLimit, dropped every fragment till the next separator was presented, protecting from OOM conditions like this.
Now are my questions: