Provide access to a wrapper that validates that bytes being read are valid UTF-8#947
Provide access to a wrapper that validates that bytes being read are valid UTF-8#947dralley wants to merge 10 commits intotafia:masterfrom
Conversation
| fn from(error: IoError) -> Error { | ||
| Self::Io(Arc::new(error)) | ||
| match error.kind() { | ||
| IoErrorKind::InvalidData => Self::Encoding(error.downcast::<EncodingError>().expect( |
There was a problem hiding this comment.
Apparently stabilized in 1.79 https://doc.rust-lang.org/std/io/struct.Error.html#method.downcast
So we would need to bump MSRV.
There was a problem hiding this comment.
If you will rebase you PR for whatever reason, it would be good to add this to the commit that bumps MSRV so we quickly could find reasons if required
src/encoding.rs
Outdated
| } | ||
| } | ||
|
|
||
| impl<R: Read> Read for Utf8ValidatingReader<R> { |
There was a problem hiding this comment.
I need to do more testing and review on this implementation myself as well, I don't trust it yet.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #947 +/- ##
==========================================
+ Coverage 55.00% 57.64% +2.64%
==========================================
Files 44 45 +1
Lines 16816 18258 +1442
==========================================
+ Hits 9249 10525 +1276
- Misses 7567 7733 +166
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| /// | ||
| /// [`Utf8BytesReader`]: crate::encoding::Utf8BytesReader | ||
| #[cfg(not(feature = "encoding"))] | ||
| pub fn from_reader_validating(reader: R) -> Self { |
There was a problem hiding this comment.
New constructors vs. opt-in feature is an open question, I have no particularly strong feelings. I leaned towards this route because Reader::from_reader() is frequently used with slices, and in that case we would be throwing a buffer into the mix which would require signatures changing, etc... I wanted to avoid changing public API as much as possible for now.
There was a problem hiding this comment.
I think, in the choice new constructors vs opt-in feature, constructors is better. However, it seems that they are unnecessary? Users can construct Utf8BytesReader and pass it to existing constructors, right? So probably just documenting that approach on Reader documentation will be enough.
Also, because Utf8BytesReader itself not guarded by encoding feature and it the future, as I understand, you plan to add transparent decoding into it (when encoding feature is enabled), then we may expect, that this constructor will return the reader, which will always return UTF-8 encoded events, even when source XML is not in UTF-8 encoding. So I think, from_reader_validating not the right name to the thing that it will do (of course, with the current feature guard its name reflects its purpose). Something like from_reader_transcoding would be beter? Not the better name, although, that is why I suggest to just not add those constructors at all.
Mingun
left a comment
There was a problem hiding this comment.
It seems to me, rather that implement own decoding stream, we may use encoding_rs_io or similar library to do the conversion.
I also just created #948 to show you where we may change the reader from the reader that guessing encoding to the reader that transparently decodes. It seems to me that it is at the level of a slightly higher level XmlReader that honest decoding should be introduced.
src/encoding.rs
Outdated
|
|
||
| // Read more data from the underlying reader | ||
| let read_size = buf.len().max(64); // Read at least 64 bytes for efficiency | ||
| let mut temp = vec![0u8; read_size]; |
There was a problem hiding this comment.
Creating a new Vec in the loop in hot path will ruin performance. You already have self.buffer, why not read into it? Or at least read into a fixed-size buffer. Or even better, require BufRead and get buffers from it.
There was a problem hiding this comment.
Yes, that is correct - as mentioned this is really a proof of concept level implemenation, it needs some iteration before being production ready. But I can address this in the next push.
src/reader/mod.rs
Outdated
| if let Some(encoding) = $reader.detect_encoding() $(.$await)? ? { | ||
| if $self.state.encoding.can_be_refined() { | ||
| $self.state.encoding = crate::reader::EncodingRef::BomDetected(encoding); | ||
| $self.state.encoding = crate::reader::EncodingRef::BomDetected(encoding.encoder()); |
There was a problem hiding this comment.
I'm not sure if this was ever strictly correct, detect_encoding() would return Some((encoder, bom_len) even when there was no BOM per-se. So for e.g. an encoding similar to or compatible with UTF-8 that is not UTF-8, I would think this is providing false certainty about BOM presence.
edit: to be clear, if it was wrong before, it's still wrong.
| /// |`3C 3F 78 6D`|UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably | ||
| #[cfg(feature = "encoding")] | ||
| pub fn detect_encoding(bytes: &[u8]) -> Option<(&'static Encoding, usize)> { | ||
| pub fn detect_encoding(bytes: &[u8]) -> Option<DetectedEncoding> { |
There was a problem hiding this comment.
Maybe extract this commit into a separate PR? It seems pretty isolated
c2dcb4b to
a6fe9e5
Compare
| let event = BytesDecl::from_start(BytesStart::wrap(content, 3, self.decoder())); | ||
|
|
||
| // TODO: once we can assume that the parser is operating on UTF-8, then we can throw | ||
| // an error here if we see a non-UTF-8 encoding... if encoding/decoding is not enabled. |
There was a problem hiding this comment.
We can't assume this yet because it's still possible to create non-validating readers under `[cfg(not(feature = "encoding"))]
f302282 to
2a1bbb0
Compare
|
@Mingun Ignore for the time being the exact details of the Architecturally, do you have any objections to the direction of this work? The general approach, how it's layered, etc. Or the (long-term) goal of taking all decoding, encoding detection and/or validation, and BOM detection and/or stripping (and maybe EOL normalization) OUT of the parser itself and into a pre-processor, provided doing so either improves performance or has minimal cost?
Yes, I agree, that's the path I was going down on the initial implementation. But assuming we want to continue keeping a separate edit: probably this path could also be implemented with
I still think the better approach is to just push all preprocessing down underneath the parsing code, and take advantage of the simplifications that makes possible. And the ability to be able to parse UTF-16. Yes maybe you can skip over performing some validation that way, but there is also a cost to running the validation (and maybe allocations) many times over small buffers instead of once over a large buffer. Not to mention the additional internal complexity of |
|
Feature-wise, this PR is now complete. It's just a matter of improving the quality of Global decoding, improving APIs, etc. -- all of that is for future PRs. It should be able to be bolted onto this infrastructure though. |
src/encoding.rs
Outdated
|
|
||
| /// A reader wrapper that ensures only valid UTF-8 bytes are read. | ||
| /// | ||
| /// This reader uses [`str::from_utf8()`] and [`Utf8Error::valid_up_to()`] to validate |
There was a problem hiding this comment.
I'd like to also try simdutf8
Probably this could also be done with encoding_rs. If you are open to requiring encoding_rs always, then we could ditch this implementation completely. I just figured that we probably wouldn't want to do that.
eb13980 to
65aae52
Compare
|
I'm working on a more efficient implementation that assumes |
(not used yet)
It does not currently do any decoding, but this provides a place where the validating and decoding functionality can be abstracted.
The goal is to adopt this functionality into the standard constructors, but backwards compatibility is tricky - this gives more room to experiment first. Reader::from_reader_validating() Reader::from_file_validating() NsReader::from_reader_validating() NsReader::from_file_validating() (when "encoding" feature is not enabled) Also, add some tests for validation errors being bubbled up through the reader.
Do all of the plumbing necessary to return EncodingError directly from Utf8ValidatingReader using IoError::InvalidData + error downcasting. The Utf8 variant of EncodingError now holds an error enum, as we cannot create instances of Utf8Error ourselves.
Have it return a generic enum so that it can be used with or without the encoding feature.
In cases where the input is sufficiently short and doesn't contain invalid sequences, Utf8ValidatingReader was unable to detect the input as being not-UTF-8 We now call detect_encoding() during the first read() so that it can more effectively raise the appropriate errors. Doing this (and BOM stripping) upstream of the parser makes it possible to eliminate this responsibility from the parser, once it can be relied upon on all code paths.
826a7e5 to
f193173
Compare
It simplifies the implementation to just require BufRead.
|
@Mingun I left the original implementation commit intact, but the new implementation in "Make Utf8ValidatingReader use BufRead" validates UTF-8 at 30 GB per second (as per the included benchmarks), which is more than 15x faster than the original implementation. Total memory bandwidth through BufReader is otherwise about 90 GBps. Since parsing benchmarks rarely breach 500 MBps, I would say that is insignificant overhead. Though, I'm now working on the decoding variant using encoding_rs, and the implementation is simpler and still performs well. So maybe just using the one implementation w/ encoding_rs is better and more maintainable, but would be at the expense of always carrying that dependency. |
|
@Mingun Please take a look at my comments when you get a moment, I'd like to move this work along. As mentioned I remain pretty flexible on the implementation (e.g. if you'd prefer to just use |
|
I will try to find time to review this and other latest issues / PRs in the end of week. |
Mingun
left a comment
There was a problem hiding this comment.
Architecturally, do you have any objections to the direction of this work? The general approach, how it's layered, etc.
Architecturally, I suggest that the parser inside still work with [u8], since it does not require UTF-8 guarantees to work. We only need guarantees of XML-compatible encoding1, and these are all single-byte encodings that encoding_rs supports. On the other hand, users usually work with UTF-8 text and they would like to have an API that operates on str / String rather than [u8] / Vec<u8>. I think this is quite achievable, but we will have to introduce duplicate structures for event types or somehow parametrize existing ones.
We can achieve correctness now2 by substituting R in Reader<R>, which will transcode the readable data to UTF-8 on the fly, or, as in the case of this PR, check that the data is correctly encoded in UTF-8 (which is also a kind of zero-cost transcoding).
In this sense, I agree with the Utf8ValidatingReader approach implemented here. To further develop and provide the str-API, I assume that one should move towards implementing the new methods in
impl<R> Reader<Utf8ValidatingReader<R>> {
}which will call the existing [u8]-based methods and convert the result to str (using unsafe methods, since we will check that the input was in UTF-8).
Or the (long-term) goal of taking all decoding, encoding detection and/or validation, and BOM detection and/or stripping (and maybe EOL normalization) OUT of the parser itself and into a pre-processor, provided doing so either improves performance or has minimal cost?
This is probably closer to my vision, but more research is needed here. For example, we want to report offsets in units of the original byte source, and not in units of the source after normalization/decoding, which will be hidden from the original user.
I still think the better approach is to just push all preprocessing down underneath the parsing code, and take advantage of the simplifications that makes possible. And the ability to be able to parse UTF-16.
Yes, #948 just allows you to do it. If the first event is an XML declaration (and the XML standard says that the XML declaration should be the very first event, even empty lines in front are not allowed. This, by the way, allows you to detect UTF-16-like encoding without using BOM), then we can extract the encoding from it and replace the reader in EntityReader with another one that will implement decoding by large chunks before parsing (since Box<dyn BufRead> is stored there, the type will not change).
edit: probably this path could also be implemented with
encoding_rs, it would just require carrying that dependency always. I am flexible on the details
In theory, we could use encoding_rs_io or encoding_rs_rw if we make PR there, which will make their dependence on encoding_rs optional, and the only supported operation in the absence of encoding_rs will check the input to UTF-8 using standard library methods.
I choose Request changes variant, because I feel that at least that can be changed:
- use
BufReader::with_capacityinstead ofChunkedReader - use
BufReadinUtf8BytesReader detect_encodingcould be extracted to separate PR- I think, that new constructors are redundant now. We always can add them later if required
Footnotes
-
That is, those in which XML markup characters have the same codes as in ASCII / UTF-8 ↩
-
Of course, currently we should know the encoding in advance. If we want to change it after reading the XML declaration or BOM, then we will need to change the
Readerfrom which we read. I believe it should be easier to do after the introduction of a new XML reader in #948 ↩
| fn from(error: IoError) -> Error { | ||
| Self::Io(Arc::new(error)) | ||
| match error.kind() { | ||
| IoErrorKind::InvalidData => Self::Encoding(error.downcast::<EncodingError>().expect( |
There was a problem hiding this comment.
If you will rebase you PR for whatever reason, it would be good to add this to the commit that bumps MSRV so we quickly could find reasons if required
| /// Helper reader that returns data in fixed-size chunks | ||
| struct ChunkedReader<'a> { | ||
| data: &'a [u8], | ||
| pos: usize, | ||
| chunk_size: usize, | ||
| } | ||
|
|
||
| impl<'a> ChunkedReader<'a> { | ||
| fn new(data: &'a [u8], chunk_size: usize) -> Self { | ||
| Self { | ||
| data, | ||
| pos: 0, | ||
| chunk_size, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl<'a> Read for ChunkedReader<'a> { | ||
| fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> { | ||
| if self.pos >= self.data.len() { | ||
| return Ok(0); | ||
| } | ||
| let len = self | ||
| .chunk_size | ||
| .min(buf.len()) | ||
| .min(self.data.len() - self.pos); | ||
| buf[..len].copy_from_slice(&self.data[self.pos..self.pos + len]); | ||
| self.pos += len; | ||
| Ok(len) | ||
| } | ||
| } |
There was a problem hiding this comment.
That is not necessary, you may use BufReader::with_capacity, the effect will be the same.
| loop { | ||
| let mut buf = [0u8; 10]; | ||
| let n = reader.read(&mut buf).unwrap(); | ||
| if n == 0 { | ||
| break; | ||
| } | ||
| result.extend_from_slice(&buf[..n]); | ||
| } |
There was a problem hiding this comment.
You may use more strict check here: we expect, that we will need only two calls of read to get the data and the 3rd call should return 0.
| loop { | ||
| let mut buf = [0u8; 10]; | ||
| let n = reader.read(&mut buf).unwrap(); | ||
| if n == 0 { | ||
| break; | ||
| } | ||
| result.extend_from_slice(&buf[..n]); | ||
| } |
| loop { | ||
| let mut buf = [0u8; 10]; | ||
| let n = reader.read(&mut buf).unwrap(); | ||
| if n == 0 { | ||
| break; | ||
| } | ||
| result.extend_from_slice(&buf[..n]); | ||
| } |
| if !self.encoding_checked { | ||
| self.encoding_checked = true; | ||
|
|
||
| let available = self.inner.fill_buf()?; | ||
| // detect_encoding uses starts_with, so patterns longer than the | ||
| // available data simply won't match — no length guard needed. | ||
| if let Some(detected) = detect_encoding(available) { |
There was a problem hiding this comment.
Isn't encoding_checked set too early? If the first returned buffer will not give us enough bytes to detect encoding we will not detect it at all. That is possible even for BufReader with large buffer, because fill_buf may continuously return small buffer until it will not be consumed. That was the reason of #939
| /// Returns the expected total number of bytes in a UTF-8 character given its first byte | ||
| /// (2, 3, or 4). Used to determine how many continuation bytes are needed to complete a | ||
| /// pending incomplete sequence. | ||
| fn utf8_char_width(first_byte: u8) -> usize { |
There was a problem hiding this comment.
May be const?
| fn utf8_char_width(first_byte: u8) -> usize { | |
| const fn utf8_char_width(first_byte: u8) -> usize { |
Also, it seems, that using compare will faster (spied at utf8_width):
https://godbolt.org/z/oxsjY8nja
| /// misreported as invalid UTF-8. | ||
| /// | ||
| /// The caller must ensure that `bytes[..index]` contains valid UTF-8 data. | ||
| fn floor_char_boundary(bytes: &[u8], index: usize) -> usize { |
There was a problem hiding this comment.
If it possible to make it const, add const.
| match std::str::from_utf8(available) { | ||
| Ok(_) => { | ||
| // All available bytes are valid UTF-8. Copy as many complete characters as | ||
| // fit in buf. We must land on a character boundary to avoid consuming a | ||
| // partial character from the BufRead — otherwise the next fill_buf() would | ||
| // start with orphaned continuation bytes, causing a false validation error. | ||
| let len = floor_char_boundary(available, buf.len()); | ||
| if len == 0 { | ||
| return Ok(0); | ||
| } | ||
| buf[..len].copy_from_slice(&available[..len]); | ||
| self.inner.consume(len); | ||
| return Ok(len); | ||
| } |
There was a problem hiding this comment.
You validate some input multiple times here. Imagine, that available is 8Kb, but buf only 4 bytes. You will validate 8192 bytes, but consume only 1..=4. Then validate 8188..=8191 bytes (which already was checked) and consume only 1..=4 and so on.
Limit available by buf.len() and validate only that part. That may be slower because you will validate by chunks that depends by size of output buffer. Maybe better approach would be
- maintain the length of already validated data
- return them without checking when requested not more than that length
- when new data coming, validate only them
| /// Small buffer for incomplete UTF-8 sequences at BufRead boundaries. | ||
| /// At most 3 bytes (the start of a 2, 3, or 4-byte sequence). | ||
| pending: [u8; 3], | ||
| pending_len: u8, |
There was a problem hiding this comment.
It seems that if you pull this code into a separate auxiliary struct, the algorithm will become clearer. The same approach is used in encoding_rs_rw
This is a small stepping stone towards #158
This adds 4 new constructors when
[cfg(not(feature = "encoding"))]Reader::from_reader_validating()Reader::from_file_validating()NsReader::from_reader_validating()NsReader::from_file_validating()Which wrap the reader with
Utf8BytesReader.Utf8BytesReaderis currently only implemented for[cfg(not(feature = "encoding"))]as transparent decoding is currently out of scope of this PR. ifencodingis enabled, it's just a pass-through shim implementation - the goal would be to eventually implement that side w/ decoding functionality instead of validation.This PR does and will not make any additional public API changes - those will only take place once the underpinnings are in place and confirmed sound.
Obviously the long-term plan is to not need these separate constructors, but given the scope of the work I thought it prudent to touch the current API surface as little as possible.