-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTF-7 to replacement encoding list? / Encoding sniffing #68
Comments
If that is compatible, sounds fine. I know @hsivonen wants to greatly extend the list of replacement labels with all kinds of legacy encodings that might cause problems. Main issue is doing the research to see whether it is safe. |
Whoa! Do you mean Blink now uses more unspecified heuristics than before even for non-Japanese locales? Why more heuristics? Why without a spec? ಠ_ಠ
It seems to me that it's bad for Blink to adopt a library that doesn't do what Blink needs to do as a black box and then tweak the output of the black box as opposed to writing code that does what's needed in the first place. As for the issue of mapping the label "utf-7" to the replacement encoding generally, I think it's a matter of determining if real Web content relies on the label "utf-7" being unrecognized so as to fall back on an ASCII-compatible encoding. If Blink is willing to experiment with shipping a mapping from "utf-7" to replacement and assessing breakage, that would be cool. |
Also, does existing Web content actually require ISO-2022-JP to be heuristically detected (as opposed to heuristically detecting between Shift_JIS and EUC-JP)? Wouldn't in be a security improvement not to detect ISO-2022-JP and not to let it be chosen from the menu (i.e. support it only when labeled)? |
Also, letting a detector decide that input should be decoded as the replacement encoding is most likely a bad idea. Consider an unlabeled site whose markup template and site-provided content are all ASCII and that accepts comments. A user could cause denial of service by submitting a comment that looks like a UTF-7 or HZ byte sequence. |
Chromium got rid of the encoding menu completely. So, the encoding cannot be chosen via UI any more. Blink did not add a new heuristic, but just replaced ICU encoding detector with Compact Encoding Detector. It only kicks in when there is no encoding label (meta / http C-T charset). |
@jungshik if the web requires such a detector, we should work towards standardizing it. Would Google be willing to work with us on that? |
FYI, CED is open-sourced and available in github. See https://github.com/google/compact_enc_det BTW, Blink was changed recently to treat auto-detected (unlabelled) 7-bit encodings (other than ISO-2022-JP) as ASCII. We came across @hsivonen's scenario ( #68 (comment) ) |
Both are good questions. I've just filed https://bugs.chromium.org/p/chromium/issues/detail?id=647582 |
And by "as ASCII" I guess you mean windows-1252? |
@jungshik, I have some more questions:
(Also, "is open-sourced and available in github" doesn't really answer @annevk's question about willingness to work towards standardizing.) |
Realized these questions better be answered by the person who ported the new detector to Blink (which is me):
TLD is utilized in the way that it is used to set the default encoding which is chosen for the encoding of the document if auto-detection fails to work or doesn't get triggered for whatever reason. So it is involved in the overall decoding process but not exactly in detection step itself.
To be precise, that's out of the scope of the detector but rather how Blink decoder is designed. It feeds the first chunk of the document received from network, which in almost most cases big enough for the detector to come up with the right guess.
As answered above, Blink is designed to feed the detector with the first chunk of the data only. After that, no more data is fed to it, therefore no revising its guess either. There is a way to artificially force the first chunk to be small so as not to have significant part of the document content but only have tags and scripts that can be viewed in ascii, which will trick the detector into thinking it is in plain ascii. But it would rarely happen in real life.
Not sure if I got the question right - AFAICT, Blink respects the detected encoding and uses it for decoding documents regardless of the localization default.
Yes it can also detect UTF-8. I wouldn't view it 'inadvertently encourage' non-labelling though. Most of the modern web sites go with UTF-8, and label the documents well. I don't have the stats on how it actually contributes to them neglecting labelling properly. The detector is more for legacy web sites. |
Thank you for the answer.
Have you estimated the success rate of looking at the TLD only versus looking at the first network buffer and falling back to looking at the TLD? How significant a win is it to look inside the first network buffer when you have the capability to look at the TLD? Do you guess windows-1252 for the new TLDs, like Gecko does?
I'm quite unhappy about Blink reintroducing a dependency on network buffer boundaries in text/html handling without discussion at the WHATWG prior to making the change. For historical context, prior to Firefox 4 text/html handling in Firefox was dependent on where in the input stream the buffer boundaries fell. This was bad, and getting rid of this, thanks to the inside from WebKit, was one of the accomplishments of the HTML parsing standardization effort at the WHATWG.
Unfortunately, the Web is vast enough for rare cases to be real.
When the guess is "ASCII", does it result in a windows-1252 decoder being instantiated? Or the decoder being chosen based on the TLD? Or something else?
Even if the HTML parser has only seen ASCII by the time it revises its encoding detection guess, the encoding may already have been inherited to CSS, JS and iframes, so revising the guess without causing a full reload seems like a bad idea.
I meant that there is a set of encodings that were the default in some localization and a set of encodings that were not to the default in any localization. When the guess is from the first set, the behavior that ensues matches the behavior of some pre-existing default configuration. When the guess is from the second set, a new Web-exposed behavior has been introduced. Apart from Japanese encodings, does Blink ever guess an encoding from the second set (e.g. ISO-8859-13)?
The vast majority of Web authors have not read your comment here and don't know how Blink works. Since Web authors do what seem to work, they don't know that UTF-8 sniffing is unreliable until it fails them. As noted in your comment about the case where the first network buffer is all ASCII, UTF-8 sniffing in Blink is unreliable. Some Web authors are going to feel free not to label UTF-8 as long as they haven't experienced the failure yet (i.e. they only test in Blink and e.g. use a language that makes it probable for (Detecting UTF-8 for file:-URL HTML by looking at the entire file is fine, IMO, and I want to implement that in a Gecko in due course.) |
Please note that I'm not working on Blink. And I ported CED as a drop-in replacement of ICU encoding detector for any Blink-based Web browser to benefit from it. I'm not entirely familiar with all the terms used in your questions.
What are 'new' TLDs? What I can say is Blink makes use of a typical set of TLDs and mappings to default encoding which can be found in Chromium/Android resource. Mapping for countries whose legacy, default encoding is not obvious (ge-Georgian, id-Indonesian, for instance) are not included.
Yes it goes with window-1252.
I don't know what belongs to the second set. Maybe you can see it for yourself here about what the detector can return: https://cs.chromium.org/chromium/src/third_party/ced/src/compact_enc_det/compact_enc_det_unittest.cc?l=4733
I can see your point. What would be your suggestion? |
I missed this:
Unfortunately I don't have a good estimation on that. |
I mean TLDs that are neither the ancient ones (.com, .org, .net, .edu, .gov, .mil) nor country TLDs. That is, TLDs like .mobi, .horse, .xyz, .goog, etc.
Where can I find these mappings? (I searched the Chromium codebase for Punycode country TLDs but didn't find this kind of mapping for those.)
What kind of situation is the TLD used in, then?
The second set is:
(In Gecko, the Japanese and Cyrillic encodings are potential detector outcomes, but my current belief (educated guess, not based on proper data) is that the Cyrillic detectors do more harm than good in Gecko, and I want to remove them. Also, ISO-2022-JP has a structure that's known-dangerous on a general level and shown-dangerous for other encodings with similar structure, and ISO-2022-JP is just waiting for someone to demonstrate an attack. In Gecko, the Japanese detector is enabled for the Japanese locale, the Russian detector is enabled for the Russian locale and the Ukrainian detector is enabled for the Ukrainian locale. Other localizations, including Cyrillic ones, ship with detector off.)
So the answer is that it can return non-Japanese and even non-Cyrillic guesses from the second set. :-(
Never guessing UTF-8.
That seems like the fundamental question when it comes to assessing whether the detector is necessary complexity assuming the absence of the manual override menu or if the detector adds unnecessary complexity to the Web Platform. |
BTW, ISO-2022-JP don't need to be "heuristically" detected. ISO 2022 generally uses special escape characters that denote the encoding in use. |
We've made some progress on this issue in https://bugs.chromium.org/p/chromium/issues/detail?id=691985. We haven't quite reached agreement though on what level of sniffing and rules are necessary for everyone to implement. |
FWIW, this is so dependent on IO implementation details that the sniffing limit is different for file: URLs on Ubuntu/ext4 vs. Windows/NTFS. (64 KB vs. something less that 64 KB, respectively, in my testing.) |
See whatwg/encoding#68 for background.
See whatwg/encoding#68 for background.
See whatwg/encoding#68 for background.
…-8 is not acceptable, a=testonly Automatic update from web-platform-tests Encoding: make it clear sniffing for UTF-8 is not acceptable See whatwg/encoding#68 for background. -- wpt-commits: 305ff11b190e7e2bec120a27083f4ba43d6d976e wpt-pr: 14455
…-8 is not acceptable, a=testonly Automatic update from web-platform-tests Encoding: make it clear sniffing for UTF-8 is not acceptable See whatwg/encoding#68 for background. -- wpt-commits: 305ff11b190e7e2bec120a27083f4ba43d6d976e wpt-pr: 14455
I did a brief evaluation of https://github.com/google/compact_enc_det to form an opinion on whether we should adopt it for Gecko. I feel very uneasy about adopting it, mainly because a) it seems bad for the Web Platform to depend on a bunch of mystery C++ as opposed to having a spec with multiple independent interoperable implementations, b) it indeed is mystery C++ in the sense that it doesn't have sufficient comments explaining its design or the rationale behind its design, c) it has a bunch of non-browser motivated code that'd be dead code among the mystery C++ code. In case we want to develop a spec for sniffing and encoding from content, here are some notes. It's clear that compact_enc_det has not been developed for the purpose that it is used for in Chromium. It looks like it's been developed for the Google search engine and possibly also for Gmail. As seen from this issue, this results in compact_enc_det doing things that don't really make sense in a Web browser and then having to correct toward something that makes more sense in the browser context. compact_enc_det does not make use of data tables and algorithms that an implementation of the Encoding Standard would necessarily include. If I was designing a detector, I'd eliminate encodings whenever decoding according to the Encoding Standard according to a candidate encoding would result in error or in C1 controls (also probably half-width katakana, as that's a great way to decide that the encoding is not Shift_JIS). compact_enc_det analyzes byte bigrams both for single-byte encodings and for legacy CJK encodings. In an implementation that assumes the presence of the Encoding Standard data and algorithms, it should be easy to distinguish between single-byte encodings and legacy CJK encodings by the process of elimination according to the above paragraph: If a legacy CJK encoding remains on the candidate list after seeing a couple of hundred non-ASCII bytes, chances are that the input is in a legacy CJK encoding, because it's so likely that single-byte-encoding content results in a errors when decoded as legacy CJK encodings. (Big5 and Shift_JIS are clearly structurally different from the EUC-JP, EUC-KR and GBK. The decision among the EUC-based encodings, if input is valid according to all of them, could be done as follows: If there's substantial kana according to EUC-JP interpretation, it's EUC-JP. If the two-byte sequences stay mainly in the original KS X 1001 Hangul area, it's EUC-KR. Otherwise, it's GBK.) compact_enc_det does not assign probabilities to bigrams directly. Instead, it takes the score for the first byte of an "aligned bigram" with the high bit XORed with the high bit of the second byte and the score for the second byte of an "aligned bigram". If the combination of the highest four bits of the first byte and the highest four bits of the second byte yields a marker for a more precise lookup, then a score for the low 5 bits of the first byte combined with the low 5 bits of the second byte is taken. It's unclear why that score isn't used unconditionally. "Aligned" appears to mean aligned to the start of the non-ASCII run, which appears to align for legacy CJK, and the next bigram consist of the third and fourth byte of the run instead of the second and third bite of the run, which makes sense for legacy CJK, but at least superficially looks like an odd choice for analyzing non-Latin single-byte encodings. I am guessing that "compact" in the name comes from not having a table of ((256 * 256) / 4) * 3 (all byte pairs where at least one byte is non-ASCII) per encoding and instead having the 256-entry table indexed by the combination of four and four high bits to decide if a table of 1024 entries indexed by five and five low bits should be used. There seems to be a capability for distinguishing between Western, Central European, and Baltic encodings using trigrams, but it's unclear if that code is actually turned on. |
…-8 is not acceptable, a=testonly Automatic update from web-platform-tests Encoding: make it clear sniffing for UTF-8 is not acceptable See whatwg/encoding#68 for background. -- wpt-commits: 305ff11b190e7e2bec120a27083f4ba43d6d976e wpt-pr: 14455 UltraBlame original commit: fc2cf66fe6fe7f880365058628d3a5b3f2c657fb
…-8 is not acceptable, a=testonly Automatic update from web-platform-tests Encoding: make it clear sniffing for UTF-8 is not acceptable See whatwg/encoding#68 for background. -- wpt-commits: 305ff11b190e7e2bec120a27083f4ba43d6d976e wpt-pr: 14455 UltraBlame original commit: fc2cf66fe6fe7f880365058628d3a5b3f2c657fb
…-8 is not acceptable, a=testonly Automatic update from web-platform-tests Encoding: make it clear sniffing for UTF-8 is not acceptable See whatwg/encoding#68 for background. -- wpt-commits: 305ff11b190e7e2bec120a27083f4ba43d6d976e wpt-pr: 14455 UltraBlame original commit: fc2cf66fe6fe7f880365058628d3a5b3f2c657fb
Maybe this was discussed before, but I couldn't find a bug on this. What do you think of treating UTF-7 the same way as ISO-2022-{KR,CN}, HZ-GB, etc?
When decoding, the whole input is replaced by U+FFFD. When encoding, use UTF-8.
Background: Blink began to use Compact Encoding Detector ( google/compact_enc_det ) when no encoding label is found (http, meta). When 7-bit encoding detection is on, it detects ISO-2022-{KR,CN}, HZ-GB AND UTF-7 in addition to ISO-2022-JP. 7-bit encoding detection is ON for ISO-2022-JP, but we want to suppress other 7-bit encodings. I think the best way to 'suppress' (unsupport) them is to turn the whole input to U+FFFD.
The text was updated successfully, but these errors were encountered: