In my case the large majority of files I work with are either UTF-8 with no BOM, or
UTF-16LE with no BOM. As I saw mentioned on other bugs and feature requests, these
sort of files aren't exactly rare.
As it is now (jEdit can't auto-detect UTF-16LE without BOM) it means that I can't
conveniently use jEdit on all of them, since I have to configure it in a way that
will consistently load one of these incorrectly and require manual reload.
Worse, it also means that I can't do a search in directory/files that will cover both
these file types correctly.
I do recognize that UTF-16 without BOM can't be recognized with 100% reliability.
But in my case, and I expect in the large majority of cases, most of the characters
in them will be essentially from the ASCII subset. So a heuristic to look at the start
of the file, and look for the pattern of alternating \x00 and not-\x00 bytes, would
have a very large success rate at correctly guessing if a file is UTF-16LE or UTF-16BE
without BOM.
This is also a pattern that is unlikely to be anything else, for the majority of cases
where jEdit isn't consistently used for binary files. So the risks of wrong behavior
are minimal. (Of course it's also possible to add this heuristic as a configurable
option, but my point is that this additional complexity can probably be avoided).
Submitted | yarondav - 2011-07-19 14:09:27 | Assigned | k_satoda |
---|---|---|---|
Priority | 5 | Labels | Encodings |
Status | open | Group | v4.3 |
Resolution | None |
2011-07-19 15:05:49 k_satoda |
- **assigned_to**: nobody --> k_satoda |
---|---|
2011-07-19 15:05:49 k_satoda |
Did you know \[Global Options\] > \[Encodings\] > \[List of fallback encodings\] ?
|
2011-07-19 15:43:36 yarondav |
No, I noticed and tried this already. |
2011-07-19 16:30:31 k_satoda |
- **status**: pending-works-for-me --> open |
2011-07-19 16:30:31 k_satoda |
Sorry I tried the fallback encodings with some japanese text files,
|
2011-07-19 18:08:35 yarondav |
Thanks, I didn't notice that these were plugins and not integral to the core. |
2011-07-20 15:56:51 yarondav |
This may work as a detector UTF16NoBOMDetector.java (2.3Kio) |
2011-07-20 15:59:54 yarondav |
I've tried making something. |
2011-07-20 17:40:36 k_satoda |
Thank you very much for trying by yourself.
|
2011-07-22 17:04:42 yarondav |
First, I want to say that I really do think something like this should be a part of
the core just like the other detectors are. It's less clear-cut, and the default parameters
could probably be better, but beyond that it's not an unusual need, and it is something
that quite a few other editors do at some level without requiring the user to install
plugins/extensions. |
2011-07-22 17:05:06 yarondav |
UTF16NoBOM-Source.zip (3.8Kio) |
2011-07-22 17:05:25 yarondav |
UTF16NoBOM.jar (4.8Kio) |
2013-12-09 19:16:08.330000 ezust |
- **labels**: core --> Encodings |