Previous Next Tracker index See it online !

(25/26) 187 - Support <?xml-model?> link to RelaxNG schema in XML Plugin

A patch to add support for `<?xml-model>` processing instructions as a way to associate a RelaxNG schema

Submitted	conal_tuohy - 2019-12-03 04:44:57.435000	Assigned
Priority	5	Labels	XML RelaxNG
Status	open	Group	None
Resolution	None

Comments

2019-12-03 04:46:36.109000 conal_tuohy	Adding the XML plugin artifact as an attachment XML.jar (2.4Mio)
2019-12-08 18:10:19.480000 kerik-sf	Thanks for the patch. Regarding documentation (good that you updated it): 1. tei hrefs (http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_tite.rng) end in 404 errors. Is it expected? 2. a pointer to the validation section would be great, for another "external" way to specify RNG schemas. 3. the new version in CHANGES.txt is 3.0.7 Regarding the code 1. processingInstruction() 1.1 more cases should be supported based on type and schematypens: rnc, xsd, RelaxNG, as in https://www.w3.org/TR/xml-model/#d0e689 1.2 schematron is not actually supported, is it? 1.3 you should get the doc's systemId using `xml.PathUtilities.pathToURL(systemId)` instead of using it directly 2. getPseudoAttributes 2.1 the value part of the regex is incorrect: it should match until the quote, not until '=' and support both single and double quotes, as in the spec (dt-parsing) 2.2 what about & and <? Can you verify if it's unescaped by the parser? If not we have to document the limitation or do it ourself...
2019-12-09 04:40:58.470000 conal_tuohy	Documentation: 1) the 404 on the TEI schemas is expected (the TEI-C website has a configuration problem). The error has been reported, and a redirect is going to be put in place. These HTTP URLs are the officially published ones at the moment though. 2) I don't quite understand what you're suggesting. But I did read the "validation" section and noticed the outdated text there (which said that a RelaxNG schema could not be specified in a document), which I've now corrected. 3) fixed I'll update the patch shortly
2019-12-09 09:42:12.035000 conal_tuohy	Regarding the code: 1 processingInstruction method 1.1 Yes I have a plan to also support schematron, but I prioritised RelaxNG because the RelaxNG schema language and the `<?xml-model?>` PI are the most commonly used in the Text Encoding community, and they were not supported. 1.2 I am working on Schematron next, but of course the validation itself is very different. 1.3 Done, thank you. 2 getPseudoAttributes 2.1 I have reviewed the pseudo-attribute parsing with more care, and dealt with both the double- and single-quote cases. 2.2 I believe I have now correctly parsed the PI including, including dealing with encoded chars. Thanks for your review! See new attached diff XML.diff (10.6Kio)
2019-12-09 10:15:55.015000 kerik-sf	1.1 I would also prioritize RelaxNG. In fact, I don't ask you to implement Schematron at all if you don't need it ;-) What I don't understand is that you match on "http://purl.oclc.org/dsdl/schematron" not "http://relaxng.org/ns/structure/1.0". 2. Yes, PI parsing looks perfect! Only a style issue: I prefer final static fields upper-cased. Also, I would store the compiled Pattern in the static field (`private static final Pattern PSEUDO_ATTRIBUTE = Pattern.compile("...")`) when applicable. Same for the pattern in `getPseudoAttributeValue()`. Thanks,
2019-12-09 11:13:25.687000 ezust	I just want to say, reading this thread and seeing you help each other, that makes me very happy! Especially for working on the XML plugin, which is my favorite plugin. alternate (231B)
2019-12-09 11:56:41.981000 conal_tuohy	Thanks very much for spotting the namespace confusion, Eric! I was wondering how I could have made that mistake and still had the schema actually work! It turns out that the sample TEI XML files I've been working with all have `?xml-model?` PIs for both Schematron and RelaxNG,, and the `href` pseudo-attribute value is the same for both the schema types (the RelaxNG schema file actually has Schematron rules embedded in it). I had made a stupid copy-and-paste error, but everything had worked for me, because my "schematron" schema was actually also a RelaxNG schema. I've fixed that now, and I've made the stylistic change to the static field names, and I've made the Pattern objects final and static too. Once more, with the diff: XML.diff (10.8Kio)
2019-12-09 12:21:33.671000 kerik-sf	Cool! Now I've only one request: put curly braces around cases in getPseudoAttributeValue because it's such a common error to add another indented line to the last case and think that it only applies in it, but it doesn't : it always applies. Or for short code you can put it on the same line as the `if`. Then braces are not necessary. ``` if (matcher.group(1) != null) unescapedValue.append("&"); // & else if (matcher.group(2) != null) unescapedValue.append("<"); // < else if (matcher.group(3) != null) unescapedValue.append(">"); // > else if (matcher.group(4) != null) unescapedValue.append("\""); // " else if (matcher.group(5) != null) unescapedValue.append("'"); // ' else if (matcher.group(6) != null) { unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(6))); // decimal unicode codepoint } else if (matcher.group(7) != null) { unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(7),16)); // hex unicode codepoint } else if (matcher.group(8) != null) unescapedValue.append(matcher.group(8)); // any other character is copied unchanged } ```
2019-12-09 13:38:20.815000 conal_tuohy	OK I have chosen to use curly braces in all cases, even where the `append` call is short enough that it would have fitted on one line; I think the consistency makes the method more readable. ~~~ if (matcher.group(1) != null) {// & unescapedValue.append("&"); } else if (matcher.group(2) != null) {// < unescapedValue.append("<"); } else if (matcher.group(3) != null) {// > unescapedValue.append(">"); } else if (matcher.group(4) != null) {// " unescapedValue.append("\""); } else if (matcher.group(5) != null) {// ' unescapedValue.append("'"); } else if (matcher.group(6) != null) {// decimal unicode codepoint unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(6))); } else if (matcher.group(7) != null) { // hex unicode codepoint unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(7),16)); } else if (matcher.group(8) != null) { // any other character copied unchanged unescapedValue.append(matcher.group(8)); } ~~~ XML.diff (10.8Kio)
2019-12-09 13:48:30.341000 kerik-sf	That's fine :-)