PreviousNextTracker indexSee it online !

(25/26) 187 - Support <?xml-model?> link to RelaxNG schema in XML Plugin

A patch to add support for `<?xml-model>` processing instructions as a way to associate a RelaxNG schema

Submitted conal_tuohy - 2019-12-03 04:44:57.435000 Assigned
Priority 5 Labels XML RelaxNG
Status open Group None
Resolution None

Comments

2019-12-03 04:46:36.109000
conal_tuohy

Adding the XML plugin artifact as an attachment

XML.jar (2.4Mio)

2019-12-08 18:10:19.480000
kerik-sf

Thanks for the patch.

Regarding documentation (good that you updated it):
1. tei hrefs (http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_tite.rng) end in 404 errors. Is it expected?
2. a pointer to the validation section would be great, for another "external" way to specify RNG schemas.
3. the new version in CHANGES.txt is 3.0.7

Regarding the code
1. processingInstruction()
1.1 more cases should be supported based on type and schematypens: rnc, xsd, RelaxNG, as in https://www.w3.org/TR/xml-model/#d0e689
1.2 schematron is not actually supported, is it?
1.3 you should get the doc's systemId using `xml.PathUtilities.pathToURL(systemId)` instead of using it directly
2. getPseudoAttributes
2.1 the value part of the regex is incorrect: it should match until the quote, not until '=' and support both single and double quotes, as in the spec (dt-parsing)
2.2 what about & and <? Can you verify if it's unescaped by the parser? If not we have to document the limitation or do it ourself...

2019-12-09 04:40:58.470000
conal_tuohy

Documentation:
1) the 404 on the TEI schemas is expected (the TEI-C website has a configuration problem). The error has been reported, and a redirect is going to be put in place. These HTTP URLs are the officially published ones at the moment though.
2) I don't quite understand what you're suggesting. But I did read the "validation" section and noticed the outdated text there (which said that a RelaxNG schema could not be specified in a document), which I've now corrected.
3) fixed
I'll update the patch shortly

2019-12-09 09:42:12.035000
conal_tuohy

Regarding the code:

1 processingInstruction method
1.1 Yes I have a plan to also support schematron, but I prioritised RelaxNG because the RelaxNG schema language and the `<?xml-model?>` PI are the most commonly used in the Text Encoding community, and they were not supported.
1.2 I am working on Schematron next, but of course the validation itself is very different.
1.3 Done, thank you.
2 getPseudoAttributes
2.1 I have reviewed the pseudo-attribute parsing with more care, and dealt with both the double- and single-quote cases.
2.2 I believe I have now correctly parsed the PI including, including dealing with encoded chars.

Thanks for your review!

See new attached diff

XML.diff (10.6Kio)

2019-12-09 10:15:55.015000
kerik-sf

1.1 I would also prioritize RelaxNG. In fact, I don't ask you to implement Schematron at all if you don't need it ;-) What I don't understand is that you match on "http://purl.oclc.org/dsdl/schematron" not "http://relaxng.org/ns/structure/1.0".

2. Yes, PI parsing looks perfect! Only a style issue: I prefer final static fields upper-cased.
Also, I would store the compiled Pattern in the static field (`private static final Pattern PSEUDO_ATTRIBUTE = Pattern.compile("...")`) when applicable. Same for the pattern in `getPseudoAttributeValue()`.

Thanks,

2019-12-09 11:13:25.687000
ezust

I just want to say, reading this thread and seeing you help each other,
that makes me very happy!
Especially for working on the XML plugin, which is my favorite plugin.

alternate (231B)

2019-12-09 11:56:41.981000
conal_tuohy

Thanks very much for spotting the namespace confusion, Eric!

I was wondering how I could have made that mistake and still had the schema actually work!

It turns out that the sample TEI XML files I've been working with all have `?xml-model?` PIs for both Schematron *and* RelaxNG,, and the `href` pseudo-attribute value is the same for both the schema types (the RelaxNG schema file actually has Schematron rules embedded in it). I had made a stupid copy-and-paste error, but everything had worked for me, because my "schematron" schema was actually also a RelaxNG schema. I've fixed that now, and I've made the stylistic change to the static field names, and I've made the Pattern objects final and static too.

Once more, with the diff:

XML.diff (10.8Kio)

2019-12-09 12:21:33.671000
kerik-sf

Cool! Now I've only one request: put curly braces around cases in getPseudoAttributeValue because it's such a common error to add another indented line to the last case and think that it only applies in it, but it doesn't : it always applies. Or for short code you can put it on the same line as the `if`. Then braces are not necessary.

```
if (matcher.group(1) != null) unescapedValue.append("&"); // &
else if (matcher.group(2) != null) unescapedValue.append("<"); // <
else if (matcher.group(3) != null) unescapedValue.append(">"); // >
else if (matcher.group(4) != null) unescapedValue.append("\""); // &quot;
else if (matcher.group(5) != null) unescapedValue.append("'"); // &apos;
else if (matcher.group(6) != null) {
unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(6))); // decimal unicode codepoint
} else if (matcher.group(7) != null) {
unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(7),16)); // hex unicode codepoint
} else if (matcher.group(8) != null)
unescapedValue.append(matcher.group(8)); // any other character is copied unchanged
}
```

2019-12-09 13:38:20.815000
conal_tuohy

OK I have chosen to use curly braces in all cases, even where the `append` call is short enough that it would have fitted on one line; I think the consistency makes the method more readable.
~~~
if (matcher.group(1) != null) {// &
unescapedValue.append("&");
} else if (matcher.group(2) != null) {// <
unescapedValue.append("<");
} else if (matcher.group(3) != null) {// >
unescapedValue.append(">");
} else if (matcher.group(4) != null) {// &quot;
unescapedValue.append("\"");
} else if (matcher.group(5) != null) {// &apos;
unescapedValue.append("'");
} else if (matcher.group(6) != null) {// decimal unicode codepoint
unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(6)));
} else if (matcher.group(7) != null) { // hex unicode codepoint
unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(7),16));
} else if (matcher.group(8) != null) { // any other character copied unchanged
unescapedValue.append(matcher.group(8));
}
~~~

XML.diff (10.8Kio)

2019-12-09 13:48:30.341000
kerik-sf

That's fine :-)