I am trying to create a regex pattern to recognise and highlight source code comments
that look like:
*this is a comment;
The language ignores format so comments can occur anywhere on a line, multiple times
on a line, or over multiple lines. Other code almost always finishes with a semi-colon
except for a few exceptions that I can work on later. For now I am trying to get a
regex zero-width positive lookbehind assertion to work so multiple *; comments in
a row can be identified. The regex mode I am working on looks like:
<SPAN_REGEXP TYPE="COMMENT2" AT_WHITESPACE_END="FALSE" MATCH_TYPE="RULE">
<BEGIN><![CDATA[(?<=[;])[\s]*[*][^;]*]]></BEGIN>
<END>;</END>
</SPAN_REGEXP>
I have also tried <BEGIN>(?<=[;])[\s]*[*][^;]*</BEGIN> which complains, and rightly
so, and <BEGIN>(?<=[;])[\s]*[*][^;]*</BEGIN> which fails to highlight just like
the first example.
An example of the code I am trying to highlight (SAS for the curious):
/* see how an assinment statement with * works */
data _null_;
new_var = old_var * 100;
/* this comment should work fine, how about the other type */
no_comment; *comment; *comment;
*comment; *comment; *comment; *comment;
*This is also a valid comment; /*as is this*/ *and this; *and this;
run;
The regex pattern doesn't seem to work at all when used in a mode file, and only works
partially when used in the search dialog by capturing the semi-colon from the previous
statement and thereby not allowing the next comment to be captured if the comment
before it is captured. (?<=[;]) means a semi-colon should be present just in front
of a captured area but should not be captured itself. This way assignment statements
and SQL etc with * are not captured. The first *; comment after the /* */ comment
fails to capture, as expected, which should only require a small regex change so ignore
that. To see how the pattern should capture the example append a semi-colon to the
regex and paste both the regex and code into:
http://www.myregextester.com/
Submitted | Anonymous - 2010-01-11 - 01:22:46z | Assigned | nobody |
---|---|---|---|
Priority | 5 | Category | text area and syntax packages |
Status | Open | Group | None |
Resolution | None | Visibility | No |
2010-01-18 - 16:15:19z goebbe |
Please take a look into the file sas.xml in your Jedit home directory. Currently sas.xml only support a maximum of two comments per line - but it is easy (but computationally expensive) to extended the regexp rules used there. In practise you have to add two addition regexp-expressions to achieve that also 3 comments in a line are supported. The main problem when implementing things like *commen1; *comment2; is that an simple regexp will always ignore all the sign of the previous matched regexp. In my example that means that you cannot write an regexp that "conditions" on the existance of the ";" that closes comment 1. In the current implementation of sas.xml there is a workaround for this: Simply use a regexp that matches two comments in a row. (of course this is more complicated) Another limitation of syntax highlighting in Jedit is the fact that you cannot span regular expression over several lines. Please see the following patch with test-files and a discussion of the issue: https://sourceforge.net/tracker/index.php?func=detail&aid=2793540&group_id=588&atid=300588 |
---|---|
2010-01-19 - 09:36:48z goebbe |
See also https://sourceforge.net/tracker/index.php?func=detail&aid=2926121&group_id=588&atid=300588 for an updates sas.xml that fixes the case of /*commnt1*/ *comment2; if they are in one line. |
2013-02-12 - 11:21:53z muntjac |
The lack of support for zero-width positive lookbehind assertions also makes it impossible
to correctly highlight strings in Matlab/Octave source code, as strings can begin
and end with a single quote ' , but in certain contexts that character can also signify
matrix/vector transposition. E.g. in the following line: a=[1 2 3]';'a is a column vector' you can only tell that ';' is not a string because the first ' follows a ] character. Similarly, you can only tell that 'a is a column vector' is a string because the first ' follows a semicolon. Even if there is a workaround for the case the original poster asked about, zero-width positive lookbehind is still a useful feature for other cases, like the one I've illustrated. |