PreviousNextTracker indexSee it online !

(50/211) 3373 - zero-width positive lookbehind assertion

I am trying to create a regex pattern to recognise and highlight source code comments that look like:

\*this is a comment;

The language ignores format so comments can occur anywhere on a line, multiple times on a line, or over multiple lines. Other code almost always finishes with a semi-colon except for a few exceptions that I can work on later. For now I am trying to get a regex zero-width positive lookbehind assertion to work so multiple \*; comments in a row can be identified. The regex mode I am working on looks like:

<SPAN_REGEXP TYPE="COMMENT2" AT_WHITESPACE_END="FALSE" MATCH_TYPE="RULE">
<BEGIN><\!\[CDATA\[(?<=\[;\])\[\s\]\*\[\*\]\[^;\]\*\]\]></BEGIN>
<END>;</END>
</SPAN_REGEXP>

I have also tried <BEGIN>(?<=\[;\])\[\s\]\*\[\*\]\[^;\]\*</BEGIN> which complains, and rightly so, and <BEGIN>(?&lt;=\[;\])\[\s\]\*\[\*\]\[^;\]\*</BEGIN> which fails to highlight just like the first example.

An example of the code I am trying to highlight (SAS for the curious):

/\* see how an assinment statement with \* works \*/
data _null_;
new_var = old_var \* 100;

/\* this comment should work fine, how about the other type \*/
no_comment; \*comment; \*comment;
\*comment; \*comment; \*comment; \*comment;
\*This is also a valid comment; /\*as is this\*/ \*and this; \*and this;
run;

The regex pattern doesn't seem to work at all when used in a mode file, and only works partially when used in the search dialog by capturing the semi-colon from the previous statement and thereby not allowing the next comment to be captured if the comment before it is captured. (?<=\[;\]) means a semi-colon should be present just in front of a captured area but should not be captured itself. This way assignment statements and SQL etc with \* are not captured. The first \*; comment after the /\* \*/ comment fails to capture, as expected, which should only require a small regex change so ignore that. To see how the pattern should capture the example append a semi-colon to the regex and paste both the regex and code into:

http://www.myregextester.com/

Submitted *anonymous - 2010-01-11 01:22:46 Assigned
Priority 5 Labels text area and syntax packages
Status open Group None
Resolution None

Comments

2010-01-18 16:15:19
goebbe

Please take a look into the file sas.xml in your Jedit home directory.
Currently sas.xml only support a maximum of two comments per line - but it is easy (but computationally expensive) to extended the regexp rules used there.
In practise you have to add two addition regexp-expressions to achieve that also 3 comments in a line are supported.

The main problem when implementing things like \*commen1; \*comment2; is that
an simple regexp will always ignore all the sign of the previous matched regexp.
In my example that means that you cannot write an regexp that "conditions" on the existance of the ";" that closes comment 1.

In the current implementation of sas.xml there is a workaround for this:
Simply use a regexp that matches two comments in a row. (of course this is more complicated)

Another limitation of syntax highlighting in Jedit is the fact that you cannot span regular expression over several lines.

Please see the following patch with test-files and a discussion of the issue:
https://sourceforge.net/tracker/index.php?func=detail&aid=2793540&group_id=588&atid=300588

2010-01-19 09:36:48
goebbe

See also
https://sourceforge.net/tracker/index.php?func=detail&aid=2926121&group_id=588&atid=300588
for an updates sas.xml that fixes the case of
/\*commnt1\*/ \*comment2;
if they are in one line.

2013-02-12 11:21:53
muntjac

The lack of support for zero-width positive lookbehind assertions also makes it impossible to correctly highlight strings in Matlab/Octave source code, as strings can begin and end with a single quote ' , but in certain contexts that character can also signify matrix/vector transposition.
E.g. in the following line:

a=\[1 2 3\]';'a is a column vector'

you can only tell that ';' is not a string because the first ' follows a \] character. Similarly, you can only tell that 'a is a column vector' is a string because the first ' follows a semicolon.

Even if there is a workaround for the case the original poster asked about, zero-width positive lookbehind is still a useful feature for other cases, like the one I've illustrated.