Monday, June 16, 2008

Java Syntax Highlighting with JEditorPane

I've been playing around with implementing a Syntax Highlighting / Coloring Editor or Text control in Java Swing. Just for fun. It would be part of TranScope to edit scripts, mostly Groovy, and to view some TAL / DDL and XML. That was both hard, time consuming and a hell-of-a-lot of fun and rewarding experience. I'll summarize what I did, and if there is demand publish my final code as a project. Most of these topics where new to me before I started.

You may also want to check out JEdit Syntax Package. The project seems dead, but it works.

XML EditorKit:

I started out by reading, and then getting the source for the Batik XML Editor, more here.
It was okay, but just for XML, and it seemed quiet tight to XML, so modifying it for other languages was not easy. But it is a very good library to use by itself. And I used it until I wrote my XML Lexer.

Take One - Dynamic Regex:

I came across this link, which was really helpful in understanding what all this Views, Documents, EditorKits are all about. Please read that link as it is very helpful and to the point. There is also the Sun documentation about the Swing Text API here.

The code I created based on Kees was very simple. I used some regular expressions to get tokens from each line, and whenever a match is found, the Color for that regex is used to color the match. All what is needed is to put the regex and associated colors in a Map, load from a Properties file, and voila! Dynamic highlighting, without any code change and for any language.

It worked perfectly... Almost. There is no need to keep any extra data about the Document, and highlighting does not need to parse anything except the single line being drawn by the View's drawUnselectedText method. This means it is very fast and needs no extra memory. The only problem is that multi-line constructs will NOT work. So multi-line comments are not handled.

This is no big limitation at all in many cases.

Take Two - Lexing + StyledDocument:

Here is where the fun begans. To properly handle multi-line constructs, simple regex matches are not really usable, and very slow. What is needed is a parser or lexer. Java has many of these, including Antlr, JavaCC and JFlex.

I did some research and found JFlex to be the easiest to use for Lexing. Remember I only need to get Tokens and not create a compiler. JFlex was also very easy to use for in-memory characters (from the Document), and very fast. I did some benchmarks, on my work PC: 2GHz, 1Gb RAM with lots of programs running, including NetBeans. Parsing a 200K Document still takes less than 15ms in most cases, and no performance is noticed while typing.

I created my Lexer to return a Token object of this form:

public class Token implements Serializable, Comparable {
public TokenType type;
public int start;
public int length;

// other boilerplate code....

public int compareTo(Object o) {
Token t = (Token) o;
if (this.start != t.start) {
return (this.start - t.start);
} else if(this.length != t.length) {
return (this.length - t.length);
} else {
return this.type.compareTo(t.type);

TokenType is an enum with all possible token types (OPER, IDENT, KEYWORD, STRING, COMMENT etc.)

So, what I initially did is create a DocumentListener that updates a matching List of Tokens for the Document, whenever the Document is updated.

Whenever the Document is updated I just call the setCharacterAttributes for the all tokens depending on their type.

That worked perfectly. If you have just a few lines. It quickly became very slow for any documents with more than about 100 lines. It also consumed a LOT of memory. The main thing is that updating the styles of a StyledDocument was not designed for this purpose.

When you write code, say you are writing the keyword "public":
  1. type "p", and parse the whole document. p is lexed as an identifier and those attributes are set to it, and everything else.
  2. type "u", same thing, "pu" is still lexed as identifier.
  3. type "b"...
  4. type "l"...
  5. type "i"...
  6. type "c" and now you have a keyword, so you change the char attributes for the whole "public".
Changing attributes is VERY slow in such cases. lots of events are fired and the StyledDocument keeps track of a lot of data about the styles of each character. For a script, or program, you will have a separate style for almost every single word. So you will have a lot of data for even the shortest of scripts. The StyledDocument was not designed for this. It was designed for normal "English" text, where most of it is the same style, except for a header here or a bold word there.

I initially changed the implementation to only call setCharacterAttributes for the modified parts of the Document. This was done by a calculating a Delta of the old and new Token, and then only the changes were used to update the Styles. But the memory use was still too much. And when a big file is opened or pasted on the JEditorPane it took a a while to set all the attributes.

It worked, but I could do better... And I am still having fun, so why stop there?

Take Three - Lexing + PlainDocument:

The final solution is to Lex the entire document whenever it changes (which is very fast) and use a PlainView and PlainDocument implementation to render the text using the drawUnselectedText method.

The code now is structured like this:
class SyntaxKit extends StyledEditorKit implements ViewFactory:
This class is used by the JEditorPane to set the type of text it will show. In NetBeans, I change the EditorKit property to point use an instance of this class. The create method of this class returns an instance of the SyntaxView class below.

class SyntaxView extends PlainView implements DocumentListener
This is the heart of the code. This class maintains a List of Tokens that match the contents of the Document it is to render. It keeps itself in-sync with all document changes by registering itself as a DocumentListener. The insertUpdate and removeUpdate methods are overridden to re-parse, or Lex the entire Document and put the Tokens in the tokens List member of this class. I removed the logic of maintaining a delta. It is fast and less code to maintain. As I said, lexing was not a performance issue at all.

The drawUnselectedText method off this class is called to draw lines of text. This method looks at the tokens and draws them in the proper Fonts, and Colors.

One more thing done in this class is to override the updateDamage method. This is needed so that something like closing a multi-line comment updates not just the last line, but all lines in the view.

If anybody is interested, I'll either put the code on a Google Project or show parts of it here. The project is now tightly integrated with TranScope, but I can spin it off as a separate project and remove the dependencies. There are currently Lexers for Java, Groovy, JavaScript, XML and Tandem / HP NSK TAL. To create your own, you only need to create the Lexer file and run it throw JFlex.


Anonymous said...

Please, post a version of this code that is note embeded. I would be very interested in seeing it

Ayman said...

The home of the project is now on Google Code here

Greg said...

This is very exciting, I have great need for this in my open source EHR project as it has a fair amount of scripting available to customize the system. It is javascript but the java version looks nice anyway. I will look into the code and see how easy it is to syntax highlight my javascript keywords, methods and functions. e.g. in this script BaseData, ServerUtility and the method names would be highlighted

Ayman said...

@Greg, I'll post some wiki pages on how to do this on the main project site here.
Customization should be very easy with just a bit of work.

Aziz's Corner said...

Hello, I checkout the Project from SVN. the package lexer is missing.
Maven can't build the code.
I want to use the build the code for JDK1.5 in order to use it on MAC.
could you please check-in the lexer package or provide a jar for jdk1.5.

Thanks in Advance.

B said...

Yes, i've net the same problems:-( Could you please help us?

Ayman said...

please refer to the issue list in the project home:

ognivo777 said...

Great work!

Julián said...

Nice job !

Just Google it!