Skip to content

Latest commit

 

History

History
342 lines (268 loc) · 15.2 KB

File metadata and controls

342 lines (268 loc) · 15.2 KB

Design of clojure-ts-mode

Note: This document is still a work in progress.

Clojure-ts-mode is based on the tree-sitter-clojure grammar.

If you want to contribute to clojure-ts-mode, it is recommended that you familiarize yourself with how Tree-sitter works. The official documentation is a great place to start: https://tree-sitter.github.io/tree-sitter/

These guides for Emacs Tree-sitter development are also useful:

In short:

  • Tree-sitter is a tool that generates parser libraries for programming languages, and provides an API for interacting with those parsers.
  • The generated parsers can create syntax trees from source code text.
  • The nodes of those trees are defined by the grammar.
  • Emacs can use these generated parsers to provide major modes with things like syntax highlighting, indentation, navigation, structural editing, and many other things.

Important Definitions

  • Parser: A dynamic library compiled from C source code that is generated by the Tree-sitter tool. A parser reads source code for a particular language and produces a syntax tree.
  • Grammar: The rules that define how a parser will create the syntax tree for a language. The grammar is written in JavaScript. Tree-sitter tooling consumes the grammar as input and outputs C source (which can be compiled into a parser)
  • Syntax Tree: a tree data structure comprised of syntax nodes that represents some source code text.
    • Concrete Syntax Tree: Syntax trees that contain nodes for every token in the source code, including things likes brackets and parentheses. Tree-sitter creates Concrete Syntax Trees.
    • Abstract Syntax Tree: A syntax tree with less important details removed. An AST may contain a node for a list, but not individual parentheses. Tree-sitter does not create Abstract Syntax Trees.
  • Syntax Node: A node in a syntax tree. It represents some subset of a source code text. Each node has a type, defined by the grammar used to produce it. Some common node types represent language constructs like strings, integers, operators.
    • Named Syntax Node: A node that can be identified by a name given to it in the Tree-sitter Grammar. In clojure-ts-mode, list_lit is a named node for lists.
    • Anonymous Syntax Node: A node that cannot be identified by a name. In the Grammar these are identified by simple strings, not by complex Grammar rules. In clojure-ts-mode, "(" and ")" are anonymous nodes.
  • Font Locking: The Emacs terminology for "syntax highlighting".

tree-sitter-clojure

clojure-ts-mode uses the experimental version tree-sitter-clojure grammar, which can be found at https://github.com/sogaiu/tree-sitter-clojure/tree/unstable-20250526. The grammar provides very basic, low level nodes that try to match Clojure's very light syntax.

There are nodes to represent:

  • Symbols (sym_lit)
    • Contain (sym_ns) and (sym_name) nodes
  • Keywords (kwd_lit)
    • Contain (kwd_ns) and (kw_name) nodes
  • Strings (str_lit)
    • Contains (str_content) node
  • Chars (char_lit)
  • Nil (nil_lit)
  • Booleans (bool_lit)
  • Numbers (num_lit)
  • Comments (comment, dis_expr)
    • dis_expr is the #_ discard expression
  • Lists (list_lit)
  • Vectors (vec_lit)
  • Maps (map_lit)
  • Sets (set_lit)
  • Metadata nodes (meta_lit, old_meta_lit)
  • Regex content (regex_content)
  • Function literals (anon_fn_lit)

The best place to learn more about the tree-sitter-clojure grammar is to read the grammar.js file from the tree-sitter-clojure repository.

Difference between stable grammar and experimental

Standalone metadata nodes

Metadata nodes in stable grammar appear as child nodes of the nodes the metadata is defined on. For example a simple vector with metadata defined on it like so:

^:has-metadata [1]

will produce a parse tree like so

(vec_lit
  meta: (meta_lit
          value: (kwd_lit name: (kwd_name)))
  value: (num_lit))

Although it's somewhat closer to how Clojure treats metadata itself, in the context of a text editor it creates some problems, which were discussed here. To name a few:

  • forward-sexp command would skip both, metadata and the node it's attached to. Called from an opening paren it would signal an error "No more sexp to move across".
  • kill-sexp command would kill both, metadata and the node it's attached to.
  • backward-up-list called from the inside of a list with metadata would move point to the beginning of metadata node.
  • Internally we had to introduce some workarounds to skip metadata nodes or figure out where the actual node starts.

Special nodes for string content and regex content

To parse the content of certain strings with a separate grammar, it is necessary to extract the string's content, excluding its opening and closing quotes. To achieve this, Emacs 31 allows specifying offsets for treesit-range-settings. However, in Emacs 30.1, this feature is broken due to bug #77848 (a fix is anticipated in Emacs 30.2). The presence of str_content and regex_content nodes allows us to support this feature across all Emacs versions without relying on offsets.

Clojure Syntax, not Clojure Semantics

An important observation that anyone familiar with popular Tree-sitter grammars may have picked up on is that there are no nodes representing things like functions, macros, types, and other semantic concepts. Representing the semantics of Clojure in a Tree-sitter grammar is much more difficult than traditional languages that do not use macros heavily like Clojure and other Lisps.

To understand what an expression represents in Clojure source code requires macro-expansion of the source code. Macro-expansion requires a runtime, and Tree-sitter does not have access to a Clojure runtime and will never have access to a Clojure runtime. Additionally Tree-sitter never looks back on what it has parsed, only forward, considering what is directly ahead of it. So even if it could identify a macro like myspecialdef it would forget about it as soon as it moved passed the declaring defmacro node. Another way to think about this: Tree-sitter is designed to be fast and good-enough for tooling to implement syntax highlighting, indentation, and other editing conveniences. It is not meant for interpreting and execution.

Example 1: False Negative Function Classification

Consider the following macro

(defmacro defn2 [sym args & body]
  `(defn ~sym ~args ~@body))

(defn2 dog [] "bark")

This macro lets the caller define a function, but a hypothetical tree-sitter-clojure semantic grammar might just see a function call where a variable dog is passed as an argument. How should Tree-sitter know that dog should be highlighted like function? It would have to evaluate the defn2 macro to understand that.

Example 2: False Positive Function Classification

(defmacro no-defn [body]
  (if (= 'defn (first body))
    (rest body)
    body))
(defn foo [& rest] 1)
(no-defn (defn foo [] 2))

evaluates to 1, and the following

(foo)

evaluates to 1.

How is Tree-sitter supposed to understand that (defn foo [] 2) of the expression (no-defn (defn foo [] 2)) is not a function declaration? It would have to evaluate the no-defn macro.

Syntax and Semantics: Conclusions

While these examples are silly, they illustrate the issue with encoding semantics into the tree-sitter-clojure grammar. If we tried to make the grammar understand functions, macros, types, and other semantic elements it will end up giving false positives and negatives in the parse tree. While this is an inevitability for simple static analysis of Clojure code, tree-sitter-clojure chooses to avoid making these kinds of mistakes all-together. Instead, it is up to the emacs-lisp code and other consumers of the tree-sitter-clojure grammar to make decisions about the semantic meaning of clojure-code.

There are some pros and cons of this decision for tree-sitter-clojure to only consider syntax and not semantics. Some of the (non-exhaustive) upsides:

  • No semantic false positives or negatives in the parse tree.
  • Simple grammar to maintain with less nodes and rules
  • Small, fast grammar (with a small set of grammar rules, tree-sitter-clojure has one of the smallest binaries and fastest grammars in widespread use)
  • Stability: the grammar changes infrequently and is very stable for downstream consumers

And the primary downside: Semantics must be (re)-implemented in tools that consume the grammar. While this results in more work for tooling authors, the tools that use the grammar are easier to change than the grammar itself. The inaccurate nature of statically interpreting Clojure semantics means that not every decision made for the grammar would meet the needs of the various grammar consumers. This would lead to bugs and feature requests. Nearly all changes to the grammar will result in some sort of breakages to its consumers, so changes are best avoided once the grammar has stabilized. Therefore avoiding these semantic interpretations in the grammar is one of the best ways to minimize changes in the grammar.

Further Reading

Syntax Highlighting

To set up Tree-sitter fontification, clojure-ts-mode sets the treesit-font-lock-settings variable with the output of clojure-ts--font-lock-settings, and then calls treesit-major-mode-setup.

clojure-ts--font-lock-settings returns a list of compiled queries. Each query must have at least one capture name (names that start with @). If a capture name matches an existing face name (e.g., font-lock-keyword-face), the captured node will be fontified with that face.

A capture name can also be arbitrary and used to check the text of the captured node. It can also be used for both fontification and text checking. For example in the following query:

`((list_lit :anchor [(comment) (meta_lit) (old_meta_lit)] :*
            :anchor (sym_lit !namespace name: (sym_name) @font-lock-keyword-face))
  (:match ,clojure-ts--builtin-symbol-regexp @font-lock-keyword-face))

We match any list whose first symbol (skipping any number of comments and metadata nodes) does not have a namespace and matches a regex stored in the clojure-ts--builtin-symbol-regexp variable. The matched symbol is fontified using font-lock-keyword-face.

Important

Compiling queries at runtime is very expensive; therefore, it should be avoided as much as possible. Ideally, all queries should be pre-compiled and stored as defconst constants.

Embedded parsers

The Clojure grammar in clojure-ts-mode is a main or "host" grammar. Emacs also supports the use of any number of "embedded" grammars. clojure-ts-mode currently uses the markdown-inline grammar to highlight Markdown constructs in docstrings and the regex grammar to highlight regular expression syntax.

To use an embedded parser, clojure-ts-mode must set an appropriate value for the treesit-range-settings variable. The Clojure grammar provides convenient nodes to capture only the content of strings and regexes, which makes defining range settings for regexes quite simple:

(treesit-range-rules
 :embed 'regex
 :host 'clojure
 :local t
 '((regex_content) @capture))

For docstrings, the query is a bit more complex. Therefore, we have the function clojure-ts--docstring-query, which is used for syntax highlighting, indentation rules, and range settings for the embedded Markdown parser:

(treesit-range-rules
 :embed 'markdown-inline
 :host 'clojure
 :local t
 (clojure-ts--docstring-query '@capture))

It is important to use the :local option for embedded parsers; otherwise, the range will not be restricted to the captured node, which will lead to broken fontification (see bug #77733).

Additional information

To find more details one can evaluate the following expression in Emacs:

(info "(elisp) Parser-based Font Lock")

Indentation

To enable the parser-based indentation engine, clojure-ts-mode sets the treesit-simple-indent-rules with the output of clojure-ts--configured-indent-rules, and then calls treesit-major-mode-setup.

According to the documentation of treesit-simple-indent-rules variable, its value is:

A list of indent rule settings. Each indent rule setting should be (LANGUAGE RULE...), where LANGUAGE is a language symbol, and each RULE is of the form

(MATCHER ANCHOR OFFSET)

MATCHER determines whether this rule applies, ANCHOR and OFFSET together determines which column to indent to.

For example rule like this:

'((clojure
   ((parent-is "^vec_lit$") parent 1)
   ((parent-is "^map_lit$") parent 1)
   ((parent-is "^set_lit$") parent 2)))

will indent any node whose parent node is a vec_lit or map_lit with 1 space, starting from the beginning of the parent node. For set_lit, it will add two spaces because sets have two opening characters: # and {.

In the example above, the parent-is matcher and parent anchor are built-in presets. There are many predefined presets provided by Emacs. The list of all available presets can be found in the documentation for the treesit-simple-indent-presets variable.

Sometimes, more complex behavior than predefined built-in presets is required. In such cases, you can write your own matchers and anchors. One good example is the clojure-ts--match-form-body matcher. It attempts to match a node at point using the combined value of clojure-ts--semantic-indent-rules-defaults and clojure-ts-semantic-indent-rules. These rules have a similar format to cljfmt indentation rules. clojure-ts-semantic-indent-rules is a customization option that users can tweak. clojure-ts--match-form-body traverses the syntax tree, starting from the node at point, towards the top of the tree in order to find a match. In addition to clojure-ts--semantic-indent-rules-defaults and clojure-ts-semantic-indent-rules, it may also use clojure-ts-get-indent-function if it is not nil. This function provides an API for dynamic indentation and must return a value compatible with cider-nrepl. Searching for an indentation rule across all these variables is slow; therefore, clojure-ts--semantic-indent-rules-cache was introduced. It is set when clojure-ts-mode is activated in a Clojure source buffer and refreshed every time clojure-ts-semantic-indent-rules is updated (using setopt or the customization interface) or when a .dir-locals.el file is updated.

Additional information

To find more details one can evaluate the following expression in Emacs:

(info "(elisp) Parser-based Indentation")