Skip to content

Reimplement markdown escaping in C++#1838

Open
hadley wants to merge 3 commits intomainfrom
md-rd-escaping-cpp
Open

Reimplement markdown escaping in C++#1838
hadley wants to merge 3 commits intomainfrom
md-rd-escaping-cpp

Conversation

@hadley
Copy link
Copy Markdown
Member

@hadley hadley commented Mar 20, 2026

x <- r"(See \code{foo()} and \link{bar})"
y <- strrep("x", 1e3)

bench::mark(
  escape_rd_for_md_c(x),
  escape_rd_for_md_c(y),
  check = FALSE
)[1:5]

This branch:

  <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 escape_rd_for_md_c(x)   2.34µs   3.32µs   248920.        0B
2 escape_rd_for_md_c(y)   4.39µs   4.67µs   206356.        0B

Main branch:

# A tibble: 2 × 5
  expression               min   median `itr/sec` mem_alloc
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 escape_rd_for_md(x)    154µs    170µs     5688.    3.56KB
2 escape_rd_for_md(y)    109µs    121µs     8112.   64.07KB

hadley added 3 commits March 20, 2026 08:42
This substantially improves performance of a common parsing bottleneck.
@hadley
Copy link
Copy Markdown
Member Author

hadley commented Mar 21, 2026

@gaborcsardi could you take a bit of a look at this? I'm not 100% convinced that it's worth reviewing this code but it is a lot faster, and I think underlying ideas are actually a bit easier to see in C++. This function is called on just about every tag component, so it is a reasonable place to optimise.

Copy link
Copy Markdown
Member

@gaborcsardi gaborcsardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bottleneck in devtools::document()?

According to my not very sophisticated timings, this PR does make devtools::document() very slightly (3-4%) faster for some packages (ps, processx, rlang), and has essentially no effect for others (testthat). E.g. this is testthat:
Before:

> system.time(devtools::document())
ℹ Updating testthat documentationLoading testthat
   user  system elapsed
  1.520   0.171   1.845

After:

> system.time(devtools::document())
ℹ Updating testthat documentationLoading testthat
   user  system elapsed
  1.501   0.173   1.836

(The fastest runs for each case from several runs.)

The C++ code seems mostly straightforward. The risk I see is that there might be edge cases we don't anticipate. I tried it on a bunch of packages, it seems to be OK, but people might be doing weird things. Should that happen, they can still use an older roxygen2 until we fix up the edge cases. So if you think that this code is better and easier to maintain, then we should merge it.

Btw. to speed this up further for repeated runs, we could memoize this function.

This is a regression test for Markdown escaping.
}
\details{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes expected?

\code{escape_rd_for_md()} replaces fragile Rd tags with placeholders, to avoid
interpreting them as markdown. \code{unescape_rd_for_md()} puts the original
text back in place of the placeholders after the markdown parsing is done.
The fragile tags are listed in \code{escaped_for_md}.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change expected?

### Some Rd tags can't contain markdown

When mixing `Rd` and Markdown notation, most `Rd` tags may contain Markdown markup, the ones that can *not* are: `r paste0("\x60", roxygen2:::escaped_for_md, "\x60", collapse = ", ")`.
When mixing `Rd` and Markdown notation, most `Rd` tags may contain Markdown markup, the ones that can *not* are: `\acronym`, `\code`, `\command`, `\CRANpkg`, `\deqn`, `\doi`, `\dontrun`, `\dontshow`, `\donttest`, `\email`, `\env`, `\eqn`, `\figure`, `\file`, `\if`, `\ifelse`, `\kbd`, `\link`, `\linkS4class`, `\method`, `\mjeqn`, `\mjdeqn`, `\mjseqn`, `\mjsdeqn`, `\mjteqn`, `\mjtdeqn`, `\newcommand`, `\option`, `\out`, `\packageAuthor`, `\packageDescription`, `\packageDESCRIPTION`, `\packageIndices`, `\packageMaintainer`, `\packageTitle`, `\pkg`, `\PR`, `\preformatted`, `\renewcommand`, `\S3method`, `\S4method`, `\samp`, `\special`, `\testonly`, `\url`, `\var`, `\verb`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to generated this list programmatically instead of repeating it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now in a C++ vector, so not easy to pull. But given that it changes rarely and there's a reminder comment in the C++ code (which claude is likely to read), I think it's low risk.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a c++ function that just returns that vector as an R character vector, no?

i = j;

// Check if the tag has arguments (next char must be '{')
if (i >= n || text[i] != '{') {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the next character can also be a [, e.g. \link[=dest]{name}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants