Fix a number of em/strong issues (#641, #642, #643) by Crozzers · Pull Request #644 · trentm/python-markdown2

Crozzers · 2025-10-05T16:05:08Z

This PR fixes #641, fixes #642 and fixes #643.

I put all these fixes in the same PR because I wanted to make sure they were compatible with each other, but that does mean it's a bit messy.

Middle word em breaking `(em)` (#641)

The middle word em regex would check for *_ chars that aren't preceded by another em char or a whitespace. The text (*em*) matches that, as the leading em char is not preceded by a space. It would then prevent this text from being processed as a valid em.
I've updated the regex to look for ems that aren't preceded by non-word chars (instead of whitespace) and that fixed this issue.
The result is that we process this as expected:

(<em>em</em>)

Improve handling for leading underscores (#642)

In this issue we had what looked like an  span, but it was straddling two other  spans:

**_confusing** ident is **_confusing**

This is not a valid em. Spans can be nested but they shouldn't stay open after the parent span closes.

I added some additional logic in the italics and bold stage that will check to see if the matched strong/em has any nested spans and that those spans are balanced and closed. If not, the strong/em is deemed invalid.

The result is that we process the strongs here, but not the em:

<p><strong>_confusing</strong> ident is <strong>_confusing</strong></p>

Consecutive strong/em can overlap (#643)

The strong/em regexes were starting their matches early as possible, and including as much text in the span as possible. This lead to the following text being processed like so:

**strong***em***strong**
strong*em*strong
strongemstrong

This renders fine in most browsers, but is invalid html.

To fix this, I modified the strong regex to try to ignore as many leading *_ chars as possible to try to get the opening  tag as close to the actual contents as possible, and try to close the  as soon as possible.

Previously the strong regex would process ***abc*** as *abc* but now it will do *abc

The effect of this is when we have consecutive strong and ems, they won't overlap anymore.

The unfortunate side effect is Github will process ***abc** as *abc, but we will output *abc instead, omitting that first em char from the span.

nicholasserra · 2025-10-06T00:15:12Z

This all looks reasonable, thank you for taking on all these edge cases!

justanotheranonymoususer · 2025-10-06T08:06:56Z

Thanks!

The unfortunate side effect is Github will process ***abc** as *abc, but we will output *abs instead, omitting that first em char from the span.

Actually, GitHub seems to output *abc as well, so looks like a fix too.

justanotheranonymoususer · 2025-10-06T08:15:09Z

Quick report for gaps/regressions, I'll create issues later unless you ninja-fix it:

**one*two***

A_**B **text **c** d

x A_**B** y

Crozzers added 5 commits October 4, 2025 10:51

Fix trentm#641

64af599

Fix trentm#642

4fb2fa1

Fix trentm#643

9a48294

Fix ReDoS regression

3173942

update changelog

40bd17f

nicholasserra merged commit 9a88ce1 into trentm:master Oct 6, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a number of em/strong issues (#641, #642, #643)#644

Fix a number of em/strong issues (#641, #642, #643)#644
nicholasserra merged 5 commits intotrentm:masterfrom
Crozzers:fix-em-strong-issues

Crozzers commented Oct 5, 2025 •

edited

Loading

Uh oh!

nicholasserra commented Oct 6, 2025

Uh oh!

Uh oh!

justanotheranonymoususer commented Oct 6, 2025

Uh oh!

justanotheranonymoususer commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Crozzers commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Middle word em breaking (*em*) (#641)

Improve handling for leading underscores (#642)

Consecutive strong/em can overlap (#643)

Uh oh!

nicholasserra commented Oct 6, 2025

Uh oh!

Uh oh!

justanotheranonymoususer commented Oct 6, 2025

Uh oh!

justanotheranonymoususer commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Crozzers commented Oct 5, 2025 •

edited

Loading

Middle word em breaking `(em)` (#641)