Add ancestral state reconstruction (coev_ancestral_states) by ErikRingen · Pull Request #110 · ScottClaessens/coevolve

ErikRingen · 2026-04-27T10:02:37Z

Summary

Add coev_ancestral_states() for extracting posterior estimates of ancestral trait values at phylogenetic nodes. Long-format output (node, variable, category, estimate, lower, upper); supports latent and response scales, specific node/variable selection, raw draws, and multi-tree models. Ordinal-response output expands to one row per category and uses the posterior mean so per-node category estimates sum exactly to 1 (matching the convention of predict.brmsfit).
Add coev_plot_node_labels() for plotting internal node IDs on the tree (auto-scaling label size, customizable appearance).
Add coev_identify_nodes() — wraps ape::identify.phylo() in a loop so that multiple node clicks accumulate before returning.
Add vignette "Ancestral State Reconstruction with coevolve" covering latent/response-scale extraction, colored-node and pie-chart tree plots, root density, node identification, and node-specific queries. Includes executable .Rmd.orig source wired into vignettes/precompile.R.
NEWS.md entries; bump dev version to 1.0.0.9001; add articles: block to _pkgdown.yml so the vignette appears on the package website.

Test plan

lintr::lint_package() — 0 lints
R CMD check --as-cran — 0 ERRORs, only pre-existing WARNING/NOTEs
Tests for coev_ancestral_states() covering input validation, latent/response scales (normal, bernoulli, poisson, gamma, ordered logistic), raw draws, variable subsetting, CI width, multi-tree, mixed ordinal/non-ordinal output, and per-node category estimates summing to 1
Tests for coev_plot_node_labels() covering input validation, return type, custom parameters, and multiPhylo trees
Tests for coev_identify_nodes() covering input validation
codecov: 100% patch coverage

Closes #86

Implements the core extraction function for ancestral state reconstruction. Extracts posterior draws of eta at internal nodes, maps them to tree node IDs, and returns either a summary tibble (median + CIs, ggtree-compatible) or raw draws. Single-tree path only for now. Response scale and multi-tree support to follow. Tests cover: input validation, output structure, node selection (internal/all/specific), variable subsetting, CI width, summary vs draws mode, and all single-tree fixture models. https://claude.ai/code/session_01J6wGw5nJjfG1Eavh4pq9kR

Response scale support for all distributions: - normal: identity - bernoulli_logit: plogis(eta) -> P(y=1) - ordered_logistic: category probabilities via cutpoints - poisson_softplus: mean(obs) * softplus(eta) - negative_binomial_softplus: mean(obs) * softplus(eta) - gamma_log: exp(eta) For ordered_logistic on response scale, output has prob_1, prob_2, ... columns instead of estimate/lower/upper. Multi-tree tests cover tree_id selection and default behavior. https://claude.ai/code/session_01J6wGw5nJjfG1Eavh4pq9kR

Add coev_plot_node_labels() for visualizing internal node IDs on the phylogenetic tree, with auto-scaling label size and customizable appearance. Add ancestral_states.qmd vignette demonstrating the full ASR workflow: latent/response scale extraction, tree visualization, interactive node selection via identify(), and node-specific queries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match the format of the existing coevolve.Rmd vignette: static code blocks with pre-rendered output, pre-generated PNG figures, and proper R vignette metadata (VignetteIndexEntry, rmarkdown::html_vignette, pkgdown: as_is). This avoids requiring Quarto or expensive model fitting during package build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix all lintr violations: indentation, commented code, explicit returns, line lengths - Fix multiPhylo coercion: use list() instead of c() to preserve phylo class when wrapping a single tree - Fix ordinal prob sum test: median of category probs need not sum to exactly 1 (median doesn't distribute over sums) - Fix expect_no_error() calls that incorrectly passed label arg - Fix Rd cross-reference warning for bracket syntax in docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-27T10:19:04Z

Codecov Report

❌ Patch coverage is 99.58159% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
R/coev_identify_nodes.R	83.33%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Add tests for single-node response-scale extraction (ordinal and non-ordinal models) to exercise the matrix-reshape branches in apply_inverse_link(), the non-numeric nodes input validation path, and the multiPhylo branch in coev_plot_node_labels(). Mark two genuinely unreachable defensive lines (1-cutpoint reshape, switch identity fallback) with # nocov. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ScottClaessens

I've left some minor comments, but overall I think this is awesome. Thanks so much for doing that!

ScottClaessens · 2026-04-29T14:08:52Z

@@ -0,0 +1,398 @@
+#' Extract ancestral state estimates from a fitted coevolutionary model
+#'
+#' Extracts posterior estimates of latent trait values at internal nodes


Suggested change

#' Extracts posterior estimates of latent trait values at internal nodes

#' Extracts posterior estimates of trait values at internal nodes

Not necessarily latent if user specifies scale = "response"

Though I suppose the posterior estimates are on the latent scale by default. I'll let you decide.

ScottClaessens · 2026-04-29T14:10:54Z

+#'     \item{clade_pp}{Clade posterior probability (1.0 for single-tree models)}
+#'   }
+#'   The data frame has attributes \code{ref_tree} (the phylo object used),
+#'   \code{prob}, and \code{scale}. The format is compatible with ggtree's


Good choice to not include ggtree as a dependency.

ScottClaessens · 2026-04-29T14:15:47Z

+#' @param tree_id (optional) An integer indicating which tree to use when the
+#'   model was fit with a \code{multiPhylo} object. If \code{NULL} (default),
+#'   uses tree 1 when only one tree is present, or integrates over trees when
+#'   multiple trees are present.


or integrates over trees when multiple trees are present.

Is this correct? It looks to me like it will just pick the first tree for multiPhylo objects (line 116).

ScottClaessens · 2026-04-29T14:24:51Z

+#'     \item{estimate}{Posterior median}
+#'     \item{lower}{Lower bound of the credible interval}
+#'     \item{upper}{Upper bound of the credible interval}
+#'     \item{clade_pp}{Clade posterior probability (1.0 for single-tree models)}


Is clade_pp redundant now? It looks like it is always 1.

ScottClaessens · 2026-04-29T15:07:56Z

+        if (scale == "response" && dist == "ordered_logistic") {
+          cat_draws <- draws_subset[[v_i]][, n_i, , drop = FALSE]
+          n_cats <- dim(cat_draws)[3]
+          row <- data.frame(
+            node = selected_nodes[n_i],
+            variable = var_names[v_i],
+            stringsAsFactors = FALSE
+          )
+          for (k in seq_len(n_cats)) {
+            cat_k <- cat_draws[, 1, k]
+            row[[paste0("prob_", k)]] <- stats::median(cat_k)
+          }
+          row$clade_pp <- 1.0
+          rows[[length(rows) + 1]] <- row


When there is a mix of ordinal and non-ordinal variables, the output looks a little messy:

# one variable is binary set.seed(1) d <- authority$data d$political_authority <- sample(0:1, size = nrow(d), replace = TRUE) # fit model fit <- coev_fit( data = d, variables = list( political_authority = "bernoulli_logit", religious_authority = "ordered_logistic" ), id = "language", tree = authority$phylogeny ) # print ancestral states coev_ancestral_states(fit, scale = "response")

# A tibble: 192 × 10 node variable estimate lower upper clade_pp prob_1 prob_2 prob_3 prob_4 <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 98 political_authority 0.487 0.150 0.856 1 NA NA NA NA 2 98 religious_authority NA NA NA 1 0.115 0.0605 0.362 0.419 3 99 political_authority 0.485 0.156 0.842 1 NA NA NA NA 4 99 religious_authority NA NA NA 1 0.112 0.0605 0.361 0.418 5 100 political_authority 0.485 0.194 0.771 1 NA NA NA NA 6 100 religious_authority NA NA NA 1 0.116 0.0618 0.363 0.425 7 101 political_authority 0.492 0.172 0.822 1 NA NA NA NA 8 101 religious_authority NA NA NA 1 0.109 0.0592 0.361 0.432 9 102 political_authority 0.496 0.211 0.759 1 NA NA NA NA 10 102 religious_authority NA NA NA 1 0.102 0.0565 0.362 0.454 # ℹ 182 more rows

Should we instead include different rows for the different category levels? We could do this by including a column called category and setting it to NA for non-ordinal variables. Then we could include lower and upper percentiles for each category level too.

If we decide to stick with this approach, could we ensure that the summarised probabilities add up to 1? They do with predict.brmsfit in the brms package, but don't seem to here.

ScottClaessens · 2026-04-29T15:50:03Z

+#'
+#' # interactive node selection (in interactive R sessions)
+#' tree <- coev_plot_node_labels(fit)
+#' selected <- identify(tree)


This functionality is very cool! I'm only able to choose one node at a time though, rather than lots of nodes with multiple clicks. Is this the intended usage? Either way, it's awesome.

ScottClaessens · 2026-04-29T15:57:41Z

+tree <- attr(asr_latent, "ref_tree")
+n_tips <- length(tree$tip.label)
+asr_all <- coev_ancestral_states(fit, nodes = "all")
+
+pal <- colorRampPalette(c("#2166AC", "#F7F7F7", "#B2182B"))(100)
+global_range <- range(asr_all$estimate)
+
+par(mfrow = c(1, 2), mar = c(2, 0, 2, 0))
+
+for (var in unique(asr_all$variable)) {
+  df_var <- asr_all |> filter(variable == var)
+  vals <- setNames(df_var$estimate, df_var$node)
+
+  plot(tree, show.tip.label = FALSE, edge.width = 1.5,
+       edge.color = "grey70", main = gsub("_", " ", var),
+       no.margin = FALSE)
+
+  internal_ids <- (n_tips + 1):(n_tips + tree$Nnode)
+  internal_vals <- vals[as.character(internal_ids)]
+  bins <- as.integer(
+    cut(internal_vals,
+        breaks = seq(global_range[1], global_range[2], length.out = 101))
+  )
+  bins[is.na(bins)] <- 50
+  bins <- pmax(1, pmin(100, bins))
+  nodelabels(pch = 21, bg = pal[bins], col = NA, cex = 1.2)
+}
+
+# add shared color bar legend
+par(fig = c(0.35, 0.65, 0.0, 0.05), new = TRUE, mar = c(0, 0, 0, 0))
+image(
+  x = seq(global_range[1], global_range[2], length.out = 100),
+  y = 1,
+  z = matrix(1:100, nrow = 100),
+  col = pal,
+  axes = FALSE, xlab = "", ylab = ""
+)
+axis(1, at = round(seq(global_range[1], global_range[2], length.out = 5), 1),
+     cex.axis = 0.8, padj = -1.5, tck = -0.3)
+mtext("Latent value (eta)", side = 1, line = 1.2, cex = 0.7)


This is nice to show, but pedagogically this code is a little dense. Is it possible to simplify the code somewhat?

ScottClaessens · 2026-04-29T15:58:38Z

+```
+
+Notice the output now has `prob_1`, `prob_2`, etc. columns instead of
+`estimate`/`lower`/`upper`. Each row sums to 1 across probability columns.


As per my previous comment, this isn't strictly true the way we've coded it.

ScottClaessens · 2026-04-29T16:00:03Z

+
+In an interactive R session you can click directly on the tree to identify nodes.
+After plotting, call `identify()` on the returned tree object -- click on nodes
+of interest, then press Escape to finish:


then press Escape to finish:

This didn't work for me. I was only able to choose one node, and then it automatically finished.

ScottClaessens · 2026-04-29T16:05:43Z

For the main coevolve vignette that currently exists, I have used the .Rmd.orig approach to pre-compute the vignette (see more details here: https://ropensci.org/blog/2019/12/08/precompute-vignettes/). This is because that particular vignette fits many models that take a long time to run. Will this vignette also take a long time to run when running R CMD check or building pkgdown sites?

Looking at this again, it seems like the vignette is precomputed, which is great. If there's an original .Rmd.orig version, could you include that too? Just for re-running the vignette in the future.

ScottClaessens · 2026-04-29T16:09:51Z

Once we're happy, we will need to increment the development version and add to NEWS.

ScottClaessens · 2026-04-29T16:32:22Z

+Variables: political_authority = ordered_logistic
+           religious_authority = ordered_logistic
+     Data: authority$data (Number of observations: 97)
+Phylogeny: authority$phylogeny (Number of trees: 1)
+    Draws: 4 chains, each with iter = 500; warmup = 500; thin = 1
+           total post-warmup draws = 2000
+
+Autoregressive selection effects:
+                    Estimate Est.Error  2.5% 97.5% Rhat Bulk_ESS Tail_ESS
+political_authority    -0.53      0.44 -1.64 -0.02 1.00     1420     1270
+religious_authority    -0.60      0.48 -1.82 -0.03 1.00     1549     1137
+
+Cross selection effects:
+                                          Estimate Est.Error  2.5% 97.5% Rhat
+political_authority ⟶ religious_authority     1.50      0.74 -0.04  2.87 1.00
+religious_authority ⟶ political_authority     1.05      0.83 -0.62  2.64 1.00
+                                          Bulk_ESS Tail_ESS
+political_authority ⟶ religious_authority      842     1055
+religious_authority ⟶ political_authority      701     1384
+
+Drift parameters:
+                                             Estimate Est.Error  2.5% 97.5%
+sd(political_authority)                          2.15      0.82  0.37  3.70
+sd(religious_authority)                          1.46      0.82  0.13  3.10
+cor(political_authority,religious_authority)     0.37      0.30 -0.35  0.85
+                                             Rhat Bulk_ESS Tail_ESS
+sd(political_authority)                      1.00      453      406
+sd(religious_authority)                      1.01      416      887
+cor(political_authority,religious_authority) 1.00     1116     1360
+
+Continuous time intercept parameters:
+                    Estimate Est.Error  2.5% 97.5% Rhat Bulk_ESS Tail_ESS
+political_authority     0.31      0.91 -1.55  2.12 1.00     2426     1337
+religious_authority     0.31      0.91 -1.47  2.14 1.01     3135     1612
+
+Ordinal cutpoint parameters:
+                       Estimate Est.Error  2.5% 97.5% Rhat Bulk_ESS Tail_ESS
+political_authority[1]    -1.44      0.88 -3.13  0.23 1.00     1266     1010
+political_authority[2]    -0.70      0.86 -2.36  1.01 1.00     1418     1276
+political_authority[3]     1.48      0.86 -0.20  3.19 1.00     1661     1415
+religious_authority[1]    -1.59      0.92 -3.32  0.24 1.00     1424     1248
+religious_authority[2]    -0.92      0.90 -2.61  0.87 1.00     1699     1531
+religious_authority[3]     1.45      0.92 -0.24  3.28 1.00     1913     1488


I used options(width = 120) in vignettes/coevolve.Rmd.orig to ensure that the summary output fits the page. Could we do the same here?

ScottClaessens · 2026-04-29T17:06:32Z

Also, I think that in order for the ASR vignette to appear on the package website, we will need to add something like the following to _pkgdown.yml:

articles:
- title: Vignettes
  contents:
  - coevolve
  - ancestral_state

* Output redesign for coev_ancestral_states(): switch to a long-format tibble with columns node, variable, category, estimate, lower, upper. Ordinal-response output expands to one row per category with category populated; non-ordinal rows have category = NA. Use the posterior mean (not median) for ordinal-response point estimates so per-node category estimates sum exactly to 1, matching the convention of predict.brmsfit. * Drop the clade_pp column; it was always 1.0 and was a forward-looking placeholder for proper multi-tree integration that does not exist yet. Easier to add back when the integration lands. * Fix doc inaccuracy: tree_id with NULL picks tree 1 -- it does not integrate over trees. Soften the "latent trait values" framing in the function description since scale = "response" returns response- scale values. * Add coev_identify_nodes(): wraps ape::identify.phylo() in a loop so that multiple node clicks accumulate before returning. * Vignette: switch to long-format examples; simplify the latent-tree color-bar block; pivot the pie-chart helper to use the new long format; replace the broken single-click identify() workflow with coev_identify_nodes(); remove clade_pp from output snippets. * Vignette source: add ancestral_states.Rmd.orig matching the existing coevolve.Rmd.orig pattern, and wire it into precompile.R for future regeneration. * _pkgdown.yml: add articles: block so the ASR vignette appears on the package website alongside coevolve. * DESCRIPTION: bump dev version to 1.0.0.9001; NEWS.md: add entries for the new functions and vignette. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ErikRingen · 2026-04-30T10:29:16Z

Thanks for the careful review @ScottClaessens! Pushed 9b34e57 addressing everything:

Output format (your big comment on L179) — switched to long format with a category column (NA for non-ordinal, cat_1/cat_2/... for ordinal-response). One row per node × variable × category. Ordinal-response point estimate is now the posterior mean (not median), so per-node category estimates sum exactly to 1 — matching predict.brmsfit. Verified on m02: min(sums) = max(sums) = 1.

Dropped clade_pp (your L36 comment) — agreed it's misleading at always-1.0. Easier to add back when proper multi-tree integration lands than to keep a column whose only current value is degenerate.

Doc fixes

L3: softened "latent trait values" → "trait values ... by default on the latent scale".
L19 + vignette L276: tree_id = NULL picks tree 1 — does not integrate over trees. Doc now says so explicitly (no integration yet).

Multi-click identify (L70 + vignette L414) — added coev_identify_nodes(tree, n = NULL), a thin wrapper that loops ape::identify.phylo() until you click outside the tree (or hit Esc, depending on device). Vignette and ?coev_plot_node_labels updated to point to it.

Vignette polish

Added vignettes/ancestral_states.Rmd.orig as the source of truth and wired it into vignettes/precompile.R (your "include .Rmd.orig" follow-up).
Set options(width = 120) and the standard knitr::opts_chunk$set(...) block so the printed tibbles fit (your L124 comment).
Simplified the dense color-bar chunk (your L204) — replaced the manual image() + axis() + mtext() sequence with a small to_color() helper and a 4-line color bar. Also tidied the pie-chart loop now that we're working from long format.

Housekeeping (your two top-level conversation comments)

Bumped DESCRIPTION → 1.0.0.9001.
Added a # coevolve (development version) block in NEWS.md for coev_ancestral_states(), coev_plot_node_labels(), coev_identify_nodes(), and the new vignette.
Added the articles: block to _pkgdown.yml so the ASR vignette appears on the package website (note: filename is ancestral_states, not ancestral_state).

All 113 ASR/identify/plot-labels tests still pass; lintr::lint_package() is clean; codecov reports 100% patch coverage.

One thing I did not re-knit: the static vignettes/ancestral_states.Rmd. The .Rmd.orig is the new source of truth, but actually running precompile.R requires fitting the model (~2 min). I updated the static .Rmd by hand to reflect the new long-format output. Happy to do a full re-knit if you'd rather have a fresh run before merge — let me know.

ape registers identify.phylo as an S3 method (S3method(identify, phylo)) but does not export it as a plain function, so calling ape::identify.phylo directly fails the namespace check. Switch to graphics::identify(), which is the registered S3 generic; dispatch will route to ape::identify.phylo when the argument inherits "phylo". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ScottClaessens · 2026-04-30T11:10:49Z

+#'     \item{estimate}{Posterior point estimate. For ordinal-response
+#'       category probabilities this is the posterior \emph{mean} (so the
+#'       per-node category estimates sum to 1, matching the convention of
+#'       \code{predict.brmsfit}); for all other cases it is the posterior
+#'       median.}


The only thing with this approach is that it means that there could be a mix of posterior medians and means in the same summary output. Do we want this? I think we should probably be consistent, even if it means that the point esimates don't add up to 1. What do you reckon @ErikRingen?

Also, consider removing the reference to predict.brmsfit in the documentation.

ScottClaessens · 2026-04-30T11:14:22Z

+#'     \item{lower}{Lower bound of the credible interval (per-category
+#'       quantile for ordinal-response rows)}
+#'     \item{upper}{Upper bound of the credible interval (per-category
+#'       quantile for ordinal-response rows)}


For simplicity, maybe just:

#' \item{lower}{Lower bound of the credible interval} #' \item{upper}{Upper bound of the credible interval}

ScottClaessens · 2026-04-30T11:25:21Z

Unfortunately this function doesn't work for me. In an interactive session:

devtools::load_all() coev_identify_nodes(authority$phylogeny) # integer(0) coev_identify_nodes(authority$phylogeny, n = 2) # integer(0)

If it's helpful, I get the following error when trying the critical line in coev_identify_nodes():

graphics::identify(authority$phylogeny, quiet = TRUE)

Error in locator(1) : plot.new has not been called yet

ScottClaessens · 2026-04-30T11:29:26Z

How can we also test that this function does the right thing? Not sure how to do this given that tests aren't run in interactive sessions.

Could we at least add a test checking that the output is not integer(0)?

ScottClaessens · 2026-04-30T11:39:43Z

One additional thing: the "category" column will always be redundant when scale = "latent". Could we only include this column when scale = "response"?

ScottClaessens · 2026-04-30T11:44:02Z

Also, I think we will need to re-run the vignette once we're happy with all the package changes, just to make sure that everything is up-to-date.

claude and others added 5 commits April 22, 2026 15:20

ErikRingen marked this pull request as draft April 27, 2026 10:05

ErikRingen marked this pull request as ready for review April 29, 2026 13:54

ErikRingen requested a review from ScottClaessens April 29, 2026 13:54

ScottClaessens reviewed Apr 29, 2026

View reviewed changes

ScottClaessens mentioned this pull request Apr 29, 2026

Update vignettes #111

Open

ScottClaessens reviewed Apr 30, 2026

View reviewed changes

	#' Extracts posterior estimates of latent trait values at internal nodes
	#' Extracts posterior estimates of trait values at internal nodes

Conversation

ErikRingen commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ScottClaessens left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScottClaessens commented Apr 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScottClaessens commented Apr 29, 2026

Uh oh!

ErikRingen commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScottClaessens commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ErikRingen commented Apr 27, 2026 •

edited

Loading

codecov Bot commented Apr 27, 2026 •

edited

Loading

ScottClaessens left a comment •

edited

Loading