Add ancestral state reconstruction (coev_ancestral_states)#110
Add ancestral state reconstruction (coev_ancestral_states)#110ErikRingen wants to merge 8 commits into
Conversation
Implements the core extraction function for ancestral state reconstruction. Extracts posterior draws of eta at internal nodes, maps them to tree node IDs, and returns either a summary tibble (median + CIs, ggtree-compatible) or raw draws. Single-tree path only for now. Response scale and multi-tree support to follow. Tests cover: input validation, output structure, node selection (internal/all/specific), variable subsetting, CI width, summary vs draws mode, and all single-tree fixture models. https://claude.ai/code/session_01J6wGw5nJjfG1Eavh4pq9kR
Response scale support for all distributions: - normal: identity - bernoulli_logit: plogis(eta) -> P(y=1) - ordered_logistic: category probabilities via cutpoints - poisson_softplus: mean(obs) * softplus(eta) - negative_binomial_softplus: mean(obs) * softplus(eta) - gamma_log: exp(eta) For ordered_logistic on response scale, output has prob_1, prob_2, ... columns instead of estimate/lower/upper. Multi-tree tests cover tree_id selection and default behavior. https://claude.ai/code/session_01J6wGw5nJjfG1Eavh4pq9kR
Add coev_plot_node_labels() for visualizing internal node IDs on the phylogenetic tree, with auto-scaling label size and customizable appearance. Add ancestral_states.qmd vignette demonstrating the full ASR workflow: latent/response scale extraction, tree visualization, interactive node selection via identify(), and node-specific queries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match the format of the existing coevolve.Rmd vignette: static code blocks with pre-rendered output, pre-generated PNG figures, and proper R vignette metadata (VignetteIndexEntry, rmarkdown::html_vignette, pkgdown: as_is). This avoids requiring Quarto or expensive model fitting during package build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix all lintr violations: indentation, commented code, explicit returns, line lengths - Fix multiPhylo coercion: use list() instead of c() to preserve phylo class when wrapping a single tree - Fix ordinal prob sum test: median of category probs need not sum to exactly 1 (median doesn't distribute over sums) - Fix expect_no_error() calls that incorrectly passed label arg - Fix Rd cross-reference warning for bracket syntax in docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Add tests for single-node response-scale extraction (ordinal and non-ordinal models) to exercise the matrix-reshape branches in apply_inverse_link(), the non-numeric nodes input validation path, and the multiPhylo branch in coev_plot_node_labels(). Mark two genuinely unreachable defensive lines (1-cutpoint reshape, switch identity fallback) with # nocov. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| @@ -0,0 +1,398 @@ | |||
| #' Extract ancestral state estimates from a fitted coevolutionary model | |||
| #' | |||
| #' Extracts posterior estimates of latent trait values at internal nodes | |||
There was a problem hiding this comment.
| #' Extracts posterior estimates of latent trait values at internal nodes | |
| #' Extracts posterior estimates of trait values at internal nodes |
Not necessarily latent if user specifies scale = "response"
There was a problem hiding this comment.
Though I suppose the posterior estimates are on the latent scale by default. I'll let you decide.
| #' \item{clade_pp}{Clade posterior probability (1.0 for single-tree models)} | ||
| #' } | ||
| #' The data frame has attributes \code{ref_tree} (the phylo object used), | ||
| #' \code{prob}, and \code{scale}. The format is compatible with ggtree's |
There was a problem hiding this comment.
Good choice to not include ggtree as a dependency.
| #' @param tree_id (optional) An integer indicating which tree to use when the | ||
| #' model was fit with a \code{multiPhylo} object. If \code{NULL} (default), | ||
| #' uses tree 1 when only one tree is present, or integrates over trees when | ||
| #' multiple trees are present. |
There was a problem hiding this comment.
or integrates over trees when multiple trees are present.
Is this correct? It looks to me like it will just pick the first tree for multiPhylo objects (line 116).
| #' \item{estimate}{Posterior median} | ||
| #' \item{lower}{Lower bound of the credible interval} | ||
| #' \item{upper}{Upper bound of the credible interval} | ||
| #' \item{clade_pp}{Clade posterior probability (1.0 for single-tree models)} |
There was a problem hiding this comment.
Is clade_pp redundant now? It looks like it is always 1.
| if (scale == "response" && dist == "ordered_logistic") { | ||
| cat_draws <- draws_subset[[v_i]][, n_i, , drop = FALSE] | ||
| n_cats <- dim(cat_draws)[3] | ||
| row <- data.frame( | ||
| node = selected_nodes[n_i], | ||
| variable = var_names[v_i], | ||
| stringsAsFactors = FALSE | ||
| ) | ||
| for (k in seq_len(n_cats)) { | ||
| cat_k <- cat_draws[, 1, k] | ||
| row[[paste0("prob_", k)]] <- stats::median(cat_k) | ||
| } | ||
| row$clade_pp <- 1.0 | ||
| rows[[length(rows) + 1]] <- row |
There was a problem hiding this comment.
When there is a mix of ordinal and non-ordinal variables, the output looks a little messy:
# one variable is binary
set.seed(1)
d <- authority$data
d$political_authority <- sample(0:1, size = nrow(d), replace = TRUE)
# fit model
fit <- coev_fit(
data = d,
variables = list(
political_authority = "bernoulli_logit",
religious_authority = "ordered_logistic"
),
id = "language",
tree = authority$phylogeny
)
# print ancestral states
coev_ancestral_states(fit, scale = "response")# A tibble: 192 × 10
node variable estimate lower upper clade_pp prob_1 prob_2 prob_3 prob_4
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 98 political_authority 0.487 0.150 0.856 1 NA NA NA NA
2 98 religious_authority NA NA NA 1 0.115 0.0605 0.362 0.419
3 99 political_authority 0.485 0.156 0.842 1 NA NA NA NA
4 99 religious_authority NA NA NA 1 0.112 0.0605 0.361 0.418
5 100 political_authority 0.485 0.194 0.771 1 NA NA NA NA
6 100 religious_authority NA NA NA 1 0.116 0.0618 0.363 0.425
7 101 political_authority 0.492 0.172 0.822 1 NA NA NA NA
8 101 religious_authority NA NA NA 1 0.109 0.0592 0.361 0.432
9 102 political_authority 0.496 0.211 0.759 1 NA NA NA NA
10 102 religious_authority NA NA NA 1 0.102 0.0565 0.362 0.454
# ℹ 182 more rows
Should we instead include different rows for the different category levels? We could do this by including a column called category and setting it to NA for non-ordinal variables. Then we could include lower and upper percentiles for each category level too.
If we decide to stick with this approach, could we ensure that the summarised probabilities add up to 1? They do with predict.brmsfit in the brms package, but don't seem to here.
| #' | ||
| #' # interactive node selection (in interactive R sessions) | ||
| #' tree <- coev_plot_node_labels(fit) | ||
| #' selected <- identify(tree) |
There was a problem hiding this comment.
This functionality is very cool! I'm only able to choose one node at a time though, rather than lots of nodes with multiple clicks. Is this the intended usage? Either way, it's awesome.
| tree <- attr(asr_latent, "ref_tree") | ||
| n_tips <- length(tree$tip.label) | ||
| asr_all <- coev_ancestral_states(fit, nodes = "all") | ||
|
|
||
| pal <- colorRampPalette(c("#2166AC", "#F7F7F7", "#B2182B"))(100) | ||
| global_range <- range(asr_all$estimate) | ||
|
|
||
| par(mfrow = c(1, 2), mar = c(2, 0, 2, 0)) | ||
|
|
||
| for (var in unique(asr_all$variable)) { | ||
| df_var <- asr_all |> filter(variable == var) | ||
| vals <- setNames(df_var$estimate, df_var$node) | ||
|
|
||
| plot(tree, show.tip.label = FALSE, edge.width = 1.5, | ||
| edge.color = "grey70", main = gsub("_", " ", var), | ||
| no.margin = FALSE) | ||
|
|
||
| internal_ids <- (n_tips + 1):(n_tips + tree$Nnode) | ||
| internal_vals <- vals[as.character(internal_ids)] | ||
| bins <- as.integer( | ||
| cut(internal_vals, | ||
| breaks = seq(global_range[1], global_range[2], length.out = 101)) | ||
| ) | ||
| bins[is.na(bins)] <- 50 | ||
| bins <- pmax(1, pmin(100, bins)) | ||
| nodelabels(pch = 21, bg = pal[bins], col = NA, cex = 1.2) | ||
| } | ||
|
|
||
| # add shared color bar legend | ||
| par(fig = c(0.35, 0.65, 0.0, 0.05), new = TRUE, mar = c(0, 0, 0, 0)) | ||
| image( | ||
| x = seq(global_range[1], global_range[2], length.out = 100), | ||
| y = 1, | ||
| z = matrix(1:100, nrow = 100), | ||
| col = pal, | ||
| axes = FALSE, xlab = "", ylab = "" | ||
| ) | ||
| axis(1, at = round(seq(global_range[1], global_range[2], length.out = 5), 1), | ||
| cex.axis = 0.8, padj = -1.5, tck = -0.3) | ||
| mtext("Latent value (eta)", side = 1, line = 1.2, cex = 0.7) |
There was a problem hiding this comment.
This is nice to show, but pedagogically this code is a little dense. Is it possible to simplify the code somewhat?
| ``` | ||
|
|
||
| Notice the output now has `prob_1`, `prob_2`, etc. columns instead of | ||
| `estimate`/`lower`/`upper`. Each row sums to 1 across probability columns. |
There was a problem hiding this comment.
As per my previous comment, this isn't strictly true the way we've coded it.
|
|
||
| In an interactive R session you can click directly on the tree to identify nodes. | ||
| After plotting, call `identify()` on the returned tree object -- click on nodes | ||
| of interest, then press Escape to finish: |
There was a problem hiding this comment.
then press Escape to finish:
This didn't work for me. I was only able to choose one node, and then it automatically finished.
There was a problem hiding this comment.
For the main coevolve vignette that currently exists, I have used the .Rmd.orig approach to pre-compute the vignette (see more details here: https://ropensci.org/blog/2019/12/08/precompute-vignettes/). This is because that particular vignette fits many models that take a long time to run. Will this vignette also take a long time to run when running R CMD check or building pkgdown sites?
There was a problem hiding this comment.
Looking at this again, it seems like the vignette is precomputed, which is great. If there's an original .Rmd.orig version, could you include that too? Just for re-running the vignette in the future.
|
Once we're happy, we will need to increment the development version and add to |
| Variables: political_authority = ordered_logistic | ||
| religious_authority = ordered_logistic | ||
| Data: authority$data (Number of observations: 97) | ||
| Phylogeny: authority$phylogeny (Number of trees: 1) | ||
| Draws: 4 chains, each with iter = 500; warmup = 500; thin = 1 | ||
| total post-warmup draws = 2000 | ||
|
|
||
| Autoregressive selection effects: | ||
| Estimate Est.Error 2.5% 97.5% Rhat Bulk_ESS Tail_ESS | ||
| political_authority -0.53 0.44 -1.64 -0.02 1.00 1420 1270 | ||
| religious_authority -0.60 0.48 -1.82 -0.03 1.00 1549 1137 | ||
|
|
||
| Cross selection effects: | ||
| Estimate Est.Error 2.5% 97.5% Rhat | ||
| political_authority ⟶ religious_authority 1.50 0.74 -0.04 2.87 1.00 | ||
| religious_authority ⟶ political_authority 1.05 0.83 -0.62 2.64 1.00 | ||
| Bulk_ESS Tail_ESS | ||
| political_authority ⟶ religious_authority 842 1055 | ||
| religious_authority ⟶ political_authority 701 1384 | ||
|
|
||
| Drift parameters: | ||
| Estimate Est.Error 2.5% 97.5% | ||
| sd(political_authority) 2.15 0.82 0.37 3.70 | ||
| sd(religious_authority) 1.46 0.82 0.13 3.10 | ||
| cor(political_authority,religious_authority) 0.37 0.30 -0.35 0.85 | ||
| Rhat Bulk_ESS Tail_ESS | ||
| sd(political_authority) 1.00 453 406 | ||
| sd(religious_authority) 1.01 416 887 | ||
| cor(political_authority,religious_authority) 1.00 1116 1360 | ||
|
|
||
| Continuous time intercept parameters: | ||
| Estimate Est.Error 2.5% 97.5% Rhat Bulk_ESS Tail_ESS | ||
| political_authority 0.31 0.91 -1.55 2.12 1.00 2426 1337 | ||
| religious_authority 0.31 0.91 -1.47 2.14 1.01 3135 1612 | ||
|
|
||
| Ordinal cutpoint parameters: | ||
| Estimate Est.Error 2.5% 97.5% Rhat Bulk_ESS Tail_ESS | ||
| political_authority[1] -1.44 0.88 -3.13 0.23 1.00 1266 1010 | ||
| political_authority[2] -0.70 0.86 -2.36 1.01 1.00 1418 1276 | ||
| political_authority[3] 1.48 0.86 -0.20 3.19 1.00 1661 1415 | ||
| religious_authority[1] -1.59 0.92 -3.32 0.24 1.00 1424 1248 | ||
| religious_authority[2] -0.92 0.90 -2.61 0.87 1.00 1699 1531 | ||
| religious_authority[3] 1.45 0.92 -0.24 3.28 1.00 1913 1488 |
There was a problem hiding this comment.
I used options(width = 120) in vignettes/coevolve.Rmd.orig to ensure that the summary output fits the page. Could we do the same here?
|
Also, I think that in order for the ASR vignette to appear on the package website, we will need to add something like the following to |
* Output redesign for coev_ancestral_states(): switch to a long-format tibble with columns node, variable, category, estimate, lower, upper. Ordinal-response output expands to one row per category with category populated; non-ordinal rows have category = NA. Use the posterior mean (not median) for ordinal-response point estimates so per-node category estimates sum exactly to 1, matching the convention of predict.brmsfit. * Drop the clade_pp column; it was always 1.0 and was a forward-looking placeholder for proper multi-tree integration that does not exist yet. Easier to add back when the integration lands. * Fix doc inaccuracy: tree_id with NULL picks tree 1 -- it does not integrate over trees. Soften the "latent trait values" framing in the function description since scale = "response" returns response- scale values. * Add coev_identify_nodes(): wraps ape::identify.phylo() in a loop so that multiple node clicks accumulate before returning. * Vignette: switch to long-format examples; simplify the latent-tree color-bar block; pivot the pie-chart helper to use the new long format; replace the broken single-click identify() workflow with coev_identify_nodes(); remove clade_pp from output snippets. * Vignette source: add ancestral_states.Rmd.orig matching the existing coevolve.Rmd.orig pattern, and wire it into precompile.R for future regeneration. * _pkgdown.yml: add articles: block so the ASR vignette appears on the package website alongside coevolve. * DESCRIPTION: bump dev version to 1.0.0.9001; NEWS.md: add entries for the new functions and vignette. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the careful review @ScottClaessens! Pushed Output format (your big comment on L179) — switched to long format with a Dropped Doc fixes
Multi-click identify (L70 + vignette L414) — added Vignette polish
Housekeeping (your two top-level conversation comments)
All 113 ASR/identify/plot-labels tests still pass; One thing I did not re-knit: the static |
ape registers identify.phylo as an S3 method (S3method(identify, phylo)) but does not export it as a plain function, so calling ape::identify.phylo directly fails the namespace check. Switch to graphics::identify(), which is the registered S3 generic; dispatch will route to ape::identify.phylo when the argument inherits "phylo". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| #' \item{estimate}{Posterior point estimate. For ordinal-response | ||
| #' category probabilities this is the posterior \emph{mean} (so the | ||
| #' per-node category estimates sum to 1, matching the convention of | ||
| #' \code{predict.brmsfit}); for all other cases it is the posterior | ||
| #' median.} |
There was a problem hiding this comment.
The only thing with this approach is that it means that there could be a mix of posterior medians and means in the same summary output. Do we want this? I think we should probably be consistent, even if it means that the point esimates don't add up to 1. What do you reckon @ErikRingen?
There was a problem hiding this comment.
Also, consider removing the reference to predict.brmsfit in the documentation.
| #' \item{lower}{Lower bound of the credible interval (per-category | ||
| #' quantile for ordinal-response rows)} | ||
| #' \item{upper}{Upper bound of the credible interval (per-category | ||
| #' quantile for ordinal-response rows)} |
There was a problem hiding this comment.
For simplicity, maybe just:
#' \item{lower}{Lower bound of the credible interval}
#' \item{upper}{Upper bound of the credible interval}
There was a problem hiding this comment.
Unfortunately this function doesn't work for me. In an interactive session:
devtools::load_all()
coev_identify_nodes(authority$phylogeny)
# integer(0)
coev_identify_nodes(authority$phylogeny, n = 2)
# integer(0)There was a problem hiding this comment.
If it's helpful, I get the following error when trying the critical line in coev_identify_nodes():
graphics::identify(authority$phylogeny, quiet = TRUE)Error in locator(1) : plot.new has not been called yet
There was a problem hiding this comment.
How can we also test that this function does the right thing? Not sure how to do this given that tests aren't run in interactive sessions.
There was a problem hiding this comment.
Could we at least add a test checking that the output is not integer(0)?
There was a problem hiding this comment.
One additional thing: the "category" column will always be redundant when scale = "latent". Could we only include this column when scale = "response"?
|
Also, I think we will need to re-run the vignette once we're happy with all the package changes, just to make sure that everything is up-to-date. |
Summary
coev_ancestral_states()for extracting posterior estimates of ancestral trait values at phylogenetic nodes. Long-format output (node, variable, category, estimate, lower, upper); supports latent and response scales, specific node/variable selection, raw draws, and multi-tree models. Ordinal-response output expands to one row per category and uses the posterior mean so per-node category estimates sum exactly to 1 (matching the convention ofpredict.brmsfit).coev_plot_node_labels()for plotting internal node IDs on the tree (auto-scaling label size, customizable appearance).coev_identify_nodes()— wrapsape::identify.phylo()in a loop so that multiple node clicks accumulate before returning..Rmd.origsource wired intovignettes/precompile.R.NEWS.mdentries; bump dev version to1.0.0.9001; addarticles:block to_pkgdown.ymlso the vignette appears on the package website.Test plan
lintr::lint_package()— 0 lintsR CMD check --as-cran— 0 ERRORs, only pre-existing WARNING/NOTEscoev_ancestral_states()covering input validation, latent/response scales (normal, bernoulli, poisson, gamma, ordered logistic), raw draws, variable subsetting, CI width, multi-tree, mixed ordinal/non-ordinal output, and per-node category estimates summing to 1coev_plot_node_labels()covering input validation, return type, custom parameters, andmultiPhylotreescoev_identify_nodes()covering input validationCloses #86