-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathmain.tex
More file actions
180 lines (150 loc) · 8.14 KB
/
main.tex
File metadata and controls
180 lines (150 loc) · 8.14 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{enumitem}
\usepackage{amsmath,amssymb}
\usepackage{epigraph}
\title{Debugging Probabilistic Programs}
\author{Vikash K. Mansinghka, Ben Zinberg}
\newcommand{\Xsp}{X_{\text{specific}}}
\begin{document}
\maketitle
\setlength{\epigraphwidth}{0.5\textwidth}
\epigraph{``When you've eliminated the plausible, whatever remains, however
impossible, must be probable.''}{{\itshape Leon Zhou}}
\section{Points}
Some points, once brought to your attention, are quite believable a priori:
\begin{itemize}
\item {\bf Probabilistic program development is not well, sometimes
incorrectly, understood.} It's a new field. Many aspects of conventional
programming are useful, but many of the basic intuitions from that area can
lead us astray here.
\item {\bf Debugging probabilistic programs is like debugging hardware.} While
probabilistic programs can produce conventional errors, debugging is
typically about chasing down other kinds of problems. The output of a
probabilistic model (or the result of inference on a model) is a probability
distribution, which can be inspected in a number of ways. In order to
ensure that the distribution achieved is acceptable, it is crucial to {\em
use} many or all of those ways. It's easy to forget to do this, since in
the finished product only a neat little corner of the model will be
inspected. But as in hardware debugging, it is best to proceed with
ruthless skepticism, as each component is not just liable but quite likely
to fail.
In conventional programming (partly due to your trained intuitions as a
programmer), you can reliably analyze most short chunks of code in your
head. You (and we) cannot do this for probabilistic code, both because you
are not trained and because it is harder. Also, it is typically much easier
to sanity-check a short conventional program; in order to sanity-check a
probabilistic program, you need to inspect a distribution, which requires
many probes.
\end{itemize}
\section{Steps}
Here is a series of steps proposed by VKM to debug probabilistic programs. Here
\begin{itemize}[noitemsep]
\item $X$ is latents
\item $\lambda$ is parameters
\item $Y$ is data
\item $X \mapsto T_Y(X)$ is the ``step'' operator of an inference program,
such that the true posterior $P(X|Y,\lambda)$ is invariant under $T_Y$, and
$T_Y^t(X)$ converges to the true posterior $P(X|Y,\lambda)$ as $t \to
\infty$.
\end{itemize}
Here are the steps:
\begin{enumerate}
\item ``Do I know the problem I'm asking the prob. prog. to solve? Do I know
it's solvable? Can I solve it?''
You should not expect v0.1 of your program to be better at solving the
problem than you---you have eyes and a brain. Your first test case should,
a fortiori, be one in which it is completely obvious to you what the
solution is, both because you need a much stupider program to work, and
because you need to be able to reliably tell whether the program has
succeeded. Along that vein, it is important at this early stage to set up
adequate {\em visualization} for answers $X$ produced by the program. Your
eyes are much more powerful signal processors than your brain's abstract
number sense, and your inspection plan (you have one, right?) should try to
leverage that. Sometimes it will be nice to have visualization of $X$
before you even have a prior. That way, each of the (probably many) times
you rewrite the prior, you can sanity-check by writing a quick loop to
sample the prior and generate several \texttt{png}s.
\item ``Do I really grok $P(X|\lambda)$?''
Cycling through summary \texttt{png}s of samples is a good first step, but
to go beyond what is visible at first glance, it will be useful to come up
with a small (or large) portfolio of statistics that can be plotted or
inspected to reveal more information about the distribution. Statistics for
a PCFG could include
\begin{itemize}[noitemsep]
\item number of variables in parse tree
\item number of times each production was used
\item yield at 0, i.e., the terminal reached by walking left maximally
\end{itemize}
\item\label{itm:grok-transition-operator}
``Do I grok $T_Y(X)$ at all?''
It is quite hard to analyze this abstractly by looking at text. Visual
methods are necessary:
\begin{itemize}[noitemsep]
\item Make a movie of $T_Y^t(X)$, $t=0,1,\ldots$.
\item Make a movie of Gibbs sampling: an animation of $X^t$, $t=0,1,\ldots$, where
\[
X^0 \sim P(X|\lambda), \quad
Y^t \sim P(Y|X^t,\lambda), \quad
X^{t+1} \sim T_{Y^t}(X^t).
\]
To better understand $T$ better, it can be helpful to also do the above
steps when $Y$ is empty, i.e., when inference is done on no data. The
true posterior in this case is trivial to write down, but you are not
trying to learn about the true posterior, you are trying to learn about
the inference algorithm.
\end{itemize}
\item ``Does $T_Y(X)$ respect Bayes' rule?''
One way to check this is to run step \ref{itm:grok-transition-operator} with
$Y$ empty, and compare the empirical distribution of $\{X^t\}$ to the prior
$P(X|\lambda)$, perhaps using a Q-Q plot on summary statistics. If $T$ is
meant to sample the posterior (rather than do MAP estimation or something
like that) then the distributions should match.
Another pair of distributions to compare is the history when $Y$ contains
data versus the distribution inferred by rejection sampling.
\item ``Now I:
\begin{enumerate}
\item Know the problem is solvable (i.e., the data is ok)
\item Grok my model (to some extent)
\item Grok my inference program (to some extent)
\item Grok their interaction (to some extent)
\item Have inspection infrastructure for model, data, and $T_Y$
\item Have quantitative evidence in support of intuition
\item Have (anecdotally) established asymptotic soundness of $T_Y$, so I
can get `Bayesian' behavior in some regime''
\end{enumerate}
\item Determine how much data and inference are needed to ``solve'' random
instances, and specific instances, and how inference performance compares to
baselines.
The idea is to get some measure of how inference quality depends on number
of steps $t$ and number of data points $N$. (Or, fixing a certain quality
goal, how large $t$ has to be as a function of $N$ to achieve that goal, or
vice versa.) If you can codify one or several concrete ways to measure
inference quality for your problem, that would be useful. One basic/rough
way to do this is to generate $X^*$ from the prior, $Y^* \sim
P(Y|X^*,\lambda)$, and double $t$ and $N$ until the inferred posterior
concentrates on $X^*$, though of course, this effect could be more
conspicuous or less conspicuous depending on the inference problem.
\item\label{itm:seatbelt}
Try to find ``seatbelt variables'' $\{\Xsp\}$ which capture the quintessence
of the sorts of inference dynamics that occur: something like an orthogonal
basis for the possible states $X$ which is orthogonal with respect to the
qualitative inference behaviors produced.
\item Use step \ref{itm:seatbelt} as a harness to perform full
diagnostics---i.e., histories/traces of ``score,'' probe statistics,
marginals, values of $\Xsp$'s, etc.--- on a few ``extreme'' instances (some
causing failure, either inductive or computational).
\item Try the real problem---or, add ``?? variable'' [not sure what this means
-- BZ] and repeat
\end{enumerate}
\section{Notes}
Notes about these steps:
\begin{itemize}
\item Each of these steps will be run many times. It is well worth it to set
up very clean, automatic machinery for the steps.
\item VKM points out, ``If the model and data generator are both parametrized
by the same scaling parameters (amount of evidence, number of random
variables) and if the inference algorithm is parametrized by $T$, then the
steps are essentially mechanical.''
\end{itemize}
\end{document}