home ::
syllabus ::
timetable ::
groups ::
moodle(591,
791) ::
video tbd ::
© 2021
Two reasons to explore all the above is that, after five years of software fairness testing, that line of research may have run its course:
- It is not... ideal... that all that work is based on so few data sets. If we broadenned the scope of that work, would we be exploring a more useful and interesting space?
- https://arxiv.org/pdf/1912.00761.pdf argues that what ComSci and ML people call fairness is NOT what lawyers call fairness. So it might be time to look at the bigger picture.
So can we leap on to a broader definition of fairness?
How did he do it
Hint: not by hyper-parameter optimization (which is like walking around a car and hitting it till it works).
Instead of build a bad model, then fixing it, why not build the right model in the first place?
(Invented about 2 weeks ago by yours truly.)
Take data with columns:
- P : protected attributes
- U : unprotected attributes
- T : target attributes (may be many: multi-objective).
Cluster the rows:
- Ps : using just the U attributed
- Us : using just the U attributed
- Ts : using just the U attributed
If a learner works, it finds what attributes select form what target values; i.e.
- At the very least: large overlap Us and Ts
If a leaner is fair, then protected attributes do not determine the target score:
- Small overlap Ps and Ts
Which means transitively:
- Small overlap Ps and Us
So a fair learner build from P is one that is very good for predicting for the target (i.e. be good at P2T) and very bad at predicting for the protected attributes (.e. be bad at P2U).
- In-processing:
- learners could be modified to maximize for P2T (which is common practice) while minimizing P2U (which is a new idea).
- Kewen's results (above) are an example of this analysis
- Pre-processing:
- The P2U connection could be determined looking at the data without access to any T.
- Pre-processors that weakened the P2U connection (e.g. Joy's rebalancing) would generate data that, theoretically, no data could then be unfair about
(no, its not called "FART"
Before going further, you need to know:
- Decision trees are recursive diversity reduction algorithms
- Find the split that most reduces diversity
- Recurse on each split.
- Stop when (e.g.) too few examples in each split.
BTW, RandomForests says "if one tree is good, why not build 100?"
- each time, grab (say) log(N) of the attributes and some percent of the rows
- build N trees
- make a conclusion by voting across the forest
Example1
- let us measure diversity using standard deviaton
- sqrt((∑ square(x - μ))/(n-1))
- standard deviation σ of 9,2,5,4,12,7 has μ = 6.5 and σ=3.619.
- Learners like CART and M5prime and Random Forest Regressorts used standard deviation.
- Why? Cause these learners predict for numeric class variables.
Example2:
- let us measure diversity using entropy; i.e. -1* ∑ p*log2(p)
e.g. 1 orange, 1 apple, 2 bananas, and 4 grapes - occur at probability 1/8, 1/8, 1/4, and 1/2
- 8 =2*2*2 so log2( 1/8 ) = -3
- 4 =2*2 so log2( 1/4 ) = -2
- 2 =2 so log2( 1/2 ) = -1
- what is entropy of (o,a,b,b,g,g,g,g)
- -1 * (1/8*-3 + 1/8*-3 + 1/4*-2 + 1/2*-1)
- = -1 * (1/8*-3 + 1/8*-3 + 2/8*-2 + 4/8*-1)
- = -1/8 * (-6 + -4 + -4)
- = 14/8
- = 1.75
- Learners like decision trees and random forests use entropy
- Why? Cause these learners predict for symbolic class variables.
Example3:
- What is the best split for this data?
- Here are the options (note that this is four different splits):
- Consider the outlook tree. We have three sub-branches so the expected value of the diversity after the split is
- 5/14 * entropy of sunny split
- 4/14 * entropy of overcast split
- 5/14 * entropy of rainy split
- The overcast split is easy:
- we only have yes so p(yes) = 1 and log2(p) = 0
- and 4/14 is zero
- The sunny and rainy split are symmetric
- sunny: 2 yes and 3 no = -1 * (2/5 * log2(2/5) + 3/5 * log2(3/5)) = 0.97
- raning: 3 yes and 2 no = same entropy as sunny
- So the expected value after the outlook split is
- (5/14 * 0.97) + (4/14 * 0) + (5/14 * 0.97) = 0.69
- (BTW, this is an improvement since before the split we have 9 yes, 5 no; ie. entropy was 0.94; i.e. more diversity).
If we repeat this calc over all splits, we get
- outlook split: 0.69
- temperate split: 0.91 = humidity split: 0.79
- windy split: 0.89
So we would split on outlook.
- https://github.com/txt/ase16/blob/master/src/ase.py#L989-L1039
- For an explanation of lines 10014 to 1017, see 3.3 of http://robotics.stanford.edu/users/sahami/papers-dir/disc.pdf#page=4
- Long story short, when recursively diving dividing numbers, they use information theory to guide termiantion
- Note that this is a top-down approach (whereas my preferred discretizer is bottom=-up... so simpler termination).
- See http://menzies.us/keys/range.html#section-15
- But to-down, botttom-up, the main point here is that there is n over-arching framework and smaller details withing
- and that framework can be used for fairness - by design
- e.g. what kewen did was to score each range by its ability to predict for the goal (minimize target entropy) with w and maximizing protected attribute entropy (with 1-w). Hence the above tabe
Consider the following Bayes net. This is a bi-directional structure. Each node's distribution is the product of its parents. If you adjust any distribution, then the upstream and downstream distributions adjust to be be consistent.
Note that if a node has more than one parent, it is non-determinate what happens when you change something downstream (which upstream variable do you fiddle)? That is:
- updates lead to choices
- choices could be assessed by fairness predicates.
E.g.
Just as we can adjust decision tree generate to enable fair-by-design, so too can we adjust other AI tools. What would it mean to tinker with out kodels, watch for the assumptions that tinkering might generate, then explore that assumptions space.
There are bayesian methods for that
There are ASP methods for that
There are prob.programming methods for that.
For me: old old song in new new new tools
- Abduction + optimization
Welcome back to abduction. When reasoning about symbols across rules, abduction tells us we can generate multiple interpretations. For example:
- Grass is wet if it rained.
- Grass is wet if the sprinkler was on.
- The sun was shining.
- false if it rained and the sun was shining.
(The last item is an integrity constraint.) Then given "the grass is wet" then the worlds of belief are "it rained" and "the sprinkler was on". But it we new the sun was shining then the integrity constraint would rule out raining.
Here's a more complex example:
feed(lactose) :- make(permease), make(galactosidase).
make(Enzyme) :- code(Gene, Enzyme), express(Gene).
express(lac(X)) :- amount(glucose, low), amount(lactose, hi).
express(lac(X)) :- amount(glucose, medium), amount(lactose, medium).
code(lac(y), permease).
code(lac(z), galactosidase).
temperature(low) :- amount(glucose, low).
Integrity constraints (IC)
false :- amount(S, V1), amount(S, V2), V1 ≠ V2.
Abducibles (A) (things we are allowed to guess).
abducible_predicate(amount).
The problem goal is feed(lactose). This can arise either as an observation to be explained or as a state of affairs to be achieved by finding a plan. This goal has two abductive explanations:
- amount(lactose, hi), amount(glucose, low)}\\Delta
- lactose, medium), amount(glucose, medium)
Here again, we are exploring some model and we find choices. Without any knowledge of local goals, we have to just throw a coin and make decisions that might be bad:
So what would Kewen's work mean for model-based logical reasoning?
- Welcome to [Answer set programming](hsolving.pdf}
- Slides 511,512
- SLides 513 (which, for me, is abduction + optimization)
- Slide 516: reasoning without keys (so I think we can supercharge ASP with keys)
- Slides 517
- Stan: variables don't have assignments, they have distributions
- Prob log: closer to ASP
- And a rich combination of both
- http://proceedings.mlr.press/v103/cozman19a/cozman19a.pdf (see Figure1).



