Dungeon crawl stone soup draconian

11/7/2022

Or grey draconians, I guess.Ĭan we conclude that pale, yellow, mottled and black draconians are all worse than grey? Unfortunately the problem of multiple comparisons rears its head again, and even dividing our p-value threshold by 9 isn't enough to save us this time, because I've committed a new crime: Testing hypotheses suggested by the data. mistakenly reject the null hypothesis no more than 5% of the time), then we do not have sufficient evidence to conclude that grey draconians are significantly better than red, green, purple, or white draconians. If we wish to keep using a significance level of 5% (i.e. ), now we're applying one test to all the data to test one hypothesis: are there any significant differences in win rate among colours?

Whereas before we separately tested 9 hypotheses ("Do grey draconians win significantly more?", "Do purple draconians win significantly more?". On the other hand, a nice aspect of this test is that we don't need to worry about combining p-values from several tests, or raising our threshold to account for multiple comparisons. (That Wikipedia section gives some rules of thumb for whether the normal approximation is appropriate, which our data conform to.) There is an equivalent exact test, Fisher's exact test, but it's computationally expensive. It relies on the data being normally distributed, which is asymptotically true for binomially/multinomially distributed data such as ours. It's not as intuitive as the binomial test, and unlike the binomial test, it is not exact. our null hypothesis that all colours having the same win rate). draconian colours), and gives us a measure of the likelihood of these observations given a hypothetical distribution (e.g. winning and losing) over some categories (e.g. Pearson's chi-squared test takes as input a table of the frequencies of some outcomes (e.g. Let's move on to the conventional test for these kinds of situations, which does address this problem. Neat.īut this still doesn't help us with the main problem of combining p-values per colour to get a single p-value for our null hypothesis. It turns out this kind of compound distribution is common enough to have a name and a Wikipedia article: it's a beta-binomial distribution. What we'd really like is to take the integral of that function over all win rates p from 0 to 1, weighted by a suitable prior on p (a beta distribution having α = 321, β = 20047 is a good representation of where we would expect the overall win rate to fall, given the wins and losses in our sample). It's possible to have a situation where none of our colours differs from the overall mean with p=53 games won out of 2200 | winrate=.016) While we were lucky enough to get a significant result here, a major problem with this method is that it's too conservative with respect to our null hypothesis. 05 / 9 ~= 0.006, only grey draconians' p-value makes the cut. The more hypotheses I test, the greater the chance that one will produce a false positive ( relevant xkcd).Ī really simple mitigation is the Bonferroni correction, which just cuts the significance threshold for each of n tests down to 1/nth of the desired overall significance level. The chance of any given group having a significant p-value given the null hypothesis is 5%, but I tested nine groups. The other colours, however, don't meet that significance threshold.īut there's a problem here. If we choose a typical significance level of 5%, then these are good enough to call significant. So under our null hypothesis, the chances of randomly observing a win rate as extreme as those of grey or black draconians are. Assuming this is the case, what's the chance that, when sampling ~2200 games, we get a win rate as few as 24 wins, or as many as 53? So let's imagine the true probability of winning for any colour of draconian is just the average win rate over all colours in our sample (1.6%). In general, the binomial test can answer any question of the form "how likely is it to get an outcome as extreme as k successes in n trials given a probability of success p?". We can calculate the probability of exactly k heads as. It's a test that can answer questions like "how unlikely is it that a fair coin would land heads at least 15 times out of 20 tosses?", and answer them exactly. The binomial test is a nice simple place to start. Our null hypothesis here is that the probability of winning the game is equal for all draconian colours. So are these differences statistically significant? Let's math! That's a lot of games, but the number of victories per colour is just a few dozen.

0 Comments

Dungeon crawl stone soup draconian

Leave a Reply.

Author

Archives

Categories