Scipy Multinomial Probability Mass Function is almost always 0.0

58 views Asked by At

To calculate the probability for categorical features, I was told to use the Multinoulli distribution, which is supposedly a special case of the multinomial distribution, where the number of trials is 1.

I was given some probabilities for 10 different categories:

p = [0.14285714 0.11428571 0.14285714 0.08571429 0.11428571 0.05714286 0.14285714 0.02857143 0.11428571 0.05714286]

And told to calculate the pmf using the following as my input [0,1,2,3,4,5,6,7,8,9].

However, when I tried this using the pmf of SciPy's multinomial distribution, I always got a result of 0.0. Playing around with different values for n, I noticed that the only time I get a number other than 0.0, is if n happens to be the sum of the values of my outcomes in x. And when I checked the documentation, coincidentally in all examples n is indeed the sum of the values.

I think I am fundamentally misunderstanding something here. What is the point of having a parameter n when apparently there is only one sensible value for it? And why would it be the sum of the values of my categories? The way the multinomial distribution was presented to me, I thought these were just names, labels, like the 0th category, the 1st category, etc. Summing them up makes no sense to me.

SciPy documentation

1

There are 1 answers

0
Matt Haberland On

And when I checked the documentation, coincidentally in all examples n is indeed the sum of the values.

It is not a coincidence.

I think I am fundamentally misunderstanding something here. What is the point of having a parameter n when apparently there is only one sensible value for it?

n, the number of trials, can have any non-negative value. And the sum of your observation vector can be any value - but you'll only get a nontrivial PMF if the sum of your observation vector equals n. (I'll try to address the point of having this parameter at the end.)

The way the multinomial distribution was presented to me, I thought these were just names, labels, like the 0th category, the 1st category, etc.

That is correct; they are just categories.

Summing them up makes no sense to me.

The categories themselves are not summed; the numbers of trials in which the categories were observed is summed.


I think I can see the confusion. I'll assume it is not a misunderstanding of the probability distribution itself, just SciPy's need for the parameter n.

To be more concrete, let's take your example. When you ask for the probability mass of x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] - 0 observations of category 0, 1 observation of category 1, etc... - you seem to want the multinomial distribution to assume that the number of trials n = 45. It doesn't make that assumption; you have to pass n explicitly. If you pass n=100, say, it is prepared to answer questions about the probabilities involved when you perform 100 trials. The probability mass of observing [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] is zero because it is impossible to make only 45 total observations in 100 trials.

from scipy import stats
p = [0.14285714, 0.11428571, 0.14285714, 0.08571429, 0.11428571, 0.05714286, 0.14285714, 0.02857143, 0.11428571, 0.05714286]
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
rv = stats.multinomial(n = 100, p=p)
rv.pmf(x)  # 0.0

If you had passed n=45, you would have gotten the result you were looking for.

rv = stats.multinomial(n=sum(x), p=p)
rv.pmf(x)  # 2.485730789414698e-16

The reason it is parameterized this way is because it is part of an infrastructure that does more than just evaluate the PMF. For instance, if you want to draw random variates using the rvs method, you can understand why you would need to pass in the number of trials n - it needs to know how many total observations to draw.

y = rv.rvs()
np.sum(y)  # 45

Besides, it's standard to consider n to be a parameter of the multinomial distribution. See Wikipedia's article on the multinomial distribution, for instance - n is listed as a parameter.

enter image description here

(k is inferred from the length of the array p.)

If that makes sense, please file a documentation issue to let us know what we might have included in the documentation to make that more obvious.