I'm running python code that's similar to:
import numpy
def get_user_group(user, groups):
if not user.group_id:
user.group_id = assign(groups)
return user.group_id
def assign(groups):
for group in groups:
ids.append(group.id)
percentages.append(group.percentage) # e.g. .33
assignment = numpy.random.choice(ids, p=percentages)
return assignment
We are running this in the wild against tens of thousands of users. I've noticed that the assignments do not respect the actual group percentages. E.G. if our percentages are [.9, .1] we've noticed a consistent hour over hour split of 80% and 20%. We've confirmed the inputs of the choice function are correct and mismatch from actual behavior.
Does anyone have a clue why this could be happening? Is it because we are using the global numpy? Some groups will be split between [.9, .1] while others are [.33,.34,.33] etc. Is it possible that different sets of groups are interfering with each other?
We are running this code in a python flask web application on a number of nodes.
Any recommendations on how to get reliable "random" weighted choice?
This comment exhausted the limitations of a comment, hence I post it here.
The fact that your team was not able to reproduce the problem but got proper results is a sign that most probably NumPy can suit your needs. You can benefit from NumPy later, when you need efficiency, and it can be seen that efficiency is not your concern now.
A more complete code and infrastructure setup on your nodes would be helpful though. How often do you restart your Flask server? Where do you initialize the NumPy random generator? Consider the following code that creates a page
/randomwhich can be customized with size, e.g:localhost:5000/random?size=20:In this example, the state is initialized once after the Flask app is started. Whenever the
/randompage is requested, good random numbers are generated.If you put the state initialization inside the function, it would surely cause unexpected distributions, bc you'll get the same random numbers (and same choices).
If you use multiple nodes and initialize with the same seed, your different nodes will produce the same choice again. In this case, use the unique node ids as seed values. If you restart the servers often, concatenate the restart ID or timestamp to the unique node ID. It is also a good idea to ensure that the timestamp is logged.