How to select records in SQL based on probability distribution on one column while making sure that another column value is unique

47 views Asked by At

Context

I have the following Tables in Microsoft SQL Server:

MainTable

UserId
EmailType
Column 1
Column 2
..
Column N

EmailDistribution

EmailType
WeightageRequired

EmailType has about 10 distinct value posibility, say A-J, for both the tables. Also, the sum of WeightageRequired for the 10 values togather is 1.

In MainTable, there are multiple combinations of UserId & EmailType. However, UserId-EmailType Combination would have a unique value. Thus, following is possibility:

UserId - EmailType

1-A
1-B
1-D
2-A
2-C
2-G
3-A
3-B

etc. Also, one particular UserId need not have a combination with all the EmailType values. Thus, in the given example above, UserId 1 only has rows corresponding to A,B & D and not for the rest of the possible values of EmailType.

Now, the requirement:

Out of, say, 50000 rows in the MainTable, I want to select 1 row for each UserId such that the distribution of the EmailType is as close as possible to the EmailDistribution's WeightageRequired.

For example, if the 50000 rows have 12000 unique UserIds, the resultset has to have 12000 rows (and only 12000 rows). However, within the set of rows for a particular UserId, a random row should be picked so as to work towards the required weightage distribution.

There is no hard requirement for the weightage to be exactly what is required, but the closer it is to the requirement, the better is the model fit.

I hope I am able to explain my problem correctly.

Request the great minds on StackOverflow for assistance.

1

There are 1 answers

1
Isolated On

Here's a non-loop version, albeit not ideal, but you could create a bunch of CTEs to get distinct userids.

Let's pretend these are your weights by emailType:

EmailType   Weights
A           0.000976563
B           0.001953125
C           0.00390625
D           0.0078125
E           0.015625
F           0.03125
G           0.0625
H           0.125
I           0.25
J           0.5

And let's pretend you have 10,000 distinct UserIDs in the main table. So group A should represent ~10 rows (10,000 * 0.00097).

with group_a as (
select distinct top 10 userid, 'A' as email_type
  from mainTable
 where emailType = 'A'
),
group_b as (
select distinct top 20 userid, 'B' as email_type
  from mainTable
 where emailType = 'B'
   and userid not in (select userid from group_a)
),
group_c as (
select distinct top 40 userid, 'C' as email_type
  from mainTable
 where emailType = 'C'
   and userid not in (select userid from group_a)
   and userid not in (select userid from group_b)
),
...
group_j as (
select distinct top 5000 userid, 'J' as email_type
  from mainTable
 where emailType = 'J'
   and userid not in (select userid from group_a)
   and userid not in (select userid from group_b)
   and userid not in (select userid from group_c)
   and userid not in (select userid from group_d)
   and userid not in (select userid from group_e)
   and userid not in (select userid from group_f)
   and userid not in (select userid from group_g)
   and userid not in (select userid from group_h)
   and userid not in (select userid from group_i)
)
select userid, email_type from group_a union
select userid, email_type from group_b union 
select userid, email_type from group_c union
...
select userid, email_type from group_j 

This "works" to some degree but you're not guaranteed to get all 10,000 user ids. For example, the first 3 CTEs may grab some people you need for Group D. If you are looking for a non-loop solution, then perhaps start by first counting how many unique IDs you have by EmailType and then compare to your Weights. That may help determine how to order the CTEs (should you choose this method).

Again, not ideal but might get you close to what you need.