Select Distinct Count is really slow

2k views Asked by At

I have a loop of about 7000 objects and within the loop I need to get a distinct count of a list of structs. Currently I am using -

foreach (var product in productsToSearch)
{
    Console.WriteLine("Time elapsed: {0} start", stopwatch.Elapsed);
    var cumulativeCount = 0;
    productStore.Add(product);
    var orderLinesList = totalOrderLines
        .Where(myRows => productStore.Contains(myRows.Sku))
        .Select(myRows => new OrderLineStruct
        {
            OrderId = myRows.OrderId,
            Sku = myRows.Sku
        });
    var differences = totalOrderLines.Except(orderLinesList);
    cumulativeCount = totalOrderLinsCount - differences.Select(x => x.OrderId).Distinct().Count();
    cumulativeStoreTable.Rows.Add(product, cumulativeCount);      
    Console.WriteLine("Time elapsed: {0} end", stopwatch.Elapsed);
}

public struct OrderLineStruct
{
    public string OrderId { get; set; }
    public string Sku { get; set; }
}

This is extremely slow when getting the distinct count. Anyone know of a more efficient way of doing this? I have tried using MoreLinq which has a DisctintBy method for Linq but it's not more efficient as I have timed it. I have played around with PLinq but I am a little unsure of where to parallelize this query.

So each iteration of the loop is timed at -
Time elapsed: 00:00:37.1142047 start
Time elapsed: 00:00:37.8310148 end

= 0.7168101 secs * 7000 = 5017.6707 (83.627845 minutes)

Its the Distinct() Count() line which is taking the most time to process (around 0.5 secs). The variable differences has a few hundred thousand OrderLineStruct's so doing any linq queries on this is slow.

UPDATE

I have modified the loop a bit and now it runs in around 10 minutes rather that over 1 hour

foreach (var product in productsToSearch)
{
    var cumulativeCount = 0;
    productStore.Add(product);
    var orderLinesList = totalOrderLines
        .Join(productStore, myRows => myRows.Sku, p => p, (myRows, p) => myRows)
        .Select(myRows => new OrderLineStruct
        {
            OrderId = myRows.OrderId,
            Sku = myRows.Sku
        });
    totalOrderLines = totalOrderLines.Except(orderLinesList).ToList();
    cumulativeCount = totalOrderLinesCount - totalOrderLines.Select(x => x.OrderId).Distinct().Count();
    cumulativeStoreTable.Rows.Add(product, cumulativeCount);
}

Having a .ToList() on the Except seems to make a difference and now I am removing the already processed orders after each iteration which is increasing performance for each iteration.

4

There are 4 answers

7
Ivan Stoev On BEST ANSWER

You are seeking the problem at the wrong place.

The orderLinesList, differences and differences.Select(x => x.OrderId).Distinct() are just LINQ to Objects chained query methods with deferred execution, and the Count() method is executing them all.

Your processing algorithm is highly inefficient. The bottleneck is the orderLinesList query which iterates the whole totalOrderLines list for each product, and this is chained (included) in the Except, Distinct etc. - again, inside the loop, i.e 7000+ times.

Here is a sample efficient algorithm that IMO does the same:

Console.WriteLine("Time elapsed: {0} start", stopwatch.Elapsed);
var productInfo =
(
    from product in productsToSearch
    join line in totalOrderLines on product equals line.Sku into orderLines
    select new { Product = product, OrderLines = orderLines }
).ToList();
var lastIndexByOrderId = new Dictionary<string, int>();
for (int i = 0; i < productInfo.Count; i++)
{
    foreach (var line in productInfo[i].OrderLines)
        lastIndexByOrderId[line.OrderId] = i; // Last wins
}
int cumulativeCount = 0;
for (int i = 0; i < productInfo.Count; i++)
{
    var product = productInfo[i].Product;
    foreach (var line in productInfo[i].OrderLines)
    {
        int lastIndex;
        if (lastIndexByOrderId.TryGetValue(line.OrderId, out lastIndex) && lastIndex == i)
        {
            cumulativeCount++;
            lastIndexByOrderId.Remove(line.OrderId);
        }
    }
    cumulativeStoreTable.Rows.Add(item.Product, cumulativeCount);
    // Remove the next if it was just to support your processing
    productStore.Add(item.Product);
}
Console.WriteLine("Time elapsed: {0} end", stopwatch.Elapsed);
0
Eniola On

I will recommend changing this portion of the your LINQ Query

totalOrderLines.Where(myRows => productStore.Contains(myRows.Sku))

to a Join to read thus:

totalOrderLines.Join(productStore, myRows => myRows.Sku, p => p, (myRows, p) => myRows)

This way you pay the cost once rather than having Contains traverse your product store list 7,000 times which is very inefficient. Also, if it is possible to make your id integral data types (int, long) rather than string, you should have faster searches and comparisons too. But I guess the structure of your model is pretty much set.

1
Wicher Visser On

Where does totalOrderLines originate from? A MSSQL database maybe? If so, you will have to have an index on the OrderId column. Doing a Distinct() without an index on this column forces the DB engine to iterate through all rows to identify the distinct values.

3
Kote On

In your case the as Jon Hanna mentioned, bottleneck is Except method.
Distinct and Count has second priority.
You can verify this by enforcing enumeration on each part of your method and putting stopwatch around.

foreach (var product in productsToSearch)
{
    var cumulativeCount = 0;
    productStore.Add(product);

    olSw.Start();
    var orderLinesList = totalOrderLines
        .Where(myRows => productStore.Contains(myRows.Sku))
        .Select(myRows => new OrderLineStruct
        {
            OrderId = myRows.OrderId,
            Sku = myRows.Sku
        }).ToList();
    olSw.Stop();

    exSw.Start();
    var differences = totalOrderLines.Except(orderLinesList).ToList();
    exSw.Stop();

    dcSw.Start();
    cumulativeCount = totalOrderLinsCount - differences.Select(x => x.OrderId).Distinct().Count();
    dcSw.Stop();
}

Measurements:
productsToSearch count 100
totalOrderLines count 300 000

Total olSw time: 00:00:01.3583340
Total exSw time: 00:00:14.3304959
Total dcSw time: 00:00:04.1986018

exSw time can be reduced by explicit implementation of GetHashCode at OrderLineStruct

With explicit GetHashCode:

Total olSw time: 00:00:01.4045676
Total exSw time: 00:00:08.4691066
Total dcSw time: 00:00:03.9439711

Total time change without redundant enumeration:
Default GetHashCode Total time: 00:00:18.9649790
Explicit GetHashCode Total time: 00:00:12.7736320

Update:
Also you can optimize this by changing method logic.

For example you can create HashSet from totalOrderLines and then just remove items from it.

var orderLinesList = totalOrderLines
    ... 
    .ToList();

orderLinesList.ForEach(item => totalOrderLines.Remove(item));

cumulativeCount = totalOrderLinsCount - totalOrderLines.Select(x => x.OrderId).Distinct().Count();

In my case it reduces total time to 7 seconds.
Total time: 00:00:07.0851111

In this case enumeration through TotalOrderLines with Dictinct is a bottleneck, but it takes O(N) time, which is ok.