I have a loop of about 7000 objects and within the loop I need to get a distinct count of a list of structs. Currently I am using -
foreach (var product in productsToSearch)
{
Console.WriteLine("Time elapsed: {0} start", stopwatch.Elapsed);
var cumulativeCount = 0;
productStore.Add(product);
var orderLinesList = totalOrderLines
.Where(myRows => productStore.Contains(myRows.Sku))
.Select(myRows => new OrderLineStruct
{
OrderId = myRows.OrderId,
Sku = myRows.Sku
});
var differences = totalOrderLines.Except(orderLinesList);
cumulativeCount = totalOrderLinsCount - differences.Select(x => x.OrderId).Distinct().Count();
cumulativeStoreTable.Rows.Add(product, cumulativeCount);
Console.WriteLine("Time elapsed: {0} end", stopwatch.Elapsed);
}
public struct OrderLineStruct
{
public string OrderId { get; set; }
public string Sku { get; set; }
}
This is extremely slow when getting the distinct count. Anyone know of a more efficient way of doing this? I have tried using MoreLinq which has a DisctintBy method for Linq but it's not more efficient as I have timed it. I have played around with PLinq but I am a little unsure of where to parallelize this query.
So each iteration of the loop is timed at -
Time elapsed: 00:00:37.1142047 start
Time elapsed: 00:00:37.8310148 end
= 0.7168101 secs * 7000 = 5017.6707 (83.627845 minutes)
Its the Distinct() Count() line which is taking the most time to process (around 0.5 secs). The variable differences has a few hundred thousand OrderLineStruct's so doing any linq queries on this is slow.
UPDATE
I have modified the loop a bit and now it runs in around 10 minutes rather that over 1 hour
foreach (var product in productsToSearch)
{
var cumulativeCount = 0;
productStore.Add(product);
var orderLinesList = totalOrderLines
.Join(productStore, myRows => myRows.Sku, p => p, (myRows, p) => myRows)
.Select(myRows => new OrderLineStruct
{
OrderId = myRows.OrderId,
Sku = myRows.Sku
});
totalOrderLines = totalOrderLines.Except(orderLinesList).ToList();
cumulativeCount = totalOrderLinesCount - totalOrderLines.Select(x => x.OrderId).Distinct().Count();
cumulativeStoreTable.Rows.Add(product, cumulativeCount);
}
Having a .ToList() on the Except seems to make a difference and now I am removing the already processed orders after each iteration which is increasing performance for each iteration.
You are seeking the problem at the wrong place.
The
orderLinesList
,differences
anddifferences.Select(x => x.OrderId).Distinct()
are just LINQ to Objects chained query methods with deferred execution, and theCount()
method is executing them all.Your processing algorithm is highly inefficient. The bottleneck is the
orderLinesList
query which iterates the wholetotalOrderLines
list for eachproduct
, and this is chained (included) in theExcept
,Distinct
etc. - again, inside the loop, i.e 7000+ times.Here is a sample efficient algorithm that IMO does the same: