Optimize insert of 1B+ rows

56 views Asked by user8015860 At 19 January 2023 at 00:20

I work in an organization that has over 75,000 employees. In our payroll system, each employee has 32 unique banks which store things like Sick Time, Vacation Time, Banked Overtime, etc.

Here are the existing tables

Employee
(
   Employee_key INT IDENTITY(1,1)
   Lastname,
   Firstname
)

Employee Key | Lastname | Firstname
-----------------------------------
100          | Smith    | John

Bank
(
   Bank_key INT IDENTITY(1,1),
   Bank_name VARCHAR(50)
)

Bank_key  | Bank_name
---------------------
100       | VACATION


Employee_balance
(
  Employee_key INT, --FK to Employee
  Bank_key INT, --FK to Bank
  Bank_balance NUMERIC(10,5) -- Aggregate value of bank including future dated entries
)

Employee_key | Bank_Key | Bank_Balance
--------------------------------------
100          | 100      | 0


Employee_balance_trans
(
  Employee_key INT, --FK to Employee
  Bank_key INT, --FK to Bank
  Trans_dt DATE -- transaction date that affects the bank
  Bank_delta NUMERIC(10,5)  
)

Employee_key | Bank_key | Trans_dt | Balance_delta
--------------------------------------------------
100          | 100      | 20230701 | -8.0
100          | 100      | 20230801 | -8.0
100          | 100      | 20230901 | -8.0
100          | 100      | 20231001 | -8.0
100          | 100      | 20231101 | -8.0

This employee has 5 vacation days booked into the future, for a total of 40 hours. As of January 1, the employee had 40 hours in their vacation bank, but because the employee_balance table is net of all future dated entries, I have to do some SQL processing to get the value for a current date.

SELECT  eb.employee_key, 
        eb.bank_key,
        '2023-01-01',
        eb.employee_balance - ISNULL(SUM(feb.balance_delta), 0)
FROM    employee a
INNER JOIN employee_balance eb on eb.employee_key = a.employee_key
LEFT OUTER JOIN wfms.employee_balance_trans ebt ON  ebt.balance_key = eb.balance_key
                                                AND ebt.employee_key = eb.employee_key
                                                AND ebt.trans_dt > '2023-01-01'
GROUP BY eb.employee_key, eb.balance_key, eb.employee_balance

Running this query using 2023-01-01 returns a bank value of 40 hours. Running the query on 2023-07-01 returns a value of 32 hours and so on. This query is fine for calculating a single employee balance. The problem starts when a manager of a department with 1000 employees wants to see a report showing the employee banks at the beginning and end of each month.

I created a new table as follows:

Employee_bank_history
(
   employee_key INT, --FK to employee
   bank_key INT, --FK to bank
   bank_date DATE,
   bank_balance NUMERIC (10,5) -- Contains the bank balance as of the bank date
)

The table has a unique clustered index consisting of employee_key, bank_key and bank_date. The table is also populated every evening with a date range from December 31 2021 to Current Date. The start date gets reset every year, so there will be a maximum of 730 days worth of data. This means that at the maximum date range of 730 days, there will be almost 2 billion rows. (75,000 employees X 32 banks X 730 days.)

Currently, I am loading 950 million rows, and the following INSERT statement takes 30-45 minutes.

DECLARE @StartDate  DATE = DATEFROMPARTS(DATEPART (YY, GETDATE())-2, 12, 31)   
DECLARE @EndDate    DATE = GETDATE()
;

WITH cte_bank_dates AS
(   
    SELECT  [date]
    FROM    dim_date
    WHERE   [date] BETWEEN @StartDate AND @EndDate
)

INSERT INTO employee_balance_history
SELECT  de.employee_key, 
        deb.balance_key,
        cte.[date],
        deb.bank_balance - ISNULL(SUM(feb.bank_delta), 0)
FROM    employee de
INNER JOIN employee_balance deb on deb.employee_key = de.employee_key
CROSS JOIN cte_bank_dates cte
LEFT OUTER JOIN employee_balance_trans feb  ON  feb.balance_key = deb.balance_key
                                                    AND feb.employee_key = deb.employee_key
                                                    AND feb.bank_date > cte.[date]
GROUP BY de.employee_key, deb.balance_key, cte.[date], deb.bank_balance
OPTION (MAXRECURSION 0)

I use the CTE to get only the dates in the correct range. I need to have each date in the range, so that I know which future dates to exclude from the aggregate option. The resulting query to get bank balances as of a given date is blazing fast.

Today, I had my hands slapped and was told that the CROSS JOIN to the CTE was not needed and to optimize the query because it was slowing everything else down when it runs.

Leaving aside the fact that it will run overnight once in production, I'm left to wonder if there's a better way to populate this table for every employee, every bank and every date. The number of rows is unavoidable, as is the calculation to strip out future dated transactions from the employee bank balance.

Does anyone have any idea how I might make this faster, and less resource intensive on the server?

Original Q&A

TechQA.

Optimize insert of 1B+ rows

There are 0 answers

Related Questions in SQL-SERVER

Related Questions in QUERY-OPTIMIZATION

Related Questions in COMMON-TABLE-EXPRESSION

Related Questions in LARGE-DATA-VOLUMES

Popular Questions

Trending Questions