I work in an organization that has over 75,000 employees. In our payroll system, each employee has 32 unique banks which store things like Sick Time, Vacation Time, Banked Overtime, etc.
Here are the existing tables
Employee
(
Employee_key INT IDENTITY(1,1)
Lastname,
Firstname
)
Employee Key | Lastname | Firstname
-----------------------------------
100 | Smith | John
Bank
(
Bank_key INT IDENTITY(1,1),
Bank_name VARCHAR(50)
)
Bank_key | Bank_name
---------------------
100 | VACATION
Employee_balance
(
Employee_key INT, --FK to Employee
Bank_key INT, --FK to Bank
Bank_balance NUMERIC(10,5) -- Aggregate value of bank including future dated entries
)
Employee_key | Bank_Key | Bank_Balance
--------------------------------------
100 | 100 | 0
Employee_balance_trans
(
Employee_key INT, --FK to Employee
Bank_key INT, --FK to Bank
Trans_dt DATE -- transaction date that affects the bank
Bank_delta NUMERIC(10,5)
)
Employee_key | Bank_key | Trans_dt | Balance_delta
--------------------------------------------------
100 | 100 | 20230701 | -8.0
100 | 100 | 20230801 | -8.0
100 | 100 | 20230901 | -8.0
100 | 100 | 20231001 | -8.0
100 | 100 | 20231101 | -8.0
This employee has 5 vacation days booked into the future, for a total of 40 hours. As of January 1, the employee had 40 hours in their vacation bank, but because the employee_balance table is net of all future dated entries, I have to do some SQL processing to get the value for a current date.
SELECT eb.employee_key,
eb.bank_key,
'2023-01-01',
eb.employee_balance - ISNULL(SUM(feb.balance_delta), 0)
FROM employee a
INNER JOIN employee_balance eb on eb.employee_key = a.employee_key
LEFT OUTER JOIN wfms.employee_balance_trans ebt ON ebt.balance_key = eb.balance_key
AND ebt.employee_key = eb.employee_key
AND ebt.trans_dt > '2023-01-01'
GROUP BY eb.employee_key, eb.balance_key, eb.employee_balance
Running this query using 2023-01-01 returns a bank value of 40 hours. Running the query on 2023-07-01 returns a value of 32 hours and so on. This query is fine for calculating a single employee balance. The problem starts when a manager of a department with 1000 employees wants to see a report showing the employee banks at the beginning and end of each month.
I created a new table as follows:
Employee_bank_history
(
employee_key INT, --FK to employee
bank_key INT, --FK to bank
bank_date DATE,
bank_balance NUMERIC (10,5) -- Contains the bank balance as of the bank date
)
The table has a unique clustered index consisting of employee_key, bank_key and bank_date. The table is also populated every evening with a date range from December 31 2021 to Current Date. The start date gets reset every year, so there will be a maximum of 730 days worth of data. This means that at the maximum date range of 730 days, there will be almost 2 billion rows. (75,000 employees X 32 banks X 730 days.)
Currently, I am loading 950 million rows, and the following INSERT statement takes 30-45 minutes.
DECLARE @StartDate DATE = DATEFROMPARTS(DATEPART (YY, GETDATE())-2, 12, 31)
DECLARE @EndDate DATE = GETDATE()
;
WITH cte_bank_dates AS
(
SELECT [date]
FROM dim_date
WHERE [date] BETWEEN @StartDate AND @EndDate
)
INSERT INTO employee_balance_history
SELECT de.employee_key,
deb.balance_key,
cte.[date],
deb.bank_balance - ISNULL(SUM(feb.bank_delta), 0)
FROM employee de
INNER JOIN employee_balance deb on deb.employee_key = de.employee_key
CROSS JOIN cte_bank_dates cte
LEFT OUTER JOIN employee_balance_trans feb ON feb.balance_key = deb.balance_key
AND feb.employee_key = deb.employee_key
AND feb.bank_date > cte.[date]
GROUP BY de.employee_key, deb.balance_key, cte.[date], deb.bank_balance
OPTION (MAXRECURSION 0)
I use the CTE to get only the dates in the correct range. I need to have each date in the range, so that I know which future dates to exclude from the aggregate option. The resulting query to get bank balances as of a given date is blazing fast.
Today, I had my hands slapped and was told that the CROSS JOIN to the CTE was not needed and to optimize the query because it was slowing everything else down when it runs.
Leaving aside the fact that it will run overnight once in production, I'm left to wonder if there's a better way to populate this table for every employee, every bank and every date. The number of rows is unavoidable, as is the calculation to strip out future dated transactions from the employee bank balance.
Does anyone have any idea how I might make this faster, and less resource intensive on the server?