I am using MariaDB sql. I am creating a statistical functionality in my website that needs to fetch the historical data for the user aggregated on each day basis. For this I have created a history table, that contains rows for only when a change occurred. And I also have my "main" table, which contains the current data as is today.
So I need my query to retrieve the status id for each user_id and organisation_id combination from the last row if this value was not null.
To illustrate, example this data, this is considered the "current" data (table name = organisation_user_link):
| id | user_id | organisation_id | status_id | stopped_reason_id | dossier_created |
|---|---|---|---|---|---|
| 1 | 3 | 73 | 2 | NULL | 2021-10-29 07:50:21 |
| 2 | 9 | 1199 | 4 | 5 | 2021-05-19 17:44:07 |
And next to this I have my history data, which looks very similar (table name = organisation_user_link_status_history):
| timestamp | user_id | organisation_id | status_id | stopped_reason_id |
|---|---|---|---|---|
| 2024-03-11 12:05:30 | 3 | 73 | 1 | NULL |
| 2024-03-08 11:15:35 | 3 | 73 | 3 | NULL |
| 2024-03-05 13:25:40 | 3 | 73 | 4 | 3 |
| 2024-03-13 02:07:10 | 9 | 1199 | 1 | NULL |
| 2024-03-11 02:07:10 | 9 | 1199 | 2 | NULL |
I want my result to include every day starting from today until a certain date. Where each day has the value from the previous row, in case there was no value for that day. The values are ordered DESC, so the "current" data is always first, since this is the data as on date TODAY.
This is the result I want to become:
| date | user_id | organisation_id | status_id | stopped_reason_id | dossier_created |
|---|---|---|---|---|---|
| 2024-03-14 | 3 | 73 | 2 | NULL | 2021-10-29 |
| 2024-03-14 | 9 | 1199 | 4 | 5 | 2021-05-19 |
| 2024-03-13 | 3 | 73 | 2 | NULL | 2021-10-29 |
| 2024-03-13 | 9 | 1199 | 1 | NULL | 2021-05-19 |
| 2024-03-12 | 3 | 73 | 2 | NULL | 2021-10-29 |
| 2024-03-12 | 9 | 1199 | 1 | NULL | 2021-05-19 |
| 2024-03-11 | 3 | 73 | 1 | NULL | 2021-10-29 |
| 2024-03-11 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-10 | 3 | 73 | 1 | NULL | 2021-10-29 |
| 2024-03-10 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-09 | 3 | 73 | 1 | NULL | 2021-10-29 |
| 2024-03-09 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-08 | 3 | 73 | 3 | NULL | 2021-10-29 |
| 2024-03-08 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-07 | 3 | 73 | 3 | NULL | 2021-10-29 |
| 2024-03-07 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-06 | 3 | 73 | 3 | NULL | 2021-10-29 |
| 2024-03-06 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-05 | 3 | 73 | 4 | 3 | 2021-10-29 |
| 2024-03-05 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-04 | 3 | 73 | 4 | 3 | 2021-10-29 |
| 2024-03-04 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-03 | 3 | 73 | 4 | 3 | 2021-10-29 |
| 2024-03-03 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-02 | 3 | 73 | 4 | 3 | 2021-10-29 |
| 2024-03-02 | 9 | 1199 | 2 | NULL | 2021-05-19 |
| 2024-03-01 | 3 | 73 | 4 | 3 | 2021-10-29 |
| 2024-03-01 | 9 | 1199 | 2 | NULL | 2021-05-19 |
This is the query I have right now:
WITH RECURSIVE dates (
DATE
) AS (
-- SELECT MIN(DATE(created))
-- FROM organisation
SELECT DATE('2024-03-01')
UNION ALL
SELECT DATE(date) + INTERVAL 1 DAY
FROM dates
WHERE DATE(DATE) < (NOW() - INTERVAL 1 DAY)
),
current_history_data_query AS (
SELECT
current_history_data.*
FROM (
SELECT
DATE(timestamp) AS date,
user_id,
organisation_id,
status_id,
stopped_reason_id,
dossier_created,
'history-data' AS src
FROM (
SELECT
oulsh.user_id,
oulsh.organisation_id,
oulsh.timestamp,
oulsh.status_id,
oulsh.stopped_reason_id,
oul.dossier_created,
ROW_NUMBER() OVER (PARTITION BY oulsh.user_id, oulsh.organisation_id, DATE(oulsh.timestamp) ORDER BY oulsh.timestamp DESC) AS row_num
FROM organisation_user_link_status_history AS oulsh
INNER JOIN organisation_user_link AS oul ON oulsh.user_id = oul.user_id AND oulsh.organisation_id = oul.organisation_id
) AS numbered_rows
WHERE row_num = 1 AND DATE(timestamp) != DATE(NOW())
UNION ALL
SELECT DATE(NOW()) AS date, oul.user_id, oul.organisation_id, oul.status_id, oul.stopped_reason_id, oul.dossier_created, 'current-data' AS src
FROM organisation_user_link AS oul
) AS current_history_data
ORDER BY DATE DESC
)
SELECT
dates.date AS dates_date,
COALESCE(user_id, LAG(user_id) OVER (ORDER BY dates_date DESC)) AS user_id,
COALESCE(organisation_id, LAG(organisation_id) OVER (ORDER BY dates_date DESC)) AS organisation_id,
COALESCE(status_id, LAG(status_id) OVER (ORDER BY dates_date DESC)) AS status_id,
COALESCE(stopped_reason_id, LAG(stopped_reason_id) OVER (ORDER BY dates_date DESC)) AS stopped_reason_id,
COALESCE(dossier_created, LAG(dossier_created) OVER (ORDER BY dates_date DESC)) AS dossier_created
FROM dates
LEFT JOIN current_history_data_query AS chdq ON dates.date = chdq.date
GROUP BY DATE(dates.date)
ORDER BY dates.date DESC;
Using this query, there are multiple problems:
- For the first row after an original non-null value row: the data is correctly being copied. However, afterwards, the third row for example will still be NULL. Even though the second row was just filled by the GAP() window function.
- The code as provided above does not take into account that I also need it to be partitioned by user_id and organisation_id. If i add in the LAG() function : PARTITION BY user_id, organisation_id . Then the entire LAG function is not working anymore, and not even my 2nd row is getting the data filled in.
What am I missing here, how can I solve this query ?
As MariaDB does not support LATERAL or CROSS APPLY, you could use three separate correlated subqueries in the select list:
Here's a db<>fiddle.
There is probably a better solution but one way you could achieve your desired output is to use a join to a lateral derived table. The modified final query would be:
Here's a db<>fiddle.