Table Device_Status
| Id | Status | Timestamp |
|---|---|---|
| 1 | Active | 2023-01-13T18.00.01.0187528 |
| 2 | Active | 2023-01-13T18.00.01.0187529 |
| 1 | Failed | 2023-01-13T18.00.01.0187530 |
| 3 | Active | 2023-01-13T18.00.01.0187531 |
| 1 | Failed | 2023-01-13T18.00.01.0187532 |
| 1 | Active | 2023-01-13T18.00.01.0187533 |
| 3 | Active | 2023-01-13T18.00.01.0187534 |
| 1 | Failed | 2023-01-13T18.00.01.0187535 |
| 4 | Failed | 2023-01-13T18.00.01.0187536 |
| 1 | Active | 2023-01-13T18.00.01.0187537 |
Expected Output (Need SQL query to generate):
| ID | Fail_Begin | Fail_End |
|---|---|---|
| 1 | 2023-01-13T18.00.01.0187530 | 2023-01-13T18.00.01.0187532 |
| 1 | 2023-01-13T18.00.01.0187535 | 2023-01-13T18.00.01.0187535 |
| 4 | 2023-01-13T18.00.01.0187536 | 2023-01-13T18.00.01.0187536 |
Basically, for each ID get min (timestamp) and max (timestamp) but over continuous records for that ID. If there is only one record, then min=max as is the case with the second and third record in the sample result set.
I have tried this (and various subquery variants thereof)
SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status
GROUP BY Id
but need to group by only continuously occurring records,
So perhaps by adding a notion of status change first would help? Starting with zero as initial status and then incrementing the change code by 1 every time the next record is different in order to thereby generate an intermediate result like so...
Table Device_Status_With_Change_Column
| Id | Status | Change | Timestamp |
|---|---|---|---|
| 1 | Active | 0 | 2023-01-13T18.00.01.0187528 |
| 2 | Active | 0 | 2023-01-13T18.00.01.0187529 |
| 1 | Failed | 1 | 2023-01-13T18.00.01.0187530 |
| 3 | Active | 0 | 2023-01-13T18.00.01.0187531 |
| 1 | Failed | 1 | 2023-01-13T18.00.01.0187532 |
| 1 | Active | 2 | 2023-01-13T18.00.01.0187533 |
| 3 | Active | 0 | 2023-01-13T18.00.01.0187534 |
| 1 | Failed | 3 | 2023-01-13T18.00.01.0187535 |
| 4 | Failed | 0 | 2023-01-13T18.00.01.0187536 |
| 1 | Active | 4 | 2023-01-13T18.00.01.0187537 |
And then doing
SELECT Id, Change, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status_With_Change_Column
GROUP BY Id, Change
Other than looping over the result set in a programming language, I do not yet see a direct SQL statement that would do this in one fell swoop without the intermediate table and I do not see how to compute the column Change (in SQL).
The solution by user @Ajax1234 generates the result in the original post. However, when stress-tested with several tandem 'Failed' records, it generates pairs instead of producing exactly one additional record as required for those additional table entries.
The DDL/Schema and initial sample data, tested for PostGreSQL version 16, with changed/trimmed timestamp for the parse to work and several tandem 'Failed' records added:
Result (fiddle):
The innermost CTE
laguses the SQL LAG function to generate a column for previous entries (i.e. exactly one previous row of corresponding column, for id and status, and in the case of id, default zero). This is projected intoflagto enter a 1 when previous status changes from 'Active' to 'Failed' or vice-versa for the same id. The first instance of any id will have unconditional zero and so will cases where there is no change to status relative to the previous recorded log entry for any device. To create the CTEDevice_Status_With_Change_Column, the Change_Flags are summed using the cumulative SUM function to record the sequence number of status change for each device Id when sorted by timestamp. The zero in the Change column ensures that SUM will put the same integer in all consecutive failed or active records having the same change sequence. In other words, to get the 'Active' image of the result set we just need to change the WHERE clause. Therefore, any number of statuses, say, 'PAUSED', 'OFFLINE' can all be addressed just by adding the corresponding word in the DDL and query.The full CTE
Device_Status_With_Change_Column(fiddle):I tested the join-based query by @Ajax1234 on a database with 13 million records, to generate a result set of approximately 20,000, and it is faster that my CTE-based solution (24 seconds versus 42 seconds).
SQL is far more powerful than doing loops! Look into CTEs using LAG, SUM, and the CASE Statement. Join-like operations on products of relations are more performant, if a bit harder to read and understand, and therefore need testing.