Let's consider the current state of the student table:
| ID | First Name | Last Name | Marks |
|---|---|---|---|
| 1 | John | Doe | 85 |
| 2 | Alice | Smith | 90 |
| 3 | Bob | Johnson | 78 |
| 4 | Emily | Brown | 95 |
| 5 | Michael | Davis | 88 |
Suppose I receive a list of records like this:
| First Name | Last Name | Marks |
|---|---|---|
| Sarah | Lee | 92 |
| Alice | Smith | 85 |
| Emma | Harris | 80 |
| Michael | Davis | 90 |
| Kevin | White | 85 |
I want to insert but not update the records into the database table. If any row fails due to duplicates, it has to be mentioned in a separate "Status" column. So, the response is:
| First Name | Last Name | Marks | Status |
|---|---|---|---|
| Sarah | Lee | 92 | Success (New Record Inserted) |
| Alice | Smith | 85 | Error: Duplicate entry for Alice Smith |
| Emma | Harris | 80 | Success (New Record Inserted) |
| Michael | Davis | 90 | Error: Duplicate entry for Michael Davis |
| Kevin | White | 85 | Success (New Record Inserted) |
I want to perform these operations in bulk, but if an error occurs, I won't know where the error is within that bulk. Therefore, if the bulk operation fails, I will break it down into smaller chunks and try to insert. If there is no error, that means all are successful, and I will just update that in the response list. My code for this is as follows:
func createStudentsAndUpdateStatus(ctx context.Context, db *sql.DB, students []Student, outputCsvData *[][]string, start int, end int) int {
if end < start {
return 0
}
if end-start < 12 {
updated := 0
for i := start; i <= end; i++ {
err := insertStudent(ctx, db, students[i])
if err == nil {
(*outputCsvData)[i] = append((*outputCsvData)[i], "Success (New Record Inserted)")
updated++
} else {
(*outputCsvData)[i] = append((*outputCsvData)[i], fmt.Sprintf("Error: %s", err.Error()))
}
}
return updated
}
err := insertStudentsBatch(ctx, db, students[start:end+1])
if err == nil {
for i := start; i <= end; i++ {
(*outputCsvData)[i] = append((*outputCsvData)[i], "Success (New Record Inserted)")
}
return end - start + 1
}
m := start + (end-start)/2
updatedLeft := createStudentsAndUpdateStatus(ctx, db, students, outputCsvData, start, m)
updatedRight := createStudentsAndUpdateStatus(ctx, db, students, outputCsvData, m+1, end)
return updatedLeft + updatedRight
}
In this code, if the size is less than 12, I am just running it in sequence. Otherwise, I am breaking the chunks into 2 parts if the bulk insert fails.
In real-life scenarios, the database has approximately 10^8 entries, and I typically receive a list of records of size 10^5.
I've experimented with different approaches to improve performance:
- Inserting records one by one took approximately 60 minutes.
- Attempting bulk updates in batches of 1000 records and then iterating over failed records to update them individually reduced the time to 30 minutes.
- Currently, I've managed to reduce the insertion time to 15 minutes using the approach described above.
Despite the improvement, I'm still looking for ways to optimize the insertion process further. Are there any alternative methods or techniques that could potentially reduce the insertion time even more?
I appreciate any suggestions or insights you can provide. Thank you!