I'd like to manipulate a set of data in an hdf5 file and be able to decide, before closing the file, whether to discard every changes or not.
From the doc of File drivers:
HDF5 ships with a variety of different low-level drivers, which map the logical HDF5 address space to different storage mechanisms. You can specify which driver you want to use when the file is opened:
f = h5py.File('myfile.hdf5', driver=<driver name>, <driver_kwds>)For example, the HDF5 “core” driver can be used to create a purely in-memory HDF5 file, optionally written out to disk when it is closed. Here’s a list of supported drivers and their options:
‘core’:
Store and manipulate the data in memory, and optionally write it back out when the file is closed. Using this with an existing file and a reading mode will read the entire file into memory. Keywords:
backing_store:
If True (default), save changes to the real file at the specified path on close() or flush(). If False, any changes are discarded when the file is closed.
Regardless whether I perform a call to flush() or not, changes are always discarded (as expected). While, opening with default driver, changes are always persisted to the file on closure.
Based on what above, I've just created a very simple example:
from h5py import File
# Create a dummy file from scratch
f = File('test.h5', 'w')
f.create_dataset("test_dataset", data=[1, 2, 3])
f.close()
# Open and modify the data
f = File('test.h5', 'r+') # In this case changes are always persisted
# f = File('test.h5', 'r+', driver='core', backing_store=False) # In this case changes are always discarded
ds = f["test_dataset"]
ds[...] = [3, 4, 5]
# f.flush() # Useless in this case, obviously
f.close() # Here changes should be discarded
# Read now `test_dataset`
f = File('test.h5', 'r')
print(f['test_dataset'][...])
f.close()
Is there a way to decide just before closing the file whether to save changes or not?
EDIT 1: PyTables undo mechanism seems to work ONLY with newly created dataset, NOT with editing of pre-existing ones
import tables as t
import numpy as np
# Create the file
with t.open_file(r'test.h5', 'w') as fr:
fr.create_carray('/', 'TestArray', obj=np.array([1, 2, 3], dtype='uint8'))
with t.open_file('test.h5', 'r+') as fr:
# This will remove any previously created marks
if fr.is_undo_enabled():
fr.disable_undo()
fr.enable_undo() # Re-enable undo
fr.mark('MyMark')
# Create new array from scratch, and it will be discarded
new_arr = fr.create_carray('/', 'NewCreatedArray', obj=np.array([10, 11, 12]))
# Modify a pre-existing array! --> THIS WILL NOT BE DISCARDED
arr = fr.root.TestArray
arr[...] = np.array([3, 4, 5])
# Move back to when I opened the file
fr.undo('MyMark')
with t.open_file('test.h5', 'r+') as fr:
print(fr)
print('Test Array: ', fr.root.TestArray[:])
Result is:
test.h5 (File) ''
Last modif.: '2023-05-18T07:26:13+00:00'
Object Tree:
/ (RootGroup) ''
/TestArray (CArray(3,)) ''
Test Array: [3 4 5]
I'm not convinced that this is the best approach as it will drive up your memory requirements. Possible alternatives are
Anyway, one thing you can do is work on a
BytesIObuffer. This mimics the "core" driver but gives you full control about when to write the buffer back to file.