How can I track and display constant changes made in an HDF5 file with the help of python

226 views Asked by At

I have this function that constantly adds a new element in a dataset array of an HDF5 file every second.

from time import time, sleep

i = 100

def update_array():

    hf = h5py.File('task1.h5', 'r+')
    old_rec = np.array(hf.get('array'))
    global i
    i = i+1
    new_rec = np.append(old_rec, i)

    #deleting old record andreplacing with updated record
    del hf['array']
    new_data = hf.create_dataset('array', data = new_rec)
    print(new_rec)
    
    hf.close()

while True:
    sleep(1 - time() % 1)
    update_array()

The output of the print line (basically showing the updated array..... we do not know if it is getting saved in the file or not):

[101.]
[101. 102.]
[101. 102. 103.]
[101. 102. 103. 104.]
[101. 102. 103. 104. 105.]
[101. 102. 103. 104. 105. 106.]
[101. 102. 103. 104. 105. 106. 107.]
[101. 102. 103. 104. 105. 106. 107. 108.]

I want to have a separate notebook that can track changes made by the above function and display the updated contents of this dataset present in the HDF5 file system.

I want a separate function for this task because I want to make sure that the updated content gets saved in the HDF5 files, and perform further on fly operations on them as they keep arriving.

1

There are 1 answers

1
kcw78 On BEST ANSWER

Here is a potential solution attaching attributes to the 'array' dataset. Adding attributes to a HDF5 data object are easy with .attrs. It has a dictionary-like syntax: h5obj[attr_name] = attr_value. Attribute value types can be ints, strings, floats, and arrays. You can add 2 attributes to your dataset with the following 2 lines:

hf['array'].attrs['Last Value'] = i
hf['array'].attrs['Time Added'] = ctime(time())

To demonstrate, I added these lines to your code, along with several other modifications to address the following issues:

  1. Correct the errors noted in my comments (I added create_array() to initially create the file and dataset. I created it as a resizable dataset to simplify logic in update_array().
  2. I modified the update_array() code to enlarge the dataset and append the new value. This is much cleaner (and faster) than your 4 step process.
  3. I used Python's with / as: context manager to open the file. This eliminates the need to close it, and (more importantly) ensures it is closed cleanly if the program exits abnormally.
  4. I removed NumPy functions. There is no need to create an array if you are adding 1 scalar each time.
  5. My print statement shows the preferred method create a NumPy array from a dataset. Use hf['array'][:] instead of np.array(hf.get('array')).
  6. I prefer to open files once (unless there is a compelling reason to open & close). That eliminates file setup/teardown overhead. I did not do this. If you want to, move the with / as: lines into the main and pass the resulting hf object to create_array() and update_array()functions. If you do that, you can easily consolidate the 2 functions. (You will need logic to test if the 'array' dataset exists.)

Code below:

import h5py
from time import time, sleep, ctime

def create_array():

    with h5py.File('task1.h5', 'w') as hf:
        global i 

        #create dataset and add new record
        new_data = hf.create_dataset('array', shape=(1,), maxshape=(None,),
                                      data = [i])
        # add attributes
        hf['array'].attrs['Last Value'] = i
        hf['array'].attrs['Time Added'] = ctime(time())

        print(hf['array'][:])

def update_array():

    with h5py.File('task1.h5', 'r+') as hf:
        global i 
        i += 1
      
        #resize dataset and add new record
        a0 = hf['array'].shape[0]
        hf['array'].resize(a0+1,axis=0)
        hf['array'][a0] = i
        
        # add attributes
        hf['array'].attrs['Last Value'] = i
        hf['array'].attrs['Time Added'] = ctime(time())
        
        print(hf['array'][:])
    
i = 100
create_array()

while i < 110:
    sleep(1 - time() % 1)
    update_array()

print('Done')