I've been dealing with a challenge for couple of days, and I appreciate any help regarding that.
I've trained a 1d-CNN with tensorflow for time series classification. Then, I quantized the network with tflite in INT8 mode and it is working fine. For research purposes, I need to implement the exact quantized inference flow in python.
So, I started to read tflite codes, the main paper, and also watching the MIT HAN LAB's efficientML course which teaches the quantized inference procedure in detail.
I started by extracting the Scale and ZeroPoint for different layers and I sucessfully quantized the weights&biases (they are exactly the same as tflite quantized weights&biases). However, when I implemented the quantized inference code and process the time series, the results of activation in my Convolution Layer is slightly different that tflite Conv layer activation(MAE = 1.21, MSE= 4.91). This error propagates throught the layers and the final result is far wrong!
Here is the code I implemented:
M = all_tensors_details[10]["quantization_parameters"]["scales"][0] # M = (Sw * Sx) / Sy
Zy = np.int8(all_tensors_details[10]["quantization_parameters"]["zero_points"][0])
Qy = []
mmin, mmax = getQuantizedRange(QUANTIZATION_BITWIDTH)
for i in range(FILTER_SIZE):
Qbias = quantizedCnnBias[i] - np.int32(np.array(list(quantizedCnnKernels[i].values())).sum()) * inputZeroPoint
conv = np.convolve(np.int32(inputSequence), np.int32(np.array(list(quantizedCnnKernels[i].values()))), mode='same')
summ = conv + Qbias
multiplier = (np.float32(cnnWeightScale[i]) * np.float32(inputScale) ) / M
term = multiplier*(np.float32(summ))
termAdded = addZy(term, Zy, mmin, mmax )
Qy.append(np.int8(np.clip(termAdded, mmin, mmax)))
Qy is the convolution result.
I'd deeply appreciate it if anyone can point out what I'm missing?