I am in the process porting MFCC algorithms from python libraries to C. Currently, I have completely ported python_speech_feature's mfcc function, and have moved on to librosa.
However, once I got to the framing function, I find the implementation difficult to understand, as in why did the developer word this the way it is currently? It just seems so wrong to me.
The problem I see is that librosa does not care about the custom "window_length" value that is given during function call, and the framing of the signal relies solely on the nfft value.
Here is librosa's implementation:
def stft(
y: np.ndarray,
*,
n_fft: int = 2048,
hop_length: Optional[int] = None,
win_length: Optional[int] = None,
window: _WindowSpec = "hann",
center: bool = True,
dtype: Optional[DTypeLike] = None,
pad_mode: _PadModeSTFT = "constant",
out: Optional[np.ndarray] = None,
) -> np.ndarray:
# By default, use the entire frame
if win_length is None:
win_length = n_fft
# Set the default hop, if it's not already specified
if hop_length is None:
hop_length = int(win_length // 4)
elif not util.is_positive_int(hop_length):
raise ParameterError(f"hop_length={hop_length} must be a positive integer")
# Check audio is valid
util.valid_audio(y, mono=False)
fft_window = get_window(window, win_length, fftbins=True)
# Pad the window out to n_fft size
fft_window = util.pad_center(fft_window, size=n_fft)
# Reshape so that the window can be broadcast
fft_window = util.expand_to(fft_window, ndim=1 + y.ndim, axes=-2)
... (skipping middle part, assuming center = False)
if n_fft > y.shape[-1]:
raise ParameterError(
f"n_fft={n_fft} is too large for uncentered analysis of input signal of length={y.shape[-1]}"
)
# "Middle" of the signal starts at sample 0
start = 0
# We have no extra frames
extra = 0
fft = get_fftlib()
if dtype is None:
dtype = util.dtype_r2c(y.dtype)
# Window the time series.
y_frames = util.frame(y[..., start:], frame_length=n_fft, hop_length=hop_length)
As far as I can see, no modifications to y_frame data has been made, so this is what librosa's going with.
However, for win_len = 0.025 s, hop_len = 0.010 s (which seems to be the default in many cases), only fft_window = get_window(window, win_length, fftbins=True uses the win_len value, and it is not used afterwards. (assume win_len *= samplerate, to account for librosa's implementation)
The following is what I expected the frame part to have been like:
temp_frames = util.frame(y[..., start:], frame_length=window_length, hop_length=hop_length)
frame_pad = ((nfft - window_length)/2, (nfft - window_length)/2)
y_frames = np.pad(temp_frames, frame_pad, mode=pad_mode)
Note that the 'suggested code' hasn't been tested, and is intended to be more of a conceptual example.
The problem occuring here is that librosa's implementation takes frames of length nfft instead of window_length, which creates an unnecessary loss of data in my opinion. The suggested implementation gives each frame the same amount of data as the window functions have, which seems like the 'correct way' to do this for me.
So my question is this: Why is librosa's framing coded the way it has been coded? Is there a reason that I'm missing?