Why does Librosa.core.spectrum.stft divide the signal into frames without padding?

92 views Asked by FloopyBeep At 15 June 2023 at 03:00

I am in the process porting MFCC algorithms from python libraries to C. Currently, I have completely ported python_speech_feature's mfcc function, and have moved on to librosa.

However, once I got to the framing function, I find the implementation difficult to understand, as in why did the developer word this the way it is currently? It just seems so wrong to me.

The problem I see is that librosa does not care about the custom "window_length" value that is given during function call, and the framing of the signal relies solely on the nfft value.

Here is librosa's implementation:

def stft(
    y: np.ndarray,
    *,
    n_fft: int = 2048,
    hop_length: Optional[int] = None,
    win_length: Optional[int] = None,
    window: _WindowSpec = "hann",
    center: bool = True,
    dtype: Optional[DTypeLike] = None,
    pad_mode: _PadModeSTFT = "constant",
    out: Optional[np.ndarray] = None,
) -> np.ndarray:


    # By default, use the entire frame
    if win_length is None:
        win_length = n_fft

    # Set the default hop, if it's not already specified
    if hop_length is None:
        hop_length = int(win_length // 4)
    elif not util.is_positive_int(hop_length):
        raise ParameterError(f"hop_length={hop_length} must be a positive integer")

    # Check audio is valid
    util.valid_audio(y, mono=False)

    fft_window = get_window(window, win_length, fftbins=True)

    # Pad the window out to n_fft size
    fft_window = util.pad_center(fft_window, size=n_fft)

    # Reshape so that the window can be broadcast
    fft_window = util.expand_to(fft_window, ndim=1 + y.ndim, axes=-2)
 
    ... (skipping middle part, assuming center = False)

        if n_fft > y.shape[-1]:
            raise ParameterError(
                f"n_fft={n_fft} is too large for uncentered analysis of input signal of length={y.shape[-1]}"
            )

    # "Middle" of the signal starts at sample 0
    start = 0
    # We have no extra frames
    extra = 0

    fft = get_fftlib()

    if dtype is None:
        dtype = util.dtype_r2c(y.dtype)

    # Window the time series.
    y_frames = util.frame(y[..., start:], frame_length=n_fft, hop_length=hop_length)

As far as I can see, no modifications to y_frame data has been made, so this is what librosa's going with.

However, for win_len = 0.025 s, hop_len = 0.010 s (which seems to be the default in many cases), only fft_window = get_window(window, win_length, fftbins=True uses the win_len value, and it is not used afterwards. (assume win_len *= samplerate, to account for librosa's implementation)

The following is what I expected the frame part to have been like:

temp_frames = util.frame(y[..., start:], frame_length=window_length, hop_length=hop_length)
frame_pad = ((nfft - window_length)/2, (nfft - window_length)/2)
y_frames = np.pad(temp_frames, frame_pad, mode=pad_mode)

Note that the 'suggested code' hasn't been tested, and is intended to be more of a conceptual example.

The problem occuring here is that librosa's implementation takes frames of length nfft instead of window_length, which creates an unnecessary loss of data in my opinion. The suggested implementation gives each frame the same amount of data as the window functions have, which seems like the 'correct way' to do this for me.

So my question is this: Why is librosa's framing coded the way it has been coded? Is there a reason that I'm missing?

Original Q&A

TechQA.

Why does Librosa.core.spectrum.stft divide the signal into frames without padding?

There are 0 answers

Related Questions in FFT

Related Questions in LIBROSA

Related Questions in MFCC

Popular Questions

Trending Questions