kaldi.segmentation

Classes

NnetSAD(model, transform, graph[, beam, …]) Neural network based speech activity detection (SAD).
SegmentationProcessor(target_labels[, …]) Segmentation post-processor.
Segmenter(graph[, beam, max_active, …]) Base class for speech segmenters.
class kaldi.segmentation.Segmenter(graph, beam=8, max_active=1000, acoustic_scale=0.1)[source]

Base class for speech segmenters.

Parameters:
  • graph (StdVectorFst) – Segmentation graph.
  • beam (float) – Logarithmic decoding beam.
  • max_active (int) – Maximum number of active states in decoding.
  • acoustic_scale (float) – Acoustic score scale.
segment(input)[source]

Segments input.

Output is a dictionary with the following (key, value) pairs:

key value value type
“alignment” Frame-level segmentation List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight

The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:input (object) – Input to segment.
Returns:A dictionary representing segmentation output.
Raises:RuntimeError – If segmentation fails.
class kaldi.segmentation.NnetSAD(model, transform, graph, beam=8, max_active=1000, decodable_opts=None)[source]

Neural network based speech activity detection (SAD).

Parameters:
  • model (Nnet) – SAD model. Model output should be log-posteriors for [silence, speech, garbage] labels.
  • transform (Matrix) – Transformation applied to SAD label posteriors. It should be a 3x2 matrix mapping [silence, speech, garbage] posteriors to [silence, speech] pseudo-likelihoods.
  • graph (StdVectorFst) – SAD graph. Silence and speech arcs should be labeled respectively with 1 and 2.
  • beam (float) – Logarithmic decoding beam.
  • max_active (int) – Maximum number of active states in decoding.
  • decodable_opts (NnetSimpleComputationOptions) – Configuration options for the SAD model.
static make_sad_graph(transition_scale=1.0, self_loop_scale=0.1, min_silence_duration=0.03, min_speech_duration=0.3, max_speech_duration=10.0, frame_shift=0.01, edge_silence_probability=0.5, transition_probability=0.1)[source]

Makes a decoding graph with a simple HMM topology suitable for SAD.

Output graph uses label 1 for ‘silence’ and label 2 for ‘speech’.

Parameters:
  • transition_scale (float) – Scale on transition log-probabilities relative to LM weights.
  • self_loop_scale (float) – Scale on self-loop log-probabilities relative to LM weights.
  • min_silence_duration (float) – Minimum duration for silence.
  • min_speech_duration (float) – Minimum duration for speech.
  • max_speech_duration (float) – Maximum duration for speech.
  • frame_shift (float) – Frame shift in seconds.
  • edge_silence_probability (float) – Probability of silence at the edges.
  • transition_probability (float) – Transition probability for silence to speech or vice-versa.
Returns:

A simple decoding graph suitable for SAD.

Return type:

StdVectorFst

static make_sad_transform(priors, sil_scale=1.0, sil_in_speech_weight=0.0, speech_in_sil_weight=0.0, garbage_in_speech_weight=0.0, garbage_in_sil_weight=0.0)[source]

Creates SAD posterior transformation matrix.

The 3x2 transformation matrix is used to convert length Nx3 posterior probability matrices to Nx2 pseudo-likelihood matrices.

The “priors” vector can be a proper prior probability distribution over SAD labels or simply average SAD label posteriors. This vector is normalized to derive a prior probability distribution.

Parameters:
  • priors (Vector) – SAD label priors to remove from the neural network output posteriors to convert them to pseudo likelihoods.
  • sil_scale (float) – Scale on the silence probability. Make this more than one to encourage decoding silence.
  • sil_in_speech_weight (float) – The fraction of silence probability to add to speech probability.
  • speech_in_sil_weight (float) – The fraction of speech probability to add to silence probability.
  • garbage_in_speech_weight (float) – The fraction of garbage probability to add to speech probability.
  • garbage_in_sil_weight (float) – The fraction of garbage probability to add to silence probability.
static read_average_posteriors(post_rxfilename)[source]

Reads average SAD label posteriors from an extended filename.

static read_model(model_rxfilename)[source]

Reads SAD model from an extended filename.

segment(input)

Segments input.

Output is a dictionary with the following (key, value) pairs:

key value value type
“alignment” Frame-level segmentation List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight

The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:input (object) – Input to segment.
Returns:A dictionary representing segmentation output.
Raises:RuntimeError – If segmentation fails.
class kaldi.segmentation.SegmentationProcessor(target_labels, frame_shift=0.01, segment_padding=0.2, min_segment_dur=0, max_merged_segment_dur=0)[source]

Segmentation post-processor.

This class is used for converting segmentation labels to a list of segments. Output includes only those segments labeled with the target labels.

Post-processing operations include::
  • filtering out short segments
  • padding segments
  • merging consecutive segments
Parameters:
  • target_labels (List[int]) – Target labels. Typically the speech labels.
  • frame_shift (float) – Frame shift in seconds.
  • segment_padding (float) – Additional padding on target segments. Padding does not go beyond the adjacent segment. This is typically used for padding speech segments with silence. Must be an integral multiple of frame shift.
  • min_segment_dur (float) – Minimum duration (in seconds) required for a segment to be included. This is before any padding. Segments shorter than this duration will be removed.
  • max_merged_segment_dur (float) – Merge consecutive segments as long as the merged segment is no longer than this many seconds. The segments are only merged if their boundaries are touching. This is after padding by –segment-padding seconds. 0 means do not merge. Use ‘inf’ to not limit the duration.
Variables:

stats (SegmentationProcessor.Stats) – Global segmentation post-processing stats.

class Stats[source]

Stores segmentation post-processing stats.

add(other)[source]

Adds stats from another

filter_short_segments(segments, stats)[source]

Filters out short segments.

initialize_segments(alignment, stats)[source]

Initializes segments.

The alignment is frame-level segmentation labels. Output includes only those segments labeled with the target labels.

merge_consecutive_segments(segments, stats)[source]

Merges consecutive segments.

Done after padding. Consecutive segments that share a boundary are merged if they have the same label and the merged segment is no longer than ‘max_merged_segment_dur’.

pad_segments(segments, stats, num_utt_frames=None)[source]

Pads segments on both sides.

Ensures that the segments do not go beyond the neighboring segments or utterance boundaries.

process(alignment)[source]

Converts frame-level segmentation labels to a list of segments.

Parameters:alignment (List[int]) – Frame-level segmentation labels.
Returns:List of segments, where each entry is a (segment-beg, segment-end, label) tuple, along with segmentation post-processing stats.
Return type:Tuple[List[Tuple[int, int, int]], SegmentationProcessor.Stats]
write(key, segments, file_handle)[source]

Writes segments to file