kaldi.alignment

Classes

Aligner(transition_model, tree, lexicon[, …]) Speech aligner.
GmmAligner(transition_model, acoustic_model, …) GMM based speech aligner.
MappedAligner(transition_model, tree, lexicon) Mapped speech aligner.
NnetAligner(transition_model, …[, …]) Neural network based speech aligner.
class kaldi.alignment.Aligner(transition_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]

Speech aligner.

This can be used to align transition-id log-likelihood matrices with reference texts.

Parameters:
  • transition_model (TransitionModel) – The transition model.
  • tree (ContextDependency) – The phonetic decision tree.
  • lexicon (StdFst) – The lexicon FST.
  • symbols (SymbolTable) – The symbol table. If provided, “text” output of decode() includes symbols instead of integer indices.
  • disambig_symbols (List[int]) – Disambiguation symbols.
  • graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
  • beam (float) – Decoding beam used in alignment.
  • transition_scale (float) – The scale on non-self-loop transition probabilities.
  • self_loop_scale (float) – The scale on self-loop transition probabilities.
  • acoustic_scale (float) – Acoustic score scale.
align(input, text)[source]

Aligns input with text.

Output is a dictionary with the following (key, value) pairs:

key value value type
“alignment” Frame-level alignment List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight

If symbols is None, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:
  • input (object) – Input to align.
  • text (str) – Reference text to align.
Returns:

A dictionary representing alignment output.

Raises:

RuntimeError – If alignment fails.

classmethod from_files(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]

Constructs a new GMM aligner from given files.

Parameters:
  • model_rxfilename (str) – Extended filename for reading the transition model.
  • tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
  • lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
  • symbols_filename (str) – The symbols file. If provided, “text” input of align() should include symbols instead of integer indices.
  • disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
  • graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
  • beam (float) – Decoding beam used in alignment.
  • transition_scale (float) – The scale on non-self-loop transition probabilities.
  • self_loop_scale (float) – The scale on self-loop transition probabilities.
  • acoustic_scale (float) – Acoustic score scale.
Returns:

A new aligner object.

static read_disambig_symbols(disambig_rxfilename)[source]

Reads disambiguation symbols from an extended filename.

Returns:List of disambiguation symbols.
Return type:List[int]
static read_lexicon(lexicon_rxfilename)[source]

Reads lexicon FST from an extended filename.

Returns:Lexicon FST.
Return type:StdFst
static read_model(model_rxfilename)[source]

Reads transition model from an extended filename.

Returns:Transition model.
Return type:TransitionModel
static read_symbols(symbols_filename)[source]

Reads symbol table from file.

Returns:Symbol table.
Return type:SymbolTable
static read_tree(tree_rxfilename)[source]

Reads phonetic decision tree from an extended filename.

Returns:Phonetic decision tree.
Return type:ContextDependency
to_phone_alignment(alignment, phones=None)[source]

Converts frame-level alignment to phone-level alignment.

Parameters:
  • alignment (List[int]) – Frame-level alignment.
  • phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns:

A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).

Return type:

List[Tuple[int,int,int]]

to_word_alignment(best_path, word_boundary_info)[source]

Converts best alignment path to word-level alignment.

Parameters:
  • best_path (CompactLattice) – Best alignment path.
  • word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns:

A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.

Return type:

List[Tuple[int,int,int]]

class kaldi.alignment.MappedAligner(transition_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]

Mapped speech aligner.

This can be used to align phone-id log-likelihood matrices with reference texts.

Parameters:
  • transition_model (TransitionModel) – The transition model.
  • tree (ContextDependency) – The phonetic decision tree.
  • lexicon (StdFst) – The lexicon FST.
  • symbols (SymbolTable) – The symbol table. If provided, “text” output of decode() includes symbols instead of integer indices.
  • disambig_symbols (List[int]) – Disambiguation symbols.
  • graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
  • beam (float) – Decoding beam used in alignment.
  • transition_scale (float) – The scale on non-self-loop transition probabilities.
  • self_loop_scale (float) – The scale on self-loop transition probabilities.
  • acoustic_scale (float) – Acoustic score scale.
align(input, text)

Aligns input with text.

Output is a dictionary with the following (key, value) pairs:

key value value type
“alignment” Frame-level alignment List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight

If symbols is None, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:
  • input (object) – Input to align.
  • text (str) – Reference text to align.
Returns:

A dictionary representing alignment output.

Raises:

RuntimeError – If alignment fails.

from_files(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)

Constructs a new GMM aligner from given files.

Parameters:
  • model_rxfilename (str) – Extended filename for reading the transition model.
  • tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
  • lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
  • symbols_filename (str) – The symbols file. If provided, “text” input of align() should include symbols instead of integer indices.
  • disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
  • graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
  • beam (float) – Decoding beam used in alignment.
  • transition_scale (float) – The scale on non-self-loop transition probabilities.
  • self_loop_scale (float) – The scale on self-loop transition probabilities.
  • acoustic_scale (float) – Acoustic score scale.
Returns:

A new aligner object.

read_disambig_symbols(disambig_rxfilename)

Reads disambiguation symbols from an extended filename.

Returns:List of disambiguation symbols.
Return type:List[int]
read_lexicon(lexicon_rxfilename)

Reads lexicon FST from an extended filename.

Returns:Lexicon FST.
Return type:StdFst
read_model(model_rxfilename)

Reads transition model from an extended filename.

Returns:Transition model.
Return type:TransitionModel
read_symbols(symbols_filename)

Reads symbol table from file.

Returns:Symbol table.
Return type:SymbolTable
read_tree(tree_rxfilename)

Reads phonetic decision tree from an extended filename.

Returns:Phonetic decision tree.
Return type:ContextDependency
to_phone_alignment(alignment, phones=None)

Converts frame-level alignment to phone-level alignment.

Parameters:
  • alignment (List[int]) – Frame-level alignment.
  • phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns:

A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).

Return type:

List[Tuple[int,int,int]]

to_word_alignment(best_path, word_boundary_info)

Converts best alignment path to word-level alignment.

Parameters:
  • best_path (CompactLattice) – Best alignment path.
  • word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns:

A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.

Return type:

List[Tuple[int,int,int]]

class kaldi.alignment.GmmAligner(transition_model, acoustic_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]

GMM based speech aligner.

This can be used to align feature matrices with reference texts.

Parameters:
  • transition_model (TransitionModel) – The transition model.
  • acoustic_model (AmDiagGmm) – The acoustic model.
  • tree (ContextDependency) – The phonetic decision tree.
  • lexicon (StdFst) – The lexicon FST.
  • symbols (SymbolTable) – The symbol table. If provided, “text” input of align() should include symbols instead of integer indices.
  • disambig_symbols (List[int]) – Disambiguation symbols.
  • graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
  • beam (float) – Decoding beam used in alignment.
  • transition_scale (float) – The scale on non-self-loop transition probabilities.
  • self_loop_scale (float) – The scale on self-loop transition probabilities.
  • acoustic_scale (float) – Acoustic score scale.
align(input, text)

Aligns input with text.

Output is a dictionary with the following (key, value) pairs:

key value value type
“alignment” Frame-level alignment List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight

If symbols is None, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:
  • input (object) – Input to align.
  • text (str) – Reference text to align.
Returns:

A dictionary representing alignment output.

Raises:

RuntimeError – If alignment fails.

classmethod from_files(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]

Constructs a new GMM aligner from given files.

Parameters:
  • model_rxfilename (str) – Extended filename for reading the model.
  • tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
  • lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
  • symbols_filename (str) – The symbols file. If provided, “text” input of align() should include symbols instead of integer indices.
  • disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
  • graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
  • beam (float) – Decoding beam used in alignment.
  • transition_scale (float) – The scale on non-self-loop transition probabilities.
  • self_loop_scale (float) – The scale on self-loop transition probabilities.
  • acoustic_scale (float) – Acoustic score scale.
Returns:

A new aligner object.

read_disambig_symbols(disambig_rxfilename)

Reads disambiguation symbols from an extended filename.

Returns:List of disambiguation symbols.
Return type:List[int]
read_lexicon(lexicon_rxfilename)

Reads lexicon FST from an extended filename.

Returns:Lexicon FST.
Return type:StdFst
static read_model(model_rxfilename)[source]

Reads model from an extended filename.

Returns:A (transition model, acoustic model) pair.
Return type:Tuple[TransitionModel, AmDiagGmm]
read_symbols(symbols_filename)

Reads symbol table from file.

Returns:Symbol table.
Return type:SymbolTable
read_tree(tree_rxfilename)

Reads phonetic decision tree from an extended filename.

Returns:Phonetic decision tree.
Return type:ContextDependency
to_phone_alignment(alignment, phones=None)

Converts frame-level alignment to phone-level alignment.

Parameters:
  • alignment (List[int]) – Frame-level alignment.
  • phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns:

A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).

Return type:

List[Tuple[int,int,int]]

to_word_alignment(best_path, word_boundary_info)

Converts best alignment path to word-level alignment.

Parameters:
  • best_path (CompactLattice) – Best alignment path.
  • word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns:

A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.

Return type:

List[Tuple[int,int,int]]

class kaldi.alignment.NnetAligner(transition_model, acoustic_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, decodable_opts=None, online_ivector_period=10)[source]

Neural network based speech aligner.

This can be used to align feature matrices with reference texts.

Parameters:
  • transition_model (TransitionModel) – The transition model.
  • acoustic_model (AmNnetSimple) – The acoustic model.
  • tree (ContextDependency) – The phonetic decision tree.
  • lexicon (StdFst) – The lexicon FST.
  • symbols (SymbolTable) – The symbol table. If provided, “text” input of align() should include symbols instead of integer indices.
  • disambig_symbols (List[int]) – Disambiguation symbols.
  • graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
  • beam (float) – Decoding beam used in alignment.
  • transition_scale (float) – The scale on non-self-loop transition probabilities.
  • self_loop_scale (float) – The scale on self-loop transition probabilities.
  • decodable_opts (NnetSimpleComputationOptions) – Configuration options for simple nnet3 am decodable objects.
  • online_ivector_period (int) – Onlne ivector period. Relevant only if online ivectors are used.
align(input, text)

Aligns input with text.

Output is a dictionary with the following (key, value) pairs:

key value value type
“alignment” Frame-level alignment List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight

If symbols is None, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:
  • input (object) – Input to align.
  • text (str) – Reference text to align.
Returns:

A dictionary representing alignment output.

Raises:

RuntimeError – If alignment fails.

classmethod from_files(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, decodable_opts=None, online_ivector_period=10)[source]

Constructs a new nnet3 aligner from given files.

Parameters:
  • model_rxfilename (str) – Extended filename for reading the model.
  • tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
  • lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
  • symbols_filename (str) – The symbols file. If provided, “text” input of align() should include symbols instead of integer indices.
  • disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
  • graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
  • beam (float) – Decoding beam used in alignment.
  • transition_scale (float) – The scale on non-self-loop transition probabilities.
  • self_loop_scale (float) – The scale on self-loop transition probabilities.
  • decodable_opts (NnetSimpleComputationOptions) – Configuration options for simple nnet3 am decodable objects.
  • online_ivector_period (int) – Onlne ivector period. Relevant only if online ivectors are used.
Returns:

A new aligner object.

read_disambig_symbols(disambig_rxfilename)

Reads disambiguation symbols from an extended filename.

Returns:List of disambiguation symbols.
Return type:List[int]
read_lexicon(lexicon_rxfilename)

Reads lexicon FST from an extended filename.

Returns:Lexicon FST.
Return type:StdFst
static read_model(model_rxfilename)[source]

Reads model from an extended filename.

Returns:A (transition model, acoustic model) pair.
Return type:Tuple[TransitionModel, AmNnetSimple]
read_symbols(symbols_filename)

Reads symbol table from file.

Returns:Symbol table.
Return type:SymbolTable
read_tree(tree_rxfilename)

Reads phonetic decision tree from an extended filename.

Returns:Phonetic decision tree.
Return type:ContextDependency
to_phone_alignment(alignment, phones=None)

Converts frame-level alignment to phone-level alignment.

Parameters:
  • alignment (List[int]) – Frame-level alignment.
  • phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns:

A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).

Return type:

List[Tuple[int,int,int]]

to_word_alignment(best_path, word_boundary_info)

Converts best alignment path to word-level alignment.

Parameters:
  • best_path (CompactLattice) – Best alignment path.
  • word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns:

A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.

Return type:

List[Tuple[int,int,int]]