kaldi.alignment¶

Classes

`Aligner`(transition_model, tree, lexicon[, …])	Speech aligner.
`GmmAligner`(transition_model, acoustic_model, …)	GMM based speech aligner.
`MappedAligner`(transition_model, tree, lexicon)	Mapped speech aligner.
`NnetAligner`(transition_model, …[, …])	Neural network based speech aligner.

class kaldi.alignment.Aligner(transition_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶

Speech aligner.

This can be used to align transition-id log-likelihood matrices with reference texts.

Parameters:

transition_model (TransitionModel) – The transition model.
tree (ContextDependency) – The phonetic decision tree.
lexicon (StdFst) – The lexicon FST.
symbols (SymbolTable) – The symbol table. If provided, “text” output of decode() includes symbols instead of integer indices.
disambig_symbols (List[int]) – Disambiguation symbols.
graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
beam (float) – Decoding beam used in alignment.
transition_scale (float) – The scale on non-self-loop transition probabilities.
self_loop_scale (float) – The scale on self-loop transition probabilities.
acoustic_scale (float) – Acoustic score scale.

align(input, text)[source]¶

Aligns input with text.

Output is a dictionary with the following (key, value) pairs:

key	value	value type
“alignment”	Frame-level alignment	`List[int]`
“best_path”	Best lattice path	`CompactLattice`
“likelihood”	Log-likelihood of best path	`float`
“weight”	Cost of best path	`LatticeWeight`

If symbols is None, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:	input (object) – Input to align. text (str) – Reference text to align.
Returns:	A dictionary representing alignment output.
Raises:	`RuntimeError` – If alignment fails.

classmethod from_files(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶

Constructs a new GMM aligner from given files.

Parameters:

model_rxfilename (str) – Extended filename for reading the transition model.
tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
symbols_filename (str) – The symbols file. If provided, “text” input of align() should include symbols instead of integer indices.
disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
beam (float) – Decoding beam used in alignment.
transition_scale (float) – The scale on non-self-loop transition probabilities.
self_loop_scale (float) – The scale on self-loop transition probabilities.
acoustic_scale (float) – Acoustic score scale.

Returns:

A new aligner object.

static read_disambig_symbols(disambig_rxfilename)[source]¶

Reads disambiguation symbols from an extended filename.

Returns:	List of disambiguation symbols.
Return type:	List[int]

static read_lexicon(lexicon_rxfilename)[source]¶

Reads lexicon FST from an extended filename.

Returns:	Lexicon FST.
Return type:	StdFst

static read_model(model_rxfilename)[source]¶

Reads transition model from an extended filename.

Returns:	Transition model.
Return type:	TransitionModel

static read_symbols(symbols_filename)[source]¶

Reads symbol table from file.

Returns:	Symbol table.
Return type:	SymbolTable

static read_tree(tree_rxfilename)[source]¶

Reads phonetic decision tree from an extended filename.

Returns:	Phonetic decision tree.
Return type:	ContextDependency

to_phone_alignment(alignment, phones=None)[source]¶

Converts frame-level alignment to phone-level alignment.

Parameters:	alignment (List[int]) – Frame-level alignment. phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns:	A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).
Return type:	List[Tuple[int,int,int]]

to_word_alignment(best_path, word_boundary_info)[source]¶

Converts best alignment path to word-level alignment.

Parameters:	best_path (CompactLattice) – Best alignment path. word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns:	A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.
Return type:	List[Tuple[int,int,int]]

class kaldi.alignment.MappedAligner(transition_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶

Mapped speech aligner.

This can be used to align phone-id log-likelihood matrices with reference texts.

Parameters:

transition_model (TransitionModel) – The transition model.
tree (ContextDependency) – The phonetic decision tree.
lexicon (StdFst) – The lexicon FST.
symbols (SymbolTable) – The symbol table. If provided, “text” output of decode() includes symbols instead of integer indices.
disambig_symbols (List[int]) – Disambiguation symbols.
graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
beam (float) – Decoding beam used in alignment.
transition_scale (float) – The scale on non-self-loop transition probabilities.
self_loop_scale (float) – The scale on self-loop transition probabilities.
acoustic_scale (float) – Acoustic score scale.

align(input, text)¶

Aligns input with text.

Output is a dictionary with the following (key, value) pairs:

key	value	value type
“alignment”	Frame-level alignment	`List[int]`
“best_path”	Best lattice path	`CompactLattice`
“likelihood”	Log-likelihood of best path	`float`
“weight”	Cost of best path	`LatticeWeight`

If symbols is None, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:	input (object) – Input to align. text (str) – Reference text to align.
Returns:	A dictionary representing alignment output.
Raises:	`RuntimeError` – If alignment fails.

from_files(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)¶

Constructs a new GMM aligner from given files.

Parameters:

model_rxfilename (str) – Extended filename for reading the transition model.
tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
symbols_filename (str) – The symbols file. If provided, “text” input of align() should include symbols instead of integer indices.
disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
beam (float) – Decoding beam used in alignment.
transition_scale (float) – The scale on non-self-loop transition probabilities.
self_loop_scale (float) – The scale on self-loop transition probabilities.
acoustic_scale (float) – Acoustic score scale.

Returns:

A new aligner object.

read_disambig_symbols(disambig_rxfilename)¶

Reads disambiguation symbols from an extended filename.

Returns:	List of disambiguation symbols.
Return type:	List[int]

read_lexicon(lexicon_rxfilename)¶

Reads lexicon FST from an extended filename.

Returns:	Lexicon FST.
Return type:	StdFst

read_model(model_rxfilename)¶

Reads transition model from an extended filename.

Returns:	Transition model.
Return type:	TransitionModel

read_symbols(symbols_filename)¶

Reads symbol table from file.

Returns:	Symbol table.
Return type:	SymbolTable

read_tree(tree_rxfilename)¶

Reads phonetic decision tree from an extended filename.

Returns:	Phonetic decision tree.
Return type:	ContextDependency

to_phone_alignment(alignment, phones=None)¶

Converts frame-level alignment to phone-level alignment.

Parameters:	alignment (List[int]) – Frame-level alignment. phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns:	A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).
Return type:	List[Tuple[int,int,int]]

to_word_alignment(best_path, word_boundary_info)¶

Converts best alignment path to word-level alignment.

Parameters:	best_path (CompactLattice) – Best alignment path. word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns:	A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.
Return type:	List[Tuple[int,int,int]]

class kaldi.alignment.GmmAligner(transition_model, acoustic_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶

GMM based speech aligner.

This can be used to align feature matrices with reference texts.

Parameters:

transition_model (TransitionModel) – The transition model.
acoustic_model (AmDiagGmm) – The acoustic model.
tree (ContextDependency) – The phonetic decision tree.
lexicon (StdFst) – The lexicon FST.
symbols (SymbolTable) – The symbol table. If provided, “text” input of align() should include symbols instead of integer indices.
disambig_symbols (List[int]) – Disambiguation symbols.
graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
beam (float) – Decoding beam used in alignment.
transition_scale (float) – The scale on non-self-loop transition probabilities.
self_loop_scale (float) – The scale on self-loop transition probabilities.
acoustic_scale (float) – Acoustic score scale.

align(input, text)¶

Aligns input with text.

Output is a dictionary with the following (key, value) pairs:

key	value	value type
“alignment”	Frame-level alignment	`List[int]`
“best_path”	Best lattice path	`CompactLattice`
“likelihood”	Log-likelihood of best path	`float`
“weight”	Cost of best path	`LatticeWeight`

If symbols is None, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:	input (object) – Input to align. text (str) – Reference text to align.
Returns:	A dictionary representing alignment output.
Raises:	`RuntimeError` – If alignment fails.

classmethod from_files(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶

Constructs a new GMM aligner from given files.

Parameters:

model_rxfilename (str) – Extended filename for reading the model.
tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
symbols_filename (str) – The symbols file. If provided, “text” input of align() should include symbols instead of integer indices.
disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
beam (float) – Decoding beam used in alignment.
transition_scale (float) – The scale on non-self-loop transition probabilities.
self_loop_scale (float) – The scale on self-loop transition probabilities.
acoustic_scale (float) – Acoustic score scale.

Returns:

A new aligner object.

read_disambig_symbols(disambig_rxfilename)¶

Reads disambiguation symbols from an extended filename.

Returns:	List of disambiguation symbols.
Return type:	List[int]

read_lexicon(lexicon_rxfilename)¶

Reads lexicon FST from an extended filename.

Returns:	Lexicon FST.
Return type:	StdFst

static read_model(model_rxfilename)[source]¶

Reads model from an extended filename.

Returns:	A (transition model, acoustic model) pair.
Return type:	Tuple[TransitionModel, AmDiagGmm]

read_symbols(symbols_filename)¶

Reads symbol table from file.

Returns:	Symbol table.
Return type:	SymbolTable

read_tree(tree_rxfilename)¶

Reads phonetic decision tree from an extended filename.

Returns:	Phonetic decision tree.
Return type:	ContextDependency

to_phone_alignment(alignment, phones=None)¶

Converts frame-level alignment to phone-level alignment.

Parameters:	alignment (List[int]) – Frame-level alignment. phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns:	A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).
Return type:	List[Tuple[int,int,int]]

to_word_alignment(best_path, word_boundary_info)¶

Converts best alignment path to word-level alignment.

Parameters:	best_path (CompactLattice) – Best alignment path. word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns:	A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.
Return type:	List[Tuple[int,int,int]]

class kaldi.alignment.NnetAligner(transition_model, acoustic_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, decodable_opts=None, online_ivector_period=10)[source]¶

Neural network based speech aligner.

This can be used to align feature matrices with reference texts.

Parameters:

transition_model (TransitionModel) – The transition model.
acoustic_model (AmNnetSimple) – The acoustic model.
tree (ContextDependency) – The phonetic decision tree.
lexicon (StdFst) – The lexicon FST.
symbols (SymbolTable) – The symbol table. If provided, “text” input of align() should include symbols instead of integer indices.
disambig_symbols (List[int]) – Disambiguation symbols.
graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
beam (float) – Decoding beam used in alignment.
transition_scale (float) – The scale on non-self-loop transition probabilities.
self_loop_scale (float) – The scale on self-loop transition probabilities.
decodable_opts (NnetSimpleComputationOptions) – Configuration options for simple nnet3 am decodable objects.
online_ivector_period (int) – Onlne ivector period. Relevant only if online ivectors are used.

align(input, text)¶

Aligns input with text.

Output is a dictionary with the following (key, value) pairs:

key	value	value type
“alignment”	Frame-level alignment	`List[int]`
“best_path”	Best lattice path	`CompactLattice`
“likelihood”	Log-likelihood of best path	`float`
“weight”	Cost of best path	`LatticeWeight`

If symbols is None, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).

Parameters:	input (object) – Input to align. text (str) – Reference text to align.
Returns:	A dictionary representing alignment output.
Raises:	`RuntimeError` – If alignment fails.

classmethod from_files(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, decodable_opts=None, online_ivector_period=10)[source]¶

Constructs a new nnet3 aligner from given files.

Parameters:

model_rxfilename (str) – Extended filename for reading the model.
tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
symbols_filename (str) – The symbols file. If provided, “text” input of align() should include symbols instead of integer indices.
disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
beam (float) – Decoding beam used in alignment.
transition_scale (float) – The scale on non-self-loop transition probabilities.
self_loop_scale (float) – The scale on self-loop transition probabilities.
decodable_opts (NnetSimpleComputationOptions) – Configuration options for simple nnet3 am decodable objects.
online_ivector_period (int) – Onlne ivector period. Relevant only if online ivectors are used.

Returns:

A new aligner object.

read_disambig_symbols(disambig_rxfilename)¶

Reads disambiguation symbols from an extended filename.

Returns:	List of disambiguation symbols.
Return type:	List[int]

read_lexicon(lexicon_rxfilename)¶

Reads lexicon FST from an extended filename.

Returns:	Lexicon FST.
Return type:	StdFst

static read_model(model_rxfilename)[source]¶

Reads model from an extended filename.

Returns:	A (transition model, acoustic model) pair.
Return type:	Tuple[TransitionModel, AmNnetSimple]

read_symbols(symbols_filename)¶

Reads symbol table from file.

Returns:	Symbol table.
Return type:	SymbolTable

read_tree(tree_rxfilename)¶

Reads phonetic decision tree from an extended filename.

Returns:	Phonetic decision tree.
Return type:	ContextDependency

to_phone_alignment(alignment, phones=None)¶

Converts frame-level alignment to phone-level alignment.

Parameters:	alignment (List[int]) – Frame-level alignment. phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns:	A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).
Return type:	List[Tuple[int,int,int]]

to_word_alignment(best_path, word_boundary_info)¶

Converts best alignment path to word-level alignment.

Parameters:	best_path (CompactLattice) – Best alignment path. word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns:	A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.
Return type:	List[Tuple[int,int,int]]