kaldi.alignment¶
Classes
Aligner (transition_model, tree, lexicon[, …]) |
Speech aligner. |
GmmAligner (transition_model, acoustic_model, …) |
GMM based speech aligner. |
MappedAligner (transition_model, tree, lexicon) |
Mapped speech aligner. |
NnetAligner (transition_model, …[, …]) |
Neural network based speech aligner. |
-
class
kaldi.alignment.
Aligner
(transition_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶ Speech aligner.
This can be used to align transition-id log-likelihood matrices with reference texts.
Parameters: - transition_model (TransitionModel) – The transition model.
- tree (ContextDependency) – The phonetic decision tree.
- lexicon (StdFst) – The lexicon FST.
- symbols (SymbolTable) – The symbol table. If provided, “text” output of
decode()
includes symbols instead of integer indices. - disambig_symbols (List[int]) – Disambiguation symbols.
- graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
- beam (float) – Decoding beam used in alignment.
- transition_scale (float) – The scale on non-self-loop transition probabilities.
- self_loop_scale (float) – The scale on self-loop transition probabilities.
- acoustic_scale (float) – Acoustic score scale.
-
align
(input, text)[source]¶ Aligns input with text.
Output is a dictionary with the following
(key, value)
pairs:key value value type “alignment” Frame-level alignment List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight
If
symbols
isNone
, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).Parameters: Returns: A dictionary representing alignment output.
Raises: RuntimeError
– If alignment fails.
-
classmethod
from_files
(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶ Constructs a new GMM aligner from given files.
Parameters: - model_rxfilename (str) – Extended filename for reading the transition model.
- tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
- lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
- symbols_filename (str) – The symbols file. If provided, “text” input
of
align()
should include symbols instead of integer indices. - disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
- graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
- beam (float) – Decoding beam used in alignment.
- transition_scale (float) – The scale on non-self-loop transition probabilities.
- self_loop_scale (float) – The scale on self-loop transition probabilities.
- acoustic_scale (float) – Acoustic score scale.
Returns: A new aligner object.
-
static
read_disambig_symbols
(disambig_rxfilename)[source]¶ Reads disambiguation symbols from an extended filename.
Returns: List of disambiguation symbols. Return type: List[int]
-
static
read_lexicon
(lexicon_rxfilename)[source]¶ Reads lexicon FST from an extended filename.
Returns: Lexicon FST. Return type: StdFst
-
static
read_model
(model_rxfilename)[source]¶ Reads transition model from an extended filename.
Returns: Transition model. Return type: TransitionModel
-
static
read_symbols
(symbols_filename)[source]¶ Reads symbol table from file.
Returns: Symbol table. Return type: SymbolTable
-
static
read_tree
(tree_rxfilename)[source]¶ Reads phonetic decision tree from an extended filename.
Returns: Phonetic decision tree. Return type: ContextDependency
-
to_phone_alignment
(alignment, phones=None)[source]¶ Converts frame-level alignment to phone-level alignment.
Parameters: - alignment (List[int]) – Frame-level alignment.
- phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns: A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).
Return type:
-
to_word_alignment
(best_path, word_boundary_info)[source]¶ Converts best alignment path to word-level alignment.
Parameters: - best_path (CompactLattice) – Best alignment path.
- word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns: A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.
Return type:
-
class
kaldi.alignment.
MappedAligner
(transition_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶ Mapped speech aligner.
This can be used to align phone-id log-likelihood matrices with reference texts.
Parameters: - transition_model (TransitionModel) – The transition model.
- tree (ContextDependency) – The phonetic decision tree.
- lexicon (StdFst) – The lexicon FST.
- symbols (SymbolTable) – The symbol table. If provided, “text” output of
decode()
includes symbols instead of integer indices. - disambig_symbols (List[int]) – Disambiguation symbols.
- graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
- beam (float) – Decoding beam used in alignment.
- transition_scale (float) – The scale on non-self-loop transition probabilities.
- self_loop_scale (float) – The scale on self-loop transition probabilities.
- acoustic_scale (float) – Acoustic score scale.
-
align
(input, text)¶ Aligns input with text.
Output is a dictionary with the following
(key, value)
pairs:key value value type “alignment” Frame-level alignment List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight
If
symbols
isNone
, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).Parameters: Returns: A dictionary representing alignment output.
Raises: RuntimeError
– If alignment fails.
-
from_files
(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)¶ Constructs a new GMM aligner from given files.
Parameters: - model_rxfilename (str) – Extended filename for reading the transition model.
- tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
- lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
- symbols_filename (str) – The symbols file. If provided, “text” input
of
align()
should include symbols instead of integer indices. - disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
- graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
- beam (float) – Decoding beam used in alignment.
- transition_scale (float) – The scale on non-self-loop transition probabilities.
- self_loop_scale (float) – The scale on self-loop transition probabilities.
- acoustic_scale (float) – Acoustic score scale.
Returns: A new aligner object.
-
read_disambig_symbols
(disambig_rxfilename)¶ Reads disambiguation symbols from an extended filename.
Returns: List of disambiguation symbols. Return type: List[int]
-
read_lexicon
(lexicon_rxfilename)¶ Reads lexicon FST from an extended filename.
Returns: Lexicon FST. Return type: StdFst
-
read_model
(model_rxfilename)¶ Reads transition model from an extended filename.
Returns: Transition model. Return type: TransitionModel
-
read_symbols
(symbols_filename)¶ Reads symbol table from file.
Returns: Symbol table. Return type: SymbolTable
-
read_tree
(tree_rxfilename)¶ Reads phonetic decision tree from an extended filename.
Returns: Phonetic decision tree. Return type: ContextDependency
-
to_phone_alignment
(alignment, phones=None)¶ Converts frame-level alignment to phone-level alignment.
Parameters: - alignment (List[int]) – Frame-level alignment.
- phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns: A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).
Return type:
-
to_word_alignment
(best_path, word_boundary_info)¶ Converts best alignment path to word-level alignment.
Parameters: - best_path (CompactLattice) – Best alignment path.
- word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns: A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.
Return type:
-
class
kaldi.alignment.
GmmAligner
(transition_model, acoustic_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶ GMM based speech aligner.
This can be used to align feature matrices with reference texts.
Parameters: - transition_model (TransitionModel) – The transition model.
- acoustic_model (AmDiagGmm) – The acoustic model.
- tree (ContextDependency) – The phonetic decision tree.
- lexicon (StdFst) – The lexicon FST.
- symbols (SymbolTable) – The symbol table. If provided, “text” input of
align()
should include symbols instead of integer indices. - disambig_symbols (List[int]) – Disambiguation symbols.
- graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
- beam (float) – Decoding beam used in alignment.
- transition_scale (float) – The scale on non-self-loop transition probabilities.
- self_loop_scale (float) – The scale on self-loop transition probabilities.
- acoustic_scale (float) – Acoustic score scale.
-
align
(input, text)¶ Aligns input with text.
Output is a dictionary with the following
(key, value)
pairs:key value value type “alignment” Frame-level alignment List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight
If
symbols
isNone
, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).Parameters: Returns: A dictionary representing alignment output.
Raises: RuntimeError
– If alignment fails.
-
classmethod
from_files
(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, acoustic_scale=0.1)[source]¶ Constructs a new GMM aligner from given files.
Parameters: - model_rxfilename (str) – Extended filename for reading the model.
- tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
- lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
- symbols_filename (str) – The symbols file. If provided, “text” input
of
align()
should include symbols instead of integer indices. - disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
- graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
- beam (float) – Decoding beam used in alignment.
- transition_scale (float) – The scale on non-self-loop transition probabilities.
- self_loop_scale (float) – The scale on self-loop transition probabilities.
- acoustic_scale (float) – Acoustic score scale.
Returns: A new aligner object.
-
read_disambig_symbols
(disambig_rxfilename)¶ Reads disambiguation symbols from an extended filename.
Returns: List of disambiguation symbols. Return type: List[int]
-
read_lexicon
(lexicon_rxfilename)¶ Reads lexicon FST from an extended filename.
Returns: Lexicon FST. Return type: StdFst
-
static
read_model
(model_rxfilename)[source]¶ Reads model from an extended filename.
Returns: A (transition model, acoustic model) pair. Return type: Tuple[TransitionModel, AmDiagGmm]
-
read_symbols
(symbols_filename)¶ Reads symbol table from file.
Returns: Symbol table. Return type: SymbolTable
-
read_tree
(tree_rxfilename)¶ Reads phonetic decision tree from an extended filename.
Returns: Phonetic decision tree. Return type: ContextDependency
-
to_phone_alignment
(alignment, phones=None)¶ Converts frame-level alignment to phone-level alignment.
Parameters: - alignment (List[int]) – Frame-level alignment.
- phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns: A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).
Return type:
-
to_word_alignment
(best_path, word_boundary_info)¶ Converts best alignment path to word-level alignment.
Parameters: - best_path (CompactLattice) – Best alignment path.
- word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns: A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.
Return type:
-
class
kaldi.alignment.
NnetAligner
(transition_model, acoustic_model, tree, lexicon, symbols=None, disambig_symbols=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, decodable_opts=None, online_ivector_period=10)[source]¶ Neural network based speech aligner.
This can be used to align feature matrices with reference texts.
Parameters: - transition_model (TransitionModel) – The transition model.
- acoustic_model (AmNnetSimple) – The acoustic model.
- tree (ContextDependency) – The phonetic decision tree.
- lexicon (StdFst) – The lexicon FST.
- symbols (SymbolTable) – The symbol table. If provided, “text” input of
align()
should include symbols instead of integer indices. - disambig_symbols (List[int]) – Disambiguation symbols.
- graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
- beam (float) – Decoding beam used in alignment.
- transition_scale (float) – The scale on non-self-loop transition probabilities.
- self_loop_scale (float) – The scale on self-loop transition probabilities.
- decodable_opts (NnetSimpleComputationOptions) – Configuration options for simple nnet3 am decodable objects.
- online_ivector_period (int) – Onlne ivector period. Relevant only if online ivectors are used.
-
align
(input, text)¶ Aligns input with text.
Output is a dictionary with the following
(key, value)
pairs:key value value type “alignment” Frame-level alignment List[int]
“best_path” Best lattice path CompactLattice
“likelihood” Log-likelihood of best path float
“weight” Cost of best path LatticeWeight
If
symbols
isNone
, the “text” input should be a string of space separated integer indices. Otherwise it should be a string of space separated symbols. The “weight” output is a lattice weight consisting of (graph-score, acoustic-score).Parameters: Returns: A dictionary representing alignment output.
Raises: RuntimeError
– If alignment fails.
-
classmethod
from_files
(model_rxfilename, tree_rxfilename, lexicon_rxfilename, symbols_filename=None, disambig_rxfilename=None, graph_compiler_opts=None, beam=200.0, transition_scale=1.0, self_loop_scale=1.0, decodable_opts=None, online_ivector_period=10)[source]¶ Constructs a new nnet3 aligner from given files.
Parameters: - model_rxfilename (str) – Extended filename for reading the model.
- tree_rxfilename (str) – Extended filename for reading the phonetic decision tree.
- lexicon_rxfilename (str) – Extended filename for reading the lexicon FST.
- symbols_filename (str) – The symbols file. If provided, “text” input
of
align()
should include symbols instead of integer indices. - disambig_rxfilename (str) – Extended filename for reading the list of disambiguation symbols.
- graph_compiler_opts (TrainingGraphCompilerOptions) – Configuration options for graph compiler.
- beam (float) – Decoding beam used in alignment.
- transition_scale (float) – The scale on non-self-loop transition probabilities.
- self_loop_scale (float) – The scale on self-loop transition probabilities.
- decodable_opts (NnetSimpleComputationOptions) – Configuration options for simple nnet3 am decodable objects.
- online_ivector_period (int) – Onlne ivector period. Relevant only if online ivectors are used.
Returns: A new aligner object.
-
read_disambig_symbols
(disambig_rxfilename)¶ Reads disambiguation symbols from an extended filename.
Returns: List of disambiguation symbols. Return type: List[int]
-
read_lexicon
(lexicon_rxfilename)¶ Reads lexicon FST from an extended filename.
Returns: Lexicon FST. Return type: StdFst
-
static
read_model
(model_rxfilename)[source]¶ Reads model from an extended filename.
Returns: A (transition model, acoustic model) pair. Return type: Tuple[TransitionModel, AmNnetSimple]
-
read_symbols
(symbols_filename)¶ Reads symbol table from file.
Returns: Symbol table. Return type: SymbolTable
-
read_tree
(tree_rxfilename)¶ Reads phonetic decision tree from an extended filename.
Returns: Phonetic decision tree. Return type: ContextDependency
-
to_phone_alignment
(alignment, phones=None)¶ Converts frame-level alignment to phone-level alignment.
Parameters: - alignment (List[int]) – Frame-level alignment.
- phones (SymbolTable) – The phone symbol table. If provided, output includes symbols instead of integer indices.
Returns: A list of triplets representing, for each phone in the input, the phone index/symbol, the begin time (in frames) and the duration (in frames).
Return type:
-
to_word_alignment
(best_path, word_boundary_info)¶ Converts best alignment path to word-level alignment.
Parameters: - best_path (CompactLattice) – Best alignment path.
- word_boundary_info (WordBoundaryInfo) – Word boundary information.
Returns: A list of triplets representing, for each word in the input, the word index/symbol, the begin time (in frames) and the duration (in frames). The zero/epsilon words correspond to optional silences.
Return type: