kaldi.rnnlm¶
Functions
check_distribution |
Validates a distribution. |
get_rnnlm_computation_request |
Creates a computation request for the given RNNLM example. |
get_rnnlm_example_derived |
Constructs a derived RNNLM example. |
merge_distributions |
Merges two distributions. |
process_rnnlm_output |
Processes the output of RNNLM computation. |
read_sparse_word_features |
Reads sparse word features from input stream. |
renumber_rnnlm_example |
Renumbers word-ids in a minibatch. |
sample_without_replacement |
Samples without replacement from a distribution. |
total_of_distribution |
Returns the sum of the elements of a distribution. |
Classes
KaldiRnnlmDeterministicFst |
Deterministic on demand RNNLM FST. |
RnnlmComputeState |
RNNLM computation state. |
RnnlmComputeStateComputationOptions |
Options for RNNLM compute state. |
RnnlmComputeStateInfo |
State information for RNNLM computation. |
RnnlmCoreComputer |
Core RNNLM computer. |
RnnlmCoreTrainer |
Core RNNLM trainer. |
RnnlmCoreTrainerOptions |
Options for core RNNLM training. |
RnnlmEgsConfig |
RNNLM example configuration. |
RnnlmEmbeddingTrainer |
RNNLM embedding trainer. |
RnnlmEmbeddingTrainerOptions |
Options for RNNLM embedding training. |
RnnlmExample |
A single minibatch for training an RNNLM. |
RnnlmExampleCreator |
RNNLM example creator. |
RnnlmExampleDerived |
Various quantities/expressions derived from an RNNLM example. |
RnnlmExampleSampler |
RNNLM example sampler. |
RnnlmObjectiveOptions |
Options for RNNLM objective function. |
RnnlmTrainer |
RNNLM trainer. |
Sampler |
Word sampler. |
SamplingLm |
Sampling LM. |
SamplingLmEstimator |
Sampling LM estimator. |
SamplingLmEstimatorOptions |
Options for sampling LM estimator. |
-
class
kaldi.rnnlm.
KaldiRnnlmDeterministicFst
¶ Deterministic on demand RNNLM FST.
Parameters: - max_ngram_order (int) – Maximum ngram order.
- info (RnnlmComputeStateInfo) – State information for RNNLM computation.
-
clear
()¶ Clears the internal maps.
This method is similar to the destructor but we retain the 0-th entries in each map which corresponds to the <bos> state.
-
final
(state:int) → TropicalWeight¶ Returns the final weight of the given state.
-
get_arc
(s:int, ilabel:int) -> (success:bool, oarc:StdArc)¶ Creates an on demand arc and returns it.
Parameters: Returns: The created arc.
-
start
() → int¶ Returns the start state index.
-
class
kaldi.rnnlm.
RnnlmComputeState
¶ RNNLM computation state.
This class handles the neural net computation; it’s mostly accessed via other wrapper classes.
At every time step this class takes a new word, advances the nnet computation by one step, and works out the log-prob of words to be used in lattice rescoring.
Parameters: - info (RnnlmComputeStateInfo) – State information for RNNLM computation.
- bos_index (int) – Index of the begin-of-sentence symbol.
-
add_word
(word_index:int)¶ Updates the state of the RNNLM by appending a word.
-
from_other
(other:RnnlmComputeState) → RnnlmComputeState¶ Creates a new instance from another. :param other: The other RNNLM computation state. :type other: RnnlmComputeState
-
get_log_prob_of_words
(output:CuMatrixBase)¶ Computes log probs of all words.
This function computes log probs of all words and outputs them as a matrix.
Note
output[0,0] corresponds to <eps> symbol and it should NEVER be used in any computation by the caller. To avoid causing unexpected issues, it is set to a very small number.
-
get_successor_state
(next_word:int) → RnnlmComputeState¶ Generates another state by processing the next-word.
Parameters: next_word (int) – The next word to process.
-
log_prob_of_word
(word_index:int) → float¶ Gets the log-prob for the provided word.
Returns: The log-prob that the model predicts for the provided word-index, given the current history.
-
class
kaldi.rnnlm.
RnnlmComputeStateComputationOptions
¶ Options for RNNLM compute state.
-
bos_index
¶ Index in wordlist representing the begin-of-sentence symbol.
We need this when we initialize the RnnlmComputeState and pass the BOS history.
-
brk_index
¶ Index in wordlist representing the break symbol.
This is not needed for computation; included only for ease of scripting.
-
compute_config
¶ Nnet compute options.
-
debug_computation
¶ Whether to turn on debug for the actual computation (very verbose!).
-
eos_index
¶ Index in wordlist representing the end-of-sentence symbol.
We need this to compute the final cost of a state.
-
normalize_probs
¶ Whether to normalize word probabilities exactly.
If False, the sum-to-one normalization is approximate.
-
optimize_config
¶ Nnet optimization options.
-
register
(opts:OptionsItf)¶ Registers options with an object implementing the options interface.
Parameters: opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
-
-
class
kaldi.rnnlm.
RnnlmComputeStateInfo
(opts, rnnlm, word_embedding_mat)[source]¶ State information for RNNLM computation.
This class keeps references to the word-embedding, nnet3 part of RNNLM and the RnnlmComputeStateComputationOptions. It handles the computation of the nnet3 object.
Parameters: - opts (RnnlmComputeStateComputationOptions) – Options for RNNLM compute state.
- rnnlm (Nnet) – The nnet part of the RNNLM.
- word_embedding_mat (CuMatrix) – The word embedding matrix.
-
computation
¶ The compiled, ‘looped’ nnet computation.
-
class
kaldi.rnnlm.
RnnlmCoreComputer
¶ Core RNNLM computer.
This class has a similar interface to
RnnlmCoreTrainer
, but it doesn’t actually train the RNNLM; it’s for computing likelihoods and (optionally) derivatives w.r.t. the embedding, in situations where you are not training the core part of the RNNLM. It reads egs– it’s not for rescoring lattices and similar purposes.Parameters: nnet (Nnet) – The neural network that is to be used to evaluate likelihoods (and possibly derivatives). -
compute
(minibatch:RnnlmExample, derived:RnnlmExampleDerived, word_embedding:CuMatrixBase, word_embedding_deriv:CuMatrixBase=default) -> (objf:float, weight:float)¶ Computes the objective on one minibatch.
If
word_embedding_deriv
is provided, it also computes derivatives w.r.t. the embedding.Parameters: - minibatch (RnnlmExample) – The RNNLM minibatch to evaluate, containing
a number of parallel word sequences. It will not necessarily
contain words with the ‘original’ numbering, it will in most
circumstances contain just the ones we used; see
renumber_rnnlm_example()
. - derived (RnnlmExampleDerived) – Derived quantities of the minibatch,
pre-computed by calling
get_rnnlm_example_derived()
with suitable arguments. - word_embedding (CuMatrixBase) – The matrix giving the embedding of
words, of dimension
minibatch.vocab_size
by the embedding dimension. The numbering of the words does not have to be the ‘real’ numbering of words, it can consist of words renumbered byrenumber_rnnlm_example()
; it just has to be consistent with the word-ids present in ‘minibatch’. - word_embedding_deriv (CuMatrixBase) – If not None, the derivative of the objective function w.r.t. the word embedding will be added to this location; it must have the same dimension as ‘word_embedding’.
Returns: - objf – The total objective function for this minibatch; divide
this by
weight
to normalize it (i.e. get the average log-prob per word). - weight – The total weight of the words in the minibatch. This
is just the sum of
minibatch.output_weights
.
- minibatch (RnnlmExample) – The RNNLM minibatch to evaluate, containing
a number of parallel word sequences. It will not necessarily
contain words with the ‘original’ numbering, it will in most
circumstances contain just the ones we used; see
-
-
class
kaldi.rnnlm.
RnnlmCoreTrainer
¶ Core RNNLM trainer.
This class does the core part of the training of the RNNLM; the word embeddings are supplied to this class for each minibatch and while this class can compute objective function derivatives w.r.t. these embeddings, it is not responsible for updating them.
Parameters: - config (RnnlmCoreTrainerOptions) – Options for core RNNLM training.
- objective_config (RnnlmObjectiveOptions) – Options for RNNLM objective.
- nnet (Nnet) – The neural network that is to be trained. It will be
modified each time you call
train()
.
-
consolidate_memory
()¶ Consolidates neural network memory.
-
print_max_change_stats
()¶ Prints out the max-change stats (if nonzero).
This is the percentage of time that per-component max-change and global max-change were enforced.
-
train
(minibatch:RnnlmExample, derived:RnnlmExampleDerived, word_embedding:CuMatrixBase, word_embedding_deriv:CuMatrixBase)¶ Do training for one minibatch.
Parameters: - minibatch (RnnlmExample) – The RNNLM minibatch to train on, containing
a number of parallel word sequences. It will not necessarily
contain words with the ‘original’ numbering, it will in most
circumstances contain just the ones we used; see
renumber_rnnlm_example()
. - derived (RnnlmExampleDerived) – Derived quantities of the minibatch,
pre-computed by calling
get_rnnlm_example_derived()
with suitable arguments. - word_embedding (CuMatrixBase) – The matrix giving the embedding of
words, of dimension
minibatch.vocab_size
by the embedding dimension. The numbering of the words does not have to be the ‘real’ numbering of words, it can consist of words renumbered byrenumber_rnnlm_example()
; it just has to be consistent with the word-ids present in ‘minibatch’. - word_embedding_deriv (CuMatrixBase) – If not None, the derivative of the objective function w.r.t. the word embedding will be added to this location; it must have the same dimension as ‘word_embedding’.
- minibatch (RnnlmExample) – The RNNLM minibatch to train on, containing
a number of parallel word sequences. It will not necessarily
contain words with the ‘original’ numbering, it will in most
circumstances contain just the ones we used; see
-
train_backstitch
(is_backstitch_step1:bool, minibatch:RnnlmExample, derived:RnnlmExampleDerived, word_embedding:CuMatrixBase, word_embedding_deriv:CuMatrixBase)¶ Do backstitch training for one minibatch.
Depending on whether is_backstitch_step1 is true, It could be either the first (backward) step, or the second (forward) step of backstitch.
Parameters: - is_backstitch_step1 (bool) – If true update stats otherwise not.
- minibatch (RnnlmExample) – The RNNLM minibatch to train on, containing
a number of parallel word sequences. It will not necessarily
contain words with the ‘original’ numbering, it will in most
circumstances contain just the ones we used; see
renumber_rnnlm_example()
. - derived (RnnlmExampleDerived) – Derived quantities of the minibatch,
pre-computed by calling
get_rnnlm_example_derived()
with suitable arguments. - word_embedding (CuMatrixBase) – The matrix giving the embedding of
words, of dimension
minibatch.vocab_size
by the embedding dimension. The numbering of the words does not have to be the ‘real’ numbering of words, it can consist of words renumbered byrenumber_rnnlm_example()
; it just has to be consistent with the word-ids present in ‘minibatch’. - word_embedding_deriv (CuMatrixBase) – If not None, the derivative of the objective function w.r.t. the word embedding will be added to this location; it must have the same dimension as ‘word_embedding’.
-
class
kaldi.rnnlm.
RnnlmCoreTrainerOptions
¶ Options for core RNNLM training.
These are related to the core RNNLM training, i.e. training the actual neural net for the RNNLM (when the word embeddings are given).
-
backstitch_training_interval
¶ Backstitch training interval (n).
Do backstitch training with the specified interval of minibatches.
-
backstitch_training_scale
¶ Backstitch training factor (alpha).
If 0 then in the normal training mode.
-
l2_regularize_factor
¶ Factor that affects the strength of l2 regularization.
This affects the strength of l2 regularization on model parameters. It will be multiplied by the component-level l2-regularize values and can be used to correct for effects related to parallelization by model averaging.
-
max_param_change
¶ The maximum change in model parameters allowed per minibatch.
This is measured in Euclidean norm. Change will be clipped to this value.
-
momentum
¶ Momentum constant (help stabilize training updates), e.g. 0.9.
We automatically multiply the learning rate by (1-momentum) so that the ‘effective’ learning rate is the same as before (because momentum would normally increase the effective learning rate by 1/(1-momentum)).
-
print_interval
¶ The log printing interval (in terms of #minibatches).
-
register
(opts:OptionsItf)¶ Registers options with an object implementing the options interface.
Parameters: opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
-
-
class
kaldi.rnnlm.
RnnlmEgsConfig
¶ RNNLM example configuration.
-
bos_symbol
¶ Beginning of sentence symbol.
It must be set.
-
brk_symbol
¶ Break symbol.
It must be set.
-
check
()¶ Validates the options.
Raises: RuntimeError
– If validation fails.
-
chunk_buffer_size
¶ The number of chunks that are buffered while processing the input.
Larger means more complete randomization but also more I/O before we produce any output, and more memory used.
-
chunk_length
¶ The length of each sequence in a minibatch.
The length of each sequence in a minibatch, including any terminating </s> symbols, which are included explicitly in the sequences. When </s> appears in the middle of sequences because we splice shorter word sequences together, we will replace it with <s> on the input side of the network. Sentences, or pieces of sentences, that were shorter than
chunk_length
, will be padded as needed.
-
eos_symbol
¶ End of sentence symbol.
It must be set.
-
min_split_context
¶ Min left-context supplied for each training sentence piece.
-
num_chunks_per_minibatch
¶ The number of parallel word sequences/chunks per minibatch.
-
num_samples
¶ The number of words we choose each time we do the sampling.
-
register
(opts:OptionsItf)¶ Registers options with an object implementing the options interface.
Parameters: opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
-
sample_group_size
¶ The sampling group size.
This is the number of consecutive time-steps which form a single unit for sampling purposes. This number will always divide
chunk_length
. Example: ifsample_group_size==2
, we’ll sample one set of words fort={0,1}
, another fort={2,3}
, and so on. We support merging time-steps in this way (but not splitting them smaller), due to considerations of computing time if you assume we also have a network that learns word representation from their character-level features.
-
special_symbol_prob
¶ Sampling probability for words that aren’t supposed to be predicted.
Sampling probability at the output for words that aren’t supposed to be predicted (<s>, <brk>)– this ensures that the model makes their output probs small, which avoids hassle when computing the normalizer in test time (if we didn’t sample them with some probability to ensure their probs are small, we’d have to exclude them from the denominator sum.
-
uniform_prob_mass
¶ The probability mass to uniformly distribute over all words.
This value should be < 1.0; it is the proportion of the unigram distribution used for sampling assigned to uniformly predicting all words. This may avoid certain pathologies during training, and ensure that all words’ probs are bounded away from zero, which might be necessary for the theory of importance sampling.
-
vocab_size
¶ The vocabulary size.
More specifically, the largest integer word-id plus one. Must be provided, as it gets included in each minibatch (mostly for checking purposes).
-
-
class
kaldi.rnnlm.
RnnlmEmbeddingTrainer
¶ RNNLM embedding trainer.
This class is responsible for training the word embedding matrix or feature embedding matrix.
Parameters: - config (RnnlmEmbeddingTrainerOptions) – Options for RNNLM embedding training.
- embedding_mat (CuMatrix) – The embedding matrix to be trained, of dimension (num-words or num-features) by embedding-dim (depending whether we are using a feature representation of words, or not).
-
train
(embedding_deriv:CuMatrixBase)¶ Do training for one minibatch.
This version is used either when there is no subsampling, or when there is subsampling but we are using a feature representation so the subsampling is handled outside of this code.
Parameters: embedding_deriv (CuMatrixBase) – The derivative of the objective function w.r.t. the word (or feature) embedding matrix.
-
train_backstitch
(is_backstitch_step1:bool, embedding_deriv:CuMatrixBase)¶ Do backstitch training for one minibatch.
This version is used either when there is no subsampling, or when there is subsampling but we are using a feature representation so the subsampling is handled outside of this code.
Depending on whether is_backstitch_step1 is true, It could be either the first (backward) step, or the second (forward) step of backstitch.
Parameters: - is_backstitch_step1 (bool) – If true update stats otherwise not.
- embedding_deriv (CuMatrixBase) – The derivative of the objective function w.r.t. the word (or feature) embedding matrix.
-
train_backstitch_with_subsampling
(is_backstitch_step1:bool, active_words:CuArray, word_embedding_deriv:CuMatrixBase)¶ Do backstitch training for one minibatch.
This version is for when there is subsampling, and the user is providing the derivative w.r.t. just the word-indexes that were used in this minibatch.
active_words
is a sorted, unique list of the word-indexes that were used in this minibatch, andword_embedding_deriv
is the derivative w.r.t. the embedding of that list of words.Depending on whether is_backstitch_step1 is true, It could be either the first (backward) step, or the second (forward) step of backstitch.
Parameters: - is_backstitch_step1 (bool) – If true update stats otherwise not.
- active_words (CuArray) – A sorted, unique list of the word indexes
used, with dimension equal to
word_embedding_deriv.num_rows
. - word_embedding_deriv (CuMatrixBase) – The derivative of the objective function w.r.t. the word embedding matrix.
-
train_with_subsampling
(active_words:CuArray, word_embedding_deriv:CuMatrixBase)¶ Do training for one minibatch.
This version is for when there is subsampling, and the user is providing the derivative w.r.t. just the word-indexes that were used in this minibatch.
active_words
is a sorted, unique list of the word-indexes that were used in this minibatch, andword_embedding_deriv
is the derivative w.r.t. the embedding of that list of words.Parameters: - active_words (CuArray) – A sorted, unique list of the word indexes
used, with dimension equal to
word_embedding_deriv.num_rows
. - word_embedding_deriv (CuMatrixBase) – The derivative of the objective function w.r.t. the word embedding matrix.
- active_words (CuArray) – A sorted, unique list of the word indexes
used, with dimension equal to
-
class
kaldi.rnnlm.
RnnlmEmbeddingTrainerOptions
¶ Options for RNNLM embedding training.
-
backstitch_training_interval
¶ Backstitch training interval (n).
Do backstitch training with the specified interval of minibatches.
-
backstitch_training_scale
¶ Backstitch training factor (alpha).
If 0 then in the normal training mode.
-
check
()¶ Validates RNNLM embedding training options.
-
l2_regularize
¶ Factor that affects the strength of l2 regularization.
This affects the strength of l2 regularization on embedding parameters.
-
learning_rate
¶ The learning rate used in training the word-embedding matrix.
-
max_param_change
¶ The maximum change in embedding parameters allowed per minibatch.
This is measured in Euclidean norm. The embedding matrix has dimensions num-features by embedding-dim or num-words by embedding-dim if we’re not using a feature-based representation.
-
momentum
¶ Momentum constant for training of embeddings (e.g. 0.5 or 0.9).
We automatically multiply the learning rate by (1-momentum) so that the ‘effective’ learning rate is the same as before (because momentum would normally increase the effective learning rate by 1/(1-momentum)).
-
natural_gradient_alpha
¶ Smoothing constant alpha to use for natural gradient.
-
natural_gradient_num_minibatches_history
¶ Determines how quickly the Fisher estimate for the natural gradient is updated, when training the word embedding.
-
natural_gradient_rank
¶ Rank of the Fisher matrix in natural gradient.
This is applied to learning the embedding matrix (this is in the embedding space, so the rank should probably be less than the embedding dimension.
-
natural_gradient_update_period
¶ Determines how often the Fisher matrix is updated for natural gradient as applied to the embedding matrix.
-
print_interval
¶ The log printing interval (in terms of #minibatches).
-
register
(opts:OptionsItf)¶ Registers options with an object implementing the options interface.
Parameters: opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
-
use_natural_gradient
¶ Whether to use natural gradient to update the embedding matrix.
-
-
class
kaldi.rnnlm.
RnnlmExample
¶ A single minibatch for training an RNNLM.
-
chunk_length
¶ The length of each sequence in a minibatch.
The length of each sequence in a minibatch, including any terminating </s> symbols, which are included explicitly in the sequences. When </s> appears in the middle of sequences because we splice shorter word sequences together, we will replace it with <s> on the input side of the network. Sentences, or pieces of sentences, that were shorter than
chunk_length
, will be padded as needed.
-
input_words
¶ The input word labels.
Contains the input word symbols 0 <= i < vocab_size for each position in each chunk; dimension == chunk_length * num_chunks, where 0 <= t < chunk_length has larger stride than 0 <= n < num_chunks. In the common case these will be the same as the previous output symbol.
-
num_chunks
¶ The number of parallel word sequences/chunks.
Some of the word sequences may actually be made up of smaller subsequences appended together.
-
num_samples
¶ The number of samples.
This is the number of words that we sample at the output of the nnet for each of the
num_sample_groups
groups. If we didn’t do sampling because the user didn’t provide the ARPA language model, this will be zero (in this case we’ll do the summation over all words in the vocab).
-
output_weights
¶ The output weights.
Weights for each of the
output_words
, indexed the same way asoutput_words
. These reflect any data-weighting we had in the original data, plus some zeros that relate to padding sequences of uneven length.
-
output_words
¶ The output word labels.
The output (predicted) word symbols for each position in each chunk; indexed in the same way as ‘input_words’. What this contains is different from ‘input_words’ in the sampling case (i.e. if !sampled_words.empty()). In this case, instead of the word-index it contains the relative index 0 <= i < num_samples within the block of sampled words. In the not-sampled case it contains actual word indexes 0 <= i < vocab_size.
-
read
(is:istream, binary:bool)¶ Reads the RNNLM example from input stream.
Parameters:
-
sample_group_size
¶ The sampling group size.
This is the number of consecutive time-steps which form a single unit for sampling purposes. This number will always divide
chunk_length
. Example: ifsample_group_size==2
, we’ll sample one set of words fort={0,1}
, another fort={2,3}
, and so on. The sampling is for the denominator of the objective function.
-
sample_inv_probs
¶ The inverses probabilities.
This vector has the same dimension as ‘sampled_words’, and contains the inverses of the probabilities probability 0 < p <= 1 with which that word was included in the sampled set of words. These inverse probabilities appear in the objective function computation (it’s related to importance sampling).
-
sampled_words
¶ The sampled word labels.
This list contains the word-indexes that we sampled for each position in the chunk and for each group of chunks. (It will be empty if the user didn’t provide the ARPA language model). Its dimension is num_sample_groups * num_samples, where num_sample_groups == (chunk_length / sample_group_size). The sample-group index has the largest stride (you can think of the sample group index as the number i = t / sample_group_size, in integer division, where 0 <= t < chunk_length is the position in the chunk). The sampled words within each block of size
num_samples
are sorted and unique.
-
swap
(other:RnnlmExample)¶ Swaps contents with another RNNLM example.
Parameters: other (RnnlmExample) – The other RNNLM example.
-
vocab_size
¶ The vocabulary size.
The vocabulary size (defined as largest integer word-id plus one) for which this example was obtained; mostly used in bounds checking.
-
-
class
kaldi.rnnlm.
RnnlmExampleCreator
¶ RNNLM example creator.
This class takes care of all of the logic of creating minibatches for RNNLM training, including the sampling aspect.
-
accept_sequence
(weight:float, words:list<int>)¶ Accepts a single sequence.
The user calls this to provide a single sequence (a sentence; or multiple sentences that are part of a continuous stream or dialogue, separated by </s>), to this class. This class will write out minibatches when it’s ready. This will normally be the result of reading a line of text with the format:
<weight> <word1> <word2> ….- e.g.:
- 1.0 Hello there
although the “hello there” would have been converted to integers by the time it was read in, via sym2int.pl, so it would look like:
1.0 7620 12309- We also allow:
- 1.0 Hello there </s> Hi </s> My name is Bob
if you want to train the model to predict sentences given the history of the conversation.
-
flush
()¶ Flushes out any pending minibatches.
-
process
(is:istream)¶ Processes the lines from input stream.
- Lines will be of the format:
- <weight> <possibly-empty-sequence-of-integers>
- e.g.:
- 1.0 2560 8991
-
without_sampling
(config:RnnlmEgsConfig, writer:RnnlmExampleWriter) → RnnlmExampleCreator¶ Instantiates a new RNNLM example creator.
This constructor is for when you are not using importance sampling, so no samples will be stored in the minibatch and the training code will presumably evaluate all the words each time. This is intended to be used for testing purposes.
-
-
class
kaldi.rnnlm.
RnnlmExampleDerived
¶ Various quantities/expressions derived from an RNNLM example.
This class contains various quantities/expressions that are derived from the quantities found in
RnnlmExample
, and which are needed when training on that example, particularly by the functionprocess_rnnlm_output()
.-
cu_input_words
¶ CUDA copy of minibatch.input_words.
-
cu_output_words
¶ CUDA copy of minibatch.output_words.
It’s only used in the sampling case.
-
cu_sampled_words
¶ CUDA copy of minibatch.sampled_words.
It’s only used in the sampling case (in the no-sampling case, minibatch.sampled_words would be empty anyway).
-
swap
(other:RnnlmExampleDerived)¶ Swaps contents with another derived RNNLM example.
Parameters: other (RnnlmExampleDerived) – The other derived RNNLM example.
-
-
class
kaldi.rnnlm.
RnnlmExampleSampler
¶ RNNLM example sampler.
This class encapsulates the logic for sampling words for a minibatch. The words at the output of the RNNLM are sampled and we train with an importance-sampling algorithm.
Parameters: - config (RnnlmEgsConfig) – The RNNLM example configuration.
- arpa_sampling (SamplingLm) – The sampling LM.
-
sample_for_minibatch
(minibatch:RnnlmExample)¶ Does the sampling for a minibatch.
Parameters: minibatch (RnnlmExample) – The minibatch. It is expected to already have all fields populated except for sampled_words
andsample_probs
. This method does the sampling and sets those fields.
-
class
kaldi.rnnlm.
RnnlmObjectiveOptions
¶ Options for RNNLM objective function.
Configuration class relating to the objective function used for RNNLM training, more specifically for use by the function
process_rnnlm_output()
.-
den_term_limit
¶ Modification to the with-sampling objective.
This prevents instability early in training, but in the end makes no difference. We scale down the denominator part of the objective when the average denominator part of the objective, for this minibatch, is more negative than this value. Set this to 0.0 to use unmodified objective function
-
max_logprob_elements
¶ Maximum number of elements in the logprob matrix.
Maximum number of elements when we allocate a matrix of size [minibatch-size, num-words] for computing logprobs of words. If the size is exceeded, we will break the matrix along the minibatch axis and compute them separately.
-
register
(opts:OptionsItf)¶ Registers options with an object implementing the options interface.
Parameters: opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
-
-
class
kaldi.rnnlm.
RnnlmTrainer
¶ RNNLM trainer.
The class RnnlmTrainer is for training an RNNLM (one individual training job, not the top-level logic about learning rate schedules, parameter averaging, and the like).
Args: train_embedding (bool) Whether to train the embedding matrix. core_config (RnnlmCoreTrainerOptions): Options for training the core
RNNLM.- embedding_config (RnnlmEmbeddingTrainerOptions): Options for training the
- embedding matrix (only relevant if train_embedding is True).
- objective_config (RnnlmObjectiveOptions): Options relating to the
- objective function used for training.
- word_feature_mat (CuSparseMatrix): Either None, or a sparse word-feature
- matrix of dimension vocab-size by feature-dim, where vocab-size is the highest-numbered word plus one.
- embedding_mat (CuMatrix): The embedding matrix; this is trained if
train_embedding
is True. Ifword_feature_mat
is None, this is the word-embedding matrix of dimension vocab-size by embedding-dim; otherwise it is the feature-embedding matrix of dimension feature-dim by by embedding-dim, and we have to multiply it byword_feature_mat
to get the word embedding matrix.
rnnlm (Nnet): The RNNLM to be trained.
-
num_minibatches_processed
() → int¶ Returns the number of minibatches processed so far.
-
train
(minibatch:RnnlmExample)¶ Train on one example.
The example is acquired destructively, via swapping contents.
Note
This function doesn’t actually train on this example; what it does is to train on the previous example, and provide this example to the background thread that computes the derived parameters of the example.
-
class
kaldi.rnnlm.
Sampler
¶ Word sampler.
This class allows us to sample a set of words from a distribution over words, where the distribution (which ultimately comes from an ARPA-style language model) is given as a combination of a unigram distribution with a sparse component represented as a list of (word-index, probability) pairs.
Parameters: unigram_probs (List[float]) – The unigram probabilities for each word. Each elemenet should be >= 0, and they should sum to a value close to 1. -
sample_words
(num_words_to_sample:int, unigram_weight:float, higher_order_probs:list<tuple<int, float>>) → list<tuple<int, float>>¶ Samples words from the supplied distribution, appropriately scaled.
- Let the unnormalized distribution be as follows:
- p(i) = unigram_weight * u(i) + h(i)
where u(i) is the ‘unigram_probs’ list this class was constructed with, and h(i) is the probability that word i is given (if any) in the sparse vector that ‘higher_order_probs’ represents. Notice that we are adding to the unigram distribution, we are not backing off to it. Doing it this way makes a lot of things simpler.
- We define the first-order inclusion probabilities:
- q(i) = min(alpha p(i), 1.0)
where alpha is chosen so that the sum of q(i) equals ‘num_words_to_sample’. Then we generate a sample whose first-order inclusion probabilities are q(i). We do all this without explicitly iterating over the unigram distribution, so this is fairly fast.
Parameters: - num_words_to_sample (int) – The number of words that we are directed sample; must be > 0 and less than the number of nonzero elements of the ‘unigram_probs’ that this class was constructed with.
- unigram_weight (float) – Must be > 0.0. Search above for p(i) to see what effect it has.
- higher_order_probs (List[Tuple[int,float]]) – A list of pairs (i, p) where 0 <= i < unigram_probs.size() (referring to the unigram_probs list used in the constructor), and p > 0.0. This list must be sorted and unique w.r.t. i. Note: the probabilities here will be added to the unigram probabilities of the words concerned.
Returns: The sampled list of words, represented as pairs (i, p), where 0 <= i < unigram_probs.size() is the word index and 0 < p <= 1 is the probabilitity with which that word was included in the set. The list will not be sorted, but it will be unique on the int. Its size will equal num_words_to_sample.
-
sample_words_with_requirements
(num_words_to_sample:int, unigram_weight:float, higher_order_probs:list<tuple<int, float>>, words_we_must_sample:list<int>) → list<tuple<int, float>>¶ Sample words by specifiying a list of words that must be sampled.
This is an alternative version of
sample_words()
which allows you to specify a list of words that must be sampled (i.e. after scaling, they must have probability 1.0.). It does this by adding them to the distribution with sufficiently large probability and then callingsample_words()
.Parameters: - num_words_to_sample (int) – The number of words that we are directed sample; must be > 0 and less than the number of nonzero elements of the ‘unigram_probs’ that this class was constructed with.
- unigram_weight (float) – Must be > 0.0. Search above for p(i) to see what effect it has.
- higher_order_probs (List[Tuple[int,float]]) – A list of pairs (i, p) where 0 <= i < unigram_probs.size() (referring to the unigram_probs list used in the constructor), and p > 0.0. This list must be sorted and unique w.r.t. i. Note: the probabilities here will be added to the unigram probabilities of the words concerned.
- words_we_must_sample (List[int]) – A list of words that must be
sampled. It must be sorted and unique, and all elements
i
must satisfy0 <= i < len(unigram_probs)
, whereunigram_probs
is the list supplied to the constructor.
Returns: The sampled list of words, represented as pairs (i, p), where 0 <= i < unigram_probs.size() is the word index and 0 < p <= 1 is the probabilitity with which that word was included in the set. The list will not be sorted, but it will be unique on the int. Its size will equal num_words_to_sample.
See also
-
-
class
kaldi.rnnlm.
SamplingLm
¶ Sampling LM.
-
from_estimator
(estimator:SamplingLmEstimator) → SamplingLm¶ Creates a new sampling LM with the given estimator.
This constructor reads the object directly from a
SamplingLmEstimator
instance, which is much faster than dealing with the ARPA format. It also allows us to avoid having to add a bunch of unnecessary n-grams to satisfy the requirements of the ARPA file format. It assumes that you have already calledestimator.estimate()
.Parameters: estimator (SamplingLmEstimator) – The sampling LM estimator.
-
from_options
(options:ArpaParseOptions, symbols:SymbolTable) → SamplingLm¶ Creates a new sampling LM with the given options.
ARPA LM is read from the file specified in the options. Only text mode is supported.
Parameters: - options (ArpaParseOptions) – The options for parsing ARPA LM files.
- symbols (SymbolTable) – The symbol table.
-
get_distribution
(histories:list<tuple<list<int>, float>>) -> (unigram_prob:float, non_unigram_probs:dict<int, float>)¶ Gets word probabilities given a list of histories.
Parameters: histories (List[Tuple[List[int],float]]) – A list of histories with associated weights. Returns: A scalar unigram_prob
which is computed by summing all history weights after scaling them with the corresponding backoff weights and a dictionary mapping words to their corresponding probabilities given the list of histories.Note
The sum of the returned
unigram_prob
plus the second elements of the outputnon_unigram_probs
will not necessarily be equal to 1.0, but it will be equal to the total of the weights of histories inhistories
.See also
-
get_distribution_pairs
(histories:list<tuple<list<int>, float>>) -> (unigram_prob:float, non_unigram_probs:list<tuple<int, float>>)¶ Gets word probabilities given a list of histories.
Parameters: histories (List[Tuple[List[int],float]]) – A list of histories with associated weights. Returns: A scalar unigram_prob
which is computed by summing all history weights after scaling them with the corresponding backoff weights and a list of pairs (word-id, weight), that’s sorted and unique on word-id, mapping words to their corresponding probabilities given the list of histories.Note
The sum of the returned
unigram_prob
plus the second elements of the outputnon_unigram_probs
will not necessarily be equal to 1.0, but it will be equal to the total of the weights of histories inhistories
.See also
-
get_unigram_distribution
() → list<float>¶ Gets unigram probabilities.
This method outputs the unigram distribution of all words represented by integers from 0 to maximum symbol id.
Returns: A list of floats representing the unigram distribution of all words. Note
There can be gaps of integers for words in the ARPA LM, we set the probabilities of words that are not in the ARPA LM to be 0.0, e.g., symbol id 0 which represents epsilon has probability 0.0
-
options
() → ArpaParseOptions¶ Gets ARPA parser options.
Returns: The ARPA parser options.
-
order
() → int¶ Gets n-gram order.
Returns: The n-gram order, e.g. 1 for a unigram LM, 2 for a bigram. Return type: int
-
read
(is:istream, binary:bool)¶ Reads the sampling LM from input stream.
This method does not read the ARPA format, it reads the special-purpose format written by
write()
.Parameters: See also
-
read_arpa
(is:istream)¶ Reads the sampling LM from a file in ARPA format.
Parameters: is (istream) – The input C++ stream.
-
swap
(other:SamplingLm)¶ Swaps contents with another sampling LM.
Parameters: other (SamplingLm) – The other sampling LM.
-
-
class
kaldi.rnnlm.
SamplingLmEstimator
¶ Sampling LM estimator.
This class is responsible for creating a backoff n-gram language model of a type that’s suitable for use in the importance sampling algorithm we use for RNNLM training. It’s the type of language model that could in principle be written in ARPA format, but it’s created in a special way. There are a few characteristics of the importance sampling algorithm that make it desirable to write a special purpose language model instead of using a generic language model toolkit. These are:
- When we sample, we sample from a distribution that is the average of a fairly large number of history states N (e.g., N=128), that can be treated as independently chosen for practical purposes (except that sometimes they’ll all be the BOS history, which is a special case).
- The convergence of the sampling-based method won’t be sensitive to small differences in the probabilities of the distribution we sample on.
- It’s important not to have too many words that are specifically predicted from a typical history-state, or it makes the sampling process slow.
Parameters: config (SamplingLmEstimatorOptions) – Options for sampling LM estimator. -
estimate
(will_write_arpa:bool)¶ Estimates the language model (including the discounting).
Parameters: will_write_arpa (bool) – Whether to retain certain n-grams (required in the ARPA file format) that would otherwise have been pruned.
-
print_as_arpa
(os:ostream, symbols:SymbolTable)¶ Prints the LM in ARPA format.
Parameters: - os (ostream) – The output stream to write the model to.
- symbols (SymbolTable) – The symbol table to map integers to words.
-
process
(is:istream)¶ Processes the lines read from the input stream.
- Lines will be of the format:
- <weight> <possibly-empty-sequence-of-integers>
- e.g.:
- 1.0 2560 8991
Parameters: is (istream) – The input stream.
-
process_line
(corpus_weight:float, sentence:list<int>)¶ Processes one line of the input, adding it to the stored stats.
Parameters: - corpus_weight (float) – Weight attached to the corpus from which this data came. (Note: you shouldn’t repeat sentences when providing them to this class, although this is allowed during the actual RNNLM training; instead, you should make sure that the multiplicity that you use in the RNNLM for this corpus is reflected in ‘corpus_weight’.)
- sentence (List[int]) – The sentence we are processing. It is not expected to contain the BOS symbol, and should not be terminated by the EOS symbol, although the EOS symbol is allowed internally (where it can be used to separate a sequence of sentences from a dialogue or other sequence of text, if you want to do this).
-
class
kaldi.rnnlm.
SamplingLmEstimatorOptions
¶ Options for sampling LM estimator.
-
backoff_factor
¶ The backoff factor.
Factor by which p(w|h) for higher-than-bigram history state h (with the backoff term excluded) has to be greater than p(w|backoff-state) for us to include it in the model (in addition to the
unigram_factor
constraint). Must be >0.0 and < unigram-factor.
-
bos_factor
¶ The beginning of sentence factor.
Factor by which p(w|h) for h == the BOS history state (with the backoff term excluded) has to be higher than p(w|unigram-state) for us to include it in the model. Must be >0.0 and <= unigram-factor.
-
bos_symbol
¶ Integer id for the BOS word (<s>).
-
brk_symbol
¶ Integer id for the Break word (<brk>).
Not needed but included for ease of scripting.
-
check
()¶ Validates the options.
Raises: RuntimeError
– If validation fails.
-
discounting_constant
¶ Constant for absolute discounting.
It should be in the range 0.8 to 1.0. Smaller values give a larger language model.
-
eos_symbol
¶ Integer id for the EOS word (</s>).
-
ngram_order
¶ Order for the n-gram model (must be >= 1), e.g. 3 means trigram
-
register
(opts:OptionsItf)¶ Registers options with an object implementing the options interface.
Parameters: opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
-
unigram_factor
¶ The unigram factor.
Factor by which p(w|h) for non-unigram history state h (with the backoff term excluded) has to be greater than p(w|unigram-state) for us to include it in the model. Must be >0.0, will normally be >1.0.
-
unigram_power
¶ The unigram power scalar.
This is an important configuration value. After all other stages of estimating the model, the unigram probabilities are taken to this power, e.g. 0.75, and then rescaled to sum to 1.0. There are both theoretical and practical reasons why we want to apply this power just to the unigram portion.
-
vocab_size
¶ The vocabulary size.
If set, must be set to the highest-numbered vocabulary word plus one; otherwise this is worked out from the symbol table.
-
-
kaldi.rnnlm.
check_distribution
(d:list<tuple<int, float>>)¶ Validates a distribution.
Checks if a distribution is sorted and unique on its first values, and if all of its second values are > 0.
Parameters: d (List[Tuple[int,float]]) – The input distribution. Raises: RuntimeError
– If validation fails.
-
kaldi.rnnlm.
get_rnnlm_computation_request
(minibatch:RnnlmExample, need_model_derivative:bool, need_input_derivative:bool, store_component_stats:bool) → ComputationRequest¶ Creates a computation request for the given RNNLM example.
This function takes an RnnlmExample (which should already have been frame-selected, if desired, and merged into a minibatch) and produces a ComputationRequest. It assumes you don’t want the derivatives w.r.t. the inputs; if you do, you can create/modify the ComputationRequest manually. Assumes that if
need_model_derivative
is true, you will be supplying derivatives w.r.t. all outputs.
-
kaldi.rnnlm.
get_rnnlm_example_derived
(minibatch:RnnlmExample, need_embedding_deriv:bool) → RnnlmExampleDerived¶ Constructs a derived RNNLM example.
Sets up the structure containing derived parameters used in training and objective function computation.
Parameters: - minibatch (RnnlmExample) – The input minibatch for which we are computing the derived parameters.
- need_embedding_deriv (bool) – True if we are going to be computing
derivatives w.r.t. the word embedding (e.g., needed in a typical
training configuration); if this is True, it will compute
input_words_tranpose
.
Returns: A derived RNNLM example structure for the input minibatch.
-
kaldi.rnnlm.
merge_distributions
(d1:list<tuple<int, float>>, d2:list<tuple<int, float>>) → list<tuple<int, float>>¶ Merges two distributions.
Sums the probabilities of any elements that occur in both input distributions.
Parameters: Returns: The output distribution.
-
kaldi.rnnlm.
process_rnnlm_output
(objective_opts:RnnlmObjectiveOptions, minibatch:RnnlmExample, derived:RnnlmExampleDerived, word_embedding:CuMatrixBase, nnet_output:CuMatrixBase, word_embedding_deriv:CuMatrixBase, nnet_output_deriv:CuMatrixBase) -> (weight:float, objf_num:float, objf_den:float, objf_den_exact:float)¶ Processes the output of RNNLM computation.
This function processes the output of the RNNLM computation for a single minibatch; it outputs the objective-function contributions from the numerator and denominator terms, and [if requested] the derivatives of the objective function w.r.t. the data inputs.
In the explanation below, the index
i
encompasses both the timet
and the membern
within the minibatch. The objective function referred to here is of the formobjf = sum_i weight(i) * ( num_term(i) + den_term(i) )
where num_term(i) is the log-prob of the ‘correct’ word, which equals the dot product of the neural-network output with the word embedding, which we can write as follows
num_term(i) = l(i, minibatch.output_words(i))
where
l(i, w)
is the unnormalized log-prob of wordw
for positioni
, specificallyl(i, w) = vec_vec(nnet_output.Row(i), word_embedding.Row(w))
.Without importance sampling (if
len(minibatch.sampled_words) == 0
), we getden_term(i) = 1.0 - (sum_w q(i,w))
This is a lower bound on the ‘natural’ normalizer term which is of the form
-log(sum_w p(i,w))
, and its linearity in the p’s allows importance sampling). Here,p(i, w) = exp(l(i, w))
q(i, w) = exp(l(i, w)) if l(i, w < 0) else 1 + l(i, w)
[the reason we use
q(i, w)
instead ofp(i, w)
is that it gives a closer bound to the natural normalizer term and helps avoid instability in early phases of training.]With importance sampling (if minibatch.sampled_words.size() > 0),
den_term
equalsden_term(i) = 1.0 - (sum_w q(w,i) * sample_inv_prob(w,i))
where
sample_inv_prob(w, i)
is zero if word w was not sampled for thist
, and 1.0 / (the probability with which it was sampled) if it was sampled.Parameters: - objective_opts (RnnlmObjectiveOptions) – Options for RNNLM objective.
- minibatch (RnnlmExample) – The minibatch for which we are processing the output.
- derived (RnnlmExampleDerived) – This struct contains certain quantities
which are precomputed from
minibatch
. It’s to be generated by callingget_rnnlm_example_derived()
prior to calling this function. - word_embedding (CuMatrixBase) – The word-embedding, dimension is
num-words by embedding-dimension. This does not have to be ‘real’
word-indexes, it can be fake word-indexes renumbered to include only
the required words if sampling is done; c.f.
renumber_rnnlm_example()
. - nnet_output (CuMatrixBase) – The neural net output. Num-rows is
minibatch.chunk_length * minibatch.num_chunks
, where the stride for the time0 <= t < chunk_length
is larger, so there are a block of rows fort=0
, a block fort=1
, and so on. Num-columns is embedding-dimension. - word_embedding_deriv (CuMatrixBase) – If not None, the derivative of the
objective function w.r.t.
word_embedding
is added to this location. - nnet_output_deriv (CuMatrixBase) – If not None, the derivative of the
objective function w.r.t.
nnet_output
is added to this location.
Returns: - weight – The total weight over this minibatch. It is equal to
minibatch.output_weights.sum()
. - objf_num – The total numerator part of the objective function,
i.e. the sum over
i
ofweight(i) * num_term(i)
. - objf_den – The total denominator part of the objective function,
i.e. the sum over
i
ofweight(i) * den_term(i)
. You add this toobjf_num
to get the total objective function. - objf_den_exact – If we’re not doing sampling (i.e. if
len(minibatch.sampled_words) == 0
), the ‘exact’ denominator part of the objective function, i.e. the weighted sum ofexact_den_term(i) = -log(sum_w p(i,w))
. If we are sampling, then there is no exact denominator part, and this will be set to zero. This is provided for diagnostic purposes. Derivatives will be computed w.r.t. the objective consisting ofobjf_num + objf_den
, i.e. ignoring the ‘exact’ one.
-
kaldi.rnnlm.
read_sparse_word_features
(is:istream, feature_dim:int) → SparseMatrix¶ Reads sparse word features from input stream.
Reads a text file (e.g. exp/rnnlm/word_feats.txt) which maps words to sparse combinations of features. The text file contains lines of the format:
<word-index> <feat1-index> <feat1-value> <feat2-index> <feat2-value>…- with the feature-indexes in sorted order, for example:
- 2056 11 3.0 25 1.0 1069 1.0
The word-indexes are expected to be in order 0, 1, 2, …; so they don’t really add any information; they are included for human readability.
- Args:
is (istream): The stream we are reading. feature_dim (int): The feature dimension, i.e. the highest-numbered
possible feature plus one. We don’t attempt to work this out from the input, in case for some reason this vocabulary does not use the highest-numbered feature.
Returns: - A sparse matrix of dimension num-words by feature-dim, containing the
- word feature information in the file we read.
- Raises:
- RuntimeError: If the input is not as expected.
-
kaldi.rnnlm.
renumber_rnnlm_example
(minibatch:RnnlmExample) → list<int>¶ Renumbers word-ids in a minibatch.
This function renumbers the word-ids referred to in a minibatch, creating a numbering that covers exactly the words referred to in this minibatch. It is only to be called when sampling is used, i.e. when
minibatch.samples
is not empty.Parameters: minibatch (RnnlmExample) – The minibatch to be modified. At entry the words-indexes in fields input_words
, andsampled_words
will be in their canonical numbering. At exit the numbers present in those arrays will be indexes into theactive_words
vector that this function outputs. For instance, supposeminibatch.input_words[9] == 1034
at entry; at exit we might haveminibatch.input_words[9] == 94
, withactive_words[94] == 1034
. This function requires thatminibatch.sampled_words
is nonempty. Ifminibatch.sampled_words
is empty, it means that sampling has not been done, so the negative part of the objf will use all the words. In this case the minibatch implicitly uses all words, so there is no use in renumbering. At exit,minibatch.vocab_size
will have been set to the same value aslen(active_words)
.Returns: The list of active words, i.e. the words that were present in the fields input_words
, andsampled_words
inminibatch
on entry. At exit, this list will be sorted and unique.Note
It is not necessary for this function to renumber
output_words
because in the sampling case they are indexes into blocks ofsampled_words
(see documentation forRnnlmExample
).
-
kaldi.rnnlm.
sample_without_replacement
(probs:list<float>) → list<int>¶ Samples without replacement from a distribution.
Samples without replacement from a distribution, with provided 1st order inclusion probabilities. For example, if
probs[i] == 1.0
,i
will definitely be included in the output list, and ifprobs[i] == 0.0
,i
will definitely not be included.Parameters: probs (List[float]) – The input list of inclusion probabilities, with 0.0 <= probs[i] <= 1.0, and the sum of probs
should be close to an integer. (specifically: within 1.0e-03 of a whole number; this should be easy to ensure in double precision). Let ‘k’ be this sum, rounded to the nearest integer.Returns: The output list is an unsorted list of ‘k’ distinct samples with first order inclusion probabilities given by probs
.
-
kaldi.rnnlm.
total_of_distribution
(d:list<tuple<int, float>>) → float¶ Returns the sum of the elements of a distribution. :param d: The input distribution. :type d: List[Tuple[int,float]]
Returns: The sum of elements of a distribution.