kaldi.rnnlm

Functions

check_distribution Validates a distribution.
get_rnnlm_computation_request Creates a computation request for the given RNNLM example.
get_rnnlm_example_derived Constructs a derived RNNLM example.
merge_distributions Merges two distributions.
process_rnnlm_output Processes the output of RNNLM computation.
read_sparse_word_features Reads sparse word features from input stream.
renumber_rnnlm_example Renumbers word-ids in a minibatch.
sample_without_replacement Samples without replacement from a distribution.
total_of_distribution Returns the sum of the elements of a distribution.

Classes

KaldiRnnlmDeterministicFst Deterministic on demand RNNLM FST.
RnnlmComputeState RNNLM computation state.
RnnlmComputeStateComputationOptions Options for RNNLM compute state.
RnnlmComputeStateInfo State information for RNNLM computation.
RnnlmCoreComputer Core RNNLM computer.
RnnlmCoreTrainer Core RNNLM trainer.
RnnlmCoreTrainerOptions Options for core RNNLM training.
RnnlmEgsConfig RNNLM example configuration.
RnnlmEmbeddingTrainer RNNLM embedding trainer.
RnnlmEmbeddingTrainerOptions Options for RNNLM embedding training.
RnnlmExample A single minibatch for training an RNNLM.
RnnlmExampleCreator RNNLM example creator.
RnnlmExampleDerived Various quantities/expressions derived from an RNNLM example.
RnnlmExampleSampler RNNLM example sampler.
RnnlmObjectiveOptions Options for RNNLM objective function.
RnnlmTrainer RNNLM trainer.
Sampler Word sampler.
SamplingLm Sampling LM.
SamplingLmEstimator Sampling LM estimator.
SamplingLmEstimatorOptions Options for sampling LM estimator.
class kaldi.rnnlm.KaldiRnnlmDeterministicFst

Deterministic on demand RNNLM FST.

Parameters:
  • max_ngram_order (int) – Maximum ngram order.
  • info (RnnlmComputeStateInfo) – State information for RNNLM computation.
clear()

Clears the internal maps.

This method is similar to the destructor but we retain the 0-th entries in each map which corresponds to the <bos> state.

final(state:int) → TropicalWeight

Returns the final weight of the given state.

get_arc(s:int, ilabel:int) -> (success:bool, oarc:StdArc)

Creates an on demand arc and returns it.

Parameters:
  • s (int) – State index.
  • ilabel (int) – Arc label.
Returns:

The created arc.

start() → int

Returns the start state index.

class kaldi.rnnlm.RnnlmComputeState

RNNLM computation state.

This class handles the neural net computation; it’s mostly accessed via other wrapper classes.

At every time step this class takes a new word, advances the nnet computation by one step, and works out the log-prob of words to be used in lattice rescoring.

Parameters:
  • info (RnnlmComputeStateInfo) – State information for RNNLM computation.
  • bos_index (int) – Index of the begin-of-sentence symbol.
add_word(word_index:int)

Updates the state of the RNNLM by appending a word.

from_other(other:RnnlmComputeState) → RnnlmComputeState

Creates a new instance from another. :param other: The other RNNLM computation state. :type other: RnnlmComputeState

get_log_prob_of_words(output:CuMatrixBase)

Computes log probs of all words.

This function computes log probs of all words and outputs them as a matrix.

Note

output[0,0] corresponds to <eps> symbol and it should NEVER be used in any computation by the caller. To avoid causing unexpected issues, it is set to a very small number.

get_successor_state(next_word:int) → RnnlmComputeState

Generates another state by processing the next-word.

Parameters:next_word (int) – The next word to process.
log_prob_of_word(word_index:int) → float

Gets the log-prob for the provided word.

Returns:The log-prob that the model predicts for the provided word-index, given the current history.
class kaldi.rnnlm.RnnlmComputeStateComputationOptions

Options for RNNLM compute state.

bos_index

Index in wordlist representing the begin-of-sentence symbol.

We need this when we initialize the RnnlmComputeState and pass the BOS history.

brk_index

Index in wordlist representing the break symbol.

This is not needed for computation; included only for ease of scripting.

compute_config

Nnet compute options.

debug_computation

Whether to turn on debug for the actual computation (very verbose!).

eos_index

Index in wordlist representing the end-of-sentence symbol.

We need this to compute the final cost of a state.

normalize_probs

Whether to normalize word probabilities exactly.

If False, the sum-to-one normalization is approximate.

optimize_config

Nnet optimization options.

register(opts:OptionsItf)

Registers options with an object implementing the options interface.

Parameters:opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
class kaldi.rnnlm.RnnlmComputeStateInfo(opts, rnnlm, word_embedding_mat)[source]

State information for RNNLM computation.

This class keeps references to the word-embedding, nnet3 part of RNNLM and the RnnlmComputeStateComputationOptions. It handles the computation of the nnet3 object.

Parameters:
computation

The compiled, ‘looped’ nnet computation.

class kaldi.rnnlm.RnnlmCoreComputer

Core RNNLM computer.

This class has a similar interface to RnnlmCoreTrainer, but it doesn’t actually train the RNNLM; it’s for computing likelihoods and (optionally) derivatives w.r.t. the embedding, in situations where you are not training the core part of the RNNLM. It reads egs– it’s not for rescoring lattices and similar purposes.

Parameters:nnet (Nnet) – The neural network that is to be used to evaluate likelihoods (and possibly derivatives).
compute(minibatch:RnnlmExample, derived:RnnlmExampleDerived, word_embedding:CuMatrixBase, word_embedding_deriv:CuMatrixBase=default) -> (objf:float, weight:float)

Computes the objective on one minibatch.

If word_embedding_deriv is provided, it also computes derivatives w.r.t. the embedding.

Parameters:
  • minibatch (RnnlmExample) – The RNNLM minibatch to evaluate, containing a number of parallel word sequences. It will not necessarily contain words with the ‘original’ numbering, it will in most circumstances contain just the ones we used; see renumber_rnnlm_example().
  • derived (RnnlmExampleDerived) – Derived quantities of the minibatch, pre-computed by calling get_rnnlm_example_derived() with suitable arguments.
  • word_embedding (CuMatrixBase) – The matrix giving the embedding of words, of dimension minibatch.vocab_size by the embedding dimension. The numbering of the words does not have to be the ‘real’ numbering of words, it can consist of words renumbered by renumber_rnnlm_example(); it just has to be consistent with the word-ids present in ‘minibatch’.
  • word_embedding_deriv (CuMatrixBase) – If not None, the derivative of the objective function w.r.t. the word embedding will be added to this location; it must have the same dimension as ‘word_embedding’.
Returns:

  • objf – The total objective function for this minibatch; divide this by weight to normalize it (i.e. get the average log-prob per word).
  • weight – The total weight of the words in the minibatch. This is just the sum of minibatch.output_weights.

class kaldi.rnnlm.RnnlmCoreTrainer

Core RNNLM trainer.

This class does the core part of the training of the RNNLM; the word embeddings are supplied to this class for each minibatch and while this class can compute objective function derivatives w.r.t. these embeddings, it is not responsible for updating them.

Parameters:
consolidate_memory()

Consolidates neural network memory.

print_max_change_stats()

Prints out the max-change stats (if nonzero).

This is the percentage of time that per-component max-change and global max-change were enforced.

train(minibatch:RnnlmExample, derived:RnnlmExampleDerived, word_embedding:CuMatrixBase, word_embedding_deriv:CuMatrixBase)

Do training for one minibatch.

Parameters:
  • minibatch (RnnlmExample) – The RNNLM minibatch to train on, containing a number of parallel word sequences. It will not necessarily contain words with the ‘original’ numbering, it will in most circumstances contain just the ones we used; see renumber_rnnlm_example().
  • derived (RnnlmExampleDerived) – Derived quantities of the minibatch, pre-computed by calling get_rnnlm_example_derived() with suitable arguments.
  • word_embedding (CuMatrixBase) – The matrix giving the embedding of words, of dimension minibatch.vocab_size by the embedding dimension. The numbering of the words does not have to be the ‘real’ numbering of words, it can consist of words renumbered by renumber_rnnlm_example(); it just has to be consistent with the word-ids present in ‘minibatch’.
  • word_embedding_deriv (CuMatrixBase) – If not None, the derivative of the objective function w.r.t. the word embedding will be added to this location; it must have the same dimension as ‘word_embedding’.
train_backstitch(is_backstitch_step1:bool, minibatch:RnnlmExample, derived:RnnlmExampleDerived, word_embedding:CuMatrixBase, word_embedding_deriv:CuMatrixBase)

Do backstitch training for one minibatch.

Depending on whether is_backstitch_step1 is true, It could be either the first (backward) step, or the second (forward) step of backstitch.

Parameters:
  • is_backstitch_step1 (bool) – If true update stats otherwise not.
  • minibatch (RnnlmExample) – The RNNLM minibatch to train on, containing a number of parallel word sequences. It will not necessarily contain words with the ‘original’ numbering, it will in most circumstances contain just the ones we used; see renumber_rnnlm_example().
  • derived (RnnlmExampleDerived) – Derived quantities of the minibatch, pre-computed by calling get_rnnlm_example_derived() with suitable arguments.
  • word_embedding (CuMatrixBase) – The matrix giving the embedding of words, of dimension minibatch.vocab_size by the embedding dimension. The numbering of the words does not have to be the ‘real’ numbering of words, it can consist of words renumbered by renumber_rnnlm_example(); it just has to be consistent with the word-ids present in ‘minibatch’.
  • word_embedding_deriv (CuMatrixBase) – If not None, the derivative of the objective function w.r.t. the word embedding will be added to this location; it must have the same dimension as ‘word_embedding’.
class kaldi.rnnlm.RnnlmCoreTrainerOptions

Options for core RNNLM training.

These are related to the core RNNLM training, i.e. training the actual neural net for the RNNLM (when the word embeddings are given).

backstitch_training_interval

Backstitch training interval (n).

Do backstitch training with the specified interval of minibatches.

backstitch_training_scale

Backstitch training factor (alpha).

If 0 then in the normal training mode.

l2_regularize_factor

Factor that affects the strength of l2 regularization.

This affects the strength of l2 regularization on model parameters. It will be multiplied by the component-level l2-regularize values and can be used to correct for effects related to parallelization by model averaging.

max_param_change

The maximum change in model parameters allowed per minibatch.

This is measured in Euclidean norm. Change will be clipped to this value.

momentum

Momentum constant (help stabilize training updates), e.g. 0.9.

We automatically multiply the learning rate by (1-momentum) so that the ‘effective’ learning rate is the same as before (because momentum would normally increase the effective learning rate by 1/(1-momentum)).

print_interval

The log printing interval (in terms of #minibatches).

register(opts:OptionsItf)

Registers options with an object implementing the options interface.

Parameters:opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
class kaldi.rnnlm.RnnlmEgsConfig

RNNLM example configuration.

bos_symbol

Beginning of sentence symbol.

It must be set.

brk_symbol

Break symbol.

It must be set.

check()

Validates the options.

Raises:RuntimeError – If validation fails.
chunk_buffer_size

The number of chunks that are buffered while processing the input.

Larger means more complete randomization but also more I/O before we produce any output, and more memory used.

chunk_length

The length of each sequence in a minibatch.

The length of each sequence in a minibatch, including any terminating </s> symbols, which are included explicitly in the sequences. When </s> appears in the middle of sequences because we splice shorter word sequences together, we will replace it with <s> on the input side of the network. Sentences, or pieces of sentences, that were shorter than chunk_length, will be padded as needed.

eos_symbol

End of sentence symbol.

It must be set.

min_split_context

Min left-context supplied for each training sentence piece.

num_chunks_per_minibatch

The number of parallel word sequences/chunks per minibatch.

num_samples

The number of words we choose each time we do the sampling.

register(opts:OptionsItf)

Registers options with an object implementing the options interface.

Parameters:opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
sample_group_size

The sampling group size.

This is the number of consecutive time-steps which form a single unit for sampling purposes. This number will always divide chunk_length. Example: if sample_group_size==2, we’ll sample one set of words for t={0,1}, another for t={2,3}, and so on. We support merging time-steps in this way (but not splitting them smaller), due to considerations of computing time if you assume we also have a network that learns word representation from their character-level features.

special_symbol_prob

Sampling probability for words that aren’t supposed to be predicted.

Sampling probability at the output for words that aren’t supposed to be predicted (<s>, <brk>)– this ensures that the model makes their output probs small, which avoids hassle when computing the normalizer in test time (if we didn’t sample them with some probability to ensure their probs are small, we’d have to exclude them from the denominator sum.

uniform_prob_mass

The probability mass to uniformly distribute over all words.

This value should be < 1.0; it is the proportion of the unigram distribution used for sampling assigned to uniformly predicting all words. This may avoid certain pathologies during training, and ensure that all words’ probs are bounded away from zero, which might be necessary for the theory of importance sampling.

vocab_size

The vocabulary size.

More specifically, the largest integer word-id plus one. Must be provided, as it gets included in each minibatch (mostly for checking purposes).

class kaldi.rnnlm.RnnlmEmbeddingTrainer

RNNLM embedding trainer.

This class is responsible for training the word embedding matrix or feature embedding matrix.

Parameters:
  • config (RnnlmEmbeddingTrainerOptions) – Options for RNNLM embedding training.
  • embedding_mat (CuMatrix) – The embedding matrix to be trained, of dimension (num-words or num-features) by embedding-dim (depending whether we are using a feature representation of words, or not).
train(embedding_deriv:CuMatrixBase)

Do training for one minibatch.

This version is used either when there is no subsampling, or when there is subsampling but we are using a feature representation so the subsampling is handled outside of this code.

Parameters:embedding_deriv (CuMatrixBase) – The derivative of the objective function w.r.t. the word (or feature) embedding matrix.
train_backstitch(is_backstitch_step1:bool, embedding_deriv:CuMatrixBase)

Do backstitch training for one minibatch.

This version is used either when there is no subsampling, or when there is subsampling but we are using a feature representation so the subsampling is handled outside of this code.

Depending on whether is_backstitch_step1 is true, It could be either the first (backward) step, or the second (forward) step of backstitch.

Parameters:
  • is_backstitch_step1 (bool) – If true update stats otherwise not.
  • embedding_deriv (CuMatrixBase) – The derivative of the objective function w.r.t. the word (or feature) embedding matrix.
train_backstitch_with_subsampling(is_backstitch_step1:bool, active_words:CuArray, word_embedding_deriv:CuMatrixBase)

Do backstitch training for one minibatch.

This version is for when there is subsampling, and the user is providing the derivative w.r.t. just the word-indexes that were used in this minibatch. active_words is a sorted, unique list of the word-indexes that were used in this minibatch, and word_embedding_deriv is the derivative w.r.t. the embedding of that list of words.

Depending on whether is_backstitch_step1 is true, It could be either the first (backward) step, or the second (forward) step of backstitch.

Parameters:
  • is_backstitch_step1 (bool) – If true update stats otherwise not.
  • active_words (CuArray) – A sorted, unique list of the word indexes used, with dimension equal to word_embedding_deriv.num_rows.
  • word_embedding_deriv (CuMatrixBase) – The derivative of the objective function w.r.t. the word embedding matrix.
train_with_subsampling(active_words:CuArray, word_embedding_deriv:CuMatrixBase)

Do training for one minibatch.

This version is for when there is subsampling, and the user is providing the derivative w.r.t. just the word-indexes that were used in this minibatch. active_words is a sorted, unique list of the word-indexes that were used in this minibatch, and word_embedding_deriv is the derivative w.r.t. the embedding of that list of words.

Parameters:
  • active_words (CuArray) – A sorted, unique list of the word indexes used, with dimension equal to word_embedding_deriv.num_rows.
  • word_embedding_deriv (CuMatrixBase) – The derivative of the objective function w.r.t. the word embedding matrix.
class kaldi.rnnlm.RnnlmEmbeddingTrainerOptions

Options for RNNLM embedding training.

backstitch_training_interval

Backstitch training interval (n).

Do backstitch training with the specified interval of minibatches.

backstitch_training_scale

Backstitch training factor (alpha).

If 0 then in the normal training mode.

check()

Validates RNNLM embedding training options.

l2_regularize

Factor that affects the strength of l2 regularization.

This affects the strength of l2 regularization on embedding parameters.

learning_rate

The learning rate used in training the word-embedding matrix.

max_param_change

The maximum change in embedding parameters allowed per minibatch.

This is measured in Euclidean norm. The embedding matrix has dimensions num-features by embedding-dim or num-words by embedding-dim if we’re not using a feature-based representation.

momentum

Momentum constant for training of embeddings (e.g. 0.5 or 0.9).

We automatically multiply the learning rate by (1-momentum) so that the ‘effective’ learning rate is the same as before (because momentum would normally increase the effective learning rate by 1/(1-momentum)).

natural_gradient_alpha

Smoothing constant alpha to use for natural gradient.

natural_gradient_num_minibatches_history

Determines how quickly the Fisher estimate for the natural gradient is updated, when training the word embedding.

natural_gradient_rank

Rank of the Fisher matrix in natural gradient.

This is applied to learning the embedding matrix (this is in the embedding space, so the rank should probably be less than the embedding dimension.

natural_gradient_update_period

Determines how often the Fisher matrix is updated for natural gradient as applied to the embedding matrix.

print_interval

The log printing interval (in terms of #minibatches).

register(opts:OptionsItf)

Registers options with an object implementing the options interface.

Parameters:opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
use_natural_gradient

Whether to use natural gradient to update the embedding matrix.

class kaldi.rnnlm.RnnlmExample

A single minibatch for training an RNNLM.

chunk_length

The length of each sequence in a minibatch.

The length of each sequence in a minibatch, including any terminating </s> symbols, which are included explicitly in the sequences. When </s> appears in the middle of sequences because we splice shorter word sequences together, we will replace it with <s> on the input side of the network. Sentences, or pieces of sentences, that were shorter than chunk_length, will be padded as needed.

input_words

The input word labels.

Contains the input word symbols 0 <= i < vocab_size for each position in each chunk; dimension == chunk_length * num_chunks, where 0 <= t < chunk_length has larger stride than 0 <= n < num_chunks. In the common case these will be the same as the previous output symbol.

num_chunks

The number of parallel word sequences/chunks.

Some of the word sequences may actually be made up of smaller subsequences appended together.

num_samples

The number of samples.

This is the number of words that we sample at the output of the nnet for each of the num_sample_groups groups. If we didn’t do sampling because the user didn’t provide the ARPA language model, this will be zero (in this case we’ll do the summation over all words in the vocab).

output_weights

The output weights.

Weights for each of the output_words, indexed the same way as output_words. These reflect any data-weighting we had in the original data, plus some zeros that relate to padding sequences of uneven length.

output_words

The output word labels.

The output (predicted) word symbols for each position in each chunk; indexed in the same way as ‘input_words’. What this contains is different from ‘input_words’ in the sampling case (i.e. if !sampled_words.empty()). In this case, instead of the word-index it contains the relative index 0 <= i < num_samples within the block of sampled words. In the not-sampled case it contains actual word indexes 0 <= i < vocab_size.

read(is:istream, binary:bool)

Reads the RNNLM example from input stream.

Parameters:
  • is (istream) – The input C++ stream.
  • binary (bool) – Whether the stream is binary.
sample_group_size

The sampling group size.

This is the number of consecutive time-steps which form a single unit for sampling purposes. This number will always divide chunk_length. Example: if sample_group_size==2, we’ll sample one set of words for t={0,1}, another for t={2,3}, and so on. The sampling is for the denominator of the objective function.

sample_inv_probs

The inverses probabilities.

This vector has the same dimension as ‘sampled_words’, and contains the inverses of the probabilities probability 0 < p <= 1 with which that word was included in the sampled set of words. These inverse probabilities appear in the objective function computation (it’s related to importance sampling).

sampled_words

The sampled word labels.

This list contains the word-indexes that we sampled for each position in the chunk and for each group of chunks. (It will be empty if the user didn’t provide the ARPA language model). Its dimension is num_sample_groups * num_samples, where num_sample_groups == (chunk_length / sample_group_size). The sample-group index has the largest stride (you can think of the sample group index as the number i = t / sample_group_size, in integer division, where 0 <= t < chunk_length is the position in the chunk). The sampled words within each block of size num_samples are sorted and unique.

swap(other:RnnlmExample)

Swaps contents with another RNNLM example.

Parameters:other (RnnlmExample) – The other RNNLM example.
vocab_size

The vocabulary size.

The vocabulary size (defined as largest integer word-id plus one) for which this example was obtained; mostly used in bounds checking.

write(os:ostream, binary:bool)

Writes the RNNLM example to output stream.

Parameters:
  • os (ostream) – The output C++ stream.
  • binary (bool) – Whether the stream is binary.
class kaldi.rnnlm.RnnlmExampleCreator

RNNLM example creator.

This class takes care of all of the logic of creating minibatches for RNNLM training, including the sampling aspect.

accept_sequence(weight:float, words:list<int>)

Accepts a single sequence.

The user calls this to provide a single sequence (a sentence; or multiple sentences that are part of a continuous stream or dialogue, separated by </s>), to this class. This class will write out minibatches when it’s ready. This will normally be the result of reading a line of text with the format:

<weight> <word1> <word2> ….
e.g.:
1.0 Hello there

although the “hello there” would have been converted to integers by the time it was read in, via sym2int.pl, so it would look like:

1.0 7620 12309
We also allow:
1.0 Hello there </s> Hi </s> My name is Bob

if you want to train the model to predict sentences given the history of the conversation.

flush()

Flushes out any pending minibatches.

process(is:istream)

Processes the lines from input stream.

Lines will be of the format:
<weight> <possibly-empty-sequence-of-integers>
e.g.:
1.0 2560 8991
without_sampling(config:RnnlmEgsConfig, writer:RnnlmExampleWriter) → RnnlmExampleCreator

Instantiates a new RNNLM example creator.

This constructor is for when you are not using importance sampling, so no samples will be stored in the minibatch and the training code will presumably evaluate all the words each time. This is intended to be used for testing purposes.

class kaldi.rnnlm.RnnlmExampleDerived

Various quantities/expressions derived from an RNNLM example.

This class contains various quantities/expressions that are derived from the quantities found in RnnlmExample, and which are needed when training on that example, particularly by the function process_rnnlm_output().

cu_input_words

CUDA copy of minibatch.input_words.

cu_output_words

CUDA copy of minibatch.output_words.

It’s only used in the sampling case.

cu_sampled_words

CUDA copy of minibatch.sampled_words.

It’s only used in the sampling case (in the no-sampling case, minibatch.sampled_words would be empty anyway).

swap(other:RnnlmExampleDerived)

Swaps contents with another derived RNNLM example.

Parameters:other (RnnlmExampleDerived) – The other derived RNNLM example.
class kaldi.rnnlm.RnnlmExampleSampler

RNNLM example sampler.

This class encapsulates the logic for sampling words for a minibatch. The words at the output of the RNNLM are sampled and we train with an importance-sampling algorithm.

Parameters:
sample_for_minibatch(minibatch:RnnlmExample)

Does the sampling for a minibatch.

Parameters:minibatch (RnnlmExample) – The minibatch. It is expected to already have all fields populated except for sampled_words and sample_probs. This method does the sampling and sets those fields.
vocab_size() → int

Gets vocabulary size.

Returns:The vocabulary size, i.e. the highest-numbered word plus one.
Return type:int
class kaldi.rnnlm.RnnlmObjectiveOptions

Options for RNNLM objective function.

Configuration class relating to the objective function used for RNNLM training, more specifically for use by the function process_rnnlm_output().

den_term_limit

Modification to the with-sampling objective.

This prevents instability early in training, but in the end makes no difference. We scale down the denominator part of the objective when the average denominator part of the objective, for this minibatch, is more negative than this value. Set this to 0.0 to use unmodified objective function

max_logprob_elements

Maximum number of elements in the logprob matrix.

Maximum number of elements when we allocate a matrix of size [minibatch-size, num-words] for computing logprobs of words. If the size is exceeded, we will break the matrix along the minibatch axis and compute them separately.

register(opts:OptionsItf)

Registers options with an object implementing the options interface.

Parameters:opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
class kaldi.rnnlm.RnnlmTrainer

RNNLM trainer.

The class RnnlmTrainer is for training an RNNLM (one individual training job, not the top-level logic about learning rate schedules, parameter averaging, and the like).

Args: train_embedding (bool) Whether to train the embedding matrix. core_config (RnnlmCoreTrainerOptions): Options for training the core

RNNLM.
embedding_config (RnnlmEmbeddingTrainerOptions): Options for training the
embedding matrix (only relevant if train_embedding is True).
objective_config (RnnlmObjectiveOptions): Options relating to the
objective function used for training.
word_feature_mat (CuSparseMatrix): Either None, or a sparse word-feature
matrix of dimension vocab-size by feature-dim, where vocab-size is the highest-numbered word plus one.
embedding_mat (CuMatrix): The embedding matrix; this is trained if
train_embedding is True. If word_feature_mat is None, this is the word-embedding matrix of dimension vocab-size by embedding-dim; otherwise it is the feature-embedding matrix of dimension feature-dim by by embedding-dim, and we have to multiply it by word_feature_mat to get the word embedding matrix.

rnnlm (Nnet): The RNNLM to be trained.

num_minibatches_processed() → int

Returns the number of minibatches processed so far.

train(minibatch:RnnlmExample)

Train on one example.

The example is acquired destructively, via swapping contents.

Note

This function doesn’t actually train on this example; what it does is to train on the previous example, and provide this example to the background thread that computes the derived parameters of the example.

class kaldi.rnnlm.Sampler

Word sampler.

This class allows us to sample a set of words from a distribution over words, where the distribution (which ultimately comes from an ARPA-style language model) is given as a combination of a unigram distribution with a sparse component represented as a list of (word-index, probability) pairs.

Parameters:unigram_probs (List[float]) – The unigram probabilities for each word. Each elemenet should be >= 0, and they should sum to a value close to 1.
sample_words(num_words_to_sample:int, unigram_weight:float, higher_order_probs:list<tuple<int, float>>) → list<tuple<int, float>>

Samples words from the supplied distribution, appropriately scaled.

Let the unnormalized distribution be as follows:
p(i) = unigram_weight * u(i) + h(i)

where u(i) is the ‘unigram_probs’ list this class was constructed with, and h(i) is the probability that word i is given (if any) in the sparse vector that ‘higher_order_probs’ represents. Notice that we are adding to the unigram distribution, we are not backing off to it. Doing it this way makes a lot of things simpler.

We define the first-order inclusion probabilities:
q(i) = min(alpha p(i), 1.0)

where alpha is chosen so that the sum of q(i) equals ‘num_words_to_sample’. Then we generate a sample whose first-order inclusion probabilities are q(i). We do all this without explicitly iterating over the unigram distribution, so this is fairly fast.

Parameters:
  • num_words_to_sample (int) – The number of words that we are directed sample; must be > 0 and less than the number of nonzero elements of the ‘unigram_probs’ that this class was constructed with.
  • unigram_weight (float) – Must be > 0.0. Search above for p(i) to see what effect it has.
  • higher_order_probs (List[Tuple[int,float]]) – A list of pairs (i, p) where 0 <= i < unigram_probs.size() (referring to the unigram_probs list used in the constructor), and p > 0.0. This list must be sorted and unique w.r.t. i. Note: the probabilities here will be added to the unigram probabilities of the words concerned.
Returns:

The sampled list of words, represented as pairs (i, p), where 0 <= i < unigram_probs.size() is the word index and 0 < p <= 1 is the probabilitity with which that word was included in the set. The list will not be sorted, but it will be unique on the int. Its size will equal num_words_to_sample.

sample_words_with_requirements(num_words_to_sample:int, unigram_weight:float, higher_order_probs:list<tuple<int, float>>, words_we_must_sample:list<int>) → list<tuple<int, float>>

Sample words by specifiying a list of words that must be sampled.

This is an alternative version of sample_words() which allows you to specify a list of words that must be sampled (i.e. after scaling, they must have probability 1.0.). It does this by adding them to the distribution with sufficiently large probability and then calling sample_words().

Parameters:
  • num_words_to_sample (int) – The number of words that we are directed sample; must be > 0 and less than the number of nonzero elements of the ‘unigram_probs’ that this class was constructed with.
  • unigram_weight (float) – Must be > 0.0. Search above for p(i) to see what effect it has.
  • higher_order_probs (List[Tuple[int,float]]) – A list of pairs (i, p) where 0 <= i < unigram_probs.size() (referring to the unigram_probs list used in the constructor), and p > 0.0. This list must be sorted and unique w.r.t. i. Note: the probabilities here will be added to the unigram probabilities of the words concerned.
  • words_we_must_sample (List[int]) – A list of words that must be sampled. It must be sorted and unique, and all elements i must satisfy 0 <= i < len(unigram_probs), where unigram_probs is the list supplied to the constructor.
Returns:

The sampled list of words, represented as pairs (i, p), where 0 <= i < unigram_probs.size() is the word index and 0 < p <= 1 is the probabilitity with which that word was included in the set. The list will not be sorted, but it will be unique on the int. Its size will equal num_words_to_sample.

See also

sample_words().

class kaldi.rnnlm.SamplingLm

Sampling LM.

from_estimator(estimator:SamplingLmEstimator) → SamplingLm

Creates a new sampling LM with the given estimator.

This constructor reads the object directly from a SamplingLmEstimator instance, which is much faster than dealing with the ARPA format. It also allows us to avoid having to add a bunch of unnecessary n-grams to satisfy the requirements of the ARPA file format. It assumes that you have already called estimator.estimate().

Parameters:estimator (SamplingLmEstimator) – The sampling LM estimator.
from_options(options:ArpaParseOptions, symbols:SymbolTable) → SamplingLm

Creates a new sampling LM with the given options.

ARPA LM is read from the file specified in the options. Only text mode is supported.

Parameters:
get_distribution(histories:list<tuple<list<int>, float>>) -> (unigram_prob:float, non_unigram_probs:dict<int, float>)

Gets word probabilities given a list of histories.

Parameters:histories (List[Tuple[List[int],float]]) – A list of histories with associated weights.
Returns:A scalar unigram_prob which is computed by summing all history weights after scaling them with the corresponding backoff weights and a dictionary mapping words to their corresponding probabilities given the list of histories.

Note

The sum of the returned unigram_prob plus the second elements of the output non_unigram_probs will not necessarily be equal to 1.0, but it will be equal to the total of the weights of histories in histories.

get_distribution_pairs(histories:list<tuple<list<int>, float>>) -> (unigram_prob:float, non_unigram_probs:list<tuple<int, float>>)

Gets word probabilities given a list of histories.

Parameters:histories (List[Tuple[List[int],float]]) – A list of histories with associated weights.
Returns:A scalar unigram_prob which is computed by summing all history weights after scaling them with the corresponding backoff weights and a list of pairs (word-id, weight), that’s sorted and unique on word-id, mapping words to their corresponding probabilities given the list of histories.

Note

The sum of the returned unigram_prob plus the second elements of the output non_unigram_probs will not necessarily be equal to 1.0, but it will be equal to the total of the weights of histories in histories.

See also

get_distribution().

get_unigram_distribution() → list<float>

Gets unigram probabilities.

This method outputs the unigram distribution of all words represented by integers from 0 to maximum symbol id.

Returns:A list of floats representing the unigram distribution of all words.

Note

There can be gaps of integers for words in the ARPA LM, we set the probabilities of words that are not in the ARPA LM to be 0.0, e.g., symbol id 0 which represents epsilon has probability 0.0

options() → ArpaParseOptions

Gets ARPA parser options.

Returns:The ARPA parser options.
order() → int

Gets n-gram order.

Returns:The n-gram order, e.g. 1 for a unigram LM, 2 for a bigram.
Return type:int
read(is:istream, binary:bool)

Reads the sampling LM from input stream.

This method does not read the ARPA format, it reads the special-purpose format written by write().

Parameters:
  • is (istream) – The input C++ stream.
  • binary (bool) – Whether the stream is binary.

See also

read_arpa().

read_arpa(is:istream)

Reads the sampling LM from a file in ARPA format.

Parameters:is (istream) – The input C++ stream.
swap(other:SamplingLm)

Swaps contents with another sampling LM.

Parameters:other (SamplingLm) – The other sampling LM.
vocab_size() → int

Gets vocabulary size.

Returns:The vocabulary size, i.e. the highest-numbered word plus one.
Return type:int
write(os:ostream, binary:bool)

Writes the sampling LM to output stream.

Parameters:
  • os (ostream) – The output C++ stream.
  • binary (bool) – Whether the stream is binary.
class kaldi.rnnlm.SamplingLmEstimator

Sampling LM estimator.

This class is responsible for creating a backoff n-gram language model of a type that’s suitable for use in the importance sampling algorithm we use for RNNLM training. It’s the type of language model that could in principle be written in ARPA format, but it’s created in a special way. There are a few characteristics of the importance sampling algorithm that make it desirable to write a special purpose language model instead of using a generic language model toolkit. These are:

  • When we sample, we sample from a distribution that is the average of a fairly large number of history states N (e.g., N=128), that can be treated as independently chosen for practical purposes (except that sometimes they’ll all be the BOS history, which is a special case).
  • The convergence of the sampling-based method won’t be sensitive to small differences in the probabilities of the distribution we sample on.
  • It’s important not to have too many words that are specifically predicted from a typical history-state, or it makes the sampling process slow.
Parameters:config (SamplingLmEstimatorOptions) – Options for sampling LM estimator.
estimate(will_write_arpa:bool)

Estimates the language model (including the discounting).

Parameters:will_write_arpa (bool) – Whether to retain certain n-grams (required in the ARPA file format) that would otherwise have been pruned.
print_as_arpa(os:ostream, symbols:SymbolTable)

Prints the LM in ARPA format.

Parameters:
  • os (ostream) – The output stream to write the model to.
  • symbols (SymbolTable) – The symbol table to map integers to words.
process(is:istream)

Processes the lines read from the input stream.

Lines will be of the format:
<weight> <possibly-empty-sequence-of-integers>
e.g.:
1.0 2560 8991
Parameters:is (istream) – The input stream.
process_line(corpus_weight:float, sentence:list<int>)

Processes one line of the input, adding it to the stored stats.

Parameters:
  • corpus_weight (float) – Weight attached to the corpus from which this data came. (Note: you shouldn’t repeat sentences when providing them to this class, although this is allowed during the actual RNNLM training; instead, you should make sure that the multiplicity that you use in the RNNLM for this corpus is reflected in ‘corpus_weight’.)
  • sentence (List[int]) – The sentence we are processing. It is not expected to contain the BOS symbol, and should not be terminated by the EOS symbol, although the EOS symbol is allowed internally (where it can be used to separate a sequence of sentences from a dialogue or other sequence of text, if you want to do this).
class kaldi.rnnlm.SamplingLmEstimatorOptions

Options for sampling LM estimator.

backoff_factor

The backoff factor.

Factor by which p(w|h) for higher-than-bigram history state h (with the backoff term excluded) has to be greater than p(w|backoff-state) for us to include it in the model (in addition to the unigram_factor constraint). Must be >0.0 and < unigram-factor.

bos_factor

The beginning of sentence factor.

Factor by which p(w|h) for h == the BOS history state (with the backoff term excluded) has to be higher than p(w|unigram-state) for us to include it in the model. Must be >0.0 and <= unigram-factor.

bos_symbol

Integer id for the BOS word (<s>).

brk_symbol

Integer id for the Break word (<brk>).

Not needed but included for ease of scripting.

check()

Validates the options.

Raises:RuntimeError – If validation fails.
discounting_constant

Constant for absolute discounting.

It should be in the range 0.8 to 1.0. Smaller values give a larger language model.

eos_symbol

Integer id for the EOS word (</s>).

ngram_order

Order for the n-gram model (must be >= 1), e.g. 3 means trigram

register(opts:OptionsItf)

Registers options with an object implementing the options interface.

Parameters:opts (OptionsItf) – An object implementing the options interface. Typically a command-line option parser.
unigram_factor

The unigram factor.

Factor by which p(w|h) for non-unigram history state h (with the backoff term excluded) has to be greater than p(w|unigram-state) for us to include it in the model. Must be >0.0, will normally be >1.0.

unigram_power

The unigram power scalar.

This is an important configuration value. After all other stages of estimating the model, the unigram probabilities are taken to this power, e.g. 0.75, and then rescaled to sum to 1.0. There are both theoretical and practical reasons why we want to apply this power just to the unigram portion.

vocab_size

The vocabulary size.

If set, must be set to the highest-numbered vocabulary word plus one; otherwise this is worked out from the symbol table.

kaldi.rnnlm.check_distribution(d:list<tuple<int, float>>)

Validates a distribution.

Checks if a distribution is sorted and unique on its first values, and if all of its second values are > 0.

Parameters:d (List[Tuple[int,float]]) – The input distribution.
Raises:RuntimeError – If validation fails.
kaldi.rnnlm.get_rnnlm_computation_request(minibatch:RnnlmExample, need_model_derivative:bool, need_input_derivative:bool, store_component_stats:bool) → ComputationRequest

Creates a computation request for the given RNNLM example.

This function takes an RnnlmExample (which should already have been frame-selected, if desired, and merged into a minibatch) and produces a ComputationRequest. It assumes you don’t want the derivatives w.r.t. the inputs; if you do, you can create/modify the ComputationRequest manually. Assumes that if need_model_derivative is true, you will be supplying derivatives w.r.t. all outputs.

kaldi.rnnlm.get_rnnlm_example_derived(minibatch:RnnlmExample, need_embedding_deriv:bool) → RnnlmExampleDerived

Constructs a derived RNNLM example.

Sets up the structure containing derived parameters used in training and objective function computation.

Parameters:
  • minibatch (RnnlmExample) – The input minibatch for which we are computing the derived parameters.
  • need_embedding_deriv (bool) – True if we are going to be computing derivatives w.r.t. the word embedding (e.g., needed in a typical training configuration); if this is True, it will compute input_words_tranpose.
Returns:

A derived RNNLM example structure for the input minibatch.

kaldi.rnnlm.merge_distributions(d1:list<tuple<int, float>>, d2:list<tuple<int, float>>) → list<tuple<int, float>>

Merges two distributions.

Sums the probabilities of any elements that occur in both input distributions.

Parameters:
Returns:

The output distribution.

kaldi.rnnlm.process_rnnlm_output(objective_opts:RnnlmObjectiveOptions, minibatch:RnnlmExample, derived:RnnlmExampleDerived, word_embedding:CuMatrixBase, nnet_output:CuMatrixBase, word_embedding_deriv:CuMatrixBase, nnet_output_deriv:CuMatrixBase) -> (weight:float, objf_num:float, objf_den:float, objf_den_exact:float)

Processes the output of RNNLM computation.

This function processes the output of the RNNLM computation for a single minibatch; it outputs the objective-function contributions from the numerator and denominator terms, and [if requested] the derivatives of the objective function w.r.t. the data inputs.

In the explanation below, the index i encompasses both the time t and the member n within the minibatch. The objective function referred to here is of the form

objf = sum_i weight(i) * ( num_term(i) + den_term(i) )

where num_term(i) is the log-prob of the ‘correct’ word, which equals the dot product of the neural-network output with the word embedding, which we can write as follows

num_term(i) = l(i, minibatch.output_words(i))

where l(i, w) is the unnormalized log-prob of word w for position i, specifically

l(i, w) = vec_vec(nnet_output.Row(i), word_embedding.Row(w)).

Without importance sampling (if len(minibatch.sampled_words) == 0), we get

den_term(i) = 1.0 - (sum_w q(i,w))

This is a lower bound on the ‘natural’ normalizer term which is of the form -log(sum_w p(i,w)), and its linearity in the p’s allows importance sampling). Here,

p(i, w) = exp(l(i, w))

q(i, w) = exp(l(i, w)) if l(i, w < 0) else  1 + l(i, w)

[the reason we use q(i, w) instead of p(i, w) is that it gives a closer bound to the natural normalizer term and helps avoid instability in early phases of training.]

With importance sampling (if minibatch.sampled_words.size() > 0), den_term equals

den_term(i) =  1.0 - (sum_w q(w,i) * sample_inv_prob(w,i))

where sample_inv_prob(w, i) is zero if word w was not sampled for this t, and 1.0 / (the probability with which it was sampled) if it was sampled.

Parameters:
  • objective_opts (RnnlmObjectiveOptions) – Options for RNNLM objective.
  • minibatch (RnnlmExample) – The minibatch for which we are processing the output.
  • derived (RnnlmExampleDerived) – This struct contains certain quantities which are precomputed from minibatch. It’s to be generated by calling get_rnnlm_example_derived() prior to calling this function.
  • word_embedding (CuMatrixBase) – The word-embedding, dimension is num-words by embedding-dimension. This does not have to be ‘real’ word-indexes, it can be fake word-indexes renumbered to include only the required words if sampling is done; c.f. renumber_rnnlm_example().
  • nnet_output (CuMatrixBase) – The neural net output. Num-rows is minibatch.chunk_length * minibatch.num_chunks, where the stride for the time 0 <= t < chunk_length is larger, so there are a block of rows for t=0, a block for t=1, and so on. Num-columns is embedding-dimension.
  • word_embedding_deriv (CuMatrixBase) – If not None, the derivative of the objective function w.r.t. word_embedding is added to this location.
  • nnet_output_deriv (CuMatrixBase) – If not None, the derivative of the objective function w.r.t. nnet_output is added to this location.
Returns:

  • weight – The total weight over this minibatch. It is equal to minibatch.output_weights.sum().
  • objf_num – The total numerator part of the objective function, i.e. the sum over i of weight(i) * num_term(i).
  • objf_den – The total denominator part of the objective function, i.e. the sum over i of weight(i) * den_term(i). You add this to objf_num to get the total objective function.
  • objf_den_exact – If we’re not doing sampling (i.e. if len(minibatch.sampled_words) == 0), the ‘exact’ denominator part of the objective function, i.e. the weighted sum of exact_den_term(i) = -log(sum_w p(i,w)). If we are sampling, then there is no exact denominator part, and this will be set to zero. This is provided for diagnostic purposes. Derivatives will be computed w.r.t. the objective consisting of objf_num + objf_den, i.e. ignoring the ‘exact’ one.

kaldi.rnnlm.read_sparse_word_features(is:istream, feature_dim:int) → SparseMatrix

Reads sparse word features from input stream.

Reads a text file (e.g. exp/rnnlm/word_feats.txt) which maps words to sparse combinations of features. The text file contains lines of the format:

<word-index> <feat1-index> <feat1-value> <feat2-index> <feat2-value>…
with the feature-indexes in sorted order, for example:
2056 11 3.0 25 1.0 1069 1.0

The word-indexes are expected to be in order 0, 1, 2, …; so they don’t really add any information; they are included for human readability.

Args:

is (istream): The stream we are reading. feature_dim (int): The feature dimension, i.e. the highest-numbered

possible feature plus one. We don’t attempt to work this out from the input, in case for some reason this vocabulary does not use the highest-numbered feature.
Returns:
A sparse matrix of dimension num-words by feature-dim, containing the
word feature information in the file we read.
Raises:
RuntimeError: If the input is not as expected.
kaldi.rnnlm.renumber_rnnlm_example(minibatch:RnnlmExample) → list<int>

Renumbers word-ids in a minibatch.

This function renumbers the word-ids referred to in a minibatch, creating a numbering that covers exactly the words referred to in this minibatch. It is only to be called when sampling is used, i.e. when minibatch.samples is not empty.

Parameters:minibatch (RnnlmExample) – The minibatch to be modified. At entry the words-indexes in fields input_words, and sampled_words will be in their canonical numbering. At exit the numbers present in those arrays will be indexes into the active_words vector that this function outputs. For instance, suppose minibatch.input_words[9] == 1034 at entry; at exit we might have minibatch.input_words[9] == 94, with active_words[94] == 1034. This function requires that minibatch.sampled_words is nonempty. If minibatch.sampled_words is empty, it means that sampling has not been done, so the negative part of the objf will use all the words. In this case the minibatch implicitly uses all words, so there is no use in renumbering. At exit, minibatch.vocab_size will have been set to the same value as len(active_words).
Returns:The list of active words, i.e. the words that were present in the fields input_words, and sampled_words in minibatch on entry. At exit, this list will be sorted and unique.

Note

It is not necessary for this function to renumber output_words because in the sampling case they are indexes into blocks of sampled_words (see documentation for RnnlmExample).

kaldi.rnnlm.sample_without_replacement(probs:list<float>) → list<int>

Samples without replacement from a distribution.

Samples without replacement from a distribution, with provided 1st order inclusion probabilities. For example, if probs[i] == 1.0, i will definitely be included in the output list, and if probs[i] == 0.0, i will definitely not be included.

Parameters:probs (List[float]) – The input list of inclusion probabilities, with 0.0 <= probs[i] <= 1.0, and the sum of probs should be close to an integer. (specifically: within 1.0e-03 of a whole number; this should be easy to ensure in double precision). Let ‘k’ be this sum, rounded to the nearest integer.
Returns:The output list is an unsorted list of ‘k’ distinct samples with first order inclusion probabilities given by probs.
kaldi.rnnlm.total_of_distribution(d:list<tuple<int, float>>) → float

Returns the sum of the elements of a distribution. :param d: The input distribution. :type d: List[Tuple[int,float]]

Returns:The sum of elements of a distribution.