Viterbi Algorithm, Part 2: Decoding
This is my second post describing the Viterbi algorithm. As before, our presentation follows Jurafsky and Martin closely, merely filling in some details omitted in the text. (See here for the first post.)
As a quick refresher, we observe a sequence of words or symbols
A Hidden Markov Model (HMM) is characterized by a set of transition
probabilities
Decoding
As in the last post, we will make the following assumptions:
- Word Independence:
This assumption is two-fold: it says first that words and are conditionally independent given and second that is conditionally independent of given - Markov Assumption:
This says that a state is conditionally independent of earlier states given the preceding state. - Known Transition and Emission Probabilities: the notation makes it clear, but it’s worth emphasizing that in this section we will assume the HMM is known. In practice this would not be the case! Our next post will address the learning problem.
The most likely sequence of states is
Applying the definition of conditional probability,
Solving this optimization problem is in principle straightforward: try all
possible sequences, calculate the probability of each one, and keep track of
which one is best. This involves examining
The Value Function
Instead, we will use a dynamic programming algorithm. The central quantity is
But before we do that, first we see that:
Now we will show that
Starting from the definition:
Next, we’ll pull the word emission probability out of the max operator since it
does not depend on the optimization variables:
Finally, we will use the property that optimization problems can be split up to
complete the derivation:
This recursive property allows us to calculate the value function for all
The Optimal State Sequence
In deriving the decoding algorithm, we will need the following property of
optimization problems. Let
Recall that
If we let
Now suppose that we knew
First, let
Next we apply the the definition of conditional probability, followed by the
Markov Assumption, and the definition of the value function, finding:
Finally, recall that we know
Thus, once we have calculated the value function, we can compute the optimal state sequence, starting from the end and working our way backwards.
Putting it all Together
The python code below calculates the value function and uses it to determine the optimal state sequence. As an implementation note, we work with the logarithm of the value function to avoid numeric underflow.
def viterbi(w1m, a, a0, b):
"""
Parameters
----------
w1m : list of length m
Sequence of word indices; wlm[t] is the index of the t-th word
a : N x N array
State transition probabilities
a0 : N x 1 array
Initial transition probabilities
b : N x |V| array
Word emission probabilities
Returns
-------
q_hat : N x 1 array
The optimal state sequence
"""
m = len(w1m)
N = len(a0)
# Calculate log(nu)
log_nu = np.zeros((m, N))
for j in range(N):
log_nu[0, j] = np.log(b[j, w1m[0]]) + np.log(a0[j])
for t in range(1, m):
for j in range(N):
for i in range(N):
nu_candidate = (
np.log(b[j, wlm[t]])
+ np.log(a[i, j])
+ log_nu[t-1, i]
)
if i == 0 or nu_candidate > log_nu[t, j]:
log_nu[t, j] = nu_candidate
# Calculate q_hat
q_hat = [None] * m
best_qval = None
for j in range(N):
if j == 0 or log_nu[m-1, j] > best_qval:
q_hat[m-1] = j
best_qval = log_nu[m-1, j]
for t in range((m-2), -1, -1):
best_qval = None
for j in range(N):
qval_candidate = log_nu[t, j] + np.log(a[j, q_hat[t+1]])
if j == 0 or qval_candidate > best_qval:
q_hat[t] = j
best_qval = qval_candidate
return q_hat