The Longest Common Subsequence Problem
08 Jul 2018Subsequence
Assume the length of a sequence \(S = \{s_1, s_2, \ldots, a_K\}\) is \(K\). Denote the subscript set of \(S\) by \(SC_S\), which is defined as follows:
- Every element in $SC_S$ is unique.
- Every element in $SC_S$ is an integer in the range $[1, K]$.
For simplicity, let \(L = \lvert SC_S \rvert\). Sort the elements of \(SC_S\) ascendingly, which generates a new sequence \(U = \{u_1, u_2, \ldots, u_L\}\), where \(u_1 \leq u_2 \leq \cdots \leq u_L\).
A subsequence of \(S\), which can be denoted by \(S_U\), is defined as
\[S_U = \{ s_{u_1}, s_{u_2}, \ldots, s_{u_L} \}.\]It can be seen that every subscript set of \(S\) uniquely defines a subsequence of \(S\), vice versa. For \(S\), there are totally \(2^K\) subsequences.
Common Subsequence
Given two sequences \(A = \{a_1, a_2, \ldots, a_N\}\) and \(B = \{b_1, b_2, \ldots, b_M\}\), a common subsequence of \(A\) and \(B\), which is denoted by \(C\), is a subsequence of \(A\) and a subsequence of \(B\) at the same time. It can be seen that \(0 \leq \lvert C \rvert \leq \min(N, M)\).
A longest common subsequence of \(A\) and \(B\) is a common subsequence of \(A\) and \(B\) with the maximum length. Consider two sequences, namely \(\{1,3,5,4,2,6,8,7\}\) and \(\{ 1,4,8,6,7,5 \}\). The longest common sequence of them can either be \(\{1,4,6,7\}\) or \(\{1,4,8,7\}\). That is to say, while the length of the longest common subsequences is unique, the content of them might not necessarily be.
The Longest Common Subsequence Problem
Usually, the longest common subsequence problem asks one to find out the length of longest common subsequence of two sequences \(A\) and \(B\). If we are to solve the problem by brute force, the time compleixty is as high as \(O(2^{M+N})\), which can be a disaster. Using dynamic programming, we are able to solve the longest common subsequence problem in \(O(MN)\) time.
The dynamic programming function $C(i,j)$ is as follows:
\[\begin{equation*} C(i,j) = \begin{cases} 0, & i = 0 \text{ or } j = 0\\ C(i-1, j-1) + 1, & a_i = b_j\\ \max \left[ C(i-1,j), C(i, j-1) \right], & a_i \neq b_j \end{cases} , \end{equation*}\]where \(C(i,j)\) denotes the length of longest common sequence of the first \(i\) element of \(A\) and the first \(j\) element of \(B\). The entire dynamic programming table can be built using the bottom-top approach. The targeted value is \(C(N,M)\).
The C++ implementation of the most usual longest common sequence problem is as follows.
using namespace std;
using SizeT = int32_t;
// DType should be STL-style containers
template<typename DType>
SizeT LCS(const DType &seq1, const DType &seq2) {
if (seq1.empty() || seq2.empty()) {
return 0;
}
// using nested vector to store the dynamic
// programming table, which may be less efficient
vector<vector<SizeT>> dpTable;
dpTable.resize(seq1.size() + 1);
for (auto &row : dpTable) {
// initialze the dpTable with 0
row.resize(seq2.size() + 1, 0);
}
for (auto i = 0U; i <= seq1.size(); ++i) {
for (auto j = 0U; j <= seq2.size(); ++j) {
if (i == 0 || j == 0) {
continue;
}
else if (seq1[i - 1] == seq2[j - 1]) {
dpTable[i][j] = dpTable[i - 1][j - 1] + 1;
}
else {
dpTable[i][j] = max(dpTable[i - 1][j], dpTable[i][j - 1]);
}
}
}
return dpTable.back().back();
}
Variants
Outputing a Longest Common Subsequence
To ouput a longest common sequence, we can keep track of a jump table in the memory, which stores the change of i
and j
during the construction of the dpTable
. We should be able to restore a longest common subsequence by backtracking through the jump table.
using namespace std;
using SizeT = int32_t;
enum class JumpInfo {
VOID,
JMPSEQ1,
JMPSEQ2,
EQUAL
};
// DType should be STL-style containers
template<typename DType>
DType LCS_One(const DType &seq1, const DType &seq2) {
if (seq1.empty() || seq2.empty()) {
return 0;
}
// using nested vector to store the dynamic
// programming table, which may be less efficient
vector<vector<SizeT>> dpTable;
dpTable.resize(seq1.size() + 1);
for (auto &row : dpTable) {
// initialze the dpTable with 0
row.resize(seq2.size() + 1, 0);
}
// keep track of a jump table
vector<vector<JumpInfo>> jumpTable;
jumpTable.resize(seq1.size() + 1);
for (auto &row : jumpTable) {
row.resize(seq2.size() + 1, JumpInfo::VOID);
}
for (auto i = 0U; i <= seq1.size(); ++i) {
for (auto j = 0U; j <= seq2.size(); ++j) {
if (i == 0 || j == 0) {
continue;
}
else if (seq1[i - 1] == seq2[j - 1]) {
dpTable[i][j] = dpTable[i - 1][j - 1] + 1;
jumpTable[i][j] = JumpInfo::EQUAL;
}
else {
if (dpTable[i - 1][j] > dpTable[i][j - 1]) {
dpTable[i][j] = dpTable[i - 1][j];
jumpTable[i][j] = JumpInfo::JMPSEQ1;
}
else {
dpTable[i][j] = dpTable[i][j - 1];
jumpTable[i][j] = JumpInfo::JMPSEQ2;
}
}
}
}
DType result;
auto inserter = back_inserter(result);
auto i = seq1.size(), j = seq2.size();
bool stopLoop = false;
while (!stopLoop) {
switch (jumpTable[i][j])
{
case JumpInfo::VOID:
stopLoop = true;
break;
case JumpInfo::JMPSEQ1:
i--;
break;
case JumpInfo::JMPSEQ2:
j--;
break;
case JumpInfo::EQUAL:
*inserter = seq1[i - 1];
i--;
j--;
break;
}
}
reverse(result.begin(), result.end());
return result;
}
Outputing All Longest Common Subsequences
In this case, we can use dynamic programming to find out the length of the longest common subsequence and possible element at each position of the longest common subsequence. We then enumerate each possible subsequence and eliminate duplicates.
using namespace std;
using SizeT = int32_t;
// DType should be STL-style containers
template<typename DType>
vector<DType> LCS_All(const DType &seq1, const DType &seq2) {
if (seq1.empty() || seq2.empty()) {
return vector<DType>{};
}
// using nested vector to store the dynamic
// programming table, which may be less efficient
vector<vector<SizeT>> dpTable;
dpTable.resize(seq1.size() + 1);
for (auto &row : dpTable) {
// initialze the dpTable with 0
row.resize(seq2.size() + 1, 0);
}
using EleType = pair<SizeT, SizeT>;
// stores possible element at each position
vector<vector<EleType>> eleTable;
eleTable.resize(min(seq1.size(), seq2.size()));
for (auto i = 0U; i <= seq1.size(); ++i) {
for (auto j = 0U; j <= seq2.size(); ++j) {
if (i == 0 || j == 0) {
continue;
}
else if (seq1[i - 1] == seq2[j - 1]) {
dpTable[i][j] = dpTable[i - 1][j - 1] + 1;
eleTable[dpTable[i - 1][j - 1]].push_back(EleType(i - 1, j - 1));
}
else {
if (dpTable[i - 1][j] > dpTable[i][j - 1]) {
dpTable[i][j] = dpTable[i - 1][j];
}
else {
dpTable[i][j] = dpTable[i][j - 1];
}
}
}
}
SizeT lcsLength = dpTable.back().back();
if (lcsLength == 0) {
return vector<DType>{};
}
struct State {
SizeT seq2Index = 0;
vector<SizeT> seq1Path;
};
using ValueType = typename DType::value_type;
struct Node {
SizeT id;
ValueType value{};
map<ValueType, SizeT> child;
};
// using queue to enumerate
queue<State> myQueue;
vector<DType> result;
// using a tree to eliminate duplicates
// id of the root node
SizeT rootID;
// node storage
map<SizeT, Node> nodes;
auto AllocateNode = [&]() {
SizeT nodeID = nodes.size();
nodes.insert(pair<SizeT, Node>(nodeID, Node{}));
nodes[nodeID].id = nodeID;
return nodeID;
};
rootID = AllocateNode();
// fill the queue with sequences whose length are one
for (const auto &ele : eleTable[0]) {
State newState;
newState.seq1Path.push_back(ele.first);
newState.seq2Index = ele.second;
myQueue.push(move(newState));
}
// begin enumeration
while (!myQueue.empty()) {
State front = move(myQueue.front());
myQueue.pop();
// if a longest common subsequence is found
if (front.seq1Path.size() == lcsLength) {
DType seq;
// recreate the sequence
auto inserter = back_inserter(seq);
for (const auto &index : front.seq1Path) {
*inserter = seq1[index];
}
// make sure there is no duplicate
bool brandNew = false;
SizeT lastNode = rootID;
for (SizeT i = 0; i < (SizeT)seq.size(); ++i) {
auto &node = nodes[lastNode];
auto findResult = node.child.find(seq[i]);
if (findResult != node.child.end()) {
lastNode = findResult->second;
}
else {
brandNew = true;
auto newNodeID = AllocateNode();
nodes[newNodeID].value = seq[i];
node.child.insert(make_pair(seq[i], newNodeID));
lastNode = newNodeID;
}
}
if (brandNew) {
result.emplace_back(move(seq));
}
continue;
}
for (const auto &ele : eleTable[front.seq1Path.size()]) {
if (ele.first > front.seq1Path.back() && ele.second > front.seq2Index) {
State newState;
copy(front.seq1Path.begin(), front.seq1Path.end(), back_inserter(newState.seq1Path));
newState.seq1Path.push_back(ele.first);
newState.seq2Index = ele.second;
myQueue.push(move(newState));
}
}
}
return result;
}
The complexity of this algorithm might be high, due to the enumeration of all possible longest common subsequence combinations. Other approaches, e.g. suffix tree, might be able to solve this question more efficiently.
Appendix
Possible STL Headers Needed to Run The Code
// tested OK with Visual C++ 2017
#include <algorithm>
#include <iostream>
#include <vector>
#include <iterator>
#include <cstdint>
#include <string>
#include <queue>
#include <tuple>
#include <map>
A Test Sample for Finding All Longest Common Subsequences
string seq1 = "abcabcaa";
string seq2 = "acbacba";
There are totally 7 longest common subsequences, namely:
- “ababa”
- “abaca”
- “abcba”
- “acaba”
- “acaca”
- “acbaa”
- “acbca”