Wednesday, August 1, 2012

Suffix Tree and String matching

Just found out that there is a popular data structure to deal with substring kind of problems, called the "Suffix Tree". (constrast with "Prefix tree", which is used in longest prefix kind of problems)

Suffix tree can be constructed with the simple O($$n^2$$) algo (insert the "n" suffixes of a string with n chars into the tree creating a child wherever necessary) or with more advanced algorithms that can do it in O(n) (http://en.wikipedia.org/wiki/Ukkonen%27s_algorithm). The point is, once we have a suffix tree, we can execute several of these algorithms in linear time.
  1. Longest repeating substring: the node which is farthest away from the tree but which has degree > 1. 
  2. Longest substring which is also a palindrome:  to do this, we take te original string lets say abcac and concatenate it with itself reversed i.e. abcaccacba and run "longest repeating substring" 
  3. Longest common substring: where we concatenate 2 strings first e.g. abcd and bcd so as to get abcdbcd and now search for the node deepest down which has degree > 1. 
  4. Shortest substring that occurs only once: to answer this, we find the node with the shortest length that takes us to a leaf node. 
The intuition about the degree > 1 is that the fact that there is a branch at this node, means that there are at least 2 different suffixes of the string that have this node's value(i.e. the string formed by the characters as we traverse from root to this node) as its substring. This means that this node's prefix is part of both (or more) different suffixes, and thus is a "common substring".

Ref: http://www.csc.liv.ac.uk/~leszek/COMP526/week5/week5.pdf

No comments:

Post a Comment