The task of transforming a string from a source format into a target format
is encountered in many information-processing tasks. Consider the task of
transforming a list of names in the form “firstname lastname” (e.g. “Michael
Jackson”) into the target form “lastname first letter of the firstname” (e.g.
“Jackson M”). In many domains, identifying an appropriate set of operations
that transforms one string to another is challenging, as the space of possible
transformations is large. In this work, we investigate the problem of learning
string transformation rules from pairs of example strings. We propose a solid
way to design these rules based on only four string operations: permutations,
insertions, deletions and updates. Additionally, we propose an efficient algorithm to learn such rules, implemented as a combination of variations of
well-known string manipulation algorithms. The proposed algorithm has the
following desirable properties: it can express any string transformation; it
can produce transformation rules that correctly transform a large part of the
data, even when a limited number of training examples is provided; it is linear w.r.t the training sample size, which allows it to scale for large tasks; and
it easy to understand and implement. We demonstrate the ability of our algorithm over real-world transformation tasks. The results indicate that the
algorithm learns transformation rules that are generalizable for a broader
range of strings to be transformed, using very few examples. The algorithm
is especially useful for spreadsheets and data cleaning tool developers that
want to support their end-users on string transformation related tasks.