Tuesday, October 25, 2016

Dynamic Programming in 5 easy steps - Examples - Text Justification

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/10/dynamic-programming-in-5-easy-steps---examples---text-justification/


Lately I spent a relatively great amount of time revising Algorithms and Data Structures. It's a broad subject, but for me personally I found the Dynamic Programming as the abandoned and unused method. It's also considered as one of the hardest methods to master, with few examples on the internet. Let's contribute a little with this post series. Today I will cover the first problem - text justification. The problem is from MIT lectures and I highly recommend well explained videos along with notes from course 6.006: Introduction to algorithms - fall 2011.

Text justification

Given a text, split it into multiple lines so it is “nicely” distributed between each line.
What does it mean? Take a look at the example:
We can clearly see that eager approach won't produce good results in all cases.
Mathematically, given a badness(i, j) function for line of words[i : j], we seek for such a splitting that the sum of all badness is minimum.
Let's assume the badness function is predefined as follows:
∞ if total length > page width, else (page width − total length)3
So in the example above, the bad splitting would have:
  • page width = 16
  • first line length = 4+4+4+2 = 16, badness of the first line = 0
  • second line length = 4, badness = (16-4)3=1728
  • third line length = 16, badness = 0
  • badness sum = 1728
The good splitting would have:
  • first line length = 4+4+1 = 9, badness of the first line = (16-9)3=343
  • second line length = 9, badness = (16-9)3=343
  • third line length = 16, badness = 8
  • badness sum = 694
The second splitting has smaller total badness sum, thus it's a better splitting.

Dynamic Programming in 5 easy steps

On the MIT course, there is presented an approach to DP in 5 easy steps. Those are:
  1. define subproblems, count number of subproblems
  2. guess (part of solution), count number of choices
  3. relate subproblem solutions, compute time per subproblem
  4. recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems
  5. solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time 
Let's try to apply them on our problem.
Well, the steps don't have to be executed always in order. Let's try to figure out the second step. We know where our line begins, it's on ith position. Where is the end of the line? We don't know. So let's check all possibilities, that's from ith onward and take the smallest badness. So now we know what's our subproblems in first step are - suffixes of word[i:]. That's pretty much the algorithm. The third step will look like:
DP(i)=for j in i+1...n+1 take
min from (DP(j) + badness(i, j))
The great thing is about DP that DP(j) considered to be free (because we compute the subproblem once and memorize the answer). Therefore the total time needed by this algorithm is:
time = time per subproblem · number of subproblems, so
time = O(n) · O(n)
time = O(n^2)
The fourth point is about proving that this really works and the time needed is finite. We compute the answer from the end of the words, down to word number 0. The recursive DP function always accesses greater indexes, therefore time needed is finite. Lastly, we have to find our answer. Here, obviously it's DP[0]. Easy? Yup, let's code it!

The code

using System;
using System.Collections.Generic;
using System.Linq;

namespace Algos
{
    public class DP_TextJustification
    {
        static void Main(String[] args)
        {
            //please note: this is not a production-ready solution!
            PrintSolution("   ", 10);
            PrintSolution("oneword", 10);
            PrintSolution("one line", 10);
            PrintSolution("two lines", 6);
            PrintSolution("blah blah blah blah reallylongword", 14);
            PrintSolution("blah blah blah blah reallylongword 1", 14);
            var txt1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " +
                       "Praesent varius urna eget lacinia pharetra. " +
                       "Vivamus lacinia in dui vitae sodales. " +
                       "Aliquam auctor justo nec condimentum pretium. " +
                       "Curabitur egestas tellus dolor, vel tempus lacus blandit vitae. " +
                       "Cras dapibus scelerisque ex nec gravida.";
            PrintSolution(txt1, 60);
            PrintSolution(txt1, 70);
            PrintSolution(txt1, 80);
            PrintSolution(txt1, 90);
            PrintSolution(txt1, 100);
            Console.ReadLine();
        }

        public static void PrintSolution(string txt, int width)
        {
            Console.WriteLine($"Test length: {txt.Length}, line width: {width}");
            foreach (var line in new Justifier(txt, width).Solve())
            {
                Console.WriteLine("{0,-" + width + "}|", line);
            }
            Console.WriteLine();
        }

        private class Justifier
        {
            private const int Marker = -1;
            private readonly int _width;
            private readonly string[] _words;
            private readonly int[] _parentPointers;
            private readonly int[] _memo;
             
            public Justifier(string text, int width)
            {
                _width = width;
                _words = text.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries);
                _parentPointers = new int[_words.Length];
                _memo = new int[_words.Length];
                for (var i = 0; i < _memo.Length; i++)
                {
                    _memo[i] = Marker;
                }
            }

            public List<string> Solve()
            {
                DP(0);

                var result = new List<string>();
                var from = 0;
                while (from != _words.Length)
                {
                    var next = _parentPointers[from];
                    result.Add(string.Join(" ", _words.Skip(from).Take(next - from)));
                    from = next;
                }
                return result;
            }

            private int DP(int i)
            {
                if (i == _words.Length) //no words in line is the end of justification
                    return 0;

                if (_memo[i] != Marker) //if we have solution calculated, return it
                    return _memo[i];

                var minBadness = int.MaxValue;

                for (var j = i + 1; j <= _words.Length; j++)
                {
                    var badness = Badness(i, j) + DP(j);
                    if (minBadness > badness)
                    {
                        _parentPointers[i] = j; //remember best choice
                        minBadness = badness;
                    }
                }

                _memo[i] = minBadness; //remember solution
                return minBadness;
            }

            private int Badness(int i, int j)
            {
                var lengthOfWords = _words.Skip(i).Take(j - i).Sum(s => s.Length);
                var numberOfSpaces = j - i - 1;
                if (lengthOfWords + numberOfSpaces > _width)
                    return 10*1000*1000;
                return (int)Math.Pow(_width - lengthOfWords - numberOfSpaces, 3);
            }
        }
    }
}
To the solution in 5 easy steps I've added parent pointers - a simple way to remember the best solution from DP. The result is as follows:
If we don't care about justifying the last line, we can easily modify the badness function to return no cost for the last line:
private int Badness(int i, int j)
{
    var lengthOfWords = _words.Skip(i).Take(j - i).Sum(s => s.Length);
    var numberOfSpaces = j - i - 1;
    if (lengthOfWords + numberOfSpaces > _width)
        return 10*1000*1000;
    if (j == _words.Length) //don't care about last line
        return 0;
    return (int)Math.Pow(_width - lengthOfWords - numberOfSpaces, 3);
}
In this case, the result will be much different in some cases:

No comments:

Post a Comment