Wednesday, December 21, 2016

Dynamic Programming and the hardest problem on HackerRank

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/12/dynamic-programming-and-the-hardest-problem-on-hackerrank/


The hardest problem on HackerRank, sorted by Max Score and level "Expert" is Separate The Chocolate. It's worth 250 points and the level "Expert" is the highest one. How to solve it?

Problem Statement

Tom and Derpina have a rectangular shaped chocolate bar with chocolates labeled T, D and U. They want to split the bar into exactly two pieces such that:
  • Tom's piece can not contain any chocolate labeled D and similarly, Derpina's piece can not contain any chocolate labeled T and U can be used by either of the two.
  • All chocolates in each piece must be connected (two chocolates are connected if they share an edge), i.e. the chocolates should form one connected component
  • The absolute difference between the number of chocolates in pieces should be at most K
  • After dividing it into exactly two pieces, in any piece, there should not be 4 adjacent chocolates that form a square, i.e. there should not be a fragment like this: 
    XX
    XX
Input Format

The first line of the input contains 3 integers M, N and K separated by a single space.
M lines follow, each of which contains N characters. Each character is 'T','D' or 'U'.

Constraints

0≤ M, N ≤8
0≤ K ≤ M * N

Output Format

A single line containing the number of ways to divide the chocolate bar.

My approach

I didn't have any ideas how to solve it with DP. I've coded a solution which checks every possible way of chocolate division. I felt it wasn't the right solution because the complexity of that is exponential. The graph of possible permutations is checked wisely - at each "branch" I check the required conditions (connectivity, if K difference can be met, squares presence) and cancel the computation of subtrees with invalid state. The result? Quite nice: 16 out of 19 tests passed, 210.53 out of 250 points scored. I looked at the leaderboard, and noticed despite the winners many scored just like me. I've optimized some cases, however that wasn't enough. DP was indeed needed. I've tried to figure it out but finally gave up.

The solution

The whole editorial provided by HackerRank is very short:
This problem is a complicated dynamic programming problem (DP). The idea is simple but the implementation is difficult.

We can iterate the grids one by one. For each grid suppose the left and the upper one has been given to Tom or Derpina. (Color black or white.)

To decide the square [i][j], we need all the square’s state in square[i][0..j – 1] and square[i – 1][j..n -1] , all the (n + 1) squares in total can decide this square’s state. We can use these as a part of state and we also must keep whether a component is dead. (If it’s dead then add one more square of the same color is invalid.)
I can't wrap my head around the DP function though. It is strange for me that the vertical line will be the DP state - because the connectivity must be met for the whole matrix... anyone got an explanation?

Lessons learned

  1. If you can solve a part of a very hard problem - it still can be very profitable in terms of HackerRank score. Compare 210.53 points to 50 points from "Two Robots" medium problem.
  2. Looking at other's solutions, I've noticed this piece of code:
    Right after adding those three cases to my code, the result was 250 points (not counted towards HackerRank score, of course). Therefore: not everybody plays fair. 

Thursday, November 24, 2016

Dynamic Programming in 5 easy steps - Examples - Two Robots

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/11/dynamic-programming-in-5-easy-steps---examples---two-robots/


This time we will solve a HackerRank problem, rated as a medium in difficulty. It's advised for you to go through a similar, but in my opinion easier problem described by me previously. The problem statement is as follows:
You have a warehouse with M containers filled with an infinite number of candies. The containers are arranged in a single row, equally spaced to be 1 meter apart. You also have 2 robots that can pick up 1 piece of candy and transport it between any two containers.

The robots take instructions in the form of queries consisting of two integers, Ma and Mb, respectively. To execute a query, a robot travels to container Ma, picks up 1 candy, transports it to container Mb, and then stops at Mb until it receives another query.

Calculate the minimum total distance the robots must travel to execute N queries in order.

Note: You choose which robot executes each query.
Reminder -  the 5 easy steps are:
  1. define subproblems, count number of subproblems
  2. guess (part of solution), count number of choices
  3. relate subproblem solutions, compute time per subproblem
  4. recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems
  5. solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time 
Step 1. Let's try to figure out what our subproblems are. At a given point in solving the problem, what information do we want to have? First obvious thing: index of queries. Next obvious thing: we want to optimize over total distance made by robots, therefore our DP function will return the minimum total distance made by robots at some point in our problem. Having that, we also probably will need positions of two robots. The situation looks like this:

Step 2. Now, we basically have two choices, A: proceed with the first robot or B: with the second robot. When A - we add the distance of moving the r1 from it's last place Mb(r1) to Ma(INDEX), otherwise B: we add the distance of moving the r2 from it's last place Mb(r2) to Ma(INDEX). Additionally, in both cases we have to add the cost of moving Ma(INDEX) to Mb(INDEX).
Step 3. Writing the DP function is now straightforward:

DP(r1, r2, index) = min from:
   A: Ma(INDEX) - Mb(r1) + Mb(INDEX) - Ma(INDEX) + DP(INDEX, r2)
   B: Ma(INDEX) - Mb(r2) + Mb(INDEX) - Ma(INDEX) + DP(r1, INDEX)
However, having the index in DP function is redundant. We can decrease the time complexity of our solution by removing it - index is always the next item after max(r1, r2). Thus, DP function now looks like:

DP(r1, r2) = min from:
   A: Ma(max(r1, r2)+1) - Mb(r1) + Mb(max(r1, r2)+1- Ma(max(r1, r2)+1) + DP(max(r1, r2)+1, r2)
   B: Ma(max(r1, r2)+1) - Mb(r2) + Mb(max(r1, r2)+1- Ma(max(r1, r2)+1) DP(r1, max(r1, r2)+1)

The number of subproblems equals to M*M, because that's how many states which in the robots can be arranged. Time of computing one subproblem is constant, therefore the final complexity of this solution is O(M2).
Step 4. So is it finite algorithm? Yes, because the graph is traversed through two increasing values, r1 and r2.
Step 5. The solution is in DP(-1, -1), assuming -1 denotes that the robot wasn't used yet and the cost of moving it to Mx is 0 (because it starts at this location).

The code

using System;
using System.Collections.Generic;
using System.Linq;

namespace Algos
{
 public class Solver
    {
        private readonly int _m;
        private readonly int _n;
        private readonly List> _queries;
        private int[][] _memo;

        public Solver(int m, int n, List> queries)
        {
            _m = m;
            _n = n;
            _queries = queries;
        }

        public int SolveCase()
        {
            _memo = new int[_n + 1][];
            for (int i = 0; i < _n + 1; i++)
                _memo[i] = new int[_n + 1];
            for (int i = _n; i >= 0; i--)
            {
                for (int j = _n; j >= 0; j--)
                {
                    _memo[i][j] = -1;
                }
            }

            return Dp(-1, -1);
        }

        private int Dp(int r1, int r2)
        {
            if (r1 + 1 == _n || r2 + 1== _n)
                return 0;
            if (_memo[r1 + 1][r2 + 1] != -1)
                return _memo[r1 + 1][r2 + 1];

            var i = Math.Max(r1, r2) + 1;
            //r1 stays in place
            var r2Cover = 0;
            if (r2 != -1)
                r2Cover = Math.Abs(_queries[r2].Item2 - _queries[i].Item1);
            var d1 = Dp(r1, i);
            //r2 stays in place
            var r1Cover = 0;
            if (r1 != -1)
                r1Cover = Math.Abs(_queries[r1].Item2 - _queries[i].Item1);
            var d2 = Dp(i, r2);

            var queryCover = Math.Abs(_queries[i].Item1 - _queries[i].Item2);
            var min = Math.Min(r2Cover + d1 + queryCover, r1Cover + d2 + queryCover);

            _memo[r1 + 1][r2 + 1] = min;

            return min;
        }
    }
 
    class Solution
    {
        static void Main(string[] args)
        {
            //CaseSub0();
            //Case0();
            //Case1();
            //Case2();
            //RandomBigCase();
            //return;

            var T = int.Parse(Console.ReadLine());
            for (var i = 0; i < T; i++)
            {
                ProcessCase();
            }
        }

        private static void ProcessCase()
        {
            var @case = Console.ReadLine().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
            var M = int.Parse(@case[0]);
            var N = int.Parse(@case[1]);
            var queries = new List>(N);
            for (int i = 0; i < N; i++)
            {
                var query = Console.ReadLine().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
                queries.Add(Tuple.Create(int.Parse(query[0]), int.Parse(query[1])));
            }
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine(solution);
        }
  
  private static void CaseSub0()
        {
            var M = 2;
            var N = 1;
            var queries = new List>();
            queries.Add(Tuple.Create(1, 2));
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("Casesub0 " + solution);
        }

        private static void Case0()
        {
            var M = 5;
            var N = 4;
            var queries = new List>();
            queries.Add(Tuple.Create(1, 5));
            queries.Add(Tuple.Create(3, 2));
            queries.Add(Tuple.Create(4, 1));
            queries.Add(Tuple.Create(2, 4));
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("Case0 " + solution);
        }

        private static void Case1()
        {
            var M = 4;
            var N = 2;
            var queries = new List>();
            queries.Add(Tuple.Create(1, 2));
            queries.Add(Tuple.Create(4, 3));
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("Case1 " + solution);
        }

        private static void Case2()
        {
            var M = 10;
            var N = 3;
            var queries = new List>();
            queries.Add(Tuple.Create(2, 4));
            queries.Add(Tuple.Create(5, 4));
            queries.Add(Tuple.Create(9, 8));
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("Case2 " + solution);
        }

        private static void RandomBigCase()
        {
            var M = 1000;
            var N = 1000;
            var queries = new List>();
            var r = new Random(666);
            while (queries.Count != N)
            {
                var from = r.Next(1, M + 1);
                var to = r.Next(1, M + 1);
                var t = Tuple.Create(from, to);
                if (!queries.Contains(t))
                    queries.Add(t);
            }
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("RandomBigCase " + solution);
        }
    }
}

Monday, October 31, 2016

Dynamic Programming in 5 easy steps - Examples - Set partitioning

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/10/dynamic-programming-in-5-easy-steps---examples---set-partitioning/


Let's continue Dynamic Programming series. Last time we have covered Text Justification problem, this time we will do something harder. The problem statement is as follows:
Given a set S of positive integers, divide it into two sets S1 and S2 such that the absolute difference between their sums is minimum.

If there is a set S with n elements, then if we assume S1 has m elements, S2 must have n-m elements and the value of abs(sum(S1) – sum(S2)) should be minimum.
The 5 easy steps are:
  1. define subproblems, count number of subproblems
  2. guess (part of solution), count number of choices
  3. relate subproblem solutions, compute time per subproblem
  4. recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems
  5. solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time 
Taking an example, assume a set S={ 1, 6, 11, 5}.
Step 1. As previously, we want to divide the problem to subproblems.
How this can be done? Suffixes, prefixes or the mix of those two are usually the answer. It's natural to imagine that in the middle of processing the set looks at some point as follows .., 11, 5. At this very point we already assumed we are working on suffixes. 
In that situation we can either include the 11 in the set S1 or in the set S2. I'll say that we take or leave the number, because we have only two choices.
Related, very important aspect - we have to track only one sum! The other one is just totalSum-chosenSetSum.
Getting back to our situation, what is the "context" of the subproblem? Does it matter how we got to this point? Not at all, what matters is only the chosenSetSum. Obviously, we have to also keep track of our current item index. Here we have it, the subproblems.

Step 2. How our guess will look like? We can increase the sum and progress with the problem,
or we can just progress with the problem, assuming we leave the item and it will be in the second set.

Step 3. Now we have to glue two pieces togeter. Our DP function will have two arguments as noted above, chosenSetSum and index.
We want to optimize for the minimum difference between the sets and that's what will be returned by the DP function. All in all, the equation is as follows:
DP(chosenSetSum, i) = min from:
  • DP(chosenSetSum, i+1) //when skipping the i'th item
  • DP(chosenSetSum + inputSet[i], i+1)) //when taking the i'th item
The corner case is DP(X, inputSet.Length) = totalSum-X, because that's the end of partitioning.
The time complexity of this approach is:
time = time per subproblem · number of subproblems, so
time = O(1) * O(sum * n) = O(sum * n)
Yey, we have pseudo-polynomial time solution.

Step 4. Is it actually an acyclic graph of DP function invocations? My reasoning here is that we have two dimensions (chosenSetSum and item index) where both increase and there is an end when the second variable reaches input set cardinality, thus it's finite algorithm.

Step 5. We start with chosenSetSum=0 and at the first index, therefore our solution is at DP(0,0).

The Code

Having the 5 steps figured out, it's relatively easy to code the solution.

Pure DP algorithm coded without memorization

private class Partitioner
{
    private readonly int[] _inputSet;
    private readonly int _inputSetSum;

    public Partitioner(int[] inputSet)
    {
        _inputSet = inputSet;
        _inputSetSum = _inputSet.Sum();
    }

    public void Solve()
    {
        DP(0, 0);
    }

    private int DP(int chosenSetSum, int i)
    {
        if (i == _inputSet.Length)
            //Note: returning the difference between chosenSetSum and skippedSetSum, 
            //not the skippedSetSum (which would be _inputSetSum - chosenSetSum)
            return Math.Abs(_inputSetSum - 2 * chosenSetSum);

        var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
        var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
        var minDiff = Math.Min(skipCurrentNumberScore, chooseCurrentNumberScore) ;

        return minDiff;
    }
}
How cool is that? 6 lines of code for almost working solution!

Memoization added

private class Partitioner
{
    private const int Marker = -1;
    private readonly int[] _inputSet;
    private readonly int _inputSetSum;
    private readonly int[][] _memo;

    public Partitioner(int[] inputSet)
    {
        _inputSet = inputSet;
        _inputSetSum = _inputSet.Sum();
        _memo = new int[_inputSetSum][];
        for (var i = 0; i < _inputSetSum; i++)
        {
            _memo[i] = new int[_inputSet.Length];
            for (var j = 0; j < _inputSet.Length; j++)
                _memo[i][j] = Marker;
        }
    }

    public void Solve()
    {
        DP(0, 0);
    }

    private int DP(int chosenSetSum, int i)
    {
        if (i == _inputSet.Length)
            //Note: returning the difference between chosenSetSum and skippedSetSum, 
            //not the skippedSetSum (which would be _inputSetSum - chosenSetSum)
            return Math.Abs(_inputSetSum - 2 * chosenSetSum);

        if (_memo[chosenSetSum][i] != Marker)
            return _memo[chosenSetSum][i];

        var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
        var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
        var minDiff = Math.Min(skipCurrentNumberScore, chooseCurrentNumberScore);

        _memo[chosenSetSum][i] = minDiff;
        return minDiff;
    }
}

Full solution with parent pointers and examples

using System;
using System.Collections.Generic;
using System.Linq;

public class DP_SetPartition
{
    static void Main(String[] args)
    {
        PrintSolution(new int[0]);
        PrintSolution(new[] { 3 });
        PrintSolution(new[] { 3, 3 });
        PrintSolution(new[] { 3, 2 });
        PrintSolution(new[] { 1, 4, 2 });
        PrintSolution(new[] { 1, 2, 4 });
        PrintSolution(new[] { 1, 1, 5, 1, 1, 1 });
        PrintSolution(new[] { 1, 6, 11, 5 });
        PrintSolution(new[] { 1, 6, 11, 5 });
        PrintSolution(new[] { 25, 21, 20, 17, 8 });
        PrintSolution(new[] { 1, 5, 6, 10 });
        PrintSolution(new[] { 3, 1, 4, 2, 2, 1 });
        PrintSolution(new[] { 200, 200, 300, 300, 400, 400, 1000, 1000 });
        PrintSolution(new[] { 100, 100, 200, 200, 300, 300, 400, 400, 1000, 1000 });
        Console.ReadLine();
    }

    private static void PrintSolution(int[] set)
    {
        Console.WriteLine($"Set: {string.Join(", ", set)}");
        int[] set1;
        int[] set2;
        new Partitioner(set).Solve(out set1, out set2);
        Console.WriteLine($"A: {string.Join(", ", set1)} \t\t B: {string.Join(", ", set2)}, sumdiff: {Math.Abs(set1.Sum() - set2.Sum())}");
        Console.WriteLine();
    }

    private class Partitioner
    {
        private const int Marker = -1;
        private readonly int[] _inputSet;
        private readonly bool[][] _parentPointers;
        private readonly int _inputSetSum;
        private readonly int[][] _memo;

        public Partitioner(int[] inputSet)
        {
            _inputSet = inputSet;
            _inputSetSum = _inputSet.Sum();
            _memo = new int[_inputSetSum][];
            _parentPointers = new bool[_inputSetSum][];
            for (var i = 0; i < _inputSetSum; i++)
            {
                _memo[i] = new int[_inputSet.Length];
                _parentPointers[i] = new bool[_inputSet.Length];
                for (var j = 0; j < _inputSet.Length; j++)
                {
                    _memo[i][j] = Marker;
                }
            }
        }

        public void Solve(out int[] set1, out int[] set2)
        {
            DP(0, 0);
            var chosenSet = new List<int>();
            var skippedSet = new List<int>();
            var sum = 0;
            var i = 0;
            while (i != _inputSet.Length)
            {
                var choose = _parentPointers[sum][i];
                if (choose)
                {
                    chosenSet.Add(_inputSet[i]);
                    sum += _inputSet[i];
                }
                else
                {
                    skippedSet.Add(_inputSet[i]);
                }
                i++;
            }
            set1 = chosenSet.ToArray();
            set2 = skippedSet.ToArray();
        }

        private int DP(int chosenSetSum, int i)
        {
            if (i == _inputSet.Length)
                return Math.Abs(_inputSetSum - 2 * chosenSetSum);

            if (_memo[chosenSetSum][i] != Marker)
                return _memo[chosenSetSum][i];

            var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
            var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
            var minDiff = skipCurrentNumberScore;
            if (chooseCurrentNumberScore < skipCurrentNumberScore)
            {
                minDiff = chooseCurrentNumberScore;
                _parentPointers[chosenSetSum][i] = true;
            }

            _memo[chosenSetSum][i] = minDiff;
            return minDiff;
        }
    }
}
It's working!

Tuesday, October 25, 2016

Dynamic Programming in 5 easy steps - Examples - Text Justification

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/10/dynamic-programming-in-5-easy-steps---examples---text-justification/


Lately I spent a relatively great amount of time revising Algorithms and Data Structures. It's a broad subject, but for me personally I found the Dynamic Programming as the abandoned and unused method. It's also considered as one of the hardest methods to master, with few examples on the internet. Let's contribute a little with this post series. Today I will cover the first problem - text justification. The problem is from MIT lectures and I highly recommend well explained videos along with notes from course 6.006: Introduction to algorithms - fall 2011.

Text justification

Given a text, split it into multiple lines so it is “nicely” distributed between each line.
What does it mean? Take a look at the example:
We can clearly see that eager approach won't produce good results in all cases.
Mathematically, given a badness(i, j) function for line of words[i : j], we seek for such a splitting that the sum of all badness is minimum.
Let's assume the badness function is predefined as follows:
∞ if total length > page width, else (page width − total length)3
So in the example above, the bad splitting would have:
  • page width = 16
  • first line length = 4+4+4+2 = 16, badness of the first line = 0
  • second line length = 4, badness = (16-4)3=1728
  • third line length = 16, badness = 0
  • badness sum = 1728
The good splitting would have:
  • first line length = 4+4+1 = 9, badness of the first line = (16-9)3=343
  • second line length = 9, badness = (16-9)3=343
  • third line length = 16, badness = 8
  • badness sum = 694
The second splitting has smaller total badness sum, thus it's a better splitting.

Dynamic Programming in 5 easy steps

On the MIT course, there is presented an approach to DP in 5 easy steps. Those are:
  1. define subproblems, count number of subproblems
  2. guess (part of solution), count number of choices
  3. relate subproblem solutions, compute time per subproblem
  4. recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems
  5. solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time 
Let's try to apply them on our problem.
Well, the steps don't have to be executed always in order. Let's try to figure out the second step. We know where our line begins, it's on ith position. Where is the end of the line? We don't know. So let's check all possibilities, that's from ith onward and take the smallest badness. So now we know what's our subproblems in first step are - suffixes of word[i:]. That's pretty much the algorithm. The third step will look like:
DP(i)=for j in i+1...n+1 take
min from (DP(j) + badness(i, j))
The great thing is about DP that DP(j) considered to be free (because we compute the subproblem once and memorize the answer). Therefore the total time needed by this algorithm is:
time = time per subproblem · number of subproblems, so
time = O(n) · O(n)
time = O(n^2)
The fourth point is about proving that this really works and the time needed is finite. We compute the answer from the end of the words, down to word number 0. The recursive DP function always accesses greater indexes, therefore time needed is finite. Lastly, we have to find our answer. Here, obviously it's DP[0]. Easy? Yup, let's code it!

The code

using System;
using System.Collections.Generic;
using System.Linq;

namespace Algos
{
    public class DP_TextJustification
    {
        static void Main(String[] args)
        {
            //please note: this is not a production-ready solution!
            PrintSolution("   ", 10);
            PrintSolution("oneword", 10);
            PrintSolution("one line", 10);
            PrintSolution("two lines", 6);
            PrintSolution("blah blah blah blah reallylongword", 14);
            PrintSolution("blah blah blah blah reallylongword 1", 14);
            var txt1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " +
                       "Praesent varius urna eget lacinia pharetra. " +
                       "Vivamus lacinia in dui vitae sodales. " +
                       "Aliquam auctor justo nec condimentum pretium. " +
                       "Curabitur egestas tellus dolor, vel tempus lacus blandit vitae. " +
                       "Cras dapibus scelerisque ex nec gravida.";
            PrintSolution(txt1, 60);
            PrintSolution(txt1, 70);
            PrintSolution(txt1, 80);
            PrintSolution(txt1, 90);
            PrintSolution(txt1, 100);
            Console.ReadLine();
        }

        public static void PrintSolution(string txt, int width)
        {
            Console.WriteLine($"Test length: {txt.Length}, line width: {width}");
            foreach (var line in new Justifier(txt, width).Solve())
            {
                Console.WriteLine("{0,-" + width + "}|", line);
            }
            Console.WriteLine();
        }

        private class Justifier
        {
            private const int Marker = -1;
            private readonly int _width;
            private readonly string[] _words;
            private readonly int[] _parentPointers;
            private readonly int[] _memo;
             
            public Justifier(string text, int width)
            {
                _width = width;
                _words = text.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries);
                _parentPointers = new int[_words.Length];
                _memo = new int[_words.Length];
                for (var i = 0; i < _memo.Length; i++)
                {
                    _memo[i] = Marker;
                }
            }

            public List<string> Solve()
            {
                DP(0);

                var result = new List<string>();
                var from = 0;
                while (from != _words.Length)
                {
                    var next = _parentPointers[from];
                    result.Add(string.Join(" ", _words.Skip(from).Take(next - from)));
                    from = next;
                }
                return result;
            }

            private int DP(int i)
            {
                if (i == _words.Length) //no words in line is the end of justification
                    return 0;

                if (_memo[i] != Marker) //if we have solution calculated, return it
                    return _memo[i];

                var minBadness = int.MaxValue;

                for (var j = i + 1; j <= _words.Length; j++)
                {
                    var badness = Badness(i, j) + DP(j);
                    if (minBadness > badness)
                    {
                        _parentPointers[i] = j; //remember best choice
                        minBadness = badness;
                    }
                }

                _memo[i] = minBadness; //remember solution
                return minBadness;
            }

            private int Badness(int i, int j)
            {
                var lengthOfWords = _words.Skip(i).Take(j - i).Sum(s => s.Length);
                var numberOfSpaces = j - i - 1;
                if (lengthOfWords + numberOfSpaces > _width)
                    return 10*1000*1000;
                return (int)Math.Pow(_width - lengthOfWords - numberOfSpaces, 3);
            }
        }
    }
}
To the solution in 5 easy steps I've added parent pointers - a simple way to remember the best solution from DP. The result is as follows:
If we don't care about justifying the last line, we can easily modify the badness function to return no cost for the last line:
private int Badness(int i, int j)
{
    var lengthOfWords = _words.Skip(i).Take(j - i).Sum(s => s.Length);
    var numberOfSpaces = j - i - 1;
    if (lengthOfWords + numberOfSpaces > _width)
        return 10*1000*1000;
    if (j == _words.Length) //don't care about last line
        return 0;
    return (int)Math.Pow(_width - lengthOfWords - numberOfSpaces, 3);
}
In this case, the result will be much different in some cases:

Tuesday, September 6, 2016

Cassandra Datastax C# Driver problems - solution

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/09/cassandra-datastax-c-sharp-driver-problems---solution/


In the previous post I've described a strange problem related to Cassandra Datastax C# Driver which was happening once in the production environment. It's time to reveal the mystery.

Two root causes

1.

One of the hidden, but important metric, which you won't find usually in your logs is the CPU usage. What important is that the connection setup to the Cassandra cluster consists of many small steps. In production, when there was a very high CPU usage (around 100% - for reason known and already eliminated), the connection setup process was timing out in such a moment, that the final result was reported as NoHostAvailableException. This shows, how important is to track and prevent 100% CPU usage.

2.

But why, let me quote myself:
Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.
The problem is here:
private readonly Cluster _cluster;
private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache

public ISession GetSession(string keyspaceName)
{
    if (!_sessions.ContainsKey(keyspaceName))
    {
        _sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
    }
    var result = _sessions[keyspaceName];
    return result.Value;
}
Do you see it? It turns out that when an exception is thrown in _cluster.Connect(key) method, the Lazy<T> will cache this exception. Therefore all invocations to Lazy<T>.Value will result in the same, cached exception instead of retrying the connection to the Cassandra cluster. If you are planning to use the Lazy<T> class, there are more "gotchas". Read the documentation on MSDN.

Lessons learned?


  1. CPU usage is very, very important and critical metric. Do not ignore it, as it may lead to numerous, strange errors.
  2. RTFM! Read the documentation when using any class for the first time. Especially when copying&pasting code from the StackOverflow.

Tuesday, August 30, 2016

Cassandra Datastax C# Driver problems - NoHostAvailableException

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/08/cassandra-datastax-c-sharp-driver-problems---nohostavailableexception/


This post will be about my journey with fixing nasty Cassandra Datastax C# driver problem, which took me a lot more time than expected...
Credits: wikimedia

Once upon a time, I've been fixing following exception:
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: x.x.x.x:9042, x.x.x.x:9042, x.x.x.x:9042)
   at Cassandra.ControlConnection.Connect(Boolean firstTime)
   at Cassandra.Cluster.Connect(String keyspace)
   at Company.Code.CassandraSessionCache.GetSession(String keyspaceName)
   at Company...
The CassandraSessionCache looked like this:
public class CassandraSessionCache
{
    private readonly Cluster _cluster;
    private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache

    public CassandraSessionCache(Cluster cluster)
    {
        _cluster = cluster;
        _sessions = new ConcurrentDictionary<string, Lazy<ISession>>();
    }

    public ISession GetSession(string keyspaceName)
    {
        if (!_sessions.ContainsKey(keyspaceName))
        {
   _sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
        }
        var result = _sessions[keyspaceName];
        return result.Value;
    }
}
Nothing fancy, however let me give you an insight about the architecture and circumstances of the error:
  • Cassandra cluster is in Amazon
  • The client is Cassandra Datastax C# Driver 2.6.0, also on server in Amazon
  • Both the client and the Cassandra cluster is the same Amazon Region
  • Amazon Region had no availability issues during given period
  • The solution was working fine for over 1 month! The client process is being restarted ~every week for various reasons
  • The client follows Cassandra Datastax C# Driver Best Practices
  • Heartbeat is turned on, so the connection should be alive, all the times
  • Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.
  • Of course, it works on my machine®

What didn't happen?

There are plenty of questions about Cassandra.NoHostAvailableException on StackOverflow. So let's get through some of them and exclude them:
  • [1][2] - no, because following C# Driver best practices excludes this
  • [3] - no, because we are using default retry strategy from driver version 2.6.0
  • [4][5][6] - no, because we are able to connect to the Cluster at the beginning
  • [7] - no, because we are not misusing batches

Debugging...

Logs on server revealed that the client closed the connection:
INFO  [SharedPool-Worker-3] yyyy-mm-dd 11:04:45,625 Message.java:605 - Unexpected exception during request; channel = [id: 0x9eaf52c5, /x.x.x.x:y :> /x.x.x.x:9042]
java.io.IOException: Error while read(...): Connection reset by peer
  at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollRdHupReady(EpollSocketChannel.java:689) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
While the message on the client says that there are no hosts available, what is confirmed by debug logs on the client side. Pretty interesting, huh? Being confused, I've decided to give an update from 2.6.3 to 2.7 a try... but that didn't help.
Accoring to yet another issue regarding NoHostsAvailableException on StackOverflow I've started to log whole exception, with serialized errors property. This is what I've logged:
System.Exception: NoHostAvailableException happened. Errors: {
  "x.x.x.x:9042": {
    "NativeErrorCode": 10060,
    "ClassName": "System.Net.Sockets.SocketException",
    "Message": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   
  at Cassandra.Connection.<Open>b__9(Task`1 t)
  at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  at System.Threading.Tasks.Task.Execute()",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
    "HResult": -2147467259,
    "Source": "Cassandra",
    "WatsonBuckets": null
  },
  "x.x.x.x:9042": {
    "NativeErrorCode": 10060,
    "ClassName": "System.Net.Sockets.SocketException",
    "Message": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "
  at Cassandra.Connection.<Open>b__9(Task`1 t)
  at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  at System.Threading.Tasks.Task.Execute()",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
    "HResult": -2147467259,
    "Source": "Cassandra",
    "WatsonBuckets": null
  }
}
Unfortunately, no interesting data is here.

So what could possibly go wrong?

Can you spot the error? I couldn't. Any guesses? Find the answer in next post.

Thursday, August 18, 2016

Algorithms and data structures - non-academic trees

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/08/algorithms-and-data-structures---non-academic-trees/

Credits: Wikipedia Tree (data structure)
There are many types of trees which are covered on Computer Science lectures. Those usually include: Binary Search Tree, AVL Tree, B Tree, Splay Tree, Red-Black Tree, Trie Trees, Heap Trees.

Those are indeed very useful and practical trees with lots of applications. However, I've discovered few other trees while brushing up my knowledge about algorithms and data structures. Here they are, the most interesting, yet not so popular trees:
  • BK-Tree - do you want to find misspellings of a word in a dictionary? E.g. given word "dog" and dictionary { "cat", "fog", "dot", "cookie" }, naive approach is to compare the word "dog" to all of the entries in the dictionary. This leads to O(n) time. It can be solved in O(lg n) time, though. Burkhard-Keller tree is used in Apache Lucene, for example. Head to Xenopax's Blog for awesome post about BK-Trees.
  • Merkle Tree - probably you didn't know but that's the name of the tree of commits and blobs in a Git VCS. Another applications known to me personally include: Cassandra (during node repair) and Bitcoin blockchain.
  • Interval Tree - interesting idea of augmenting "normal" (single value) trees with additional data in order to solve windowing queries.
  • Lemon Tree - the most complicated type of tree. Many wondered what it really is, but few actually knew... Find the official statement here.

Tuesday, July 26, 2016

The performance of setting T[] vs. List by index

Source: wikimedia.org

Let's compare asymptotic time complexity of two following loops:
int[] _array = Enumerable.Range(0, n).ToArray();
List<int> _list = Enumerable.Range(0, n).ToList();

//ListChangeByIndex
for (var i = 0; i < n; i++)
{
    _list[i] = i + 1;
}

//ArrayChangeByIndex
for (var i = 0; i < n; i++)
{
    _array[i] = i + 1;
}
How do you think, which one is faster?
.
.
.
.
.
.
.
.
Many developers think the fist one will be slower, because in each loop computer is forced to visit all nodes from 0 to i to finally set the variable.
However, that's not the case. Both loops have O(n) complexity. That's because in the .NET source we can clearly see that's underlying data structure for a List<T> is an array: list.cs. Therefore, those two loops are essentially equal.

Tuesday, July 12, 2016

Presentation recommendation - Cloud-based Microservices powering BBC iPlayer

Today I recommend you following presentation: Cloud-based Microservices powering BBC iPlayer

Why? It's interesting (at least for me) how one of the most popular British broadcasting organisation make their's channels available online. A high-level architecture is presented. If you are new to AWS, you will also learn a thing or two about the Amazon Cloud.

Thursday, June 30, 2016

Git tips - replace all occurrences of a string in files


Git can be used from VisualStudio, however it's like saying you drive a car, when actually you play Need for Speed. Unleash the full power of Git, learn to use it. It's not that hard.

Doesn't matter if you are a beginner or an advanced user, you should know what an git alias is. If you don't know, go here immediately: Git Basics - Git Aliases.

Today, you will get a very useful git alias. It's for replacing all occurrences of one string with another. Suppose you want to replace EntityFramework with NHibernate in your project (which seems to be a pretty reasonable thing to do:) ). Here is the alias:
replaceall = "!f() { git grep -l \"$1\" | xargs sed -b -i \"s/$1/$2/g\"; }; f"
The first part of the alias lists all files containing first argument and passes it through pipe and xargs to sed, which performs the replacement. Use it like this:
git replaceall EntityFramework NHibernate
 Please note: it will replace all occurrences of "EntityFramework" with "NHibernate" in all tracked and untracked files.

Monday, June 13, 2016

Messing with C# types. Making type1.FullName==type2.FullName, but not type1==type2!

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/07/the-performance-of-setting-t-vs.-list-by-index/


Given the following method:
private static void CompareTypes(Type type1, Type type2)
{
    Console.WriteLine($"type1.FullName = {type1.FullName}");
    Console.WriteLine($"type2.FullName = {type2.FullName}");
    Console.WriteLine($"type1.FullName {(type1.FullName == type2.FullName ? '=' : '!')}= type2.Fullname");
    Console.WriteLine($"type1.AssemblyQualifiedName = {type1.AssemblyQualifiedName}");
    Console.WriteLine($"type2.AssemblyQualifiedName = {type2.AssemblyQualifiedName}");
    Console.WriteLine($"type1.AssemblyQualifiedName {(type1.AssemblyQualifiedName == type2.AssemblyQualifiedName ? '=' : '!')}= type2.AssemblyQualifiedName");
    Console.WriteLine($"type1.GUID = {type1.GUID}");
    Console.WriteLine($"type2.GUID = {type2.GUID}");
    Console.WriteLine($"type1.GUID {(type1.GUID == type2.GUID ? '=' : '!')}= type2.GUID");

    Console.WriteLine("o1 = Activator.CreateInstance(type1)");
    Console.WriteLine("o2 = Activator.CreateInstance(type2)");
    var o1 = Activator.CreateInstance(type1);
    var o2 = Activator.CreateInstance(type2);
    Console.WriteLine($"o1 == {o1}");
    Console.WriteLine($"o2 == {o2}");

    Console.WriteLine();
    Console.WriteLine($"but... type1 {(type1 == type2 ? '=' : '!')}= type2");
}
Is it possible to get the following result?
type1.FullName = MyLibrary.MyPrecious
type2.FullName = MyLibrary.MyPrecious
type1.FullName == type2.Fullname
type1.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type2.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type1.AssemblyQualifiedName == type2.AssemblyQualifiedName
type1.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type2.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type1.GUID == type2.GUID
o1 = Activator.CreateInstance(type1)
o2 = Activator.CreateInstance(type2)
o1 == MyLibrary.MyPrecious
o2 == MyLibrary.MyPrecious

but... type1 != type2
As it turns out, it is. Doing such a hell is relatively easy:
private static Assembly LoadAssemblyByName(string name)
{
    var myPreciousAssemblyLocation = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), name);
    using (var fs = new FileStream(myPreciousAssemblyLocation, FileMode.Open, FileAccess.Read))
    {
        var data = new byte[fs.Length];
        fs.Read(data, 0, data.Length);
        fs.Close();
        var assembly = Assembly.Load(data);
        return assembly;
    }
}

static void Main()
{
    var type1 = typeof (MyPrecious);
    var myLibraryAssembly = LoadAssemblyByName("MyLibrary.dll");
    var type2 = myLibraryAssembly.GetType("MyLibrary.MyPrecious", true);

    CompareTypes(type1, type2);
}
The code above compares type1 from referenced project to type2 from the same assembly, but loaded again through Assembly.Load(byte[]). That makes the library loaded twice in the AppDomain. Now when a call to AppDomain.CurrentDomain.GetAssemblies() is made, the assemblies are:
AppDomain.CurrentDomain.GetAssemblies:
mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
Microsoft.VisualStudio.HostingProcess.Utilities, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
System.Windows.Forms, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
System.Drawing, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
Microsoft.VisualStudio.HostingProcess.Utilities.Sync, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
Microsoft.VisualStudio.Debugger.Runtime, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
vshost32, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
System.Core, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
ConsoleApplication1, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
Accessibility, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3
Even in such a small, console application it is quite confusing. So, let's make it more confusing... What's the output of the following code?
var myPreciousAssemblyLocation = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "MyLibrary.dll");
var myLibraryAssemblyLoadFrom = Assembly.LoadFrom(myPreciousAssemblyLocation);
var type3 = myLibraryAssemblyLoadFrom.GetType("MyLibrary.MyPrecious", true);
CompareTypes(type1, type3);
Now, surprisingly, its:
type1.FullName = MyLibrary.MyPrecious
type2.FullName = MyLibrary.MyPrecious
type1.FullName == type2.Fullname
type1.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type2.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type1.AssemblyQualifiedName == type2.AssemblyQualifiedName
type1.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type2.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type1.GUID == type2.GUID
o1 = Activator.CreateInstance(type1)
o2 = Activator.CreateInstance(type2)
o1 == MyLibrary.MyPrecious
o2 == MyLibrary.MyPrecious

but... type1 == type2

Hint

A nice hint is shown, when you try to execute the following code:
var o1 = Activator.CreateInstance(type1);
var o2 = Activator.CreateInstance(type2);
MyPrecious p1 = (MyPrecious) o1;
try
{
    MyPrecious p2 = (MyPrecious)o2;
}
catch (Exception e)
{
    Console.WriteLine(e);
}
System.InvalidCastException: [A]MyLibrary.MyPrecious cannot be cast to [B]MyLibrary.MyPrecious. Type A originates from 'MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' in the context 'LoadNeither' in a byte array. Type B originates from 'MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' in the context 'Default' at location 'c:\users\pwdev\documents\visual studio 2015\Projects\ConsoleApplication1\ConsoleApplication1\bin\Debug\MyLibrary.dll'. at ConsoleApplication1.Program.Main() in c:\users\pwdev\documents\visual studio 2015\Projects\ConsoleApplication1\ConsoleApplication1\Program.cs:line 49

Explanation

Yes, it's all about the load contexts. There are three different assembly load contexts: Load, LoadFrom, Neither. Usually there is no need to load the same library twice and get the strange behavior written above, but sometimes there might be. There are many advantages and disadvantages of using different Assembly.Load(From/File) methods. Take a look: Choosing a Binding Context. Furthermore, consider what's happening to assembly dependencies when you load an assembly. There are best practices described on MDSN for loading assemblies: Best Practices for Assembly Loading. I have to say, in my whole career I've been loading assemblies by hand twice, and from time perspective, both two cases were wrong.

TypeHandle

Instead of comparing the types in the examples above by == operator, there is a possibility to compare them by the TypeHandle:
TypeHandle encapsulates a pointer to an internal data structure that represents the type. This handle is unique during the process lifetime. The handle is valid only in the application domain in which it was obtained.
Source: MDSN. Well, I can't think of an interesting usage for the TypeHandles for now, but it's good to know.

Sunday, May 22, 2016

The five stages of coming to terms with Cassandra

From Wikimedia Commons, the free media repository
The five stages of coming to terms with JavaScript are:
  1. Denial: “I won’t need this language.”
  2. Anger: “Why does the web have to be so popular?”
  3. Bargaining: “OK, at least let me compile a reasonable language to JavaScript.”
  4. Depression: “Programming is not for me, I’ll pursue a career in masonry, like I always wanted.”
  5. Acceptance: “I can’t fight it, I may as well prepare for it.”

The same is with Cassandra - however, IMO in the opposite order:
  1. Acceptance: “I will use Cassandra. It's... AMAZING! Let me just quote Apache Cassandra landing page:"
       The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
  2. Depression: “Damn, it's so well designed, but a complex piece of software and it doesn't work as expected.”
  3. Bargaining: "OK, at least let me try to tune it or report some bugs.”
  4. Anger: “Why is it so popular? Why it has so good PR?”
  5. Denial: “I won’t use it or recommend it ever again.”

The context

I've done the research, checked multiple websites - read about performance, architecture, hosting, maintenance, TCO, libraries, popularity... and Cassandra seemed to be a good database for time-series logs storage, with 95% writes (with SLA) and only 5% reads (without SLA). I've chosen prepared Cassandra Datastax virtual disk image on Amazon with bootstrap scripts, made a proof-of-concept solution and read a book or two about Cassandra. All seemed good. However, it's not post about the good. So ...fast forward...

The bad

Some stories which I remember:
  • Cassandra cluster is on production (along with pararell, old solution for this purpose). Phone rings at 2AM. C* cluster is down. Quick look at logs - OutOfMemoryException in random place in JVM. Outage time: 1h - let me just remind you "proven fault-tolerance". Cluster restart, it works again.
  • Next day at work, random hour, the same thing. Related bug: OutOfMemoryError after few hours from node restart 
  • After few days... firing repair - the standard C* operation, which you have to run at least every gc_grace_seconds, by default 10 days. Usually it worked, but then, unexpectedly the server died and later again and again, related issue: "Unknown type 0" Stream failure on Repair. Outage time: stopped counting.
  • Because of the failing servers in the cluster I decided to scale it out a little. Unfortunately, the issue above also made the scaling impossible.
  • After a while, I've encountered a second (thrid?) problem with the repair. Related bug: Repair session exception Validation failed

Fail

Let's get back to the landing page:
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Now, let's see at critical JIRA issue dates:
This means that for around one month at least few people could scale or repair their Cassandra clusters. I fully understand - it's free and Open-Sourced-Software. However, even if something it's free you expect it to work - that's the harsh reality. If it doesn't work just you look for something else. No offence Cassandra Datastax/Apache teams, you are doing truly amazing work, however in resilient software, stability is a TOP 1 requirement. 

Maybe it's me? Maybe I'm the only one having problems?

Fortunately (for me) not:
  1. Here is a presentation how guys at Adform switched from Cassandra to Aerospike: Big Data Strategy Minsk 2014 - Tadas Pivorius - Married to Cassandra
  2. My friend working at a different company also told me, that they used Cassandra and they abandoned it.
  3. Just looked at linked issues and the number of watchers.
In all cases the problems were similar to mine.

Thursday, May 5, 2016

Specifying requirements for live notification mechanism for systems integration purposes

Recently I've designed a mechanism to notify external systems (with which we cooperate) about changes in our system. This, obviously, can be done in multiple ways. Let's look at some considerations on a high level, some questions and how that affects our requirements.

Assumptions

  • we want to notify other, external systems, owned by someone else
  • allowed delay, between the change in our system and making the notification is around one minute
  • the change can carry multiple information and varies on the type of change
  • we expose an API which is currently used by those external systems - they fetch the changes periodically
  • the number of changes per second in our system is spiky in nature (assume 50-5000 notifications/second for now)
  • external systems will subscribe themselves for notifications
Those are real-life business assumptions, which are delivered to the designer/programmer/you/me.

Questions?

  1. How to notify external systems?
  2. What information should we pass? When is the notification delivered?
  3. How long should we wait for the response?
  4. When should we retry? 
Let's try to answer those questions.

Answers

  1. There are multiple external systems, made in multiple different technologies. The most popular and basic method of integration is just making HTTP(S) calls. Should it be GET, POST or X? Let's consider two most popular - the GETs and POSTs.
  2. We have to pass multiple values, depending on the notification type. For example, normal amount of information is: string (300 chars), 5 dates, 5 integers - therefore both GET (allowing ~2k chars on nearly all browsers and servers) and POST methods are viable. However, GET is very straightforward and simple. No issues with encoding, accepting compression or even reading the stream. What is more, GET put less pressure on your's servers as you do not have to send the body stream. Unfortunately GET query string is also visible for (nearly) everyone, therefore only-non sensitive information can be passed. What about concurrent notifications? How could one make "exactly-once" delivery model? Here is where we can use nicely one of our assumptions. Because of our API we can force external systems to fetch information through our API, after we will notify them. Such notification can be delivered in "at-least-once" model and we can provide non-sensitive, idempotent information about the change, which then can be used to get, full sensitive data from our API. One can even imagine an optimization - keep notifications to send in a buffer and delete duplicates in a small time bucket. 
  3. The obvious thing is that the longer we wait for responses the more resources are used. However, there is one more important thing. By specifying the request timeout, we can control how the architecture of the external system will look like. By saying "you have 30 seconds to process the notification" is like saying ~"you have a lot of time to get our notification, process it and synchronously ask our API then send us HTTP 200 status code". Compare it with "you have 3 seconds to store the notification for processing later or process it asynchronously". The implications are clear, short time = less required resources + better integration.
  4. We want to be sure that the notification reaches the external system and thanks to the design specified in second point we can use "at-least-once" delivery model. I see two options now: a) hit specified URL 3 times (for example), don't wait for the answer and don't send this notification ever again, b) hit specified URL, retry in X minutes if HTTP status code was different than 200. First option is very simple in implementation, however it assumes that external systems will develop a mechanism to avoid processing the same notifications multiple times - which will likely end in hitting our API three times for every single notification.

Conclusion

There we have it, answers which potentially should lead to a simple, sleek design which is relatively easy for implementation, completely fulfills the needs and requires a good design from external systems.

Monday, April 18, 2016

Presentation recommendation - The Microservices and DevOps Journey

Today I recommend you following presentation: The Microservices and DevOps Journey

Why? "Microservices" is a buzz-word, created around one year ago, still not popular in Google, but surprisingly popular on conferences: Google Trends: Microservices vs SOA. In my opinion, in this video, a sensible approach of transforming a monolith to a microservices system is presented. KISS architecture. LogStash, consul, Cassandra, Docker, Octopus are cool, however the question is: "Do you really need them?". Expect nothing super fancy though, I'm just sharing what I agree with.

Sunday, March 20, 2016

Little-known, useful, charming and beautiful algorithms - part 2

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/03/little-known-useful-charming-and-beautiful-algorithms---part-2/

Image source: https://goo.gl/EGDuX7

This is the second post about charming algorithms, the first post is here: Little-known, useful, charming and beautiful algorithms - part 1

.NET Timers

How many timers there are in the .NET framework? It's not 2 or 3. It's 5:

  1. System.Timers.Timer
  2. System.Threading.Timer
  3. System.Windows.Forms.Timer
  4. System.Web.UI.Timer
  5. System.Windows.Threading.DispatcherTimer

Here is the comparison of 3 most important of them:
Comparing the Timer Classes in the .NET Framework Class Library.

Which timer would you use for implementing Speculative query execution (which, BTW. it is a nice idea too)? The answer should be: none of them.

.NET timers: internals&assumptions

As you can read in the .NET source code of the System.Threading.Timer, each AppDomain in .NET has one single native timer, supplied by the VM, to fire all managed timers in the AppDomain. The performance considerations are:
We assume that timers are created and destroyed frequently, but rarely actually fire. There are roughly two types of timer:
  - timeouts for operations.  These are created and destroyed very frequently, but almost never fire, because the whole point is that the timer only fires if something has gone wrong.
  - scheduled background tasks.  These typically do fire, but they usually have quite long durations. So the impact of spending a few extra cycles to fire these is negligible.
Because of this, we want to choose a data structure with very fast insert and delete times, but we can live with linear traversal times when firing timers.
The data structure we've chosen is an unordered doubly-linked list of active timers.  This gives O(1) insertion and removal, and O(N) traversal when finding expired timers.
Because of the nature of speculative query execution, we actually care about finding expired timers, and O(N) time is not acceptable. Imagine 10000 read requests per second per one single node to the Cassandra cluster in your web server (which isn't really a big deal, rather normal situation in high traffic scenarios). If you use speculative query execution, you get 10000 timers to manage. Stop, start and expiry processing operations aren't the big deal here, but the per-tick bookkeeping is. Fortunately, there is a better structure for keeping timers, than doubly-linked list.

Hashed and Hierarchical Timing Wheels

Adrian Colyer described all kinds of different timer structures in his blog post: Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility. Go ahead and read it, as he done really good job explaining academic paper.

Real world applications

I've found two real world applications for those structures:
If you are looking for the C# implementation, the one and only implementation of this (which I've found) is here: HashedWheelTimer.cs (many thanks to Cassandra Datastax C# Driver team).

Friday, February 12, 2016

Little-known, useful, charming and beautiful algorithms - part 1

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/02/little-known-useful-charming-and-beautiful-algorithms---part-1/

Image source: https://goo.gl/mJjN4S


Warning: this post won't be about "boring" or "typical" algorithms from Computer Science which we all have learned on studies (like quick sort, merge sort, xxx sort, A*, FFT). Instead, this will be about other little-known, especially USEFUL algorithms, which people working as professional developers should know or heard of.

Little-known

ID generation problems are usually overlooked. Database ID's I mean. Ask someone to name ID "types". Well, GUID, newsequentialid(), int/long (increased one by one, as in IDENTITY), NHibernate's Hi/Lo will be the answers. The first will make your SQL Server cry, when used as clustering key. Two next requires SQL database as a part of the generation process. That limits the performance as well as can be problematic in occasionally connected client scenarios. The last one is quite interesting - solves the problems related to database usage in the generation process, doesn't have the problem as the GUID has... however, it's not quite suitable for distributed (cloud) environment. Many servers can generate the same ID's, if they begin from the same initial ID.

To address those and other issues, listen guys from Twitter:
As we at Twitter move away from Mysql towards Cassandra, we've needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.
they designed a system for generating ID's with following requirements (source):
Performance
 - minimum 10k ids per second per process
 - response rate 2ms (plus network latency)
    Uncoordinated
       For high availability within and across data centers, machines generating ids should not have to coordinate with each other.

    (Roughly) Time Ordered
       We have a number of API resources that assume an ordering (they let you look things up "since this id").
       However, as a result of a large number of asynchronous operations, we already don't guarantee in-order delivery.
       We can guarantee, however, that the id numbers will be k-sorted (references: http://portal.acm.org/citation.cfm?id=70413.70419 and http://portal.acm.org/citation.cfm?id=110778.110783) within a reasonable bound (we're promising 1s, but shooting for 10's of ms).

    Directly Sortable
       The ids should be sortable without loading the full objects that the represent. This sorting should be the above ordering.

    Compact
       There are many otherwise reasonable solutions to this problem that require 128bit numbers. For various reasons, we need to keep our ids under 64bits.

    Highly Available
       The id generation scheme should be at least as available as our related services (like our storage services).
    Note this is a system for generating ID's. However, this system can be easily detached from the whole infrastructure and put into a nice algorithm or a program for generating uncoordinated, time ordered, k-sortable, compact IDs. In fact, there is already a .NET project for that: IdGen. I recommend you looking into the internals of the generation algorithm because the idea is very simple and sleek. I can't stress enough the usefulness of this algorithm in distributed systems. A must have!

    Concealed algorithm, which .NET developers use every day

    Whenever you use TCP protocol (by WebClient or HttpWebRequest), this algorithm kicks in. Normally it's good and useful (source):
       Nagle's document, Congestion Control in IP/TCP Internetworks (RFC 896) describes what he called the "small packet problem", where an application repeatedly emits data in small chunks, frequently only 1 byte in size. Since TCP packets have a 40 byte header (20 bytes for TCP, 20 bytes for IPv4), this results in a 41 byte packet for 1 byte of useful information, a huge overhead. (...)
       Nagle's algorithm works by combining a number of small outgoing messages, and sending them all at once. Specifically, as long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet's worth of output, so that output can be sent all at once.
    However, sometimes it's not and it is good to know that such a thing even exists. Here is a performance test, which showed speed increase in Azure Queue PUT operations from ~210ms to ~28ms: Nagle’s Algorithm is Not Friendly towards Small Requests

    Little algorithms

    There are two "short" algorithms which I personally like. They aren't a big deal and by googling the problem, you probably will use one.

    What is your favorite algorithm/solution?

    In part two there will be one, special "algorithm".

    Thursday, January 7, 2016

    Checking "Star Wars - The Force Awakens" tickets availability with Azure WebJobs, scriptcs and SendGrid

    This user story is quite simple: there is a guy (me) who likes Star Wars. This guy wants to buy the best tickets available in an IMAX Cinema. The premiere was not so long ago, so a day after the showtimes are updated, the best seats are booked. This guy (also me) is quite lazy, so he doesn't like to check the showtimes manually.

    Hm... let's do it like pro developers using cutting-edge technologies!

    How the booking system works?

    There is this whole UI for selecting seats and so on, however there is one interesting request which I can use to check showtimes. It look like this:
    POST http://www.cinema-city.pl/scheduleInfoRows HTTP/1.1
    Host: www.cinema-city.pl
    Connection: keep-alive
    Content-Length: 52
    Accept: */*
    Origin: http://www.cinema-city.pl
    X-Requested-With: XMLHttpRequest
    User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
    Content-Type: application/x-www-form-urlencoded
    Referer: http://www.cinema-city.pl/imax
    Accept-Encoding: gzip, deflate
    Accept-Language: en-US,en;q=0.8
    Cookie: bla bla bla
    
    locationId=1010304&date=09%2F01%2F2016&venueTypeId=2
    

    When there is a picture show on that date it returns an HTML table with links for booking, otherwise an empty HTML table.

    scriptcs to make the job done

    I've written a simple scriptcs which will make a POST with appropriate headers and check if a HTML link opening tag is in the response. If that's the case, I send an email using fresh, free SendGrid account.
    using System.Net;
    using System.Net.Mail;
    using SendGrid;
    
    public void SendMeEmail()
    {
     var myMail = new SendGridMessage();
     myMail.From = new MailAddress("Yoda@gmail.com");
     myMail.AddTo("me@gmail.com");
     myMail.Subject = "StarWars tickets are available!!";
     myMail.Text = "Go to CinemaCity IMAX to book them.";
    
     var credentials = new NetworkCredential("user-sendgrid@azure.com", "sendgrid-password");
     var transportWeb = new Web(credentials);
     transportWeb.DeliverAsync(myMail).Wait();
    }
    
    var httpClient = new HttpClient();
    httpClient.DefaultRequestHeaders.Add("Host", "www.cinema-city.pl");
    httpClient.DefaultRequestHeaders.Add("X-Requested-With", "XMLHttpRequest");
    httpClient.DefaultRequestHeaders.Add("Referer", "http://www.cinema-city.pl/imax");
    httpClient.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36");
    
    var content = new StringContent(@"locationId=1010304&date=09%2F01%2F2016&venueTypeId=2", Encoding.UTF8, @"application/x-www-form-urlencoded");
    var response = httpClient.PostAsync("http://www.cinema-city.pl/scheduleInfoRows", content).Result;
    var responseAsString = response.Content.ReadAsStringAsync().Result;
    
    var isMovieAvailable = responseAsString.Contains("<a");
    if(isMovieAvailable)
    {
     Console.WriteLine("Movie is available, sending email");
     SendMeEmail();
     Console.WriteLine("Movie is available, email sent");
    }
    else
    {
     Console.WriteLine("Movie is not available.");
    }
    

    Evironment setup is quite simple:
    • create a new SendGrid account
    • download scriptcs as zip (link) and unzip it to a folder StarWarsCheck
    • save the code as checkmovie.csx to folder StarWarsCheck
    • update checkmovie.csx with your SendGrid credentials
    • add reference to Sendgrid dll's by invoking scriptcs.exe -Install Sendgrid
    • now you can run the script locally. In a console write: scriptcs.exe checkmovie.csx
    If you are interested on other options how one can prepare a standalone, portable scriptcs scripts - check my question on StackOverflow: How to run scriptcs without installation? Make portable/standalone scriptcs (csx)

    Note: of course, once showtimes are updated, I will get email every hour. But that's good, isn't it? There is a chance I won't miss it.

    Create an Azure WebJob to run scriptcs file every hour

    You can use Azure WebJobs by simply uploading a .zip file with the job and configuring how it should be scheduled in Azure. The job entry point is based on naming convention. According to the documentation, the recommended script file to have in your job directory is: run.cmd. Therefore, my run.cmd look like this:
    call scriptcs.exe checkmovie.csx
    

    Next, pack whole StarWarsCheck folder as a zip file and upload it as Azure WebJob. Instructions are here.

    Effects

    It started with...

    But then...

    The email arrived without problems:

    However, most importantly, I've booked the best seats for my friends & me:

    Summary

    It was fun, easy and profitable to play with Azure WebJobs and scriptcs. I liked the scriptcs sleekness and Azure WebJobs simplicity. For sure I'll use them for something else in the future.