Thursday, February 23, 2017

Cassandra logs/time series storage design - optimal compaction strategy

Source: http://www.ccpixs.com/
Let's assume you are considering using Cassandra for logs storage or in general, for time series storage. Let's assume your usage pattern is that you store insane amounts of data for three months and you query them rarely, usually when something goes wrong and you need to investigate why. So, you have made business analysis. You have made technical analysis. You made performance benchmarks. You identified clustering and partition keys. You picked the right hardware. You tested how the cluster responds to peaks. You tested the cluster for minute, hour surges. You have been running the cluster in tests for weeks. You are now ready to deploy system to the production. You are now ready to fail in ~one month. Wait, what?!

Of course, all mentioned steps bring you closer to the success. It's good that there are plenty of great resources on how to design time series model in Cassandra. Just google "time series design Cassandra".

Surprisingly, the first five of them (dunno about others), don't mention setting the right compaction strategy.

The default compaction strategy (Size Tiered Compaction Strategy)

Source: https://www.slideshare.net/tomitakazutaka/cassandra-compaction
The default compaction strategy triggers a compaction when multiple SSTables of a similar size are present. To emphasize how evil is that for log storage scenario, let's assume that a week of production logs results in 1x 168GB SSTable, 3 x 42GB SSTables, 3 x 10GB SSTables, etc. After two weeks, the biggest SSTable will be probably still 168GB . That's fine. You have tested the cluster for two weeks of production load, right? The trap is, you want to store logs for three months. Nobody will spend that amount of time hitting the test cluster with production load. After three months, the biggest SSTable will be around 672GB and month after that probably 2688GB. The compaction isn't free. It takes your CPU, it takes your disc IOPS (yes, nice sequential writes, but still). It will take your life (or rather kill your cluster with pending compactions).

Solution: Date Tiered Compaction Strategy

Source: http://www.datastax.com/dev/blog/datetieredcompactionstrategy 
The date tiered compaction strategy will simply leave, after some time, old SSTables untouched. The same situation described above will translate to having (depending on your use case) for example 10x ~268GB SSTables, or perhaps 100x ~26GB SSTables. No killing compactions on old data! Read about the details here: DateTieredCompactionStrategy: Compaction for Time Series Data and here: Date-Tiered Compaction in Apache Cassandra. Yes, queries probably will be a little bit slower.

All in all, invoking this CQL (with carefully chosen consts), will save your cluster:
ALTER TABLE myprecouslogs.myprecioustable WITH compaction = { 'class' :  'DateTieredCompactionStrategy', 'base_time_seconds':'XXX', 'max_sstable_age_days':'YYY' }
and here is how to do it on a live organism: How to change Cassandra compaction strategy on a production cluster.

BTW. Do you know that there is also anticompaction?

Tuesday, February 21, 2017

[Backup] Why shouldn't you ever use ResilioSync? "Database Error" problem


As they say: there are two kinds of people in the World - those who pick up the ice cube that falls on the floor, and those who kick it under the fridge those who back up their files and those who haven't experienced losing all their files yet.

Which category do you fall in?

I decided to set up a backup system with ResilioSync - the heir apparent of the BitTorrent Sync software. Well, that wasn't good idea and I don't recommend anyone using this software.

Maybe it's me, however I prefer the backup software to be:
  • working and rock solid... dependable... stiff... hard... proven software
  • well documented
  • actively maintained (security patches, support)

Whereas my experience with ResilioSync turned out to be:
  • Working... kind of. That was until I've updated it. I don't remember precisely from which version to which version and you won't guess that from dates too, because the releases doesn't have dates. I believe that was from 2.4.3 to 2.4.4. Maybe from 2.4.X to 2.4.X. I'm sure the major number didn't change. It's so important, because the ResilioSync showed "database error" after upgrading. Bummer. This problem alone, caused that I exterminated that piece of software from all my devices. No, I didn't wan't to check why it happened, because it shouldn't happen at all. When my primary data would be gone, I would have lost my data.
  • The documentation is poor. You won't find it easy to follow. Sometimes you won't find what you are looking for at all.
  • There is very limited amount of activity on ResilioSync official forum. You also can't help yourself looking at the code, because it's closed source. My question about HTTPS still has no answer after 5 months.

This, of course, is just my opinion.

Wednesday, December 21, 2016

Dynamic Programming and the hardest problem on HackerRank

The hardest problem on HackerRank, sorted by Max Score and level "Expert" is Separate The Chocolate. It's worth 250 points and the level "Expert" is the highest one. How to solve it?

Problem Statement

Tom and Derpina have a rectangular shaped chocolate bar with chocolates labeled T, D and U. They want to split the bar into exactly two pieces such that:
  • Tom's piece can not contain any chocolate labeled D and similarly, Derpina's piece can not contain any chocolate labeled T and U can be used by either of the two.
  • All chocolates in each piece must be connected (two chocolates are connected if they share an edge), i.e. the chocolates should form one connected component
  • The absolute difference between the number of chocolates in pieces should be at most K
  • After dividing it into exactly two pieces, in any piece, there should not be 4 adjacent chocolates that form a square, i.e. there should not be a fragment like this: 
    XX
    XX
Input Format

The first line of the input contains 3 integers M, N and K separated by a single space.
M lines follow, each of which contains N characters. Each character is 'T','D' or 'U'.

Constraints

0≤ M, N ≤8
0≤ K ≤ M * N

Output Format

A single line containing the number of ways to divide the chocolate bar.

My approach

I didn't have any ideas how to solve it with DP. I've coded a solution which checks every possible way of chocolate division. I felt it wasn't the right solution because the complexity of that is exponential. The graph of possible permutations is checked wisely - at each "branch" I check the required conditions (connectivity, if K difference can be met, squares presence) and cancel the computation of subtrees with invalid state. The result? Quite nice: 16 out of 19 tests passed, 210.53 out of 250 points scored. I looked at the leaderboard, and noticed despite the winners many scored just like me. I've optimized some cases, however that wasn't enough. DP was indeed needed. I've tried to figure it out but finally gave up.

The solution

The whole editorial provided by HackerRank is very short:
This problem is a complicated dynamic programming problem (DP). The idea is simple but the implementation is difficult.

We can iterate the grids one by one. For each grid suppose the left and the upper one has been given to Tom or Derpina. (Color black or white.)

To decide the square [i][j], we need all the square’s state in square[i][0..j – 1] and square[i – 1][j..n -1] , all the (n + 1) squares in total can decide this square’s state. We can use these as a part of state and we also must keep whether a component is dead. (If it’s dead then add one more square of the same color is invalid.)
I can't wrap my head around the DP function though. It is strange for me that the vertical line will be the DP state - because the connectivity must be met for the whole matrix... anyone got an explanation?

Lessons learned

  1. If you can solve a part of a very hard problem - it still can be very profitable in terms of HackerRank score. Compare 210.53 points to 50 points from "Two Robots" medium problem.
  2. Looking at other's solutions, I've noticed this piece of code:
    Right after adding those three cases to my code, the result was 250 points (not counted towards HackerRank score, of course). Therefore: not everybody plays fair. 

Thursday, November 24, 2016

Dynamic Programming in 5 easy steps - Examples - Two Robots

This time we will solve a HackerRank problem, rated as a medium in difficulty. It's advised for you to go through a similar, but in my opinion easier problem described by me previously. The problem statement is as follows:
You have a warehouse with M containers filled with an infinite number of candies. The containers are arranged in a single row, equally spaced to be 1 meter apart. You also have 2 robots that can pick up 1 piece of candy and transport it between any two containers.

The robots take instructions in the form of queries consisting of two integers, Ma and Mb, respectively. To execute a query, a robot travels to container Ma, picks up 1 candy, transports it to container Mb, and then stops at Mb until it receives another query.

Calculate the minimum total distance the robots must travel to execute N queries in order.

Note: You choose which robot executes each query.
Reminder -  the 5 easy steps are:
  1. define subproblems, count number of subproblems
  2. guess (part of solution), count number of choices
  3. relate subproblem solutions, compute time per subproblem
  4. recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems
  5. solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time 
Step 1. Let's try to figure out what our subproblems are. At a given point in solving the problem, what information do we want to have? First obvious thing: index of queries. Next obvious thing: we want to optimize over total distance made by robots, therefore our DP function will return the minimum total distance made by robots at some point in our problem. Having that, we also probably will need positions of two robots. The situation looks like this:

Step 2. Now, we basically have two choices, A: proceed with the first robot or B: with the second robot. When A - we add the distance of moving the r1 from it's last place Mb(r1) to Ma(INDEX), otherwise B: we add the distance of moving the r2 from it's last place Mb(r2) to Ma(INDEX). Additionally, in both cases we have to add the cost of moving Ma(INDEX) to Mb(INDEX).
Step 3. Writing the DP function is now straightforward:

DP(r1, r2, index) = min from:
   A: Ma(INDEX) - Mb(r1) + Mb(INDEX) - Ma(INDEX) + DP(INDEX, r2)
   B: Ma(INDEX) - Mb(r2) + Mb(INDEX) - Ma(INDEX) + DP(r1, INDEX)
However, having the index in DP function is redundant. We can decrease the time complexity of our solution by removing it - index is always the next item after max(r1, r2). Thus, DP function now looks like:

DP(r1, r2) = min from:
   A: Ma(max(r1, r2)+1) - Mb(r1) + Mb(max(r1, r2)+1- Ma(max(r1, r2)+1) + DP(max(r1, r2)+1, r2)
   B: Ma(max(r1, r2)+1) - Mb(r2) + Mb(max(r1, r2)+1- Ma(max(r1, r2)+1) DP(r1, max(r1, r2)+1)

The number of subproblems equals to M*M, because that's how many states which in the robots can be arranged. Time of computing one subproblem is constant, therefore the final complexity of this solution is O(M2).
Step 4. So is it finite algorithm? Yes, because the graph is traversed through two increasing values, r1 and r2.
Step 5. The solution is in DP(-1, -1), assuming -1 denotes that the robot wasn't used yet and the cost of moving it to Mx is 0 (because it starts at this location).

The code

using System;
using System.Collections.Generic;
using System.Linq;

namespace Algos
{
 public class Solver
    {
        private readonly int _m;
        private readonly int _n;
        private readonly List> _queries;
        private int[][] _memo;

        public Solver(int m, int n, List> queries)
        {
            _m = m;
            _n = n;
            _queries = queries;
        }

        public int SolveCase()
        {
            _memo = new int[_n + 1][];
            for (int i = 0; i < _n + 1; i++)
                _memo[i] = new int[_n + 1];
            for (int i = _n; i >= 0; i--)
            {
                for (int j = _n; j >= 0; j--)
                {
                    _memo[i][j] = -1;
                }
            }

            return Dp(-1, -1);
        }

        private int Dp(int r1, int r2)
        {
            if (r1 + 1 == _n || r2 + 1== _n)
                return 0;
            if (_memo[r1 + 1][r2 + 1] != -1)
                return _memo[r1 + 1][r2 + 1];

            var i = Math.Max(r1, r2) + 1;
            //r1 stays in place
            var r2Cover = 0;
            if (r2 != -1)
                r2Cover = Math.Abs(_queries[r2].Item2 - _queries[i].Item1);
            var d1 = Dp(r1, i);
            //r2 stays in place
            var r1Cover = 0;
            if (r1 != -1)
                r1Cover = Math.Abs(_queries[r1].Item2 - _queries[i].Item1);
            var d2 = Dp(i, r2);

            var queryCover = Math.Abs(_queries[i].Item1 - _queries[i].Item2);
            var min = Math.Min(r2Cover + d1 + queryCover, r1Cover + d2 + queryCover);

            _memo[r1 + 1][r2 + 1] = min;

            return min;
        }
    }
 
    class Solution
    {
        static void Main(string[] args)
        {
            //CaseSub0();
            //Case0();
            //Case1();
            //Case2();
            //RandomBigCase();
            //return;

            var T = int.Parse(Console.ReadLine());
            for (var i = 0; i < T; i++)
            {
                ProcessCase();
            }
        }

        private static void ProcessCase()
        {
            var @case = Console.ReadLine().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
            var M = int.Parse(@case[0]);
            var N = int.Parse(@case[1]);
            var queries = new List>(N);
            for (int i = 0; i < N; i++)
            {
                var query = Console.ReadLine().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
                queries.Add(Tuple.Create(int.Parse(query[0]), int.Parse(query[1])));
            }
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine(solution);
        }
  
  private static void CaseSub0()
        {
            var M = 2;
            var N = 1;
            var queries = new List>();
            queries.Add(Tuple.Create(1, 2));
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("Casesub0 " + solution);
        }

        private static void Case0()
        {
            var M = 5;
            var N = 4;
            var queries = new List>();
            queries.Add(Tuple.Create(1, 5));
            queries.Add(Tuple.Create(3, 2));
            queries.Add(Tuple.Create(4, 1));
            queries.Add(Tuple.Create(2, 4));
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("Case0 " + solution);
        }

        private static void Case1()
        {
            var M = 4;
            var N = 2;
            var queries = new List>();
            queries.Add(Tuple.Create(1, 2));
            queries.Add(Tuple.Create(4, 3));
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("Case1 " + solution);
        }

        private static void Case2()
        {
            var M = 10;
            var N = 3;
            var queries = new List>();
            queries.Add(Tuple.Create(2, 4));
            queries.Add(Tuple.Create(5, 4));
            queries.Add(Tuple.Create(9, 8));
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("Case2 " + solution);
        }

        private static void RandomBigCase()
        {
            var M = 1000;
            var N = 1000;
            var queries = new List>();
            var r = new Random(666);
            while (queries.Count != N)
            {
                var from = r.Next(1, M + 1);
                var to = r.Next(1, M + 1);
                var t = Tuple.Create(from, to);
                if (!queries.Contains(t))
                    queries.Add(t);
            }
            var solution = new Solver(M, N, queries).SolveCase();
            Console.WriteLine("RandomBigCase " + solution);
        }
    }
}

Monday, October 31, 2016

Dynamic Programming in 5 easy steps - Examples - Set partitioning

Let's continue Dynamic Programming series. Last time we have covered Text Justification problem, this time we will do something harder. The problem statement is as follows:
Given a set S of positive integers, divide it into two sets S1 and S2 such that the absolute difference between their sums is minimum.

If there is a set S with n elements, then if we assume S1 has m elements, S2 must have n-m elements and the value of abs(sum(S1) – sum(S2)) should be minimum.
The 5 easy steps are:
  1. define subproblems, count number of subproblems
  2. guess (part of solution), count number of choices
  3. relate subproblem solutions, compute time per subproblem
  4. recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems
  5. solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time 
Taking an example, assume a set S={ 1, 6, 11, 5}.
Step 1. As previously, we want to divide the problem to subproblems.
How this can be done? Suffixes, prefixes or the mix of those two are usually the answer. It's natural to imagine that in the middle of processing the set looks at some point as follows .., 11, 5. At this very point we already assumed we are working on suffixes. 
In that situation we can either include the 11 in the set S1 or in the set S2. I'll say that we take or leave the number, because we have only two choices.
Related, very important aspect - we have to track only one sum! The other one is just totalSum-chosenSetSum.
Getting back to our situation, what is the "context" of the subproblem? Does it matter how we got to this point? Not at all, what matters is only the chosenSetSum. Obviously, we have to also keep track of our current item index. Here we have it, the subproblems.

Step 2. How our guess will look like? We can increase the sum and progress with the problem,
or we can just progress with the problem, assuming we leave the item and it will be in the second set.

Step 3. Now we have to glue two pieces togeter. Our DP function will have two arguments as noted above, chosenSetSum and index.
We want to optimize for the minimum difference between the sets and that's what will be returned by the DP function. All in all, the equation is as follows:
DP(chosenSetSum, i) = min from:
  • DP(chosenSetSum, i+1) //when skipping the i'th item
  • DP(chosenSetSum + inputSet[i], i+1)) //when taking the i'th item
The corner case is DP(X, inputSet.Length) = totalSum-X, because that's the end of partitioning.
The time complexity of this approach is:
time = time per subproblem · number of subproblems, so
time = O(1) * O(sum * n) = O(sum * n)
Yey, we have pseudo-polynomial time solution.

Step 4. Is it actually an acyclic graph of DP function invocations? My reasoning here is that we have two dimensions (chosenSetSum and item index) where both increase and there is an end when the second variable reaches input set cardinality, thus it's finite algorithm.

Step 5. We start with chosenSetSum=0 and at the first index, therefore our solution is at DP(0,0).

The Code

Having the 5 steps figured out, it's relatively easy to code the solution.

Pure DP algorithm coded without memorization

private class Partitioner
{
    private readonly int[] _inputSet;
    private readonly int _inputSetSum;

    public Partitioner(int[] inputSet)
    {
        _inputSet = inputSet;
        _inputSetSum = _inputSet.Sum();
    }

    public void Solve()
    {
        DP(0, 0);
    }

    private int DP(int chosenSetSum, int i)
    {
        if (i == _inputSet.Length)
            //Note: returning the difference between chosenSetSum and skippedSetSum, 
            //not the skippedSetSum (which would be _inputSetSum - chosenSetSum)
            return Math.Abs(_inputSetSum - 2 * chosenSetSum);

        var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
        var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
        var minDiff = Math.Min(skipCurrentNumberScore, chooseCurrentNumberScore) ;

        return minDiff;
    }
}
How cool is that? 6 lines of code for almost working solution!

Memoization added

private class Partitioner
{
    private const int Marker = -1;
    private readonly int[] _inputSet;
    private readonly int _inputSetSum;
    private readonly int[][] _memo;

    public Partitioner(int[] inputSet)
    {
        _inputSet = inputSet;
        _inputSetSum = _inputSet.Sum();
        _memo = new int[_inputSetSum][];
        for (var i = 0; i < _inputSetSum; i++)
        {
            _memo[i] = new int[_inputSet.Length];
            for (var j = 0; j < _inputSet.Length; j++)
                _memo[i][j] = Marker;
        }
    }

    public void Solve()
    {
        DP(0, 0);
    }

    private int DP(int chosenSetSum, int i)
    {
        if (i == _inputSet.Length)
            //Note: returning the difference between chosenSetSum and skippedSetSum, 
            //not the skippedSetSum (which would be _inputSetSum - chosenSetSum)
            return Math.Abs(_inputSetSum - 2 * chosenSetSum);

        if (_memo[chosenSetSum][i] != Marker)
            return _memo[chosenSetSum][i];

        var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
        var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
        var minDiff = Math.Min(skipCurrentNumberScore, chooseCurrentNumberScore);

        _memo[chosenSetSum][i] = minDiff;
        return minDiff;
    }
}

Full solution with parent pointers and examples

using System;
using System.Collections.Generic;
using System.Linq;

public class DP_SetPartition
{
    static void Main(String[] args)
    {
        PrintSolution(new int[0]);
        PrintSolution(new[] { 3 });
        PrintSolution(new[] { 3, 3 });
        PrintSolution(new[] { 3, 2 });
        PrintSolution(new[] { 1, 4, 2 });
        PrintSolution(new[] { 1, 2, 4 });
        PrintSolution(new[] { 1, 1, 5, 1, 1, 1 });
        PrintSolution(new[] { 1, 6, 11, 5 });
        PrintSolution(new[] { 1, 6, 11, 5 });
        PrintSolution(new[] { 25, 21, 20, 17, 8 });
        PrintSolution(new[] { 1, 5, 6, 10 });
        PrintSolution(new[] { 3, 1, 4, 2, 2, 1 });
        PrintSolution(new[] { 200, 200, 300, 300, 400, 400, 1000, 1000 });
        PrintSolution(new[] { 100, 100, 200, 200, 300, 300, 400, 400, 1000, 1000 });
        Console.ReadLine();
    }

    private static void PrintSolution(int[] set)
    {
        Console.WriteLine($"Set: {string.Join(", ", set)}");
        int[] set1;
        int[] set2;
        new Partitioner(set).Solve(out set1, out set2);
        Console.WriteLine($"A: {string.Join(", ", set1)} \t\t B: {string.Join(", ", set2)}, sumdiff: {Math.Abs(set1.Sum() - set2.Sum())}");
        Console.WriteLine();
    }

    private class Partitioner
    {
        private const int Marker = -1;
        private readonly int[] _inputSet;
        private readonly bool[][] _parentPointers;
        private readonly int _inputSetSum;
        private readonly int[][] _memo;

        public Partitioner(int[] inputSet)
        {
            _inputSet = inputSet;
            _inputSetSum = _inputSet.Sum();
            _memo = new int[_inputSetSum][];
            _parentPointers = new bool[_inputSetSum][];
            for (var i = 0; i < _inputSetSum; i++)
            {
                _memo[i] = new int[_inputSet.Length];
                _parentPointers[i] = new bool[_inputSet.Length];
                for (var j = 0; j < _inputSet.Length; j++)
                {
                    _memo[i][j] = Marker;
                }
            }
        }

        public void Solve(out int[] set1, out int[] set2)
        {
            DP(0, 0);
            var chosenSet = new List<int>();
            var skippedSet = new List<int>();
            var sum = 0;
            var i = 0;
            while (i != _inputSet.Length)
            {
                var choose = _parentPointers[sum][i];
                if (choose)
                {
                    chosenSet.Add(_inputSet[i]);
                    sum += _inputSet[i];
                }
                else
                {
                    skippedSet.Add(_inputSet[i]);
                }
                i++;
            }
            set1 = chosenSet.ToArray();
            set2 = skippedSet.ToArray();
        }

        private int DP(int chosenSetSum, int i)
        {
            if (i == _inputSet.Length)
                return Math.Abs(_inputSetSum - 2 * chosenSetSum);

            if (_memo[chosenSetSum][i] != Marker)
                return _memo[chosenSetSum][i];

            var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
            var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
            var minDiff = skipCurrentNumberScore;
            if (chooseCurrentNumberScore < skipCurrentNumberScore)
            {
                minDiff = chooseCurrentNumberScore;
                _parentPointers[chosenSetSum][i] = true;
            }

            _memo[chosenSetSum][i] = minDiff;
            return minDiff;
        }
    }
}
It's working!

Tuesday, October 25, 2016

Dynamic Programming in 5 easy steps - Examples - Text Justification

Lately I spent a relatively great amount of time revising Algorithms and Data Structures. It's a broad subject, but for me personally I found the Dynamic Programming as the abandoned and unused method. It's also considered as one of the hardest methods to master, with few examples on the internet. Let's contribute a little with this post series. Today I will cover the first problem - text justification. The problem is from MIT lectures and I highly recommend well explained videos along with notes from course 6.006: Introduction to algorithms - fall 2011.

Text justification

Given a text, split it into multiple lines so it is “nicely” distributed between each line.
What does it mean? Take a look at the example:
We can clearly see that eager approach won't produce good results in all cases.
Mathematically, given a badness(i, j) function for line of words[i : j], we seek for such a splitting that the sum of all badness is minimum.
Let's assume the badness function is predefined as follows:
∞ if total length > page width, else (page width − total length)3
So in the example above, the bad splitting would have:
  • page width = 16
  • first line length = 4+4+4+2 = 16, badness of the first line = 0
  • second line length = 4, badness = (16-4)3=1728
  • third line length = 16, badness = 0
  • badness sum = 1728
The good splitting would have:
  • first line length = 4+4+1 = 9, badness of the first line = (16-9)3=343
  • second line length = 9, badness = (16-9)3=343
  • third line length = 16, badness = 8
  • badness sum = 694
The second splitting has smaller total badness sum, thus it's a better splitting.

Dynamic Programming in 5 easy steps

On the MIT course, there is presented an approach to DP in 5 easy steps. Those are:
  1. define subproblems, count number of subproblems
  2. guess (part of solution), count number of choices
  3. relate subproblem solutions, compute time per subproblem
  4. recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems
  5. solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time 
Let's try to apply them on our problem.
Well, the steps don't have to be executed always in order. Let's try to figure out the second step. We know where our line begins, it's on ith position. Where is the end of the line? We don't know. So let's check all possibilities, that's from ith onward and take the smallest badness. So now we know what's our subproblems in first step are - suffixes of word[i:]. That's pretty much the algorithm. The third step will look like:
DP(i)=for j in i+1...n+1 take
min from (DP(j) + badness(i, j))
The great thing is about DP that DP(j) considered to be free (because we compute the subproblem once and memorize the answer). Therefore the total time needed by this algorithm is:
time = time per subproblem · number of subproblems, so
time = O(n) · O(n)
time = O(n^2)
The fourth point is about proving that this really works and the time needed is finite. We compute the answer from the end of the words, down to word number 0. The recursive DP function always accesses greater indexes, therefore time needed is finite. Lastly, we have to find our answer. Here, obviously it's DP[0]. Easy? Yup, let's code it!

The code

using System;
using System.Collections.Generic;
using System.Linq;

namespace Algos
{
    public class DP_TextJustification
    {
        static void Main(String[] args)
        {
            //please note: this is not a production-ready solution!
            PrintSolution("   ", 10);
            PrintSolution("oneword", 10);
            PrintSolution("one line", 10);
            PrintSolution("two lines", 6);
            PrintSolution("blah blah blah blah reallylongword", 14);
            PrintSolution("blah blah blah blah reallylongword 1", 14);
            var txt1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " +
                       "Praesent varius urna eget lacinia pharetra. " +
                       "Vivamus lacinia in dui vitae sodales. " +
                       "Aliquam auctor justo nec condimentum pretium. " +
                       "Curabitur egestas tellus dolor, vel tempus lacus blandit vitae. " +
                       "Cras dapibus scelerisque ex nec gravida.";
            PrintSolution(txt1, 60);
            PrintSolution(txt1, 70);
            PrintSolution(txt1, 80);
            PrintSolution(txt1, 90);
            PrintSolution(txt1, 100);
            Console.ReadLine();
        }

        public static void PrintSolution(string txt, int width)
        {
            Console.WriteLine($"Test length: {txt.Length}, line width: {width}");
            foreach (var line in new Justifier(txt, width).Solve())
            {
                Console.WriteLine("{0,-" + width + "}|", line);
            }
            Console.WriteLine();
        }

        private class Justifier
        {
            private const int Marker = -1;
            private readonly int _width;
            private readonly string[] _words;
            private readonly int[] _parentPointers;
            private readonly int[] _memo;
             
            public Justifier(string text, int width)
            {
                _width = width;
                _words = text.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries);
                _parentPointers = new int[_words.Length];
                _memo = new int[_words.Length];
                for (var i = 0; i < _memo.Length; i++)
                {
                    _memo[i] = Marker;
                }
            }

            public List<string> Solve()
            {
                DP(0);

                var result = new List<string>();
                var from = 0;
                while (from != _words.Length)
                {
                    var next = _parentPointers[from];
                    result.Add(string.Join(" ", _words.Skip(from).Take(next - from)));
                    from = next;
                }
                return result;
            }

            private int DP(int i)
            {
                if (i == _words.Length) //no words in line is the end of justification
                    return 0;

                if (_memo[i] != Marker) //if we have solution calculated, return it
                    return _memo[i];

                var minBadness = int.MaxValue;

                for (var j = i + 1; j <= _words.Length; j++)
                {
                    var badness = Badness(i, j) + DP(j);
                    if (minBadness > badness)
                    {
                        _parentPointers[i] = j; //remember best choice
                        minBadness = badness;
                    }
                }

                _memo[i] = minBadness; //remember solution
                return minBadness;
            }

            private int Badness(int i, int j)
            {
                var lengthOfWords = _words.Skip(i).Take(j - i).Sum(s => s.Length);
                var numberOfSpaces = j - i - 1;
                if (lengthOfWords + numberOfSpaces > _width)
                    return 10*1000*1000;
                return (int)Math.Pow(_width - lengthOfWords - numberOfSpaces, 3);
            }
        }
    }
}
To the solution in 5 easy steps I've added parent pointers - a simple way to remember the best solution from DP. The result is as follows:
If we don't care about justifying the last line, we can easily modify the badness function to return no cost for the last line:
private int Badness(int i, int j)
{
    var lengthOfWords = _words.Skip(i).Take(j - i).Sum(s => s.Length);
    var numberOfSpaces = j - i - 1;
    if (lengthOfWords + numberOfSpaces > _width)
        return 10*1000*1000;
    if (j == _words.Length) //don't care about last line
        return 0;
    return (int)Math.Pow(_width - lengthOfWords - numberOfSpaces, 3);
}
In this case, the result will be much different in some cases:

Tuesday, September 6, 2016

Cassandra Datastax C# Driver problems - solution

In the previous post I've described a strange problem related to Cassandra Datastax C# Driver which was happening once in the production environment. It's time to reveal the mystery.

Two root causes

1.

One of the hidden, but important metric, which you won't find usually in your logs is the CPU usage. What important is that the connection setup to the Cassandra cluster consists of many small steps. In production, when there was a very high CPU usage (around 100% - for reason known and already eliminated), the connection setup process was timing out in such a moment, that the final result was reported as NoHostAvailableException. This shows, how important is to track and prevent 100% CPU usage.

2.

But why, let me quote myself:
Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.
The problem is here:
private readonly Cluster _cluster;
private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache

public ISession GetSession(string keyspaceName)
{
    if (!_sessions.ContainsKey(keyspaceName))
    {
        _sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
    }
    var result = _sessions[keyspaceName];
    return result.Value;
}
Do you see it? It turns out that when an exception is thrown in _cluster.Connect(key) method, the Lazy<T> will cache this exception. Therefore all invocations to Lazy<T>.Value will result in the same, cached exception instead of retrying the connection to the Cassandra cluster. If you are planning to use the Lazy<T> class, there are more "gotchas". Read the documentation on MSDN.

Lessons learned?


  1. CPU usage is very, very important and critical metric. Do not ignore it, as it may lead to numerous, strange errors.
  2. RTFM! Read the documentation when using any class for the first time. Especially when copying&pasting code from the StackOverflow.

Tuesday, August 30, 2016

Cassandra Datastax C# Driver problems - NoHostAvailableException

This post will be about my journey with fixing nasty Cassandra Datastax C# driver problem, which took me a lot more time than expected...
Credits: wikimedia

Once upon a time, I've been fixing following exception:
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: x.x.x.x:9042, x.x.x.x:9042, x.x.x.x:9042)
   at Cassandra.ControlConnection.Connect(Boolean firstTime)
   at Cassandra.Cluster.Connect(String keyspace)
   at Company.Code.CassandraSessionCache.GetSession(String keyspaceName)
   at Company...
The CassandraSessionCache looked like this:
public class CassandraSessionCache
{
    private readonly Cluster _cluster;
    private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache

    public CassandraSessionCache(Cluster cluster)
    {
        _cluster = cluster;
        _sessions = new ConcurrentDictionary<string, Lazy<ISession>>();
    }

    public ISession GetSession(string keyspaceName)
    {
        if (!_sessions.ContainsKey(keyspaceName))
        {
   _sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
        }
        var result = _sessions[keyspaceName];
        return result.Value;
    }
}
Nothing fancy, however let me give you an insight about the architecture and circumstances of the error:
  • Cassandra cluster is in Amazon
  • The client is Cassandra Datastax C# Driver 2.6.0, also on server in Amazon
  • Both the client and the Cassandra cluster is the same Amazon Region
  • Amazon Region had no availability issues during given period
  • The solution was working fine for over 1 month! The client process is being restarted ~every week for various reasons
  • The client follows Cassandra Datastax C# Driver Best Practices
  • Heartbeat is turned on, so the connection should be alive, all the times
  • Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.
  • Of course, it works on my machine®

What didn't happen?

There are plenty of questions about Cassandra.NoHostAvailableException on StackOverflow. So let's get through some of them and exclude them:
  • [1][2] - no, because following C# Driver best practices excludes this
  • [3] - no, because we are using default retry strategy from driver version 2.6.0
  • [4][5][6] - no, because we are able to connect to the Cluster at the beginning
  • [7] - no, because we are not misusing batches

Debugging...

Logs on server revealed that the client closed the connection:
INFO  [SharedPool-Worker-3] yyyy-mm-dd 11:04:45,625 Message.java:605 - Unexpected exception during request; channel = [id: 0x9eaf52c5, /x.x.x.x:y :> /x.x.x.x:9042]
java.io.IOException: Error while read(...): Connection reset by peer
  at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollRdHupReady(EpollSocketChannel.java:689) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
While the message on the client says that there are no hosts available, what is confirmed by debug logs on the client side. Pretty interesting, huh? Being confused, I've decided to give an update from 2.6.3 to 2.7 a try... but that didn't help.
Accoring to yet another issue regarding NoHostsAvailableException on StackOverflow I've started to log whole exception, with serialized errors property. This is what I've logged:
System.Exception: NoHostAvailableException happened. Errors: {
  "x.x.x.x:9042": {
    "NativeErrorCode": 10060,
    "ClassName": "System.Net.Sockets.SocketException",
    "Message": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   
  at Cassandra.Connection.<Open>b__9(Task`1 t)
  at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  at System.Threading.Tasks.Task.Execute()",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
    "HResult": -2147467259,
    "Source": "Cassandra",
    "WatsonBuckets": null
  },
  "x.x.x.x:9042": {
    "NativeErrorCode": 10060,
    "ClassName": "System.Net.Sockets.SocketException",
    "Message": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "
  at Cassandra.Connection.<Open>b__9(Task`1 t)
  at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  at System.Threading.Tasks.Task.Execute()",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
    "HResult": -2147467259,
    "Source": "Cassandra",
    "WatsonBuckets": null
  }
}
Unfortunately, no interesting data is here.

So what could possibly go wrong?

Can you spot the error? I couldn't. Any guesses? Find the answer in next post.

Thursday, August 18, 2016

Algorithms and data structures - non-academic trees

Credits: Wikipedia Tree (data structure)
There are many types of trees which are covered on Computer Science lectures. Those usually include: Binary Search Tree, AVL Tree, B Tree, Splay Tree, Red-Black Tree, Trie Trees, Heap Trees.

Those are indeed very useful and practical trees with lots of applications. However, I've discovered few other trees while brushing up my knowledge about algorithms and data structures. Here they are, the most interesting, yet not so popular trees:
  • BK-Tree - do you want to find misspellings of a word in a dictionary? E.g. given word "dog" and dictionary { "cat", "fog", "dot", "cookie" }, naive approach is to compare the word "dog" to all of the entries in the dictionary. This leads to O(n) time. It can be solved in O(lg n) time, though. Burkhard-Keller tree is used in Apache Lucene, for example. Head to Xenopax's Blog for awesome post about BK-Trees.
  • Merkle Tree - probably you didn't know but that's the name of the tree of commits and blobs in a Git VCS. Another applications known to me personally include: Cassandra (during node repair) and Bitcoin blockchain.
  • Interval Tree - interesting idea of augmenting "normal" (single value) trees with additional data in order to solve windowing queries.
  • Lemon Tree - the most complicated type of tree. Many wondered what it really is, but few actually knew... Find the official statement here.

Tuesday, July 26, 2016

The performance of setting T[] vs. List by index

Source: wikimedia.org

Let's compare asymptotic time complexity of two following loops:
int[] _array = Enumerable.Range(0, n).ToArray();
List<int> _list = Enumerable.Range(0, n).ToList();

//ListChangeByIndex
for (var i = 0; i < n; i++)
{
    _list[i] = i + 1;
}

//ArrayChangeByIndex
for (var i = 0; i < n; i++)
{
    _array[i] = i + 1;
}
How do you think, which one is faster?
.
.
.
.
.
.
.
.
Many developers think the fist one will be slower, because in each loop computer is forced to visit all nodes from 0 to i to finally set the variable.
However, that's not the case. Both loops have O(n) complexity. That's because in the .NET source we can clearly see that's underlying data structure for a List<T> is an array: list.cs. Therefore, those two loops are essentially equal.