Tuesday, September 6, 2016

Cassandra Datastax C# Driver problems - solution

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/09/cassandra-datastax-c-sharp-driver-problems---solution/


In the previous post I've described a strange problem related to Cassandra Datastax C# Driver which was happening once in the production environment. It's time to reveal the mystery.

Two root causes

1.

One of the hidden, but important metric, which you won't find usually in your logs is the CPU usage. What important is that the connection setup to the Cassandra cluster consists of many small steps. In production, when there was a very high CPU usage (around 100% - for reason known and already eliminated), the connection setup process was timing out in such a moment, that the final result was reported as NoHostAvailableException. This shows, how important is to track and prevent 100% CPU usage.

2.

But why, let me quote myself:
Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.
The problem is here:
private readonly Cluster _cluster;
private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache

public ISession GetSession(string keyspaceName)
{
    if (!_sessions.ContainsKey(keyspaceName))
    {
        _sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
    }
    var result = _sessions[keyspaceName];
    return result.Value;
}
Do you see it? It turns out that when an exception is thrown in _cluster.Connect(key) method, the Lazy<T> will cache this exception. Therefore all invocations to Lazy<T>.Value will result in the same, cached exception instead of retrying the connection to the Cassandra cluster. If you are planning to use the Lazy<T> class, there are more "gotchas". Read the documentation on MSDN.

Lessons learned?


  1. CPU usage is very, very important and critical metric. Do not ignore it, as it may lead to numerous, strange errors.
  2. RTFM! Read the documentation when using any class for the first time. Especially when copying&pasting code from the StackOverflow.

No comments:

Post a Comment