[__DynamicallyInvokable]

Tuesday, September 6, 2016

Cassandra Datastax C# Driver problems - solution

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/09/cassandra-datastax-c-sharp-driver-problems---solution/

In the previous post I've described a strange problem related to Cassandra Datastax C# Driver which was happening once in the production environment. It's time to reveal the mystery.

Two root causes

1.

One of the hidden, but important metric, which you won't find usually in your logs is the CPU usage. What important is that the connection setup to the Cassandra cluster consists of many small steps. In production, when there was a very high CPU usage (around 100% - for reason known and already eliminated), the connection setup process was timing out in such a moment, that the final result was reported as NoHostAvailableException. This shows, how important is to track and prevent 100% CPU usage.

2.

But why, let me quote myself:

Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.

The problem is here:

private readonly Cluster _cluster;
private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache

public ISession GetSession(string keyspaceName)
{
    if (!_sessions.ContainsKey(keyspaceName))
    {
        _sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
    }
    var result = _sessions[keyspaceName];
    return result.Value;
}

Do you see it? It turns out that when an exception is thrown in _cluster.Connect(key) method, the Lazy<T> will cache this exception. Therefore all invocations to Lazy<T>.Value will result in the same, cached exception instead of retrying the connection to the Cassandra cluster. If you are planning to use the Lazy<T> class, there are more "gotchas". Read the documentation on MSDN.

Lessons learned?

CPU usage is very, very important and critical metric. Do not ignore it, as it may lead to numerous, strange errors.
~~RTFM!~~ Read the documentation when using any class for the first time. Especially when copying&pasting code from the StackOverflow.

Tuesday, August 30, 2016

Cassandra Datastax C# Driver problems - NoHostAvailableException

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/08/cassandra-datastax-c-sharp-driver-problems---nohostavailableexception/

This post will be about my journey with fixing nasty Cassandra Datastax C# driver problem, which took me a lot more time than expected...

Credits: wikimedia

Once upon a time, I've been fixing following exception:

Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: x.x.x.x:9042, x.x.x.x:9042, x.x.x.x:9042)
   at Cassandra.ControlConnection.Connect(Boolean firstTime)
   at Cassandra.Cluster.Connect(String keyspace)
   at Company.Code.CassandraSessionCache.GetSession(String keyspaceName)
   at Company...

The CassandraSessionCache looked like this:

public class CassandraSessionCache
{
    private readonly Cluster _cluster;
    private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache

    public CassandraSessionCache(Cluster cluster)
    {
        _cluster = cluster;
        _sessions = new ConcurrentDictionary<string, Lazy<ISession>>();
    }

    public ISession GetSession(string keyspaceName)
    {
        if (!_sessions.ContainsKey(keyspaceName))
        {
   _sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
        }
        var result = _sessions[keyspaceName];
        return result.Value;
    }
}

Nothing fancy, however let me give you an insight about the architecture and circumstances of the error:

Cassandra cluster is in Amazon
The client is Cassandra Datastax C# Driver 2.6.0, also on server in Amazon
Both the client and the Cassandra cluster is the same Amazon Region
Amazon Region had no availability issues during given period
The solution was working fine for over 1 month! The client process is being restarted ~every week for various reasons
The client follows Cassandra Datastax C# Driver Best Practices
Heartbeat is turned on, so the connection should be alive, all the times
Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.
Of course, it works on my machine®

What didn't happen?

There are plenty of questions about Cassandra.NoHostAvailableException on StackOverflow. So let's get through some of them and exclude them:

[1], [2] - no, because following C# Driver best practices excludes this
[3] - no, because we are using default retry strategy from driver version 2.6.0
[4], [5], [6] - no, because we are able to connect to the Cluster at the beginning
[7] - no, because we are not misusing batches

Debugging...

Logs on server revealed that the client closed the connection:

INFO  [SharedPool-Worker-3] yyyy-mm-dd 11:04:45,625 Message.java:605 - Unexpected exception during request; channel = [id: 0x9eaf52c5, /x.x.x.x:y :> /x.x.x.x:9042]
java.io.IOException: Error while read(...): Connection reset by peer
  at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollRdHupReady(EpollSocketChannel.java:689) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]

While the message on the client says that there are no hosts available, what is confirmed by debug logs on the client side. Pretty interesting, huh? Being confused, I've decided to give an update from 2.6.3 to 2.7 a try... but that didn't help.
Accoring to yet another issue regarding NoHostsAvailableException on StackOverflow I've started to log whole exception, with serialized errors property. This is what I've logged:

System.Exception: NoHostAvailableException happened. Errors: {
  "x.x.x.x:9042": {
    "NativeErrorCode": 10060,
    "ClassName": "System.Net.Sockets.SocketException",
    "Message": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   
  at Cassandra.Connection.<Open>b__9(Task`1 t)
  at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  at System.Threading.Tasks.Task.Execute()",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
    "HResult": -2147467259,
    "Source": "Cassandra",
    "WatsonBuckets": null
  },
  "x.x.x.x:9042": {
    "NativeErrorCode": 10060,
    "ClassName": "System.Net.Sockets.SocketException",
    "Message": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "
  at Cassandra.Connection.<Open>b__9(Task`1 t)
  at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  at System.Threading.Tasks.Task.Execute()",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
    "HResult": -2147467259,
    "Source": "Cassandra",
    "WatsonBuckets": null
  }
}

Unfortunately, no interesting data is here.

So what could possibly go wrong?

Can you spot the error? I couldn't. Any guesses? Find the answer in next post.

Thursday, August 18, 2016

Algorithms and data structures - non-academic trees

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/08/algorithms-and-data-structures---non-academic-trees/

Credits: Wikipedia Tree (data structure)

There are many types of trees which are covered on Computer Science lectures. Those usually include: Binary Search Tree, AVL Tree, B Tree, Splay Tree, Red-Black Tree, Trie Trees, Heap Trees.

Those are indeed very useful and practical trees with lots of applications. However, I've discovered few other trees while brushing up my knowledge about algorithms and data structures. Here they are, the most interesting, yet not so popular trees:

BK-Tree - do you want to find misspellings of a word in a dictionary? E.g. given word "dog" and dictionary { "cat", "fog", "dot", "cookie" }, naive approach is to compare the word "dog" to all of the entries in the dictionary. This leads to O(n) time. It can be solved in O(lg n) time, though. Burkhard-Keller tree is used in Apache Lucene, for example. Head to Xenopax's Blog for awesome post about BK-Trees.
Merkle Tree - probably you didn't know but that's the name of the tree of commits and blobs in a Git VCS. Another applications known to me personally include: Cassandra (during node repair) and Bitcoin blockchain.
Interval Tree - interesting idea of augmenting "normal" (single value) trees with additional data in order to solve windowing queries.
Lemon Tree - the most complicated type of tree. Many wondered what it really is, but few actually knew... Find the official statement here.

Tuesday, July 26, 2016

The performance of setting T[] vs. List by index

Source: wikimedia.org

Let's compare asymptotic time complexity of two following loops:

int[] _array = Enumerable.Range(0, n).ToArray();
List<int> _list = Enumerable.Range(0, n).ToList();

//ListChangeByIndex
for (var i = 0; i < n; i++)
{
    _list[i] = i + 1;
}

//ArrayChangeByIndex
for (var i = 0; i < n; i++)
{
    _array[i] = i + 1;
}

How do you think, which one is faster?

Many developers think the fist one will be slower, because in each loop computer is forced to visit all nodes from 0 to i to finally set the variable.
However, that's not the case. Both loops have O(n) complexity. That's because in the .NET source we can clearly see that's underlying data structure for a List<T> is an array: list.cs. Therefore, those two loops are essentially equal.

Tuesday, July 12, 2016

Presentation recommendation - Cloud-based Microservices powering BBC iPlayer

Today I recommend you following presentation: Cloud-based Microservices powering BBC iPlayer

Why? It's interesting (at least for me) how one of the most popular British broadcasting organisation make their's channels available online. A high-level architecture is presented. If you are new to AWS, you will also learn a thing or two about the Amazon Cloud.

Thursday, June 30, 2016

Git tips - replace all occurrences of a string in files

Git can be used from VisualStudio, however it's like saying you drive a car, when actually you play Need for Speed. Unleash the full power of Git, learn to use it. It's not that hard.

Doesn't matter if you are a beginner or an advanced user, you should know what an git alias is. If you don't know, go here immediately: Git Basics - Git Aliases.

Today, you will get a very useful git alias. It's for replacing all occurrences of one string with another. Suppose you want to replace EntityFramework with NHibernate in your project (which seems to be a pretty reasonable thing to do:) ). Here is the alias:

replaceall = "!f() { git grep -l \"$1\" | xargs sed -b -i \"s/$1/$2/g\"; }; f"

The first part of the alias lists all files containing first argument and passes it through pipe and xargs to sed, which performs the replacement. Use it like this:

git replaceall EntityFramework NHibernate

Please note: it will replace all occurrences of "EntityFramework" with "NHibernate" in all tracked and untracked files.

Monday, June 13, 2016

Messing with C# types. Making type1.FullName==type2.FullName, but not type1==type2!

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/07/the-performance-of-setting-t-vs.-list-by-index/

Given the following method:

private static void CompareTypes(Type type1, Type type2)
{
    Console.WriteLine($"type1.FullName = {type1.FullName}");
    Console.WriteLine($"type2.FullName = {type2.FullName}");
    Console.WriteLine($"type1.FullName {(type1.FullName == type2.FullName ? '=' : '!')}= type2.Fullname");
    Console.WriteLine($"type1.AssemblyQualifiedName = {type1.AssemblyQualifiedName}");
    Console.WriteLine($"type2.AssemblyQualifiedName = {type2.AssemblyQualifiedName}");
    Console.WriteLine($"type1.AssemblyQualifiedName {(type1.AssemblyQualifiedName == type2.AssemblyQualifiedName ? '=' : '!')}= type2.AssemblyQualifiedName");
    Console.WriteLine($"type1.GUID = {type1.GUID}");
    Console.WriteLine($"type2.GUID = {type2.GUID}");
    Console.WriteLine($"type1.GUID {(type1.GUID == type2.GUID ? '=' : '!')}= type2.GUID");

    Console.WriteLine("o1 = Activator.CreateInstance(type1)");
    Console.WriteLine("o2 = Activator.CreateInstance(type2)");
    var o1 = Activator.CreateInstance(type1);
    var o2 = Activator.CreateInstance(type2);
    Console.WriteLine($"o1 == {o1}");
    Console.WriteLine($"o2 == {o2}");

    Console.WriteLine();
    Console.WriteLine($"but... type1 {(type1 == type2 ? '=' : '!')}= type2");
}

Is it possible to get the following result?

type1.FullName = MyLibrary.MyPrecious
type2.FullName = MyLibrary.MyPrecious
type1.FullName == type2.Fullname
type1.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type2.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type1.AssemblyQualifiedName == type2.AssemblyQualifiedName
type1.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type2.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type1.GUID == type2.GUID
o1 = Activator.CreateInstance(type1)
o2 = Activator.CreateInstance(type2)
o1 == MyLibrary.MyPrecious
o2 == MyLibrary.MyPrecious

but... type1 != type2

As it turns out, it is. Doing such a hell is relatively easy:

private static Assembly LoadAssemblyByName(string name)
{
    var myPreciousAssemblyLocation = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), name);
    using (var fs = new FileStream(myPreciousAssemblyLocation, FileMode.Open, FileAccess.Read))
    {
        var data = new byte[fs.Length];
        fs.Read(data, 0, data.Length);
        fs.Close();
        var assembly = Assembly.Load(data);
        return assembly;
    }
}

static void Main()
{
    var type1 = typeof (MyPrecious);
    var myLibraryAssembly = LoadAssemblyByName("MyLibrary.dll");
    var type2 = myLibraryAssembly.GetType("MyLibrary.MyPrecious", true);

    CompareTypes(type1, type2);
}

The code above compares type1 from referenced project to type2 from the same assembly, but loaded again through Assembly.Load(byte[]). That makes the library loaded twice in the AppDomain. Now when a call to AppDomain.CurrentDomain.GetAssemblies() is made, the assemblies are:

AppDomain.CurrentDomain.GetAssemblies:
mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
Microsoft.VisualStudio.HostingProcess.Utilities, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
System.Windows.Forms, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
System.Drawing, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
Microsoft.VisualStudio.HostingProcess.Utilities.Sync, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
Microsoft.VisualStudio.Debugger.Runtime, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
vshost32, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
System.Core, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
ConsoleApplication1, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
Accessibility, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3

Even in such a small, console application it is quite confusing. So, let's make it more confusing... What's the output of the following code?

var myPreciousAssemblyLocation = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "MyLibrary.dll");
var myLibraryAssemblyLoadFrom = Assembly.LoadFrom(myPreciousAssemblyLocation);
var type3 = myLibraryAssemblyLoadFrom.GetType("MyLibrary.MyPrecious", true);
CompareTypes(type1, type3);

Now, surprisingly, its:

type1.FullName = MyLibrary.MyPrecious
type2.FullName = MyLibrary.MyPrecious
type1.FullName == type2.Fullname
type1.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type2.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type1.AssemblyQualifiedName == type2.AssemblyQualifiedName
type1.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type2.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type1.GUID == type2.GUID
o1 = Activator.CreateInstance(type1)
o2 = Activator.CreateInstance(type2)
o1 == MyLibrary.MyPrecious
o2 == MyLibrary.MyPrecious

but... type1 == type2

Hint

A nice hint is shown, when you try to execute the following code:

var o1 = Activator.CreateInstance(type1);
var o2 = Activator.CreateInstance(type2);
MyPrecious p1 = (MyPrecious) o1;
try
{
    MyPrecious p2 = (MyPrecious)o2;
}
catch (Exception e)
{
    Console.WriteLine(e);
}

System.InvalidCastException: [A]MyLibrary.MyPrecious cannot be cast to [B]MyLibrary.MyPrecious. Type A originates from 'MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' in the context 'LoadNeither' in a byte array. Type B originates from 'MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' in the context 'Default' at location 'c:\users\pwdev\documents\visual studio 2015\Projects\ConsoleApplication1\ConsoleApplication1\bin\Debug\MyLibrary.dll'. at ConsoleApplication1.Program.Main() in c:\users\pwdev\documents\visual studio 2015\Projects\ConsoleApplication1\ConsoleApplication1\Program.cs:line 49

Explanation

Yes, it's all about the load contexts. There are three different assembly load contexts: Load, LoadFrom, Neither. Usually there is no need to load the same library twice and get the strange behavior written above, but sometimes there might be. There are many advantages and disadvantages of using different Assembly.Load(From/File) methods. Take a look: Choosing a Binding Context. Furthermore, consider what's happening to assembly dependencies when you load an assembly. There are best practices described on MDSN for loading assemblies: Best Practices for Assembly Loading. I have to say, in my whole career I've been loading assemblies by hand twice, and from time perspective, both two cases were wrong.

TypeHandle

Instead of comparing the types in the examples above by == operator, there is a possibility to compare them by the TypeHandle:

TypeHandle encapsulates a pointer to an internal data structure that represents the type. This handle is unique during the process lifetime. The handle is valid only in the application domain in which it was obtained.

Source: MDSN. Well, I can't think of an interesting usage for the TypeHandles for now, but it's good to know.

Sunday, May 22, 2016

The five stages of coming to terms with Cassandra

From Wikimedia Commons, the free media repository

The five stages of coming to terms with JavaScript are:

Denial: “I won’t need this language.”
Anger: “Why does the web have to be so popular?”
Bargaining: “OK, at least let me compile a reasonable language to JavaScript.”
Depression: “Programming is not for me, I’ll pursue a career in masonry, like I always wanted.”
Acceptance: “I can’t fight it, I may as well prepare for it.”

The same is with Cassandra - however, IMO in the opposite order:

Acceptance: “I will use Cassandra. It's... AMAZING! Let me just quote Apache Cassandra landing page:"
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Depression: “Damn, it's so well designed, but a complex piece of software and it doesn't work as expected.”
Bargaining: "OK, at least let me try to tune it or report some bugs.”
Anger: “Why is it so popular? Why it has so good PR?”
Denial: “I won’t use it or recommend it ever again.”

The context

I've done the research, checked multiple websites - read about performance, architecture, hosting, maintenance, TCO, libraries, popularity... and Cassandra seemed to be a good database for time-series logs storage, with 95% writes (with SLA) and only 5% reads (without SLA). I've chosen prepared Cassandra Datastax virtual disk image on Amazon with bootstrap scripts, made a proof-of-concept solution and read a book or two about Cassandra. All seemed good. However, it's not post about the good. So ...fast forward...

The bad

Some stories which I remember:

Cassandra cluster is on production (along with pararell, old solution for this purpose). Phone rings at 2AM. C* cluster is down. Quick look at logs - OutOfMemoryException in random place in JVM. Outage time: 1h - let me just remind you "proven fault-tolerance". Cluster restart, it works again.
Next day at work, random hour, the same thing. Related bug: OutOfMemoryError after few hours from node restart
After few days... firing repair - the standard C* operation, which you have to run at least every gc_grace_seconds, by default 10 days. Usually it worked, but then, unexpectedly the server died and later again and again, related issue: "Unknown type 0" Stream failure on Repair. Outage time: stopped counting.
Because of the failing servers in the cluster I decided to scale it out a little. Unfortunately, the issue above also made the scaling impossible.
After a while, I've encountered a second (thrid?) problem with the repair. Related bug: Repair session exception Validation failed

Fail

Let's get back to the landing page:

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.

Now, let's see at critical JIRA issue dates:

This means that for around one month at least few people could scale or repair their Cassandra clusters. I fully understand - it's free and Open-Sourced-Software. However, even if something it's free you expect it to work - that's the harsh reality. If it doesn't work just you look for something else. No offence Cassandra Datastax/Apache teams, you are doing truly amazing work, however in resilient software, stability is a TOP 1 requirement.

Maybe it's me? Maybe I'm the only one having problems?

Fortunately (for me) not:

Here is a presentation how guys at Adform switched from Cassandra to Aerospike: Big Data Strategy Minsk 2014 - Tadas Pivorius - Married to Cassandra
My friend working at a different company also told me, that they used Cassandra and they abandoned it.
Just looked at linked issues and the number of watchers.

In all cases the problems were similar to mine.

Thursday, May 5, 2016

Specifying requirements for live notification mechanism for systems integration purposes

Recently I've designed a mechanism to notify external systems (with which we cooperate) about changes in our system. This, obviously, can be done in multiple ways. Let's look at some considerations on a high level, some questions and how that affects our requirements.

Assumptions

we want to notify other, external systems, owned by someone else
allowed delay, between the change in our system and making the notification is around one minute
the change can carry multiple information and varies on the type of change
we expose an API which is currently used by those external systems - they fetch the changes periodically
the number of changes per second in our system is spiky in nature (assume 50-5000 notifications/second for now)
external systems will subscribe themselves for notifications

Those are real-life business assumptions, which are delivered to the designer/programmer/you/me.

Questions?

How to notify external systems?
What information should we pass? When is the notification delivered?
How long should we wait for the response?
When should we retry?

Let's try to answer those questions.

Answers

There are multiple external systems, made in multiple different technologies. The most popular and basic method of integration is just making HTTP(S) calls. Should it be GET, POST or X? Let's consider two most popular - the GETs and POSTs.
We have to pass multiple values, depending on the notification type. For example, normal amount of information is: string (300 chars), 5 dates, 5 integers - therefore both GET (allowing ~2k chars on nearly all browsers and servers) and POST methods are viable. However, GET is very straightforward and simple. No issues with encoding, accepting compression or even reading the stream. What is more, GET put less pressure on your's servers as you do not have to send the body stream. Unfortunately GET query string is also visible for (nearly) everyone, therefore only-non sensitive information can be passed. What about concurrent notifications? How could one make "exactly-once" delivery model? Here is where we can use nicely one of our assumptions. Because of our API we can force external systems to fetch information through our API, after we will notify them. Such notification can be delivered in "at-least-once" model and we can provide non-sensitive, idempotent information about the change, which then can be used to get, full sensitive data from our API. One can even imagine an optimization - keep notifications to send in a buffer and delete duplicates in a small time bucket.
The obvious thing is that the longer we wait for responses the more resources are used. However, there is one more important thing. By specifying the request timeout, we can control how the architecture of the external system will look like. By saying "you have 30 seconds to process the notification" is like saying ~"you have a lot of time to get our notification, process it and synchronously ask our API then send us HTTP 200 status code". Compare it with "you have 3 seconds to store the notification for processing later or process it asynchronously". The implications are clear, short time = less required resources + better integration.
We want to be sure that the notification reaches the external system and thanks to the design specified in second point we can use "at-least-once" delivery model. I see two options now: a) hit specified URL 3 times (for example), don't wait for the answer and don't send this notification ever again, b) hit specified URL, retry in X minutes if HTTP status code was different than 200. First option is very simple in implementation, however it assumes that external systems will develop a mechanism to avoid processing the same notifications multiple times - which will likely end in hitting our API three times for every single notification.

Conclusion

There we have it, answers which potentially should lead to a simple, sleek design which is relatively easy for implementation, completely fulfills the needs and requires a good design from external systems.

Monday, April 18, 2016

Presentation recommendation - The Microservices and DevOps Journey

Today I recommend you following presentation: The Microservices and DevOps Journey

Why? "Microservices" is a buzz-word, created around one year ago, still not popular in Google, but surprisingly popular on conferences: Google Trends: Microservices vs SOA. In my opinion, in this video, a sensible approach of transforming a monolith to a microservices system is presented. KISS architecture. LogStash, consul, Cassandra, Docker, Octopus are cool, however the question is: "Do you really need them?". Expect nothing super fancy though, I'm just sharing what I agree with.