Friday, February 12, 2016

Little-known, useful, charming and beautiful algorithms - part 1

Please find the updated version of this post here: https://piotr.westfalewicz.com/blog/2016/02/little-known-useful-charming-and-beautiful-algorithms---part-1/

Image source: https://goo.gl/mJjN4S


Warning: this post won't be about "boring" or "typical" algorithms from Computer Science which we all have learned on studies (like quick sort, merge sort, xxx sort, A*, FFT). Instead, this will be about other little-known, especially USEFUL algorithms, which people working as professional developers should know or heard of.

Little-known

ID generation problems are usually overlooked. Database ID's I mean. Ask someone to name ID "types". Well, GUID, newsequentialid(), int/long (increased one by one, as in IDENTITY), NHibernate's Hi/Lo will be the answers. The first will make your SQL Server cry, when used as clustering key. Two next requires SQL database as a part of the generation process. That limits the performance as well as can be problematic in occasionally connected client scenarios. The last one is quite interesting - solves the problems related to database usage in the generation process, doesn't have the problem as the GUID has... however, it's not quite suitable for distributed (cloud) environment. Many servers can generate the same ID's, if they begin from the same initial ID.

To address those and other issues, listen guys from Twitter:
As we at Twitter move away from Mysql towards Cassandra, we've needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.
they designed a system for generating ID's with following requirements (source):
Performance
 - minimum 10k ids per second per process
 - response rate 2ms (plus network latency)
    Uncoordinated
       For high availability within and across data centers, machines generating ids should not have to coordinate with each other.

    (Roughly) Time Ordered
       We have a number of API resources that assume an ordering (they let you look things up "since this id").
       However, as a result of a large number of asynchronous operations, we already don't guarantee in-order delivery.
       We can guarantee, however, that the id numbers will be k-sorted (references: http://portal.acm.org/citation.cfm?id=70413.70419 and http://portal.acm.org/citation.cfm?id=110778.110783) within a reasonable bound (we're promising 1s, but shooting for 10's of ms).

    Directly Sortable
       The ids should be sortable without loading the full objects that the represent. This sorting should be the above ordering.

    Compact
       There are many otherwise reasonable solutions to this problem that require 128bit numbers. For various reasons, we need to keep our ids under 64bits.

    Highly Available
       The id generation scheme should be at least as available as our related services (like our storage services).
    Note this is a system for generating ID's. However, this system can be easily detached from the whole infrastructure and put into a nice algorithm or a program for generating uncoordinated, time ordered, k-sortable, compact IDs. In fact, there is already a .NET project for that: IdGen. I recommend you looking into the internals of the generation algorithm because the idea is very simple and sleek. I can't stress enough the usefulness of this algorithm in distributed systems. A must have!

    Concealed algorithm, which .NET developers use every day

    Whenever you use TCP protocol (by WebClient or HttpWebRequest), this algorithm kicks in. Normally it's good and useful (source):
       Nagle's document, Congestion Control in IP/TCP Internetworks (RFC 896) describes what he called the "small packet problem", where an application repeatedly emits data in small chunks, frequently only 1 byte in size. Since TCP packets have a 40 byte header (20 bytes for TCP, 20 bytes for IPv4), this results in a 41 byte packet for 1 byte of useful information, a huge overhead. (...)
       Nagle's algorithm works by combining a number of small outgoing messages, and sending them all at once. Specifically, as long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet's worth of output, so that output can be sent all at once.
    However, sometimes it's not and it is good to know that such a thing even exists. Here is a performance test, which showed speed increase in Azure Queue PUT operations from ~210ms to ~28ms: Nagle’s Algorithm is Not Friendly towards Small Requests

    Little algorithms

    There are two "short" algorithms which I personally like. They aren't a big deal and by googling the problem, you probably will use one.

    What is your favorite algorithm/solution?

    In part two there will be one, special "algorithm".