tag:blogger.com,1999:blog-925513466820378762024-03-13T14:27:47.658+01:00[__DynamicallyInvokable]Patterns C# Software Information Architecture Design Technology Big .NET Data Microsoft DevelopmentPWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.comBlogger28125tag:blogger.com,1999:blog-92551346682037876.post-36047876861290826012017-11-03T09:18:00.000+01:002017-11-03T09:23:24.518+01:00This blog is moving. No further content will be published here.If you still want to follow me, please update your subscriptions. The new blog you can follow by Facebook/Twitter or RSS.<br />
The new address is: <a href="https://piotr.westfalewicz.com/blog/">https://piotr.westfalewicz.com/blog/</a>PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-89713424970940544142017-10-12T08:30:00.000+02:002017-11-03T09:22:59.281+01:00Rock solid pipeline - how to deploy to production comfortably?<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2017/10/rock-solid-pipeline---how-to-deploy-to-production-comfortably/">https://piotr.westfalewicz.com/blog/2017/10/rock-solid-pipeline---how-to-deploy-to-production-comfortably/</a></span></h2>
<br />
First of all, this post won’t be for people who think developer’s job is to design, write code and test it. It’s far beyond that. One of the important responsibilities is to ship your code to production. How to do that safely?<br />
<h2>
Starting with the artifacts</h2>
<div>
Where do we begin? Assume you designed, wrote the code, tested it, reviewed it, wrote integration tests, added logging, metrics, created documentation, updated dependencies/libraries, pushed the code to some kind of a repository (doesn’t matter which one) and your build system created runnable version of your code with the all needed content – you have <b>the ARTIFACTS</b> for the deployments. So now what?</div>
<div>
For deployment and code validation purposes we use a pipeline. It’s a series of verification steps which ensure our code is working as required. How many stages should the pipeline have? What stages should it have?<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-1OE4Zl2lnYs/Wd7zmF-rfTI/AAAAAAAADJ4/T7iwWahzitsT_Mb-MBx6qGYc__9q4dgNgCLcBGAs/s1600/a%2Bpipeline.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="277" data-original-width="1141" height="95" src="https://3.bp.blogspot.com/-1OE4Zl2lnYs/Wd7zmF-rfTI/AAAAAAAADJ4/T7iwWahzitsT_Mb-MBx6qGYc__9q4dgNgCLcBGAs/s400/a%2Bpipeline.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A pipeline. <a href="https://www.draw.io/?lightbox=1&highlight=0000ff&edit=_blank&layers=1&nav=1&title=a%20pipeline#R5ZbBcoIwEIafhmNnIlFLj5WKvfTkoecIC2QMLBOjaJ%2B%2BQTcCQzv1UutYLiT%2FbjbJt%2F8hHg%2BL%2FUKLKn%2FDBJTns2Tv8RfP90f%2BmNlfoxycMp2elEzLhLRWWMoPIJEWZluZwKaXaBCVkVVfjLEsITY9TWiNdT8tRdXftRIZDIRlLNRQfZeJyUkdMdYGXkFmOW0dTCiwEvE607gtaT%2FP5%2BnxO4UL4WpR%2FiYXCdYdic89HmpEcxoV%2BxBUA9dhO62Lvomez62hNJcsmDymMBUriFmQsnG8eqAKO6G2xMLjEZ3VHBwfe%2ByqGW4MVB6f1bk0sKxE3Gi19YTVclMoOxvZYYqloR77QTOXSoWoUB%2BLccYCFkWU19Gj42f1jdG4BhcpsbSVZsObuqODNrDvSHTzBWABRh9sirOla2fd9pg7C%2Bbd9pImyFbZuVSL1g6I7oWk%2FQHpBHagsCqaG90ZcYqeSd5EB%2Fh%2F8DpFJ9MLyLNrkR8PyFcak21sJJZWn91pEwL%2Bh%2FYfur3H%2FPl2mQ8Af9GGn5lfg%2FEPvg7vnbHz9dPk15jbafs8OsY6j1A%2B%2FwQ%3D" target="_blank">Edit it on draw.io</a>.</td></tr>
</tbody></table>
<h2>
Understanding the tradeoffs</h2>
Of course, it will depend on your use case. You have to find the balance between time to production, time invested in the pipeline (tests, monitoring, infrastructure…) and validation strictness. StackOverflow stated on one of their presentations that they test the software on their users. While it may work for them, imagine a bank testing the software on the end users. In some cases, the trust is too important to lose. This post will present rock solid pipeline for one development environment and multi-region production environment. If executed correctly, it’s a pipeline which will catch nasty things like memory leakage or minimize blast radius in production.<br />
<h2>
Rock solid pipeline</h2>
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-ncgHHB7pGA0/Wd75uCQmunI/AAAAAAAADKI/CuluTj8KjKMJ92V0ooHtxS_7l-ay04v1QCLcBGAs/s1600/Rock%2Bsolid%2Bpipeline.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="530" data-original-width="1534" height="135" src="https://1.bp.blogspot.com/-ncgHHB7pGA0/Wd75uCQmunI/AAAAAAAADKI/CuluTj8KjKMJ92V0ooHtxS_7l-ay04v1QCLcBGAs/s400/Rock%2Bsolid%2Bpipeline.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Rock solid pipeline. <a href="https://www.draw.io/?lightbox=1&highlight=0000ff&edit=_blank&layers=1&nav=1&title=Rock%20solid%20pipeline#R7Vvfc6o4FP5rnN19uB0ggPCoVts7c7vt1HZ2XyNEzRSIA2m1%2B9dvAomGHyrtSGF68aGFQziB833nfEcMAzAJdzcx3KzviI%2BCgaH5uwG4HhiGrusW%2B8ct75nFst3MsIqxLwYdDHP8HxJGTVhfsY%2BS3EBKSEDxJm%2F0SBQhj%2BZsMI7JNj9sSYL8rBu4QiXD3IOBtF5ZB%2Fs%2F2KdreWfyRviBW4RXazG5Y9jZgQX0XlYxeY3EjAMDLNNPdjiE0pe41WQNfbJVTGA6AJOYEJpthbsJCnh0ZeDkefRdXu0AjNc0DNiOzjbTw7MjJ%2Bt1TmY3F6OIqtMd82drPnQW7tI0bNP0AfxhZB7eYPAqJyjOuF1jiuYb6PH9LSNQ%2FhLS2CFf7O3Do%2FEdGpOXPR4OsyxJRAV9DD4CBngVsR2PXT%2BK%2BQAcBBMSkDidG8ym14Y%2B3LtSj6SfyhCIoL2hmKKdYhIhuUEkRDR%2BZ0Pk0aEASdDftISL7YFKhiPGrBUWmUAYoSDwau%2F7AALbEDjUxMSswMQO2JRjH7%2BxzRVNbzszvQZFS4Cl5TnClKciSmgijy7i4nh2fYdTyk7mFFLs8XuMYPCeYHanow86O3WNHitGzHmM3jDa8hKxRt5LzUsT52Zc%2FdApHmHs4EXl1FnMWL7yY8YcNIUMYjSk%2BaTJ0zkiESpwX5hkfgRoyT1wSmNW9kbCHGLf55OMq3L0kJhaMTGVNEyzi%2B83lVuulcsty7TLuQUqcgtoTeSWfb7eBTgNfb54Fepuvdgaw3MlTsDcbHVz89XNAhXVzaxAQG8CAKcEwNPtlBlGj08%2FZ6PJ07yEB3PI%2BohjLK8HSh5MuyIFzinReDY2R%2B5JrJrMIh3kFcosY7jv4nIYSuNFQXTPZ5Fa7FQkfJis9xiVgn60wDEfG%2B453K14C3uVdY3GFZ8RJ8gf8X3ungdIu%2BIIp%2B2tyaeJCPV4kFgsGEqItXR8MjbMLLMmRxS9Ip%2FPlu6sgzQ87zLQs%2BjnoZctrQq9VpG%2BdhPpy3Kl2w3jbKJpZYW9cDqaVqcaRvkd43dSNXPYJVWTAVcQGP16uB11VMyAO7wejdoSM%2BB2SswM0KvZl6mZZXZJzYyqr9qdUrPxF6jZEHRLzcpfEb69mg2tTqlZub8fT5%2B6KmbTmQss0JaY2cNOiRmo0Qr2YnYhMXOMLomZLBmdFbOp62rayTy9ACa6pndKzUCNFuO7qZmugS7JGSg%2F9L0Z3d11Vc%2FGMxu4rX05c61u6VmN5029nl1Iz%2FS9WLUhaFbH5WvmaF8gX4bbpnz1D%2Fa%2FMNuA02K26b%2FhM2Md6MW%2BpOLH6Kb6krKSsavVxvf%2Fsr%2FT574ZqayGIN%2BNAMf8wm6k3Lo%2FPN5fP0%2Beft7%2F3RZoTvdBswpdBLCcMmiyrcyBJo2XbSHLzxNzMD53d8lBp2A03Qp5agzGMmiHatkjVq9aViLWWLXUq54%2FfWrd4gOM%2BbpFsuTeWaRZZ0gxiQaFpYxnlvqNYZKuW5w%2BPt4%2Fsv8BWXHOeDD61Kq%2Fb7zAr0SxCiKe%2BCG%2BzQV%2BunGUdKcYxiAOjpFLSxD99GrZXwT6wlXNMwxzPeDL0F%2F4ulSKQ6Ssr%2F3watoDx%2BtmyZpkN80jj710i0SYkhhHKzk6ed3I4briVjHXnAxuNgGjt4h3g1PWWbrb57Dy82OrOVy1AOM81%2FvE6ROn7cTRgW5eOewDTBtYpm4P28yj4y%2BOnKL%2BPCRpCp3otGrKYM%2BTE7%2BItVphq157qF8vZJnhtHjfoOwNIC3E0Y8Q7gYGm96GIQ92tEg2Ch%2F2BFnuZ9uk3Xdq%2FEOdTrEr5r2DP0sud1XD%2FjpxMYu9%2F8ls4ADl9EVxZpXuFyfx%2BUfi3aXxsLB6z5HPRVUW2009sa5aF9aTuCfxhzVbK7K4ohZfiMVs9%2FBucHpMeQUbTP8H" target="_blank">Edit it on draw.io.</a></td></tr>
</tbody></table>
<div>
The orange cards are validation steps. Once all requirements in validation steps are completed, the change is promoted to the next environment.</div>
<div>
<br /></div>
<h3>
The environments</h3>
<div>
<ul>
<li><b>The artifacts</b> – not really an environment… but the graph looks nice with it :)</li>
<li><b>Alpha </b>– environment only for tests purposes. It’s not facing any real traffic. The main purpose is to make the beta environment stable - to catch errors before they will reach development environment and cause cross-team failures.</li>
<li><b>Beta </b>– this is the development environment.</li>
<li><b>Gamma </b>– again, an environment which isn’t facing any real traffic. It’s very important though, because it is configured identically as the real production environment.</li>
<li><b>1 Box</b> – a small subset of production hosts. Surprise! Not really 1 host… if your service runs on 1000 hosts, you can have, for e.g. 10 of 1 Box hosts.</li>
<li><b>Production</b>.</li>
</ul>
<h3>
Validation steps</h3>
</div>
<div>
<div>
First of all, before deploying the changes anywhere, rudimentary checks can be done. Unit tests can be ran, static analysis can be performed – if code review by right people is done, if the change follows the code style, if the code is covered by unit tests. All checked? Proceed to <b>Alpha</b>.</div>
<div>
<br /></div>
<div>
After <b>Alpha </b>deployment, part of integration tests can be executed. It may be impossible to execute all of the integration tests and keep sensible execution time. Pick the most important ones. As previously written, Alpha is to keep the Beta (development env) stable. By integration tests, I mean all automated test which interact with the environment in any way. While executing the tests, scan the logs for ERRORs. The errors amount has to be kept in reasonable limits. Poorly written code will result in treating the presence of the errors as a normal situation. No issues discovered? Proceed to <b>Beta</b>.</div>
<div>
<br /></div>
<div>
<b>Beta</b> is the development environment. It’s the environment used for demos or manual testing. It’s used heavily through the company, so any issues here will cause time loss for many teams. The change will spend here quite some time and will get tested thoroughly. This is the time to run all integration tests and load tests. Load tests should aim at least production peak volume (scaled per number of hosts). During all this time when different tests are executed and people are using your service, monitor the logs as before. Validate if the logs are produced at all. Use different metrics:</div>
</div>
<div>
<ul>
<li>Host level (CPU usage, thread pool, memory, network IO, disk IO, disk usage, file handles/Inodes number).</li>
<li>Application level, that is specific to your business, for example:</li>
<ul>
<li>Number of invalid sign-ups, average size of sent messages.</li>
<li>Latency of the downstream APIs, number of requests to the downstream APIs, number of requests to your APIs, latency of your APIs, number of errors returned, number of exceptions thrown.</li>
</ul>
</ul>
<div>
<div>
Monitor those metrics with different kinds of alarms. Most basic one: check if the value is between min/max values. However, there are more powerful and more sophisticated types: first derivative or σ value checks. More on those in the next post. Rigorous tested passed? Proceed to <b>Gamma</b>.<br />
<br /></div>
<div>
<b>Gamma </b>is a special environment, because it is the first environment with production configuration. The only validation is smoke integration tests, which uses different components of the service to check the configuration. The purpose of those tests is to catch, for example, mistyping in production connection string. Seems to be working? Go to <b>1-Box</b>.</div>
<div>
<br /></div>
<div>
<b>1-Box</b>, as written previously, is part of your production fleet. It should be a small percentage of production hosts, serving production traffic. Despite obvious reduction of blast radius by number of hosts, there is an additional benefit in some situations. Taking as an example processing messages from a queue, if the faulty 1-Box will take the message and fail, there is a high chance that later on a healthy host will take the message and there will be no customer impact at all. To further reduce blast radius, deploy to one 1-Box region at one time, obviously at off-peak time. After deployment is made, monitor what was monitored in Beta (logs, hosts metrics, application metrics), however now you are performing validation against real production traffic. What’s more, here you can add one more peculiar type of check - compare the Production metrics to 1-Box metrics. This should hopefully reveal any anomaly missed before. After that, go for <b>Production</b>!</div>
<div>
<br /></div>
<div>
Finally, after ~2 days your change arrives in <b>production</b>. We are not perfect, what if critical bug is introduced! Does that mean we have to wait 2 days for a fix? Nope – deploy hotfix as you wish. You can for example skip the two “baking” processes and leave other validation steps in place.</div>
</div>
</div>
<div>
<br /></div>
<div>
<b>Please note</b>: the views I express are mine alone and they do not necessarily reflect the views of Amazon.com.</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-56923639673917719912017-07-05T08:46:00.001+02:002017-11-03T09:24:37.713+01:00The nightmare of large distributed systems<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2017/07/the-nightmare-of-large-distributed-systems/">https://piotr.westfalewicz.com/blog/2017/07/the-nightmare-of-large-distributed-systems/</a></span></h2>
<br />
There are certain classes of exciting problems which are surfaced only in a massively distributed systems. This post will be about one of them. It's rare, it's real and if it happens, it will take your system down. The root cause, however, is easy to overlook.<br />
<h2>
Assumptions</h2>
For this post sake, let's make some assumptions:<br />
<ul>
<li>Critical service is a service which is crucial from your business perspective, if it goes down, you don't make money, and your customers are furious.</li>
<li>Your business is a veterinary clinic - or rather - immense, world-wide, distributed, most popular, top-notch veterinary clinic network. You make the money when the clients sign up online for visits. A critical service would be any service, which prevents the client from signing up. Let's call the path of signing up a critical path.</li>
<li>The massive distributed system is a system consisting of, for example, 333 critical services, each one of them has it's own fleet of 1222 hosts (so in total, there are 406926 hosts running critical services in your system).</li>
<li>Each critical service is managed by a two pizza team.</li>
</ul>
<div>
All in all, I'm thinking about something like this:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-JRnVw8izfog/WVr87BMYyII/AAAAAAAAC_Q/otqU-mam-w4203g3z9Zv4q4PfDMmR8qiQCLcBGAs/s1600/n%252B1%2Bin%2Bcloud.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Large distributed system - exponential failure problem - unhealthy host called 4 times in critical path" border="0" data-original-height="573" data-original-width="867" height="263" src="https://2.bp.blogspot.com/-JRnVw8izfog/WVr87BMYyII/AAAAAAAAC_Q/otqU-mam-w4203g3z9Zv4q4PfDMmR8qiQCLcBGAs/s400/n%252B1%2Bin%2Bcloud.png" title="" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Large distributed system - exponential failure problem</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
We have a customer, who want's to sign up, the load balancer takes the request, there are 332 healthy services involved, and there is this one red-star service, which is asked 4 times on the critical path. If it goes down, the customer can't sign up. I would compare it to kind of SELECT N+1 problem.</div>
<div>
<br /></div>
<div>
In such a big system, it may happen that a particular, red service will be called 10+ times on a single sign up. The sign up itself may have many steps, and each step may call many services, including our red service, many times. Architecture smell... design smell, you may tell... Yes and no. Think about the scale. 333 critical services, times two pizza teams = around 1333 people involved in just the critical path. Many more involved for non-critical services. I think this situation is inevitable at scale.</div>
<h2>
Problems</h2>
<h3>
Availability</h3>
<div>
When the red service is down - that's an obvious situation, easy to spot. What happens if our service has some kind of issues that leads to 90% availability? What is the customer impact, assuming 10 calls made to the red service during sign up? Unfortunately, only 35% of customers will be able to go through the critical path. Why? The first call has 90% chances of success, the second call has also 90% chances of success, but both calls combined have only 0.9*0.9 chances of success. For 10 calls, that will be 0.9^10= ~35%. Unfortunately, in case of "system is down" event, while having enormous amount of alarms going off, tens of services impacted, loosing many customers each second, 10% drop in availability in one service is very, very easy to miss.</div>
<h3>
<br />
Timeouts</h3>
<div>
The second nasty problem is with the timeouts. The origin of the timeouts can span network problems, hosts problems, code bugs and more. It may happen that random 10% of requests will time out. Let's assume the red service has normally 20ms latency (on average). It's being called 10 times, so total wait time on a single sign up is 200ms. However, with those 10% timeouts, let's say 1.5 seconds each, changes the average latency to 20ms * 0.9 + 1500ms * 0.1 = 168ms. Now the total wait time for the red service is 1680ms. Probably the customer is getting timeouts at this point. It's nasty, because the 10% timeouts issue can be well hidden. For example, it might be a core router which is misbehaving, but still works and our monitoring doesn't discover the packet loss.</div>
<div>
<br /></div>
<h3>
TL;DR;</h3>
<div>
Even tiny issues - like 10% timeouts or 10% availability drop - of only one service out of hundreds on a critical path in large distributed system can take it down.<br />
<br />
<b>Please note</b>: the views I express are mine alone and they do not necessarily reflect the views of Amazon.com.</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-22780628592120443462017-06-05T07:30:00.000+02:002017-11-03T09:25:24.130+01:00What does it mean to design a highly scalable system?<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2017/06/what-does-it-mean-to-design-a-highly-scalable-system/">https://piotr.westfalewicz.com/blog/2017/06/what-does-it-mean-to-design-a-highly-scalable-system/</a></span></h2>
<br />
Please note: the views I express are mine alone and they do not necessarily reflect the views of Amazon.com.<br />
<br />
It's surprising how the volume of data is changing around the world, in the Internet. Who would have thought 10 years ago, that in future a physical experiment will generate 25 petabytes (26 214 400 GB) of data, yearly? Yes, I'm looking at you, LHC. Times are changing, companies are changing. Everyone is designing for scale a tad different. And that's good, it's important to design for the right scale.<br />
<br />
<h2>
Over scaling</h2>
Let's assume you are a startup building a brand new, unique CMS (unlike the 2000 other ones). Is it worth thinking about NoSQL/Cassandra/DynamoDB/Azure Blob Storage? Probably not. It's safe to assume that most of data will fit into one small or medium SQL database. When performance problems will appear, that's good. It means your startup is working (or you just don't know SQL...). That means you have clients, paying clients. Also, at that point you will have probably completely different idea about the system - you gone from "no clients and your imagination only" state to - "working product with customers and proven solutions" state. You can reiterate over your architecture now. Hopefully you have founds now. What I've heard multiple times is failing a startup because someone created complicated, scalable system for 321 000 clients. All the money is spent on IT, none on business. Total failure.<br />
<h2>
No need for scaling</h2>
Now, some systems don't have to scale, or the requirements with scale progress slower than the progress of the new hardware (so effectively they fall into the first category). Probably some ERP systems for most medium sized companies are good example. Large SQL database, maybe an Azure Blob Storage/DynamoDB and all scalability problems are solved.<br />
<h2>
Some scaling needed</h2>
As I mentioned before, sometimes throwing an NoSQL database into the "ecosystem" solves the problem. Unfortunately for us, geeks, that's usually the case.<br />
<h2>
Scaling definitively needed</h2>
There are times when people say "Azure Blob is highly scalable". Well, that statement is a joke. Azure Storage isn't scalable at all. Theirs 20 000 Requests Per Second limit sometimes might be a only a tiny part what you need. Furthermore, there are other, hard limits: <a href="https://docs.microsoft.com/en-us/azure/storage/storage-scalability-targets" target="_blank">Azure Storage Scalability and Performance Targets</a>. To be fair, <a href="http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html" target="_blank">DynamoDB has limits too</a>. However, you can contact support and request as much throughput as you need. There is one more catch too - pricing. In Azure you pay for requests amount (not throughput) + storage, in DynamoDB for provisioned throughput + storage. Depending on your use case, one might be much cheaper than the other.<br />
<h2>
Unique scale</h2>
Finally, there are times when you need to build a new solution and even have a dedicated team for your challenge. Let's imagine you work at a big company, your company has hundreds of thousands of services, each service is called many thousands of times per second, and each call is generating logs. You want a solution to store the logs for months and search within them. The scale is unusual, and it's expected the number of calls will grow 30% year over year, having 4x more traffic on some days. This time, you can probably start thinking about your own, new storage engine, strictly coupled to your needs.<br />
<br />
<br />
TL;DR: covered some scalability levels, turns out everyone should scale different. PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-5115556593211699492017-02-23T09:53:00.001+01:002017-11-03T09:28:37.273+01:00Cassandra logs/time series storage design - optimal compaction strategy<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2017/02/cassandra-logs/time-series-storage-design---optimal-compaction-strategy/">https://piotr.westfalewicz.com/blog/2017/02/cassandra-logs/time-series-storage-design---optimal-compaction-strategy/</a></span></h2>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-8Z3Ess_u6yI/WK6KnbEfciI/AAAAAAAACeA/6VT0s_Xiv58g5x4oMTZGyzHdxbgPnGyIgCLcB/s1600/8226451812_88007f08df_z.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="213" src="https://2.bp.blogspot.com/-8Z3Ess_u6yI/WK6KnbEfciI/AAAAAAAACeA/6VT0s_Xiv58g5x4oMTZGyzHdxbgPnGyIgCLcB/s320/8226451812_88007f08df_z.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Source: <a href="http://www.ccpixs.com/">http://www.ccpixs.com/</a></td></tr>
</tbody></table>
Let's assume you are considering using Cassandra for logs storage or in general, for time series storage. Let's assume your usage pattern is that you store insane amounts of data for three months and you query them rarely, usually when something goes wrong and you need to investigate why. So, you have made business analysis. You have made technical analysis. You made performance benchmarks. You identified clustering and partition keys. You picked the right hardware. You tested how the cluster responds to peaks. You tested the cluster for minute, hour surges. You have been running the cluster in tests for weeks. You are now ready to deploy system to the production. <b>You are now ready to fail in ~one month. Wait, what?!</b><br />
<b><br /></b>
Of course, all mentioned steps bring you closer to the success. It's good that there are plenty of great resources on how to design time series model in Cassandra. Just google "<i>time series design Cassandra</i>".<br />
<br />
Surprisingly, the first five of them (dunno about others), don't mention setting the right compaction strategy.<br />
<br />
<h2>
The default compaction strategy (Size Tiered Compaction Strategy)</h2>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://image.slidesharecdn.com/cassandracompaction-150206000410-conversion-gate02/95/cassandra-compaction-16-638.jpg?cb=1423181077" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="240" src="https://image.slidesharecdn.com/cassandracompaction-150206000410-conversion-gate02/95/cassandra-compaction-16-638.jpg?cb=1423181077" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Source: <a href="https://www.slideshare.net/tomitakazutaka/cassandra-compaction">https://www.slideshare.net/tomitakazutaka/cassandra-compaction</a></td></tr>
</tbody></table>
The default compaction strategy triggers a compaction when multiple SSTables of a similar size are present. To emphasize how evil is that for log storage scenario, let's assume that a week of production logs results in 1x 168GB SSTable, 3 x 42GB SSTables, 3 x 10GB SSTables, etc. After two weeks, the biggest SSTable will be probably still 168GB . That's fine. You have tested the cluster for two weeks of production load, right? The trap is, you want to store logs for three months. Nobody will spend that amount of time hitting the test cluster with production load. After three months, the biggest SSTable will be around 672GB and month after that probably 2688GB. The compaction isn't free. It takes your CPU, it takes your disc IOPS (yes, nice sequential writes, but still). <b>It will take your life (or rather kill your cluster with pending compactions).</b><br />
<h2>
Solution: Date Tiered Compaction Strategy</h2>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://www.datastax.com/wp-content/uploads/2014/11/dtcs_blog1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="151" src="https://www.datastax.com/wp-content/uploads/2014/11/dtcs_blog1.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Source: <a href="http://www.datastax.com/dev/blog/datetieredcompactionstrategy">http://www.datastax.com/dev/blog/datetieredcompactionstrategy</a> </td></tr>
</tbody></table>
The date tiered compaction strategy will simply leave, after some time, old SSTables untouched. The same situation described above will translate to having (depending on your use case) for example 10x ~268GB SSTables, or perhaps 100x ~26GB SSTables. No killing compactions on old data! Read about the details here: <a href="http://www.datastax.com/dev/blog/datetieredcompactionstrategy" target="_blank">DateTieredCompactionStrategy: Compaction for Time Series Data</a> and here: <a href="https://labs.spotify.com/2014/12/18/date-tiered-compaction/" target="_blank">Date-Tiered Compaction in Apache Cassandra</a>. Yes, queries probably will be a little bit slower.<br />
<br />
All in all, invoking this CQL (with carefully chosen consts), will save your cluster:<br />
<blockquote class="tr_bq">
ALTER TABLE myprecouslogs.myprecioustable WITH compaction = { 'class' : 'DateTieredCompactionStrategy', 'base_time_seconds':'XXX', 'max_sstable_age_days':'YYY' }</blockquote>
and here is how to do it on a live organism: <a href="http://blog.alteroot.org/articles/2015-04-20/change-cassandra-compaction-strategy-on-production-cluster.html" target="_blank">How to change Cassandra compaction strategy on a production cluster</a>.<br />
<br />
BTW. Do you know that there is also <a href="http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1" target="_blank">anticompaction</a>?PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-87259215908475322032017-02-21T23:25:00.001+01:002017-11-03T09:30:09.445+01:00[Backup] Why shouldn't you ever use ResilioSync? "Database Error" problem<h2 style="clear: both; text-align: center;">
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2017/02/why-shouldnt-you-ever-use-resiliosync-database-error-problem/">https://piotr.westfalewicz.com/blog/2017/02/why-shouldnt-you-ever-use-resiliosync-database-error-problem/</a></span></h2>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-emcsfx7U8_Y/WKyxFey1nyI/AAAAAAAACdg/VKi-v7n-ssMQjKSEWlodQIKdqpeiwUMyACLcB/s1600/introResilioHeader-1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="183" src="https://4.bp.blogspot.com/-emcsfx7U8_Y/WKyxFey1nyI/AAAAAAAACdg/VKi-v7n-ssMQjKSEWlodQIKdqpeiwUMyACLcB/s320/introResilioHeader-1.png" width="320" /></a></div>
<br />
<blockquote class="tr_bq">
As they say: there are two kinds of people in the World - <strike>those who pick up the ice cube that falls on the floor, and those who kick it under the fridge</strike> those who back up their files and those who haven't experienced losing all their files yet.</blockquote>
<br />
Which category do you fall in?<br />
<br />
I decided to set up a backup system with ResilioSync - the heir apparent of the BitTorrent Sync software. Well, that wasn't good idea and I don't recommend anyone using this software.<br />
<div>
<br />
Maybe it's me, however I prefer the backup software to be:
<br />
<ul>
<li>working and rock solid... dependable... stiff... hard... proven software</li>
<li>well documented</li>
<li>actively maintained (security patches, support)</li>
</ul>
<br />
<div>
Whereas my experience with ResilioSync turned out to be:
<br />
<ul>
<li>Working... kind of. That was until I've updated it. I don't remember precisely from which version to which version and you won't guess that from dates too, <a href="https://help.getsync.com/hc/en-us/articles/206216855-Sync-2-x-change-log" target="_blank">because the releases doesn't have dates</a>. I believe that was from 2.4.3 to 2.4.4. Maybe from 2.4.X to 2.4.X. I'm sure the major number didn't change. It's so important, because the ResilioSync showed "database error" after upgrading. Bummer. This problem alone, caused that I exterminated that piece of software from all my devices. No, I didn't wan't to check why it happened, because it shouldn't happen at all. When my primary data would be gone, I would have lost my data.</li>
<li>The documentation is poor. You won't find it easy to follow. Sometimes you won't find what you are looking for at all.</li>
<li>There is very limited amount of activity on ResilioSync official forum. You also can't help yourself looking at the code, because it's closed source. <a href="https://forum.resilio.com/topic/42764-force_https-not-working-on-linux/" target="_blank">My question about HTTPS still has no answer after 5 months</a>.</li>
</ul>
</div>
<div>
<br />
This, of course, is just my opinion.</div>
</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-59308203557470457162016-12-21T22:41:00.002+01:002017-11-03T09:30:52.010+01:00Dynamic Programming and the hardest problem on HackerRank<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/12/dynamic-programming-and-the-hardest-problem-on-hackerrank/">https://piotr.westfalewicz.com/blog/2016/12/dynamic-programming-and-the-hardest-problem-on-hackerrank/</a></span></h2>
<br />
The hardest problem on HackerRank, sorted by Max Score and level "Expert" is <a href="https://www.hackerrank.com/challenges/separate-the-chocolate" target="_blank">Separate The Chocolate</a>. It's worth 250 points and the level "Expert" is the highest one. How to solve it?<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-jlm4C261CZk/WFrVtw3721I/AAAAAAAACVE/Ae3wdcKxNXAQk8NC5r4P0QrZkTFjhZdiwCLcB/s1600/DP.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="148" src="https://3.bp.blogspot.com/-jlm4C261CZk/WFrVtw3721I/AAAAAAAACVE/Ae3wdcKxNXAQk8NC5r4P0QrZkTFjhZdiwCLcB/s400/DP.PNG" width="400" /></a></div>
<h2>
Problem Statement</h2>
<blockquote class="tr_bq">
Tom and Derpina have a rectangular shaped chocolate bar with chocolates labeled T, D and U. They want to split the bar into exactly two pieces such that:<br />
<ul>
<li>Tom's piece can not contain any chocolate labeled D and similarly, Derpina's piece can not contain any chocolate labeled T and U can be used by either of the two.</li>
<li>All chocolates in each piece must be connected (two chocolates are connected if they share an edge), i.e. the chocolates should form one connected component</li>
<li>The absolute difference between the number of chocolates in pieces should be at most K</li>
<li>After dividing it into exactly two pieces, in any piece, there should not be 4 adjacent chocolates that form a square, i.e. there should not be a fragment like this:
<br />
XX<br />
XX<br />
</li>
</ul>
<b>Input Format</b><br />
<br />
The first line of the input contains 3 integers M, N and K separated by a single space.<br />
M lines follow, each of which contains N characters. Each character is 'T','D' or 'U'.<br />
<br />
<b>Constraints</b><br />
<br />
0≤ M, N ≤8<br />
0≤ K ≤ M * N<br />
<br />
<b>Output Format</b><br />
<b><br /></b>
A single line containing the number of ways to divide the chocolate bar.</blockquote>
<h2>
My approach</h2>
<div>
I didn't have any ideas how to solve it with DP. I've coded a solution which checks every possible way of chocolate division. I felt it wasn't the right solution because the complexity of that is exponential. The graph of possible permutations is checked wisely - at each "branch" I check the required conditions (connectivity, if K difference can be met, squares presence) and cancel the computation of subtrees with invalid state. The result? Quite nice: 16 out of 19 tests passed, 210.53 out of 250 points scored. I looked at the leaderboard, and noticed despite the winners many scored just like me. I've optimized some cases, however that wasn't enough. DP was indeed needed. I've tried to figure it out but finally gave up.</div>
<h2>
The solution</h2>
<div>
The whole editorial provided by HackerRank is very short:</div>
<blockquote class="tr_bq">
<div>
<div>
This problem is a complicated dynamic programming problem (DP). The idea is simple but the implementation is difficult.</div>
<div>
<br /></div>
<div>
We can iterate the grids one by one. For each grid suppose the left and the upper one has been given to Tom or Derpina. (Color black or white.)</div>
<div>
<br /></div>
<div>
To decide the square [i][j], we need all the square’s state in square[i][0..j – 1] and square[i – 1][j..n -1] , all the (n + 1) squares in total can decide this square’s state. We can use these as a part of state and we also must keep whether a component is dead. (If it’s dead then add one more square of the same color is invalid.)</div>
</div>
</blockquote>
<div>
I can't wrap my head around the DP function though. It is strange for me that the vertical line will be the DP state - because the connectivity must be met for the whole matrix... anyone got an explanation?<br />
<h2>
Lessons learned</h2>
</div>
<div>
<ol>
<li>If you can solve a part of a very hard problem - it still can be very profitable in terms of HackerRank score. Compare 210.53 points to 50 points from <a href="http://dynamicallyinvokable.blogspot.com/2016/11/dynamic-programming-in-5-easy-steps.html" target="_blank">"Two Robots" medium problem</a>.</li>
<li>Looking at other's solutions, I've noticed this piece of code:<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-kgkaDx-rcfo/WFrz3usjuNI/AAAAAAAACVU/rj2trVBWJaMbPD2zbGC5uczoi7fHclqmwCLcB/s1600/DP_HACK.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://1.bp.blogspot.com/-kgkaDx-rcfo/WFrz3usjuNI/AAAAAAAACVU/rj2trVBWJaMbPD2zbGC5uczoi7fHclqmwCLcB/s320/DP_HACK.PNG" width="280" /></a>
</div>
Right after adding those three cases to my code, the result was 250 points (not counted towards HackerRank score, of course). Therefore: not everybody plays fair. </li>
</ol>
</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com1tag:blogger.com,1999:blog-92551346682037876.post-14460756773199145142016-11-24T08:00:00.000+01:002017-11-03T09:31:35.190+01:00Dynamic Programming in 5 easy steps - Examples - Two Robots<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/11/dynamic-programming-in-5-easy-steps---examples---two-robots/">https://piotr.westfalewicz.com/blog/2016/11/dynamic-programming-in-5-easy-steps---examples---two-robots/</a></span></h2>
<br />
This time we will solve a <a href="https://www.hackerrank.com/challenges/two-robots" target="_blank">HackerRank problem</a>, rated as a medium in difficulty. It's advised for you to go through a similar, but in my opinion easier problem <a href="http://dynamicallyinvokable.blogspot.com/2016/10/dynamic-programming-in-5-easy-steps_31.html" target="_blank">described by me previously</a>. The problem statement is as follows:<br />
<blockquote>
You have a warehouse with <i>M </i>containers filled with an infinite number of candies. The containers are arranged in a single row, equally spaced to be 1 meter apart. You also have 2 robots that can pick up 1 piece of candy and transport it between any two containers.<br />
<br />
The robots take instructions in the form of <i>queries </i>consisting of two integers, <i>M<sub>a</sub></i> and <i>M<sub>b</sub></i>, respectively. To execute a query, a robot travels to container <i>M<sub>a</sub></i>, picks up 1 candy, transports it to container <i>M<sub>b</sub></i>, and then stops at <i>M<sub>b</sub></i> until it receives another query.<br />
<br />
Calculate the minimum total distance the robots must travel to execute <i>N</i> queries in order.<br />
<br />
Note: You choose which robot executes each query.</blockquote>
Reminder - the 5 easy steps are:<br />
<ol>
<li style="margin: 0px 0px 0.25em; padding: 0px;">define subproblems, count number of subproblems</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">guess (part of solution), count number of choices</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">relate subproblem solutions, compute time per subproblem</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time </li>
</ol>
<b>Step 1. </b>Let's try to figure out what our subproblems are. At a given point in solving the problem, what information do we want to have? First obvious thing: index of <i>queries</i>. Next obvious thing: we want to optimize over total distance made by robots, therefore our DP function will return the minimum total distance made by robots at some point in our problem. Having that, we also probably will need positions of two robots. The situation looks like this:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-AqGGWyyfXpg/WDR8KhC1_0I/AAAAAAAACQ8/NZG2PmPPMy05Z_iqPjUdj_qYUD7cnQs9wCLcB/s1600/TwoRobotsSketch.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="144" src="https://1.bp.blogspot.com/-AqGGWyyfXpg/WDR8KhC1_0I/AAAAAAAACQ8/NZG2PmPPMy05Z_iqPjUdj_qYUD7cnQs9wCLcB/s400/TwoRobotsSketch.png" width="400" /></a></div>
<br />
<b>Step 2. </b>Now, we basically have two choices, A: proceed with the first robot or B: with the second robot. When <i>A </i>- we add the distance of moving the <i>r1</i> from it's last place <i>M<sub>b(r1)</sub></i> to <i>M<sub>a(INDEX)</sub></i><i>, </i>otherwise <i>B</i>: we add the distance of moving the <i>r2</i> from it's last place <i>M<sub>b(r2)</sub></i> to <i>M<sub>a(INDEX)</sub></i>. Additionally, in both cases we have to add the cost of moving <i>M<sub>a(INDEX) </sub></i>to<i> M<sub>b(INDEX).</sub></i><br />
<b>Step 3. </b>Writing the DP function is now straightforward:<br />
<br />
<span style="font-family: "trebuchet ms" , sans-serif;">DP(r1, r2, index) = min from:</span><br />
<span style="font-family: "trebuchet ms" , sans-serif;"> A: <i>M<sub>a(INDEX) - </sub></i><i>M<sub>b(r1) </sub></i><i>+ M<sub>b(INDEX) </sub></i><i>- M<sub>a(INDEX) </sub></i></span><i style="font-family: "trebuchet ms", sans-serif;">+ DP(INDEX, r2)</i><br />
<span style="font-family: "trebuchet ms" , sans-serif;"> B: <i>M<sub>a(INDEX) - </sub></i><i>M<sub>b(r2) </sub></i><i>+ M<sub>b(INDEX) </sub></i><i>- M<sub>a(INDEX) </sub></i></span><i style="font-family: "trebuchet ms", sans-serif;">+ DP(r1, </i><i style="font-family: "trebuchet ms", sans-serif;">INDEX</i><i style="font-family: "trebuchet ms", sans-serif;">)</i><br />
<span style="font-family: "trebuchet ms" , sans-serif;"></span>
However, having the index in DP function is redundant. We can decrease the time complexity of our solution by removing it - index is always the next item after max(<i>r1</i>, <i>r2</i>). Thus, DP function now looks like:<br />
<br />
<span style="font-family: "trebuchet ms" , sans-serif;">DP(r1, r2) = min from:</span><br />
<span style="font-family: "trebuchet ms" , sans-serif;"> A: <i>M<sub>a(max(r1, r2)+1) - </sub></i><i>M<sub>b(r1) </sub></i><i>+ M<sub>b(</sub></i><i><sub>max(r1, r2)+1</sub></i><i><sub>) </sub></i><i>- M<sub>a(</sub></i><i><sub>max(r1, r2)+1</sub></i><i><sub>)</sub></i></span><span style="font-family: "trebuchet ms" , sans-serif;"><i><sub> </sub></i></span><i style="font-family: "trebuchet ms", sans-serif;">+ DP(</i><i style="font-family: "trebuchet ms", sans-serif;">max(r1, r2)+1, r2)</i><br />
<span style="font-family: "trebuchet ms" , sans-serif;"> B: <i>M<sub>a(</sub></i><i><sub>max(r1, r2)+1</sub></i><i><sub>) - </sub></i><i>M<sub>b(r2) </sub></i><i>+ M<sub>b(</sub></i><i><sub>max(r1, r2)+1</sub></i><i><sub>) </sub></i><i>- M<sub>a(</sub></i><i><sub>max(r1, r2)+1</sub></i><i><sub>)</sub></i></span><span style="font-family: "trebuchet ms" , sans-serif;"><i><sub> </sub></i></span><i style="font-family: "trebuchet ms", sans-serif;">+ </i><i style="font-family: "trebuchet ms", sans-serif;">DP(</i><i style="font-family: "trebuchet ms", sans-serif;">r1, </i><i style="font-family: "trebuchet ms", sans-serif;"><i style="font-family: "trebuchet ms", sans-serif;">max(r1, r2)+1</i>)</i><br />
<div>
<br /></div>
<div>
<span style="font-family: inherit;">The number of subproblems equals to <i>M</i>*<i>M</i>, because that's how many states which in the robots can be arranged. Time of computing one subproblem is constant, therefore the final complexity of this solution is O(<span style="font-family: inherit;">M</span></span><span style="font-family: "trebuchet ms" , sans-serif;"><sup><span style="font-family: inherit;">2</span></sup></span><span style="font-family: inherit;"><span style="font-family: inherit;">)</span>.</span><br />
<span style="font-family: inherit;"><b>Step 4. </b>So is it finite algorithm? Yes, because the graph is traversed through two increasing values, <i>r1 </i>and <i>r2</i>.</span><br />
<span style="font-family: inherit;"><b>Step 5. </b>The solution is in DP(-1, -1), assuming -1 denotes that the robot wasn't used yet and the cost of moving it to </span><i>M<sub>x</sub></i> is 0 (because it starts at this location).<br />
<h2>
The code</h2>
</div>
<pre class="brush: csharp">using System;
using System.Collections.Generic;
using System.Linq;
namespace Algos
{
public class Solver
{
private readonly int _m;
private readonly int _n;
private readonly List<tuple int="">> _queries;
private int[][] _memo;
public Solver(int m, int n, List<tuple int="">> queries)
{
_m = m;
_n = n;
_queries = queries;
}
public int SolveCase()
{
_memo = new int[_n + 1][];
for (int i = 0; i < _n + 1; i++)
_memo[i] = new int[_n + 1];
for (int i = _n; i >= 0; i--)
{
for (int j = _n; j >= 0; j--)
{
_memo[i][j] = -1;
}
}
return Dp(-1, -1);
}
private int Dp(int r1, int r2)
{
if (r1 + 1 == _n || r2 + 1== _n)
return 0;
if (_memo[r1 + 1][r2 + 1] != -1)
return _memo[r1 + 1][r2 + 1];
var i = Math.Max(r1, r2) + 1;
//r1 stays in place
var r2Cover = 0;
if (r2 != -1)
r2Cover = Math.Abs(_queries[r2].Item2 - _queries[i].Item1);
var d1 = Dp(r1, i);
//r2 stays in place
var r1Cover = 0;
if (r1 != -1)
r1Cover = Math.Abs(_queries[r1].Item2 - _queries[i].Item1);
var d2 = Dp(i, r2);
var queryCover = Math.Abs(_queries[i].Item1 - _queries[i].Item2);
var min = Math.Min(r2Cover + d1 + queryCover, r1Cover + d2 + queryCover);
_memo[r1 + 1][r2 + 1] = min;
return min;
}
}
class Solution
{
static void Main(string[] args)
{
//CaseSub0();
//Case0();
//Case1();
//Case2();
//RandomBigCase();
//return;
var T = int.Parse(Console.ReadLine());
for (var i = 0; i < T; i++)
{
ProcessCase();
}
}
private static void ProcessCase()
{
var @case = Console.ReadLine().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
var M = int.Parse(@case[0]);
var N = int.Parse(@case[1]);
var queries = new List<tuple int="">>(N);
for (int i = 0; i < N; i++)
{
var query = Console.ReadLine().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
queries.Add(Tuple.Create(int.Parse(query[0]), int.Parse(query[1])));
}
var solution = new Solver(M, N, queries).SolveCase();
Console.WriteLine(solution);
}
private static void CaseSub0()
{
var M = 2;
var N = 1;
var queries = new List<tuple int="">>();
queries.Add(Tuple.Create(1, 2));
var solution = new Solver(M, N, queries).SolveCase();
Console.WriteLine("Casesub0 " + solution);
}
private static void Case0()
{
var M = 5;
var N = 4;
var queries = new List<tuple int="">>();
queries.Add(Tuple.Create(1, 5));
queries.Add(Tuple.Create(3, 2));
queries.Add(Tuple.Create(4, 1));
queries.Add(Tuple.Create(2, 4));
var solution = new Solver(M, N, queries).SolveCase();
Console.WriteLine("Case0 " + solution);
}
private static void Case1()
{
var M = 4;
var N = 2;
var queries = new List<tuple int="">>();
queries.Add(Tuple.Create(1, 2));
queries.Add(Tuple.Create(4, 3));
var solution = new Solver(M, N, queries).SolveCase();
Console.WriteLine("Case1 " + solution);
}
private static void Case2()
{
var M = 10;
var N = 3;
var queries = new List<tuple int="">>();
queries.Add(Tuple.Create(2, 4));
queries.Add(Tuple.Create(5, 4));
queries.Add(Tuple.Create(9, 8));
var solution = new Solver(M, N, queries).SolveCase();
Console.WriteLine("Case2 " + solution);
}
private static void RandomBigCase()
{
var M = 1000;
var N = 1000;
var queries = new List<tuple int="">>();
var r = new Random(666);
while (queries.Count != N)
{
var from = r.Next(1, M + 1);
var to = r.Next(1, M + 1);
var t = Tuple.Create(from, to);
if (!queries.Contains(t))
queries.Add(t);
}
var solution = new Solver(M, N, queries).SolveCase();
Console.WriteLine("RandomBigCase " + solution);
}
}
}
</tuple></tuple></tuple></tuple></tuple></tuple></tuple></tuple></pre>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-74373309373254340742016-10-31T14:15:00.000+01:002017-11-03T09:32:17.518+01:00Dynamic Programming in 5 easy steps - Examples - Set partitioning<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/10/dynamic-programming-in-5-easy-steps---examples---set-partitioning/">https://piotr.westfalewicz.com/blog/2016/10/dynamic-programming-in-5-easy-steps---examples---set-partitioning/</a></span></h2>
<br />
Let's continue Dynamic Programming series. Last time we have covered <a href="http://dynamicallyinvokable.blogspot.com/2016/10/dynamic-programming-in-5-easy-steps.html" target="_blank">Text Justification</a> problem, this time we will do something harder. The problem statement is as follows:<br />
<blockquote class="tr_bq">
Given a set S of positive integers, divide it into two sets S1 and S2 such that the absolute difference between their sums is minimum.<br />
<br />
If there is a set S with n elements, then if we assume S1 has m elements, S2 must have n-m elements and the value of abs(sum(S1) – sum(S2)) should be minimum.</blockquote>
The 5 easy steps are:<br />
<ol>
<li style="margin: 0px 0px 0.25em; padding: 0px;">define subproblems, count number of subproblems</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">guess (part of solution), count number of choices</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">relate subproblem solutions, compute time per subproblem</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">recurse + memoize OR build DP table bottom-up check subproblems acyclic/topological order, time = time per subproblem · number of subproblems</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">solve original problem: = a subproblem OR by combining subproblem solutions ⇒ extra time </li>
</ol>
<div>
<div>
Taking an example, assume a set S={ 1, 6, 11, 5}.</div>
<div>
<b>Step 1. </b>As previously, we want to divide the problem to subproblems.</div>
<div>
How this can be done? Suffixes, prefixes or the mix of those two are usually the answer. It's natural to imagine that in the middle of processing the set looks at some point as follows .., 11, 5. At this very point we already assumed we are working on suffixes. </div>
<div>
In that situation we can either include the 11 in the set S1 or in the set S2. I'll say that we take or leave the number, because we have only two choices.</div>
<div>
Related, very important aspect - we have to track only one sum! The other one is just totalSum-chosenSetSum.</div>
<div>
Getting back to our situation, what is the "context" of the subproblem? Does it matter how we got to this point? Not at all, what matters is only the chosenSetSum. Obviously, we have to also keep track of our current item index. Here we have it, the subproblems.<br />
<br /></div>
<div>
<b>Step 2.</b> How our guess will look like? We can increase the sum and progress with the problem,</div>
<div>
or we can just progress with the problem, assuming we leave the item and it will be in the second set.</div>
<div>
<b><br /></b>
<b>Step 3.</b> Now we have to glue two pieces togeter. Our DP function will have two arguments as noted above, chosenSetSum and index.</div>
<div>
We want to optimize for the minimum difference between the sets and that's what will be returned by the DP function. All in all, the equation is as follows:</div>
<blockquote class="tr_bq">
DP(chosenSetSum, i) = min from:
<br />
<ul>
<li>DP(chosenSetSum, i+1) //when skipping the i'th item</li>
<li>DP(chosenSetSum + inputSet[i], i+1)) //when taking the i'th item</li>
</ul>
</blockquote>
<div>
The corner case is DP(X, inputSet.Length) = totalSum-X, because that's the end of partitioning.</div>
<div>
The time complexity of this approach is:
<br />
<pre class="brush: text">time = time per subproblem · number of subproblems, so
time = O(1) * O(sum * n) = O(sum * n)</pre>
</div>
<div>
Yey, we have pseudo-polynomial time solution.<br />
<b><br /></b>
<b>Step 4.</b> Is it actually an acyclic graph of DP function invocations? My reasoning here is that we have two dimensions (chosenSetSum and item index) where both increase and there is an end when the second variable reaches input set cardinality, thus it's finite algorithm.<br />
<b><br /></b>
<b>Step 5. </b>We start with chosenSetSum=0 and at the first index, therefore our solution is at DP(0,0).<br />
<h2>
The Code</h2>
</div>
</div>
<div>
Having the 5 steps figured out, it's relatively easy to code the solution.<br />
<br /></div>
<div>
<h3>
Pure DP algorithm coded without memorization</h3>
</div>
<pre class="brush: csharp">private class Partitioner
{
private readonly int[] _inputSet;
private readonly int _inputSetSum;
public Partitioner(int[] inputSet)
{
_inputSet = inputSet;
_inputSetSum = _inputSet.Sum();
}
public void Solve()
{
DP(0, 0);
}
private int DP(int chosenSetSum, int i)
{
if (i == _inputSet.Length)
//Note: returning the difference between chosenSetSum and skippedSetSum,
//not the skippedSetSum (which would be _inputSetSum - chosenSetSum)
return Math.Abs(_inputSetSum - 2 * chosenSetSum);
var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
var minDiff = Math.Min(skipCurrentNumberScore, chooseCurrentNumberScore) ;
return minDiff;
}
}</pre>
How cool is that? 6 lines of code for almost working solution!<br />
<div>
<h3>
</h3>
<h3>
Memoization added</h3>
</div>
<pre class="brush: csharp">private class Partitioner
{
private const int Marker = -1;
private readonly int[] _inputSet;
private readonly int _inputSetSum;
private readonly int[][] _memo;
public Partitioner(int[] inputSet)
{
_inputSet = inputSet;
_inputSetSum = _inputSet.Sum();
_memo = new int[_inputSetSum][];
for (var i = 0; i < _inputSetSum; i++)
{
_memo[i] = new int[_inputSet.Length];
for (var j = 0; j < _inputSet.Length; j++)
_memo[i][j] = Marker;
}
}
public void Solve()
{
DP(0, 0);
}
private int DP(int chosenSetSum, int i)
{
if (i == _inputSet.Length)
//Note: returning the difference between chosenSetSum and skippedSetSum,
//not the skippedSetSum (which would be _inputSetSum - chosenSetSum)
return Math.Abs(_inputSetSum - 2 * chosenSetSum);
if (_memo[chosenSetSum][i] != Marker)
return _memo[chosenSetSum][i];
var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
var minDiff = Math.Min(skipCurrentNumberScore, chooseCurrentNumberScore);
_memo[chosenSetSum][i] = minDiff;
return minDiff;
}
}</pre>
<div>
<h3>
</h3>
<h3>
Full solution with parent pointers and examples</h3>
</div>
<pre class="brush: csharp">using System;
using System.Collections.Generic;
using System.Linq;
public class DP_SetPartition
{
static void Main(String[] args)
{
PrintSolution(new int[0]);
PrintSolution(new[] { 3 });
PrintSolution(new[] { 3, 3 });
PrintSolution(new[] { 3, 2 });
PrintSolution(new[] { 1, 4, 2 });
PrintSolution(new[] { 1, 2, 4 });
PrintSolution(new[] { 1, 1, 5, 1, 1, 1 });
PrintSolution(new[] { 1, 6, 11, 5 });
PrintSolution(new[] { 1, 6, 11, 5 });
PrintSolution(new[] { 25, 21, 20, 17, 8 });
PrintSolution(new[] { 1, 5, 6, 10 });
PrintSolution(new[] { 3, 1, 4, 2, 2, 1 });
PrintSolution(new[] { 200, 200, 300, 300, 400, 400, 1000, 1000 });
PrintSolution(new[] { 100, 100, 200, 200, 300, 300, 400, 400, 1000, 1000 });
Console.ReadLine();
}
private static void PrintSolution(int[] set)
{
Console.WriteLine($"Set: {string.Join(", ", set)}");
int[] set1;
int[] set2;
new Partitioner(set).Solve(out set1, out set2);
Console.WriteLine($"A: {string.Join(", ", set1)} \t\t B: {string.Join(", ", set2)}, sumdiff: {Math.Abs(set1.Sum() - set2.Sum())}");
Console.WriteLine();
}
private class Partitioner
{
private const int Marker = -1;
private readonly int[] _inputSet;
private readonly bool[][] _parentPointers;
private readonly int _inputSetSum;
private readonly int[][] _memo;
public Partitioner(int[] inputSet)
{
_inputSet = inputSet;
_inputSetSum = _inputSet.Sum();
_memo = new int[_inputSetSum][];
_parentPointers = new bool[_inputSetSum][];
for (var i = 0; i < _inputSetSum; i++)
{
_memo[i] = new int[_inputSet.Length];
_parentPointers[i] = new bool[_inputSet.Length];
for (var j = 0; j < _inputSet.Length; j++)
{
_memo[i][j] = Marker;
}
}
}
public void Solve(out int[] set1, out int[] set2)
{
DP(0, 0);
var chosenSet = new List<int>();
var skippedSet = new List<int>();
var sum = 0;
var i = 0;
while (i != _inputSet.Length)
{
var choose = _parentPointers[sum][i];
if (choose)
{
chosenSet.Add(_inputSet[i]);
sum += _inputSet[i];
}
else
{
skippedSet.Add(_inputSet[i]);
}
i++;
}
set1 = chosenSet.ToArray();
set2 = skippedSet.ToArray();
}
private int DP(int chosenSetSum, int i)
{
if (i == _inputSet.Length)
return Math.Abs(_inputSetSum - 2 * chosenSetSum);
if (_memo[chosenSetSum][i] != Marker)
return _memo[chosenSetSum][i];
var chooseCurrentNumberScore = DP(chosenSetSum + _inputSet[i], i + 1);
var skipCurrentNumberScore = DP(chosenSetSum, i + 1);
var minDiff = skipCurrentNumberScore;
if (chooseCurrentNumberScore < skipCurrentNumberScore)
{
minDiff = chooseCurrentNumberScore;
_parentPointers[chosenSetSum][i] = true;
}
_memo[chosenSetSum][i] = minDiff;
return minDiff;
}
}
}</pre>
It's working!
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-fUtUltSTHFg/WBT1VZnK9PI/AAAAAAAACOc/I03hTG4_Q5AFrzEJgSeaW-PotDw_Xk8sgCLcB/s1600/DP_SetPartitioning.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="356" src="https://2.bp.blogspot.com/-fUtUltSTHFg/WBT1VZnK9PI/AAAAAAAACOc/I03hTG4_Q5AFrzEJgSeaW-PotDw_Xk8sgCLcB/s400/DP_SetPartitioning.png" width="400" /></a></div>
<br />PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-84720323567229962222016-10-25T08:00:00.000+02:002017-11-03T09:33:02.083+01:00Dynamic Programming in 5 easy steps - Examples - Text Justification<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/10/dynamic-programming-in-5-easy-steps---examples---text-justification/">https://piotr.westfalewicz.com/blog/2016/10/dynamic-programming-in-5-easy-steps---examples---text-justification/</a></span></h2>
<br />
Lately I spent a relatively great amount of time revising Algorithms and Data Structures. It's a broad subject, but for me personally I found the Dynamic Programming as the abandoned and unused method. It's also considered as one of the hardest methods to master, with few examples on the internet. Let's contribute a little with this post series. Today I will cover the first problem - text justification. The problem is from MIT lectures and I highly recommend well explained videos along with notes from course 6.006: <a href="https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/" target="_blank">Introduction to algorithms - fall 2011</a>.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-HS4qKKuV7-8/WAkop1hBqiI/AAAAAAAACMw/FiZgn2-ZuK8IUC8eKrwzteMhloxvI73CwCLcB/s1600/text-alignment.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://1.bp.blogspot.com/-HS4qKKuV7-8/WAkop1hBqiI/AAAAAAAACMw/FiZgn2-ZuK8IUC8eKrwzteMhloxvI73CwCLcB/s200/text-alignment.png" width="176" /></a></div>
<h2>
Text justification</h2>
Given a text, split it into multiple lines so it is “nicely” distributed between each line.<br />
What does it mean? Take a look at the example:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-3e9OP74Y578/WAko1w86lcI/AAAAAAAACM0/DTSxiD3QFcs9UK-bs3bQ9rRhEB7g6m64QCLcB/s1600/example.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="85" src="https://4.bp.blogspot.com/-3e9OP74Y578/WAko1w86lcI/AAAAAAAACM0/DTSxiD3QFcs9UK-bs3bQ9rRhEB7g6m64QCLcB/s400/example.PNG" width="400" /></a></div>
We can clearly see that eager approach won't produce good results in all cases.<br />
Mathematically, given a badness(i, j) function for line of words[i : j], we seek for such a splitting that the sum of all badness is minimum.<br />
Let's assume the badness function is predefined as follows:<br />
<blockquote class="tr_bq">
∞ if total length > page width, else (page width − total length)<sup>3</sup></blockquote>
So in the example above, the bad splitting would have:<br />
<ul>
<li>page width = 16</li>
<li>first line length = 4+4+4+2 = 16, badness of the first line = 0</li>
<li>second line length = 4, badness = (16-4)<sup>3</sup>=1728</li>
<li>third line length = 16, badness = 0</li>
<li>badness sum = 1728</li>
</ul>
<div>
The good splitting would have:</div>
<div>
<ul>
<li>first line length = 4+4+1 = 9, badness of the first line = (16-9)<sup>3</sup>=343</li>
<li>second line length = 9, badness = (16-9)<sup>3</sup>=343</li>
<li>third line length = 16, badness = 8</li>
<li>badness sum = 694</li>
</ul>
<div>
The second splitting has smaller total badness sum, thus it's a better splitting.</div>
</div>
<div>
<br /></div>
<h2>
Dynamic Programming in 5 easy steps</h2>
<div>
On the MIT course, there is presented an approach to DP in 5 easy steps. Those are:</div>
<div>
<ol>
<li>define subproblems, count number of subproblems</li>
<li>guess (part of solution), count number of choices</li>
<li>relate subproblem solutions, compute time per subproblem</li>
<li>recurse + memoize OR build DP table bottom-up
check subproblems acyclic/topological order, time = time per subproblem · number of subproblems</li>
<li>solve original problem: = a subproblem
OR by combining subproblem solutions ⇒ extra time </li>
</ol>
</div>
Let's try to apply them on our problem.<br />
Well, the steps don't have to be executed always in order. Let's try to figure out the <b>second</b> step. We know where our line begins, it's on ith position. Where is the end of the line? We don't know. So let's check all possibilities, that's from ith onward and take the smallest badness. So now we know what's our subproblems in <b>first</b> step are - suffixes of word[i:]. That's pretty much the algorithm. The <b>third </b>step will look like:<br />
<pre class="brush: text">DP(i)=for j in i+1...n+1 take
min from (DP(j) + badness(i, j))</pre>
The great thing is about DP that DP(j) considered to be free (because we compute the subproblem once and memorize the answer). Therefore the total time needed by this algorithm is:<br />
<pre class="brush: text">time = time per subproblem · number of subproblems, so
time = O(n) · O(n)
time = O(n^2)</pre>
The <b>fourth </b>point is about proving that this really works and the time needed is finite. We compute the answer from the end of the words, down to word number 0. The recursive DP function always accesses greater indexes, therefore time needed is finite. <b>Lastly</b>, we have to find our answer. Here, obviously it's DP[0]. Easy? Yup, let's code it!<br />
<h2>
The code</h2>
<pre class="brush: csharp">using System;
using System.Collections.Generic;
using System.Linq;
namespace Algos
{
public class DP_TextJustification
{
static void Main(String[] args)
{
//please note: this is not a production-ready solution!
PrintSolution(" ", 10);
PrintSolution("oneword", 10);
PrintSolution("one line", 10);
PrintSolution("two lines", 6);
PrintSolution("blah blah blah blah reallylongword", 14);
PrintSolution("blah blah blah blah reallylongword 1", 14);
var txt1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " +
"Praesent varius urna eget lacinia pharetra. " +
"Vivamus lacinia in dui vitae sodales. " +
"Aliquam auctor justo nec condimentum pretium. " +
"Curabitur egestas tellus dolor, vel tempus lacus blandit vitae. " +
"Cras dapibus scelerisque ex nec gravida.";
PrintSolution(txt1, 60);
PrintSolution(txt1, 70);
PrintSolution(txt1, 80);
PrintSolution(txt1, 90);
PrintSolution(txt1, 100);
Console.ReadLine();
}
public static void PrintSolution(string txt, int width)
{
Console.WriteLine($"Test length: {txt.Length}, line width: {width}");
foreach (var line in new Justifier(txt, width).Solve())
{
Console.WriteLine("{0,-" + width + "}|", line);
}
Console.WriteLine();
}
private class Justifier
{
private const int Marker = -1;
private readonly int _width;
private readonly string[] _words;
private readonly int[] _parentPointers;
private readonly int[] _memo;
public Justifier(string text, int width)
{
_width = width;
_words = text.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries);
_parentPointers = new int[_words.Length];
_memo = new int[_words.Length];
for (var i = 0; i < _memo.Length; i++)
{
_memo[i] = Marker;
}
}
public List<string> Solve()
{
DP(0);
var result = new List<string>();
var from = 0;
while (from != _words.Length)
{
var next = _parentPointers[from];
result.Add(string.Join(" ", _words.Skip(from).Take(next - from)));
from = next;
}
return result;
}
private int DP(int i)
{
if (i == _words.Length) //no words in line is the end of justification
return 0;
if (_memo[i] != Marker) //if we have solution calculated, return it
return _memo[i];
var minBadness = int.MaxValue;
for (var j = i + 1; j <= _words.Length; j++)
{
var badness = Badness(i, j) + DP(j);
if (minBadness > badness)
{
_parentPointers[i] = j; //remember best choice
minBadness = badness;
}
}
_memo[i] = minBadness; //remember solution
return minBadness;
}
private int Badness(int i, int j)
{
var lengthOfWords = _words.Skip(i).Take(j - i).Sum(s => s.Length);
var numberOfSpaces = j - i - 1;
if (lengthOfWords + numberOfSpaces > _width)
return 10*1000*1000;
return (int)Math.Pow(_width - lengthOfWords - numberOfSpaces, 3);
}
}
}
}
</pre>
To the solution in 5 easy steps I've added parent pointers - a simple way to remember the best solution from DP. The result is as follows:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-v_zJuyZsfgo/WA5rjKHazYI/AAAAAAAACNU/zyWKnbNM_6YUe5zB_iyYlM_5VDqcOLcOQCLcB/s1600/DP_TextJustification_Result1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="276" src="https://1.bp.blogspot.com/-v_zJuyZsfgo/WA5rjKHazYI/AAAAAAAACNU/zyWKnbNM_6YUe5zB_iyYlM_5VDqcOLcOQCLcB/s400/DP_TextJustification_Result1.PNG" width="400" /></a></div>
If we don't care about justifying the last line, we can easily modify the badness function to return no cost for the last line:
<br />
<pre class="brush: csharp">private int Badness(int i, int j)
{
var lengthOfWords = _words.Skip(i).Take(j - i).Sum(s => s.Length);
var numberOfSpaces = j - i - 1;
if (lengthOfWords + numberOfSpaces > _width)
return 10*1000*1000;
if (j == _words.Length) //don't care about last line
return 0;
return (int)Math.Pow(_width - lengthOfWords - numberOfSpaces, 3);
}</pre>
In this case, the result will be much different in some cases:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-JP-p0xgiIyM/WA5tCG_hzcI/AAAAAAAACNg/UMfzL-7lDog38W59pF8PsaPhjQRhXQc0wCLcB/s1600/DP_TextJustification_Result2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="201" src="https://3.bp.blogspot.com/-JP-p0xgiIyM/WA5tCG_hzcI/AAAAAAAACNg/UMfzL-7lDog38W59pF8PsaPhjQRhXQc0wCLcB/s400/DP_TextJustification_Result2.PNG" width="400" /></a></div>
<br />PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-20610600447676495002016-09-06T18:30:00.000+02:002017-11-03T09:33:46.197+01:00Cassandra Datastax C# Driver problems - solution<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/09/cassandra-datastax-c-sharp-driver-problems---solution/">https://piotr.westfalewicz.com/blog/2016/09/cassandra-datastax-c-sharp-driver-problems---solution/</a></span></h2>
<br />
<a href="http://dynamicallyinvokable.blogspot.com/2016/08/cassandra-datastax-c-driver-problems.html" target="_blank">In the previous post</a> I've described a strange problem related to Cassandra Datastax C# Driver which was happening once in the production environment. It's time to reveal the mystery.<br />
<br />
<h2>
Two root causes</h2>
<h3>
1.</h3>
<div>
One of the hidden, but important metric, which you won't find usually in your logs is the CPU usage. What important is that the connection setup to the Cassandra cluster consists of many small steps. In production, when there was a very high CPU usage (around 100% - for reason known and already eliminated), the connection setup process was timing out in such a moment, that the final result was reported as NoHostAvailableException. This shows, how important is to track and prevent 100% CPU usage.<br />
<br /></div>
<h3>
2.</h3>
<div>
But why, let me quote myself:</div>
<blockquote class="tr_bq">
Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.</blockquote>
The problem is here:
<br />
<pre class="brush: csharp">private readonly Cluster _cluster;
private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache
public ISession GetSession(string keyspaceName)
{
if (!_sessions.ContainsKey(keyspaceName))
{
_sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
}
var result = _sessions[keyspaceName];
return result.Value;
}
</pre>
Do you see it? It turns out that when an exception is thrown in <b>_cluster.Connect(key)</b> method, the <b>Lazy<T> will cache this exception</b>. Therefore all invocations to <b>Lazy<T>.Value</b> will result in the same, cached exception instead of retrying the connection to the Cassandra cluster. If you are planning to use the Lazy<T> class, there are more "gotchas". <a href="https://msdn.microsoft.com/en-us/library/dd642331(v=vs.110).aspx" target="_blank">Read the documentation on MSDN</a>.<br />
<br />
<h2>
Lessons learned?</h2>
<div>
<br />
<ol>
<li>CPU usage is very, very important and critical metric. Do not ignore it, as it may lead to numerous, strange errors.</li>
<li><strike>RTFM!</strike> Read the documentation when using any class for the first time. Especially when copying&pasting code from the StackOverflow.</li>
</ol>
</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-60999965825518757532016-08-30T17:55:00.001+02:002017-11-03T09:34:38.590+01:00Cassandra Datastax C# Driver problems - NoHostAvailableException<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/08/cassandra-datastax-c-sharp-driver-problems---nohostavailableexception/">https://piotr.westfalewicz.com/blog/2016/08/cassandra-datastax-c-sharp-driver-problems---nohostavailableexception/</a></span></h2>
<br />
This post will be about my journey with fixing nasty Cassandra Datastax C# driver problem, which took me a lot more time than expected...
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-6JS92ywdBG8/V7yOACtVvjI/AAAAAAAACJk/DkSNsO1qUhEKBg-wpm90yP5fmTLej93LQCLcB/s1600/2000px-Bug_icon_-_Noun_project_198.svg.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://2.bp.blogspot.com/-6JS92ywdBG8/V7yOACtVvjI/AAAAAAAACJk/DkSNsO1qUhEKBg-wpm90yP5fmTLej93LQCLcB/s200/2000px-Bug_icon_-_Noun_project_198.svg.png" width="154" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Credits: <a href="https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Bug_icon_-_Noun_project_198.svg/2000px-Bug_icon_-_Noun_project_198.svg.png" target="_blank">wikimedia</a></td></tr>
</tbody></table>
<br />
Once upon a time, I've been fixing following exception:<br />
<pre class="brush: text">Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: x.x.x.x:9042, x.x.x.x:9042, x.x.x.x:9042)
at Cassandra.ControlConnection.Connect(Boolean firstTime)
at Cassandra.Cluster.Connect(String keyspace)
at Company.Code.CassandraSessionCache.GetSession(String keyspaceName)
at Company...
</pre>
The CassandraSessionCache looked like this:
<br />
<pre class="brush: csharp">public class CassandraSessionCache
{
private readonly Cluster _cluster;
private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache
public CassandraSessionCache(Cluster cluster)
{
_cluster = cluster;
_sessions = new ConcurrentDictionary<string, Lazy<ISession>>();
}
public ISession GetSession(string keyspaceName)
{
if (!_sessions.ContainsKey(keyspaceName))
{
_sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
}
var result = _sessions[keyspaceName];
return result.Value;
}
}
</pre>
Nothing fancy, however let me give you an insight about the architecture and circumstances of the error:<br />
<ul>
<li>Cassandra cluster is in Amazon</li>
<li>The client is Cassandra Datastax C# Driver 2.6.0, also on server in Amazon</li>
<li>Both the client and the Cassandra cluster is the same Amazon Region</li>
<li>Amazon Region had no availability issues during given period</li>
<li>The solution was working fine for over 1 month! The client process is being restarted ~every week for various reasons</li>
<li>The client follows <a href="http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra" target="_blank">Cassandra Datastax C# Driver Best Practices</a></li>
<li>Heartbeat is turned on, so the connection should be alive, all the times</li>
<li><b>Things get back to normal after the client restart... and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.</b></li>
<li>Of course, it works on my machine®</li>
</ul>
<h2>
What didn't happen?</h2>
There are plenty of questions about Cassandra.NoHostAvailableException on StackOverflow. So let's get through some of them and exclude them:<br />
<ul>
<li><a href="http://stackoverflow.com/questions/25145980/datastax-cassandra-java-driver-crashes-with-nohostavailableexception-after-a-few?rq=1" target="_blank">[1]</a>, <a href="http://stackoverflow.com/questions/24821966/nohostavailableexception-with-1000-concurrent-request-to-cassandra-with-datastax?rq=1" target="_blank">[2]</a> - no, because following C# Driver best practices excludes this</li>
<li><a href="http://stackoverflow.com/questions/20507904/cassandra-nohostavailableexception-while-there-is-still-alive-node?rq=1" target="_blank">[3]</a> - no, because we are using default retry strategy from driver version 2.6.0</li>
<li><a href="http://stackoverflow.com/questions/19419975/nohostavailableexception-all-hosts-tried-for-query-failed-exception-occurs?rq=1" target="_blank">[4]</a>, <a href="http://stackoverflow.com/questions/18724334/cant-connect-to-cassandra-nohostavailableexception?rq=1" target="_blank">[5]</a>, <a href="http://stackoverflow.com/questions/18643569/cassandra-nohostavailableexception?rq=1" target="_blank">[6]</a> - no, because we are able to connect to the Cluster at the beginning</li>
<li><a href="http://stackoverflow.com/questions/29076352/cassandra-datastax-driver-nohostavailableexception-during-large-batch-insert?rq=1" target="_blank">[7]</a> - no, because we are not misusing batches</li>
</ul>
<h2>
Debugging...</h2>
Logs on server revealed that the client closed the connection:
<br />
<pre class="brush: text">INFO [SharedPool-Worker-3] yyyy-mm-dd 11:04:45,625 Message.java:605 - Unexpected exception during request; channel = [id: 0x9eaf52c5, /x.x.x.x:y :> /x.x.x.x:9042]
java.io.IOException: Error while read(...): Connection reset by peer
at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollRdHupReady(EpollSocketChannel.java:689) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
</pre>
While the message on the client says that there are no hosts available, what is confirmed by debug logs on the client side. Pretty interesting, huh? Being confused, I've decided to give an update from 2.6.3 to 2.7 a try... but that didn't help.
<br />
Accoring to <a href="http://stackoverflow.com/questions/31705443/cassandra-nohostavailableexception-all-hosts-tried-for-query-failed-in-produc" target="_blank">yet another issue regarding NoHostsAvailableException on StackOverflow</a> I've started to log whole exception, with serialized errors property. This is what I've logged:<br />
<pre class="brush: text">System.Exception: NoHostAvailableException happened. Errors: {
"x.x.x.x:9042": {
"NativeErrorCode": 10060,
"ClassName": "System.Net.Sockets.SocketException",
"Message": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond",
"Data": null,
"InnerException": null,
"HelpURL": null,
"StackTraceString": "
at Cassandra.Connection.<Open>b__9(Task`1 t)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.Tasks.Task.Execute()",
"RemoteStackTraceString": null,
"RemoteStackIndex": 0,
"ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
"HResult": -2147467259,
"Source": "Cassandra",
"WatsonBuckets": null
},
"x.x.x.x:9042": {
"NativeErrorCode": 10060,
"ClassName": "System.Net.Sockets.SocketException",
"Message": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond",
"Data": null,
"InnerException": null,
"HelpURL": null,
"StackTraceString": "
at Cassandra.Connection.<Open>b__9(Task`1 t)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.Tasks.Task.Execute()",
"RemoteStackTraceString": null,
"RemoteStackIndex": 0,
"ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
"HResult": -2147467259,
"Source": "Cassandra",
"WatsonBuckets": null
}
}
</pre>
Unfortunately, no interesting data is here.<br />
<h2>
So what could possibly go wrong?</h2>
Can you spot the error? I couldn't. Any guesses? Find the answer in next post.PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com2tag:blogger.com,1999:blog-92551346682037876.post-17054773049726146662016-08-18T08:00:00.000+02:002017-11-03T09:35:38.242+01:00Algorithms and data structures - non-academic trees<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/08/algorithms-and-data-structures---non-academic-trees/">https://piotr.westfalewicz.com/blog/2016/08/algorithms-and-data-structures---non-academic-trees/</a></span></h2>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-e5U84qje50s/V7N-4bLz_xI/AAAAAAAACI0/gjrcUtnCJQcf0Ba1wrdUcg6PLqKra47kACEw/s1600/300px-Binary_tree.svg.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="266" src="https://1.bp.blogspot.com/-e5U84qje50s/V7N-4bLz_xI/AAAAAAAACI0/gjrcUtnCJQcf0Ba1wrdUcg6PLqKra47kACEw/s320/300px-Binary_tree.svg.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Credits: <a href="https://en.wikipedia.org/wiki/Tree_(data_structure)#/media/File:Binary_tree.svg" target="_blank">Wikipedia Tree (data structure)</a></td></tr>
</tbody></table>
There are many types of trees which are covered on Computer Science lectures. Those usually include: Binary Search Tree, AVL Tree, B Tree, Splay Tree, Red-Black Tree, Trie Trees, Heap Trees.<br />
<br />
Those are indeed very useful and practical trees with lots of applications. However, I've discovered few other trees while brushing up my knowledge about algorithms and data structures. Here they are, the most interesting, yet not so popular trees:<br />
<ul>
<li><a href="https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/" target="_blank">BK-Tree</a> - do you want to find misspellings of a word in a dictionary? E.g. given word "dog" and dictionary { "cat", "fog", "dot", "cookie" }, naive approach is to compare the word "dog" to all of the entries in the dictionary. This leads to O(n) time. It can be solved in O(lg n) time, though. Burkhard-Keller tree is used in <a href="https://lucene.apache.org/core/" target="_blank">Apache Lucene</a>, for example. <a href="https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/" target="_blank">Head to Xenopax's Blog for awesome post about BK-Trees.</a></li>
<li><a href="https://en.wikipedia.org/wiki/Merkle_tree" target="_blank">Merkle Tree</a> - probably you didn't know but that's the name of the tree of commits and blobs in a Git VCS. Another applications known to me personally include: Cassandra (during node repair) and Bitcoin blockchain.</li>
<li><a href="https://en.wikipedia.org/wiki/Interval_tree" target="_blank">Interval Tree</a> - interesting idea of augmenting "normal" (single value) trees with additional data in order to solve windowing queries.</li>
<li><a href="https://www.youtube.com/watch?v=Va0vs1fhhNI" target="_blank">Lemon Tree</a> - the most complicated type of tree. Many wondered what it really is, but few actually knew... <a href="http://forum.thefreedictionary.com/postst4230_Meaning-of--lemon-tree--in-the-same-name-song-of-Fool-garden.aspx" target="_blank">Find the official statement here</a>.</li>
</ul>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-22343739591005751722016-07-26T08:00:00.000+02:002016-07-26T08:00:12.414+02:00The performance of setting T[] vs. List by index<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-IxyOabqli3w/V5ZniYrZbiI/AAAAAAAACHI/dkNdpl49pFULMK574jCPW3yfTg_4vElKgCLcB/s1600/Big-O-notation%2B-%2Bgradient%2Bdescent.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="163" src="https://3.bp.blogspot.com/-IxyOabqli3w/V5ZniYrZbiI/AAAAAAAACHI/dkNdpl49pFULMK574jCPW3yfTg_4vElKgCLcB/s200/Big-O-notation%2B-%2Bgradient%2Bdescent.png" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Source: <a href="https://upload.wikimedia.org/wikipedia/commons/6/68/Gradient_ascent_(surface).png" target="_blank">wikimedia.org</a></td></tr>
</tbody></table>
<br />
Let's compare asymptotic time complexity of two following loops:<br />
<pre class="brush: csharp">int[] _array = Enumerable.Range(0, n).ToArray();
List<int> _list = Enumerable.Range(0, n).ToList();
//ListChangeByIndex
for (var i = 0; i < n; i++)
{
_list[i] = i + 1;
}
//ArrayChangeByIndex
for (var i = 0; i < n; i++)
{
_array[i] = i + 1;
}</pre>
How do you think, which one is faster?<br />
<div style="text-align: center;">
.</div>
<div style="text-align: center;">
.</div>
<div style="text-align: center;">
.</div>
<div style="text-align: center;">
.<br />
<div style="text-align: center;">
.</div>
<div style="text-align: center;">
.</div>
<div style="text-align: center;">
.</div>
<div style="text-align: center;">
.</div>
</div>
Many developers think the fist one will be slower, because in each loop computer is forced to visit all nodes from 0 to i to finally set the variable.<br />
However, that's not the case. Both loops have O(n) complexity. That's because in the .NET source we can clearly see that's underlying data structure for a List<T> is an array: <a href="http://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs,40" target="_blank">list.cs</a>. Therefore, those two loops are essentially equal.PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com2tag:blogger.com,1999:blog-92551346682037876.post-75621947136996204372016-07-12T08:00:00.000+02:002016-07-12T08:00:19.124+02:00Presentation recommendation - Cloud-based Microservices powering BBC iPlayerToday I recommend you following presentation: <a href="https://www.infoq.com/presentations/bbc-microservices-aws" target="_blank">Cloud-based Microservices powering BBC iPlayer</a><br />
<br />
Why? It's interesting (at least for me) how one of the most popular British broadcasting organisation make their's channels available online. A high-level architecture is presented. If you are new to AWS, you will also learn a thing or two about the Amazon Cloud.PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-89333797440216667432016-06-30T23:59:00.000+02:002016-07-01T00:02:34.615+02:00Git tips - replace all occurrences of a string in files<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-ZeR-gaAY7Vs/V3WW3D3hbjI/AAAAAAAACFo/3vfoiPOm8hERdxWsjRlx9Zfrrt_qFflpgCLcB/s1600/gitlogo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-ZeR-gaAY7Vs/V3WW3D3hbjI/AAAAAAAACFo/3vfoiPOm8hERdxWsjRlx9Zfrrt_qFflpgCLcB/s1600/gitlogo.png" /></a></div>
<br />
Git can be used from VisualStudio, however it's like saying you drive a car, when actually you play Need for Speed. Unleash the full power of Git, learn to use it. It's not that hard.<br />
<div>
<br /></div>
<div>
Doesn't matter if you are a beginner or an advanced user, you should know what an git alias is. If you don't know, go here immediately: <a href="https://git-scm.com/book/en/v2/Git-Basics-Git-Aliases" target="_blank">Git Basics - Git Aliases</a>.</div>
<div>
<br /></div>
<div>
Today, you will get a very useful git alias. It's for replacing all occurrences of one string with another. Suppose you want to replace EntityFramework with NHibernate in your project (which seems to be a pretty reasonable thing to do:) ). Here is the alias:</div>
<blockquote class="tr_bq">
replaceall = "!f() { git grep -l \"$1\" | xargs sed -b -i \"s/$1/$2/g\"; }; f"</blockquote>
<div>
The first part of the alias lists all files containing first argument and passes it through pipe and xargs to sed, which performs the replacement. Use it like this:<br />
<blockquote class="tr_bq">
git replaceall EntityFramework NHibernate</blockquote>
Please note: it will replace all occurrences of "EntityFramework" with "NHibernate" in all tracked and <i>untracked</i> files.</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-45587557378654161692016-06-13T08:00:00.000+02:002017-11-03T09:36:29.697+01:00Messing with C# types. Making type1.FullName==type2.FullName, but not type1==type2!<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/07/the-performance-of-setting-t-vs.-list-by-index/">https://piotr.westfalewicz.com/blog/2016/07/the-performance-of-setting-t-vs.-list-by-index/</a></span></h2>
<br />
Given the following method:<br />
<pre class="brush: csharp; highlight: [21]">private static void CompareTypes(Type type1, Type type2)
{
Console.WriteLine($"type1.FullName = {type1.FullName}");
Console.WriteLine($"type2.FullName = {type2.FullName}");
Console.WriteLine($"type1.FullName {(type1.FullName == type2.FullName ? '=' : '!')}= type2.Fullname");
Console.WriteLine($"type1.AssemblyQualifiedName = {type1.AssemblyQualifiedName}");
Console.WriteLine($"type2.AssemblyQualifiedName = {type2.AssemblyQualifiedName}");
Console.WriteLine($"type1.AssemblyQualifiedName {(type1.AssemblyQualifiedName == type2.AssemblyQualifiedName ? '=' : '!')}= type2.AssemblyQualifiedName");
Console.WriteLine($"type1.GUID = {type1.GUID}");
Console.WriteLine($"type2.GUID = {type2.GUID}");
Console.WriteLine($"type1.GUID {(type1.GUID == type2.GUID ? '=' : '!')}= type2.GUID");
Console.WriteLine("o1 = Activator.CreateInstance(type1)");
Console.WriteLine("o2 = Activator.CreateInstance(type2)");
var o1 = Activator.CreateInstance(type1);
var o2 = Activator.CreateInstance(type2);
Console.WriteLine($"o1 == {o1}");
Console.WriteLine($"o2 == {o2}");
Console.WriteLine();
Console.WriteLine($"but... type1 {(type1 == type2 ? '=' : '!')}= type2");
}
</pre>
Is it possible to get the following result?<br />
<pre class="brush: text; highlight: [15]">type1.FullName = MyLibrary.MyPrecious
type2.FullName = MyLibrary.MyPrecious
type1.FullName == type2.Fullname
type1.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type2.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type1.AssemblyQualifiedName == type2.AssemblyQualifiedName
type1.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type2.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type1.GUID == type2.GUID
o1 = Activator.CreateInstance(type1)
o2 = Activator.CreateInstance(type2)
o1 == MyLibrary.MyPrecious
o2 == MyLibrary.MyPrecious
but... type1 != type2
</pre>
As it turns out, it is. Doing such a hell is relatively easy:
<br />
<pre class="brush: csharp">private static Assembly LoadAssemblyByName(string name)
{
var myPreciousAssemblyLocation = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), name);
using (var fs = new FileStream(myPreciousAssemblyLocation, FileMode.Open, FileAccess.Read))
{
var data = new byte[fs.Length];
fs.Read(data, 0, data.Length);
fs.Close();
var assembly = Assembly.Load(data);
return assembly;
}
}
static void Main()
{
var type1 = typeof (MyPrecious);
var myLibraryAssembly = LoadAssemblyByName("MyLibrary.dll");
var type2 = myLibraryAssembly.GetType("MyLibrary.MyPrecious", true);
CompareTypes(type1, type2);
}
</pre>
The code above compares <b>type1 </b>from referenced project to <b>type2 </b>from the same assembly, but loaded again through <b>Assembly.Load(byte[])</b>. That makes the library loaded twice in the AppDomain. Now when a call to <b>AppDomain.CurrentDomain.GetAssemblies() </b>is made, the assemblies are:<br />
<pre class="brush: text; highlight: [12, 13]">AppDomain.CurrentDomain.GetAssemblies:
mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
Microsoft.VisualStudio.HostingProcess.Utilities, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
System.Windows.Forms, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
System.Drawing, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
Microsoft.VisualStudio.HostingProcess.Utilities.Sync, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
Microsoft.VisualStudio.Debugger.Runtime, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
vshost32, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
System.Core, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089
ConsoleApplication1, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
Accessibility, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3
</pre>
Even in such a small, console application it is quite confusing. So, let's make it more confusing... What's the output of the following code?
<br />
<pre class="brush: csharp;">var myPreciousAssemblyLocation = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "MyLibrary.dll");
var myLibraryAssemblyLoadFrom = Assembly.LoadFrom(myPreciousAssemblyLocation);
var type3 = myLibraryAssemblyLoadFrom.GetType("MyLibrary.MyPrecious", true);
CompareTypes(type1, type3);
</pre>
Now, surprisingly, its:
<br />
<pre class="brush: text;">type1.FullName = MyLibrary.MyPrecious
type2.FullName = MyLibrary.MyPrecious
type1.FullName == type2.Fullname
type1.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type2.AssemblyQualifiedName = MyLibrary.MyPrecious, MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
type1.AssemblyQualifiedName == type2.AssemblyQualifiedName
type1.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type2.GUID = cacf8c0d-b903-3da6-808f-024a3070ab9d
type1.GUID == type2.GUID
o1 = Activator.CreateInstance(type1)
o2 = Activator.CreateInstance(type2)
o1 == MyLibrary.MyPrecious
o2 == MyLibrary.MyPrecious
but... type1 == type2
</pre>
<h2>
Hint</h2>
<div>
A nice hint is shown, when you try to execute the following code:</div>
<pre class="brush: csharp;">var o1 = Activator.CreateInstance(type1);
var o2 = Activator.CreateInstance(type2);
MyPrecious p1 = (MyPrecious) o1;
try
{
MyPrecious p2 = (MyPrecious)o2;
}
catch (Exception e)
{
Console.WriteLine(e);
}
</pre>
<blockquote>
System.InvalidCastException: [A]MyLibrary.MyPrecious cannot be cast to [B]MyLibrary.MyPrecious. Type A originates from 'MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' in the context 'LoadNeither' in a byte array. Type B originates from 'MyLibrary, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' in the context 'Default' at location 'c:\users\pwdev\documents\visual studio 2015\Projects\ConsoleApplication1\ConsoleApplication1\bin\Debug\MyLibrary.dll'.
at ConsoleApplication1.Program.Main() in c:\users\pwdev\documents\visual studio 2015\Projects\ConsoleApplication1\ConsoleApplication1\Program.cs:line 49
</blockquote>
<br />
<h2>
Explanation</h2>
<div>
Yes, it's all about the load contexts. There are three different assembly load contexts: Load, LoadFrom, Neither. Usually there is no need to load the same library twice and get the strange behavior written above, but sometimes there might be. There are many advantages and disadvantages of using different Assembly.Load(From/File) methods. Take a look: <a href="https://blogs.msdn.microsoft.com/suzcook/2003/05/29/choosing-a-binding-context/" target="_blank">Choosing a Binding Context</a>. Furthermore, consider what's happening to assembly dependencies when you load an assembly. There are best practices described on MDSN for loading assemblies: <a href="https://msdn.microsoft.com/en-us/library/dd153782(v=vs.110).aspx" target="_blank">Best Practices for Assembly Loading</a>. I have to say, in my whole career I've been loading assemblies by hand twice, and from time perspective, both two cases were wrong.
</div>
<h2>
TypeHandle</h2>
<div>
Instead of comparing the types in the examples above by == operator, there is a possibility to compare them by the TypeHandle:
<br />
<blockquote>
TypeHandle encapsulates a pointer to an internal data structure that represents the type. This handle is unique during the process lifetime. The handle is valid only in the application domain in which it was obtained.
</blockquote>
Source: <a href="https://msdn.microsoft.com/en-us/library/system.type.typehandle(v=vs.110).aspx" target="_blank">MDSN</a>. Well, I can't think of an interesting usage for the TypeHandles for now, but it's good to know.</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-41061060145166353072016-05-22T17:00:00.000+02:002016-05-22T17:00:08.395+02:00The five stages of coming to terms with Cassandra<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-McR_jLPQ-8E/V0HBk9o9XHI/AAAAAAAACEY/5eZ9ZKUXQ4A7n3-GvIiLsOaa3YGKqPmagCLcB/s1600/Cassandra_logo.svg.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="214" src="https://2.bp.blogspot.com/-McR_jLPQ-8E/V0HBk9o9XHI/AAAAAAAACEY/5eZ9ZKUXQ4A7n3-GvIiLsOaa3YGKqPmagCLcB/s320/Cassandra_logo.svg.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="background-color: white; color: #252525; font-family: sans-serif; font-size: 12.88px; line-height: 20.608px;">From Wikimedia Commons, the free media repository</span></td></tr>
</tbody></table>
The five stages of coming to terms with JavaScript are:
<br />
<ol>
<li>Denial: “I won’t need this language.”</li>
<li>Anger: “Why does the web have to be so popular?”</li>
<li>Bargaining: “OK, at least let me compile a reasonable language to JavaScript.”</li>
<li>Depression: “Programming is not for me, I’ll pursue a career in masonry, like I always wanted.”</li>
<li>Acceptance: “I can’t fight it, I may as well prepare for it.”</li>
</ol>
<br />
The same is with Cassandra - however, IMO in the opposite order:
<br />
<ol>
<li>Acceptance: “I will use Cassandra. It's... AMAZING! Let me just quote <a href="http://cassandra.apache.org/" target="_blank">Apache Cassandra landing page</a>:"<blockquote class="tr_bq">
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.</blockquote>
</li>
<li>Depression: “Damn, it's so well designed, but a complex piece of software and it doesn't work as expected.”</li>
<li>Bargaining: "OK, at least let me try to tune it or report some bugs.”</li>
<li>Anger: “Why is it so popular? Why it has so good PR?”</li>
<li>Denial: “I won’t use it or recommend it ever again.”</li>
</ol>
<h2>
The context</h2>
I've done the research, checked multiple websites - read about performance, architecture, hosting, maintenance, TCO, libraries, popularity... and Cassandra seemed to be a good database for time-series logs storage, with 95% writes (with SLA) and only 5% reads (without SLA). I've chosen prepared Cassandra Datastax virtual disk image on Amazon with bootstrap scripts, made a proof-of-concept solution and read a book or two about Cassandra. All seemed good. However, it's not post about the good. So ...fast forward...<br />
<h2>
The bad</h2>
Some stories which I remember:
<br />
<ul>
<li>Cassandra cluster is on production (along with pararell, old solution for this purpose). Phone rings at 2AM. C* cluster is down. Quick look at logs - OutOfMemoryException in random place in JVM. Outage time: 1h - let me just remind you <b>"proven fault-tolerance"</b>. Cluster restart, it works again.</li>
<li>Next day at work, random hour, the same thing. Related bug: <a href="https://issues.apache.org/jira/browse/CASSANDRA-10787" target="_blank">OutOfMemoryError after few hours from node restart</a> </li>
<li>After few days... firing <b>repair</b> - the standard C* operation, which you have to run at least every <span style="background-color: white; white-space: pre-wrap;"><b>gc_grace_seconds</b>, by default 10 days. Usually it worked, but then, unexpectedly </span>the server died and later again and again, related issue: <a href="https://issues.apache.org/jira/browse/CASSANDRA-10448" target="_blank">"Unknown type 0" Stream failure on Repair</a>. Outage time: stopped counting.</li>
<li>Because of the failing servers in the cluster I decided to scale it out a little. Unfortunately, the issue above also made the scaling impossible.</li>
<li>After a while, I've encountered a second (thrid?) problem with the <b>repair</b>. Related bug: <a href="https://issues.apache.org/jira/browse/CASSANDRA-10389" target="_blank">Repair session exception Validation failed</a></li>
</ul>
<h2>
Fail</h2>
<div>
Let's get back to the landing page:</div>
<blockquote class="tr_bq">
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.</blockquote>
Now, let's see at <a href="https://issues.apache.org/jira/browse/CASSANDRA-10448" target="_blank">critical JIRA issue</a> dates:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-QJCSM0zGAhg/V0G8W782X3I/AAAAAAAACEI/TjUGpVTHDmUnTv6Ijf0x-qhkhLvVhVA8ACLcB/s1600/CassandraFail.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://2.bp.blogspot.com/-QJCSM0zGAhg/V0G8W782X3I/AAAAAAAACEI/TjUGpVTHDmUnTv6Ijf0x-qhkhLvVhVA8ACLcB/s320/CassandraFail.PNG" width="229" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<b>This means that for around one month at least few people could scale or repair their Cassandra clusters.</b> I fully understand - it's free and Open-Sourced-Software. However, even if something it's free you expect it to work - that's the harsh reality. If it doesn't work just you look for something else. No offence Cassandra Datastax/Apache teams, you are doing truly amazing work, however in resilient software, stability is a TOP 1 requirement. </div>
<h2>
Maybe it's me? Maybe I'm the only one having problems?</h2>
Fortunately (for me) not:
<br />
<ol>
<li>Here is a presentation how guys at Adform switched from Cassandra to Aerospike: <a href="https://vimeo.com/102812401" target="_blank">Big Data Strategy Minsk 2014 - Tadas Pivorius - Married to Cassandra</a></li>
<li>My friend working at a different company also told me, that they used Cassandra and they abandoned it.</li>
<li>Just looked at linked issues and the number of watchers.</li>
</ol>
<div>
In all cases the problems were similar to mine.</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-41166232131579426342016-05-05T21:20:00.001+02:002016-05-05T21:20:22.498+02:00Specifying requirements for live notification mechanism for systems integration purposesRecently I've designed a mechanism to notify external systems (with which we cooperate) about changes in our system. This, obviously, can be done in multiple ways. Let's look at some considerations on a high level, some questions and how that affects our requirements.<br />
<h2>
Assumptions</h2>
<ul>
<li>we want to notify other, external systems, owned by someone else</li>
<li>allowed delay, between the change in our system and making the notification is around one minute</li>
<li>the change can carry multiple information and varies on the type of change</li>
<li>we expose an API which is currently used by those external systems - they fetch the changes periodically</li>
<li>the number of changes per second in our system is spiky in nature (assume 50-5000 notifications/second for now)</li>
<li>external systems will subscribe themselves for notifications</li>
</ul>
<div>
Those are real-life business assumptions, which are delivered to the designer/programmer/you/me.</div>
<h2>
Questions?</h2>
<div>
<ol>
<li>How to notify external systems?</li>
<li>What information should we pass? When is the notification delivered?</li>
<li>How long should we wait for the response?</li>
<li>When should we retry? </li>
</ol>
<div>
Let's try to answer those questions.</div>
</div>
<h2>
Answers</h2>
<ol>
<li>There are multiple external systems, made in multiple different technologies. The most popular and basic method of integration is just making HTTP(S) calls. Should it be GET, POST or X? Let's consider two most popular - the GETs and POSTs.</li>
<li>We have to pass multiple values, depending on the notification type. For example, normal amount of information is: string (300 chars), 5 dates, 5 integers - therefore both GET (allowing ~2k chars on nearly all browsers and servers) and POST methods are viable. However, GET is very straightforward and simple. No issues with encoding, accepting compression or even reading the stream. What is more, GET put less pressure on your's servers as you do not have to send the body stream. Unfortunately GET query string is also visible for (nearly) everyone, therefore only-non sensitive information can be passed. What about concurrent notifications? How could one make "exactly-once" delivery model? Here is where we can use nicely one of our assumptions. Because of our API we can force external systems to fetch information through our API, after we will notify them. Such notification can be delivered in "at-least-once" model and we can provide non-sensitive, idempotent information about the change, which then can be used to get, full sensitive data from our API. One can even imagine an optimization - keep notifications to send in a buffer and delete duplicates in a small time bucket. </li>
<li>The obvious thing is that the longer we wait for responses the more resources are used. However, there is one more important thing. By specifying the request timeout, we can control how the architecture of the external system will look like. By saying "you have 30 seconds to process the notification" is like saying ~"you have a lot of time to get our notification, process it and synchronously ask our API then send us HTTP 200 status code". Compare it with "you have 3 seconds to store the notification for processing later or process it asynchronously". The implications are clear, short time = less required resources + better integration.</li>
<li>We want to be sure that the notification reaches the external system and thanks to the design specified in second point we can use "at-least-once" delivery model. I see two options now: a) hit specified URL 3 times (for example), don't wait for the answer and don't send this notification ever again, b) hit specified URL, retry in X minutes if HTTP status code was different than 200. First option is very simple in implementation, however it assumes that external systems will develop a mechanism to avoid processing the same notifications multiple times - which will likely end in hitting our API three times for every single notification.</li>
</ol>
<h2>
Conclusion</h2>
<div>
There we have it, answers which potentially should lead to a simple, sleek design which is relatively easy for implementation, completely fulfills the needs and requires a good design from external systems.</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-48997385317989501392016-04-18T19:37:00.001+02:002016-04-18T19:37:14.135+02:00Presentation recommendation - The Microservices and DevOps JourneyToday I recommend you following presentation: <a href="http://www.infoq.com/presentations/wix-microservices-devops" target="_blank">The Microservices and DevOps Journey</a><br />
<br />
Why? "Microservices" is a buzz-word, created around one year ago, still not popular in Google, but surprisingly popular on conferences: <a href="https://www.google.com/trends/explore#q=microservices%2C%20%2Fm%2F0315s4&cmpt=q&tz=Etc%2FGMT-2" target="_blank">Google Trends: Microservices vs SOA</a>. In my opinion, in this video, a sensible approach of transforming a monolith to a microservices system is presented. KISS architecture. LogStash, consul, Cassandra, Docker, Octopus are cool, however the question is: "Do you really need them?". Expect nothing super fancy though, I'm just sharing what I agree with.PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-67692378016950939832016-03-20T23:15:00.001+01:002017-11-03T09:39:19.671+01:00Little-known, useful, charming and beautiful algorithms - part 2<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/03/little-known-useful-charming-and-beautiful-algorithms---part-2/">https://piotr.westfalewicz.com/blog/2016/03/little-known-useful-charming-and-beautiful-algorithms---part-2/</a></span></h2>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-4bqKX2VZ-Fs/Vu1KWhqE9qI/AAAAAAAACCU/K7tF_HBJrBwg0kv9u1z0WsEyn2JfogJWA/s1600/startrails_timer.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="316" src="https://3.bp.blogspot.com/-4bqKX2VZ-Fs/Vu1KWhqE9qI/AAAAAAAACCU/K7tF_HBJrBwg0kv9u1z0WsEyn2JfogJWA/s320/startrails_timer.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Image source: <a href="https://goo.gl/EGDuX7" target="_blank">https://goo.gl/EGDuX7</a></td></tr>
</tbody></table>
<b><br /></b>This is the second post about charming algorithms, the first post is here: <a href="http://dynamicallyinvokable.blogspot.com/2016/02/little-known-useful-charming-and.html" target="_blank">Little-known, useful, charming and beautiful algorithms - part 1</a><br />
<h2>
.NET Timers</h2>
<div>
How many timers there are in the .NET framework? It's not 2 or 3. It's 5:<br />
<br />
<ol>
<li>System.Timers.Timer</li>
<li>System.Threading.Timer</li>
<li>System.Windows.Forms.Timer</li>
<li>System.Web.UI.Timer</li>
<li>System.Windows.Threading.DispatcherTimer</li>
</ol>
<br />
Here is the comparison of 3 most important of them:<br />
<a href="https://web.archive.org/web/20150329101415/https://msdn.microsoft.com/en-us/magazine/cc164015.aspx" target="_blank">Comparing the Timer Classes in the .NET Framework Class Library</a>.<br />
<br />
Which timer would you use for implementing <a href="http://docs.datastax.com/en/latest-csharp-driver/csharp-driver/reference/speculativeQueryExecution.html" target="_blank">Speculative query execution</a> (which, BTW. it is a nice idea too)? The answer should be: none of them.<br />
<h2>
.NET timers: internals&assumptions</h2>
As you can read in the .<a href="http://referencesource.microsoft.com/#mscorlib/system/threading/timer.cs" target="_blank">NET source code of the System.Threading.Timer</a>, each AppDomain in .NET has one single native timer, supplied by the VM, to fire all managed timers in the AppDomain. The performance considerations are:<br />
<blockquote class="tr_bq">
<blockquote class="tr_bq">
We assume that timers are created and destroyed frequently, but rarely actually fire. There are roughly two types of timer:</blockquote>
<blockquote class="tr_bq">
- timeouts for operations. These are created and destroyed very frequently, but almost never fire, because the whole point is that the timer only fires if something has gone wrong.</blockquote>
<blockquote class="tr_bq">
- scheduled background tasks. These typically do fire, but they usually have quite long durations. So the impact of spending a few extra cycles to fire these is negligible.</blockquote>
<blockquote class="tr_bq">
Because of this, we want to choose a data structure with very fast insert and delete times, but we can live with linear traversal times when firing timers.</blockquote>
<blockquote class="tr_bq">
The data structure we've chosen is an unordered doubly-linked list of active timers. This gives O(1) insertion and removal, and O(N) traversal when finding expired timers.</blockquote>
</blockquote>
Because of the nature of speculative query execution, we actually care about finding expired timers, and O(N) time is not acceptable. Imagine 10000 read requests per second per one single node to the Cassandra cluster in your web server (which isn't really a big deal, rather normal situation in high traffic scenarios). If you use speculative query execution, you get 10000 timers to manage. Stop, start and expiry processing operations aren't the big deal here, but the per-tick bookkeeping is. Fortunately, there is a better structure for keeping timers, than doubly-linked list.<br />
<h2>
Hashed and Hierarchical Timing Wheels</h2>
</div>
<div>
Adrian Colyer described all kinds of different timer structures in his blog post: <a href="http://blog.acolyer.org/2015/11/23/hashed-and-hierarchical-timing-wheels/" target="_blank">Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility</a>. Go ahead and read it, as he done really good job explaining <a href="http://www.cs.columbia.edu/~nahum/w6998/papers/sosp87-timing-wheels.pdf" target="_blank">academic paper</a>.</div>
<h2>
Real world applications</h2>
<div>
I've found two real world applications for those structures:</div>
<div>
<ul>
<li>One is already mentioned <a href="http://docs.datastax.com/en/latest-csharp-driver/csharp-driver/reference/speculativeQueryExecution.html" target="_blank">Speculative Query Execution</a>, in Cassandra</li>
<li>Second is request purgatory in Kafka: <a href="http://www.confluent.io/blog/apache-kafka-purgatory-hierarchical-timing-wheels" target="_blank">Apache Kafka, Purgatory, and Hierarchical Timing Wheels</a></li>
</ul>
</div>
<div>
If you are looking for the C# implementation, the one and only implementation of this (which I've found) is here: <a href="https://github.com/datastax/csharp-driver/blob/d48d2510a0c72f12559053b60919781d52ae94b9/src/Cassandra/Tasks/HashedWheelTimer.cs" target="_blank">HashedWheelTimer.cs</a> (many thanks to Cassandra Datastax C# Driver team).</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com1tag:blogger.com,1999:blog-92551346682037876.post-12091617012613437312016-02-12T21:00:00.000+01:002017-11-03T09:37:50.810+01:00Little-known, useful, charming and beautiful algorithms - part 1<h2>
<span style="color: red;">Please find the updated version of this post here: <a href="https://piotr.westfalewicz.com/blog/2016/02/little-known-useful-charming-and-beautiful-algorithms---part-1/">https://piotr.westfalewicz.com/blog/2016/02/little-known-useful-charming-and-beautiful-algorithms---part-1/</a></span></h2>
<div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-tUKoddm1TuA/Vr22ldfueEI/AAAAAAAACBo/I-kEDDzFlpE/s1600/14999534034_ba01564b36_z.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://2.bp.blogspot.com/-tUKoddm1TuA/Vr22ldfueEI/AAAAAAAACBo/I-kEDDzFlpE/s320/14999534034_ba01564b36_z.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Image source: <a href="https://goo.gl/mJjN4S">https://goo.gl/mJjN4S</a></td></tr>
</tbody></table>
<b><br /></b>
<b><br /></b>
<b>Warning</b>: this post won't be about "boring" or "typical" algorithms from Computer Science which we all have learned on studies (like quick sort, merge sort, xxx sort, A*, FFT). Instead, this will be about other little-known, especially USEFUL algorithms, which people working as professional developers should know or heard of.<br />
<h2>
Little-known</h2>
<div>
ID generation problems are usually overlooked. Database ID's I mean. Ask someone to name ID "types". Well, GUID, newsequentialid(), int/long (increased one by one, as in IDENTITY), NHibernate's Hi/Lo will be the answers. The first <a href="http://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/" target="_blank">will make your SQL Server cry</a>, when used as clustering key. Two next requires SQL database as a part of the generation process. That limits the performance as well as can be problematic in occasionally connected client scenarios. The last one is quite interesting - solves the problems related to database usage in the generation process, doesn't have the problem as the GUID has... however, it's not quite suitable for distributed (cloud) environment. Many servers can generate the same ID's, if they begin from the same initial ID.<br />
<br /></div>
<div>
To address those and other issues, listen guys from Twitter:<br />
<blockquote class="tr_bq">
As we at Twitter move away from Mysql towards Cassandra, we've needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.</blockquote>
they designed a system for generating ID's with following requirements (<a href="https://github.com/twitter/snowflake/blob/b3f6a3c6ca8e1b6847baa6ff42bf72201e2c2231/README.mkd" target="_blank">source</a>):<br />
<blockquote class="tr_bq">
<b>Performance</b><br />
- minimum 10k ids per second per process<br />
- response rate 2ms (plus network latency)<br />
<ul></ul>
<b>Uncoordinated</b><br />
For high availability within and across data centers, machines generating ids should not have to coordinate with each other.<br />
<br />
<b>(Roughly) Time Ordered</b><br />
We have a number of API resources that assume an ordering (they let you look things up "since this id").<br />
However, as a result of a large number of asynchronous operations, we already don't guarantee in-order delivery.<br />
We can guarantee, however, that the id numbers will be k-sorted (references: http://portal.acm.org/citation.cfm?id=70413.70419 and http://portal.acm.org/citation.cfm?id=110778.110783) within a reasonable bound (we're promising 1s, but shooting for 10's of ms).<br />
<br />
<b>Directly Sortable</b><br />
The ids should be sortable without loading the full objects that the represent. This sorting should be the above ordering.<br />
<br />
<b>Compact</b><br />
There are many otherwise reasonable solutions to this problem that require 128bit numbers. For various reasons, we need to keep our ids under 64bits.<br />
<br />
<b>Highly Available</b><br />
The id generation scheme should be at least as available as our related services (like our storage services).</blockquote>
</div>
<div>
Note this is a system for generating ID's. However, this system can be easily detached from the whole infrastructure and put into a nice algorithm or a program for generating uncoordinated, time ordered, k-sortable, compact IDs. In fact, there is already a .NET project for that: <a href="https://github.com/RobThree/IdGen" target="_blank">IdGen</a>. I recommend you looking into the internals of the generation algorithm because the idea is very simple and sleek. I can't stress enough the usefulness of this algorithm in distributed systems. A must have!<br />
<br />
<h2>
Concealed algorithm, which .NET developers use every day</h2>
</div>
<div>
Whenever you use TCP protocol (by WebClient or HttpWebRequest), this algorithm kicks in. Normally it's good and useful (<a href="https://en.wikipedia.org/wiki/Nagle%27s_algorithm" target="_blank">source</a>):</div>
<blockquote class="tr_bq">
Nagle's document, Congestion Control in IP/TCP Internetworks (RFC 896) describes what he called the "small packet problem", where an application repeatedly emits data in small chunks, frequently only 1 byte in size. Since TCP packets have a 40 byte header (20 bytes for TCP, 20 bytes for IPv4), this results in a 41 byte packet for 1 byte of useful information, a huge overhead. (...)<br />
Nagle's algorithm works by combining a number of small outgoing messages, and sending them all at once. Specifically, as long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet's worth of output, so that output can be sent all at once.</blockquote>
<div>
However, sometimes it's not and it is good to know that such a thing even exists. Here is a performance test, which showed speed increase in Azure Queue PUT operations from ~210ms to ~28ms: <a href="http://blogs.msdn.com/b/windowsazurestorage/archive/2010/06/25/nagle-s-algorithm-is-not-friendly-towards-small-requests.aspx" target="_blank">Nagle’s Algorithm is Not Friendly towards Small Requests</a></div>
<div>
<br /></div>
<h2>
Little algorithms</h2>
There are two "short" algorithms which I personally like. They aren't a big deal and by googling the problem, you probably will use one.<br />
<ul>
<li><a href="http://mathworld.wolfram.com/Box-MullerTransformation.html" target="_blank">Box–Muller transform</a> which you probably want to use, when you want to turn standard C# <b>Random</b> class to Gaussian distributed random generator. C# code here: <a href="http://stackoverflow.com/questions/218060/random-gaussian-variables" target="_blank">Random Gaussian Variables</a></li>
<li><a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle" target="_blank">Fisher–Yates shuffle</a> which is a solution for randomly shuffling an array. C# code here: <a href="http://stackoverflow.com/questions/108819/best-way-to-randomize-an-array-with-net" target="_blank">Best way to randomize an array with .NET</a></li>
</ul>
<h2>
What is your favorite algorithm/solution?</h2>
In part two there will be one, special "algorithm".</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-20840057406892478492016-01-07T21:00:00.000+01:002016-01-11T10:29:03.010+01:00Checking "Star Wars - The Force Awakens" tickets availability with Azure WebJobs, scriptcs and SendGridThis user story is quite simple: there is a guy (me) who likes Star Wars. This guy wants to buy the best tickets available in an <a href="https://en.wikipedia.org/wiki/IMAX" target="_blank">IMAX Cinema</a>. The premiere was not so long ago, so a day after the showtimes are updated, the best seats are booked. This guy (also me) is quite lazy, so he doesn't like to check the showtimes manually.<br />
<br />
Hm... let's do it like <b>pro</b> developers using cutting-edge technologies!<br />
<h2>
How the booking system works?</h2>
There is this whole UI for selecting seats and so on, however there is one interesting request which I can use to check showtimes. It look like this:<br />
<pre class="brush: text">POST http://www.cinema-city.pl/scheduleInfoRows HTTP/1.1
Host: www.cinema-city.pl
Connection: keep-alive
Content-Length: 52
Accept: */*
Origin: http://www.cinema-city.pl
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Referer: http://www.cinema-city.pl/imax
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Cookie: bla bla bla
locationId=1010304&date=09%2F01%2F2016&venueTypeId=2
</pre>
<br />
When there is a picture show on that date it returns an HTML table with links for booking, otherwise an empty HTML table.<br />
<h2>
scriptcs to make the job done</h2>
I've written a simple scriptcs which will make a POST with appropriate headers and check if a HTML link opening tag is in the response. If that's the case, I send an email using fresh, free SendGrid account.<br />
<pre class="brush: csharp">using System.Net;
using System.Net.Mail;
using SendGrid;
public void SendMeEmail()
{
var myMail = new SendGridMessage();
myMail.From = new MailAddress("Yoda@gmail.com");
myMail.AddTo("me@gmail.com");
myMail.Subject = "StarWars tickets are available!!";
myMail.Text = "Go to CinemaCity IMAX to book them.";
var credentials = new NetworkCredential("user-sendgrid@azure.com", "sendgrid-password");
var transportWeb = new Web(credentials);
transportWeb.DeliverAsync(myMail).Wait();
}
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Add("Host", "www.cinema-city.pl");
httpClient.DefaultRequestHeaders.Add("X-Requested-With", "XMLHttpRequest");
httpClient.DefaultRequestHeaders.Add("Referer", "http://www.cinema-city.pl/imax");
httpClient.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36");
var content = new StringContent(@"locationId=1010304&date=09%2F01%2F2016&venueTypeId=2", Encoding.UTF8, @"application/x-www-form-urlencoded");
var response = httpClient.PostAsync("http://www.cinema-city.pl/scheduleInfoRows", content).Result;
var responseAsString = response.Content.ReadAsStringAsync().Result;
var isMovieAvailable = responseAsString.Contains("<a");
if(isMovieAvailable)
{
Console.WriteLine("Movie is available, sending email");
SendMeEmail();
Console.WriteLine("Movie is available, email sent");
}
else
{
Console.WriteLine("Movie is not available.");
}
</pre>
<br />
Evironment setup is quite simple:<br />
<ul>
<li>create a new SendGrid account</li>
<li>download scriptcs as zip (<a href="https://codeload.github.com/scriptcs/scriptcs/legacy.zip/master" target="_blank">link</a>) and unzip it to a folder <b>StarWarsCheck</b></li>
<li>save the code as <b>checkmovie.csx</b> to folder <b>StarWarsCheck</b></li>
<li>update <b>checkmovie.csx</b> with your SendGrid credentials</li>
<li>add reference to Sendgrid dll's by invoking <b>scriptcs.exe -Install Sendgrid</b></li>
<li>now you can run the script locally. In a console write: <b>scriptcs.exe checkmovie.csx</b></li>
</ul>
<div>
If you are interested on other options how one can prepare a standalone, portable scriptcs scripts - check my question on StackOverflow: <a href="http://stackoverflow.com/questions/34589667/how-to-run-scriptcs-without-installation-make-portable-standalone-scriptcs-csx" target="_blank">How to run scriptcs without installation? Make portable/standalone scriptcs (csx)</a><br />
<br />
Note: of course, once showtimes are updated, I will get email every hour. But that's good, isn't it? There is a chance I won't miss it.</div>
<h2>
Create an Azure WebJob to run scriptcs file every hour</h2>
<div>
You can use Azure WebJobs by simply uploading a <b>.zip </b>file with the job and configuring how it should be scheduled in Azure. The job entry point is based on naming convention. <a href="https://github.com/projectkudu/kudu/wiki/Web-jobs" target="_blank">According to the documentation</a>, <i>the recommended script file to have in your job directory is: run.cmd</i>. Therefore, my <b>run.cmd</b> look like this:<br />
<pre class="brush: bash">call scriptcs.exe checkmovie.csx
</pre>
<br />
Next, pack whole <b>StarWarsCheck </b>folder as a zip file and upload it as Azure WebJob. <a href="https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/" target="_blank">Instructions are here.</a><br />
<h2>
Effects</h2>
</div>
<div>
It started with...</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-pPshJffT_fg/VosJsYMDl8I/AAAAAAAACAU/xK1nHmdLXsE/s1600/CheckingStarWarsTheForceAwakens_noshowtimes.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="107" src="http://4.bp.blogspot.com/-pPshJffT_fg/VosJsYMDl8I/AAAAAAAACAU/xK1nHmdLXsE/s320/CheckingStarWarsTheForceAwakens_noshowtimes.PNG" width="320" /></a></div>
<div>
<br /></div>
<div>
But then...</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-e0ojcFvWhs4/VosJ3QCmvTI/AAAAAAAACAc/Tm-LxniBIm8/s1600/CheckingStarWarsTheForceAwakens_showtimes.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="51" src="http://4.bp.blogspot.com/-e0ojcFvWhs4/VosJ3QCmvTI/AAAAAAAACAc/Tm-LxniBIm8/s320/CheckingStarWarsTheForceAwakens_showtimes.PNG" width="320" /></a></div>
<div>
<br /></div>
<div>
The email arrived without problems:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-YiUEiyff6_o/VosJ-R-HJSI/AAAAAAAACAk/JJ3IIYy1CLY/s1600/CheckingStarWarsTheForceAwakens_email.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="122" src="http://4.bp.blogspot.com/-YiUEiyff6_o/VosJ-R-HJSI/AAAAAAAACAk/JJ3IIYy1CLY/s320/CheckingStarWarsTheForceAwakens_email.PNG" width="320" /></a></div>
<div>
<br /></div>
<div>
However, most importantly, I've booked the best seats for my friends & me:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-Q1lY7n3-dnM/VosKYDUbh_I/AAAAAAAACAs/GG3c4QTNzbE/s1600/CheckingStarWarsTheForceAwakens_bookedSeats.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="158" src="http://2.bp.blogspot.com/-Q1lY7n3-dnM/VosKYDUbh_I/AAAAAAAACAs/GG3c4QTNzbE/s320/CheckingStarWarsTheForceAwakens_bookedSeats.PNG" width="320" /></a></div>
<h2>
Summary</h2>
<div>
It was fun, easy and profitable to play with Azure WebJobs and scriptcs. I liked the scriptcs sleekness and Azure WebJobs simplicity. For sure I'll use them for something else in the future.</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com3tag:blogger.com,1999:blog-92551346682037876.post-88056272257774383842016-01-03T01:17:00.000+01:002016-01-03T01:19:18.825+01:00Merging multiple git repositories into one and purging sensitive dataGit is a very powerful, distributed version control system. It's based on simple concept - directed graph without cycles (a tree) pointing four types of objects in it's database. I love git and it's brilliant design. Therefore, when I saw how misused it was in a company which I joined, I've had to fix it.<br />
<br />
<h2>
The state before</h2>
<div>
Due to multiple factors, there were around 4 repositories, which had to be cloned in one directory. Each repository was using or was being used by another repository. In other words, projects in VisualStudio were having dependencies on another projects, or worse - on compiled dlls in another repositories. Therefore, sometimes, one change required 4 commits (including projects rebuild and adding compiled dlls to the commit). During one month, around 2 man-days were lost for checking in/out changes from multiple repositories and for false (or true...) alarms that somebody forgot to check in/out something from the repositories. What's more, this was only for one branch - master - because only one existed back then. However, in future, to support multiple environments or development on fine-grained features/stories, multiply those problems by the number of branches and the number of new developers, at least. As always in IT, there wasn't much time, so setting up internal company NuGet Server wasn't the best thing to do. It isn't that it takes a long time to setup NuGet Server, but training all developers requires a great amount of time. Instead, I've decided to create one repository.<br />
The state before was like this:</div>
<pre class="brush: text">\Repo1
\src
\project1
project1.csproj with dll reference to project 2
\ExternalDlls
project2.dll
Solution1.sln
\Repo2
\project2
project2.csproj with project reference to project 3
\project3
project3.csproj with project reference to project 4
Solution2.sln
\Repo3
\project4
project4.csproj
Solution3.sln</pre>
<h2>
One to rule them all</h2>
<div>
In those 4 repositories passwords or MachineKey's to production environment were stored in plain text. Therefore I've decided to create a new repository. Side note: remember, passwords pushed to git repository always there will be, Yoda said. Therefore the new repository will have entirely rewritten history (removed passwords). Naturally, all branches (masters in this case) from all repositories with their's history must be included in the new repository. It will look like this:<br />
<br /></div>
<pre class="brush: text">new repo HEAD
|
M
| \
M \
| \ \
x' \ \
| y' z'
x' | |
| y' z'
x' | |
. y' z'
. . .
. . .
. .
Legend:
x' - commits from repository 1 with removed sensitive data
y' - commits from repository 2 with removed sensitive data
z' - commits from repository 3 with removed sensitive data
M - merges in the new repository
new repo HEAD - the brand new, future repo HEAD (master)</pre>
<div>
<h2>
Migration scripts</h2>
</div>
<div>
Migration must be done in "atomic" way, well at least it must be seen from developers perspective as atomic operation - they commit to the old repos and from some point in time they commit to the new repo (note: stashes will have to be discarded). Therefore, I've decided to run the migration during the weekend, when repositories are inactive. However, I don't like to work during weekends, so I wrote a script or two to automate the majority of the work. The <i>git filter-branch</i> command which I will be using is painfully slow, so additionally I've used powerful Amazon EC2 instance to make things a little faster.<br />
<h2>
Step 1 - fetch all repos and form a nice repository structure</h2>
Look that in the state before not all repos have had the code in <i>src </i>folder. To fix it, I'll use <i>git filter-branch</i> command to entirely rewrite the history. Each commit in history, blame etc. will look like it was committed to the right, <i>src</i> folder. Additionally, I've seen that someone was committing <i>Packages </i>folder to the git (possibly due to poor <i>.gitignore</i> file), so now it's a chance to remove that bloat permanently. Here is the bash script. Save it as <i>mergerepos.sh</i> and run it from git bash console like normal linux script (<i>./mergerepos.sh</i>):</div>
<pre class="brush: bash">#!/bin/bash
FinalRepo="main"
echo $FinalRepo
mkdir $FinalRepo
cd $FinalRepo
git init
touch tmp
git add -A
git commit -m 'merge all repositories'
declare -a reponames=("repo1" "repo2" "repo3")
declare -a repourls=("https://user@bitbucket.org/Company/repo1.git" "https://user@bitbucket.org/Company/repo2.git" "https://user@bitbucket.org/Company/repo3.git")
numberofrepos=${#reponames[@]}
function rewriterepo {
git checkout $1/master
git checkout -b "$1master"
git filter-branch -f --tree-filter 'rm -rf packages
mkdir "src"
rm -rf src/packages
ls -A | grep -v ^[Ss]rc | grep -v \.git | while read filename
do
mv "$filename" "src/"
done' HEAD
}
for (( i=0; i<${numberofrepos}; i++ ));
do
echo $i " -> " ${reponames[$i]} $(date) "-" ${repourls[$i]} " STARTED"
git remote add ${reponames[$i]} ${repourls[$i]}
git fetch ${reponames[$i]}
rewriterepo ${reponames[$i]}
git remote rm ${reponames[$i]}
echo $i " -> " ${reponames[$i]} $(date) "-" ${repourls[$i]} " FINISHED"
done
</pre>
<div>
The script will:
<br />
<ul>
<li>set up a new repository</li>
<li>make a dummy commit</li>
<li>go through the list of given repositories and for each do</li>
<ul>
<li>add it as remote, fetch it, checkout it to repoXmaster branch</li>
<li>clean each commit as follows</li>
<ul>
<li>create <i>src </i>folder, remove <i>src/packages</i> folder</li>
<li>move each file/directory from the root, except of <i>src</i> and <i>git</i> folder to <i>src</i> folder</li>
</ul>
<li>remove added remote</li>
</ul>
</ul>
So far so good.
<br />
<br />
<h2>
Step 2 - merge all branches (repositores)</h2>
</div>
<div>
As I have all repositories in the right structure and in our one "chosen" repository, merging them is just a normal merge operation.</div>
<h2>
Step 3 - delete sensitive data (passwords etc)</h2>
<div>
This can be done by painfully slow <i>git filter-branch</i> or... fast and easy to use <a href="https://rtyley.github.io/bfg-repo-cleaner/" target="_blank">BFG Repo Cleaner</a>. 1Check the project website, it's self explanatory. </div>
<h2>
Step 4 - add a nice, root .gitignore</h2>
<div>
All my work of removing redundant <i>Packages</i> folder can be destroyed by a single commit. Therefore I've merged all existing <i>.gitignore</i> files and added those rules to well known <a href="https://github.com/github/gitignore/blob/master/VisualStudio.gitignore" target="_blank">github/gitignore for VS</a> file.</div>
<h2>
Further steps</h2>
<div>
I have now one repository with the right structure and good history. Further steps?</div>
<div>
Taking the chance, I've introduced one solution for all projects in the new VS 2015, <a href="https://github.com/owen2/AutomaticPackageRestoreMigrationScript" target="_blank">migrated to Automatic NuGet package restore</a> (check all those scripts - one also fixes project hint paths), changed all dll references to project references and upgraded projects to the new VS version. This is how I've done the csproj update:</div>
<pre class="brush: powershell">$listOfBadStuff = @(
"<Project DefaultTargets=""Build"" xmlns=""http://schemas.microsoft.com/developer/msbuild/2003"" ToolsVersion=""4.0"">",
"<OldToolsVersion>[0-9]\.0</OldToolsVersion>",
"<Project ToolsVersion=""12.0"""
)
$listOfGoodStuff = @(
"<Project DefaultTargets=""Build"" xmlns=""http://schemas.microsoft.com/developer/msbuild/2003"" ToolsVersion=""14.0"">",
"<OldToolsVersion>14.0</OldToolsVersion>",
"<Project ToolsVersion=""14.0"""
)
ls -Recurse -include *.csproj, *.sln, *.fsproj, *.vbproj, *.wixproj |
foreach {
$content = cat $_.FullName | Out-String
$origContent = $content
For ($i=0; $i -lt $listOfBadStuff.Length; $i++) {
$content = $content -replace $listOfBadStuff[$i], $listOfGoodStuff[$i]
}
if ($origContent -ne $content)
{
$content | out-file -encoding "UTF8" $_.FullName
write-host messed with $_.Name
}
}</pre>
<div>
<h2>
Summary</h2>
It was relatively easy to get from nightmare to a reasonable repository environment. Those one or two days of merging repositories will pay off very quickly. Not mentioning the removal of sensitive data from the repository - this can be priceless.</div>
PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0tag:blogger.com,1999:blog-92551346682037876.post-80522933075247401942015-12-22T22:00:00.000+01:002015-12-23T08:48:23.797+01:00Presentation recommendation - LinkedIn's Active/Active EvolutionI like to watch a IT related presentations. Those give you a lazy way of grasping the idea of "what's up" in particular subject. It's grasping because by watching you won't learn actual skills, but you will know what exists, where are the dragons hidden and where to look deeper. What's more, if you are bored, you can always skip some parts or go to next presentation - which is extremely important in our fast-paced business. And most importantly - always learn from the bests.<br />
<br />
Today I recommend you following presentation: <a href="http://www.infoq.com/presentations/linkedin-scalability-arch" target="_blank">LinkedIn's Active/Active Evolution - Erran Berger</a><br />
Why? <a href="http://www.availabilitydigest.com/public_articles/0101/what_is_active-active.pdf" target="_blank">Active/active architecture</a> is probably one of the most popular solutions for scaling existing monolith. Check out where you can expect problems and how LinkedIn team solved them.PWhttp://www.blogger.com/profile/04772885730860606963noreply@blogger.com0