lessons learned from online gambling - predicting scalability

I work with someone who has spent a few years working for an online poker company who shall remain nameless. This company was responsible for a poker platform that supported both their own branded poker offering as well as being an engine for other companies who would layer on their branding. My colleague played an important role in taking their fairly well built existing system from thousands of users to tens of thousands of users, and in the process exposing a large handful of very deep bugs, some of which were core design issues.

Looking at this site http://www.pokerlistings.com/texas-holdem and seeing just this small sample of some of the top poker sites is a bit insane.  We're talking close to 100,000 concurrent players at peak hours JUST from the 16 top "texas hold-em" sites listed here. Who are these people? (I'm totally gonna waste some money online one of these days by the way) I'm sure this is just the tip of the iceberg too.  

It's a great domain for learning critical systems in my opinion. Real time, high concurrency, real money, third party integrations for credits and tournaments and the vast reporting that goes on for all that data being generated.

My colleague's experience in dealing with real time load and breaking the barriers of scalability are truly fascinating and a good source of learning for me. While I realize there are bigger puzzles out there, but it's not every day you have direct access to that experience where you work.  In any case, one such learning that I am in the process of trying to apply is a more forward looking approach to load modeling. That is, rather than to simply design and test for scalability; to actually drill down into the theoretical limit of what you are building in an attempt to predict failure.

This prediction can mean a lot of different things of course, being on a spectrum with something like a vague statement about being IO bound to much more complicated models of actual transactions and usage to enable extrapolating much richer information about those weak points in the system. In at least one case, my boss has taken this to the point where the model of load was expressed as differential equations prior to any code being written at all. Despite my agile leanings I have to say I'm extremely impressed by that. Definitely something I'd throw on my resume. So I'm simultaneously excited and intimidated at the prospect of delving into our relatively new platform that we're building in the hopes of producing something similar. I definitely see the value in at least the first few iterations of highlighting weak points and patterns and usage. How far I can go from there will be a big question mark.

For now I'll be starting at http://www.tpc.org/information/benchmarks.asp and then moving into as exhaustive list as I can of the riskiest elements of our system. From there I'll need to prioritize and find the dimensions that will impact our ability to scale. I expect with each there will be natural next steps to removing the barrier (caching, distributing load, eliminating work, etc) and I hope to be able to put a cost next to each of those.