Every now and then I’ll come across a math problem as part of my work… It’s not what I do it’s just something that happens. When it does happen, I either become obsessed with it until I solve it (or at least THINK I solve it, one way or another) or I eventually give up with the blunt realization that I just spent hours thinking about something that is likely to cause psychosis in old age. Fortunately, the problem I ran across today seems to have fallen into the first category…

Here’s the question, simplified… Imagine you have a collection of at least 10 items, of which you are going to choose 5 at random. One of those 5 is then taken away and replaced with a different NEW item (pulled from some other pile of items). Next you choose another 5, again randomly. The questions I asked myself in this scenario were:

- What are the chances that you WON’T see any of the same 5 that you originally picked?
- What if the collection is 20 items instead of 10 items?
- How many items have to be in the larger collection so that you have at least a 90% chance of NOT seeing any one of the originally chosen items if you are choosing 5 at a time?

##### The Math…

This actually is a fairly simple problem to solve… but I ran thru an interesting programming journey on the way to the answers, and it was the programming aspect that prompted this post. To reiterate the scenario, the collection starts with 10 items of which 5 are chosen and one is taken away; then another NEW item is added, again giving you 10 items from which to choose 5 of them, and out of the 10 from which you choose, 4 of them were ones you saw before.

When working problems of this kind (because I’m not so good at probabilities) I prefer to back up to a simpler case. So let’s consider what happens when you have 6 items in the collection, and you choose 3 at a time. This means that on the second draw, there will be 2 that must be avoided. If the items in the second draw are A, B, C, D, E, and F, and we presume that E and F must be avoided, then we have the following:

- “Good” choices: ABC, ABD, ACD, BCD
- “Bad” choices: ABE, ABF, ACE, ACF, ADE, ADF, AEF, BCE, BCF, BDE, BDF, BEF, CDE, CDF, CEF, DEF

So there are 20 possible choices, of which only 4 yield a good result. Each of these numbers is a binomial coefficient. The number 20 is the total number of 3-letter combinations, or _{6}C_{3}. The number 4 is the number of 3-letter combinations of only the first 4 “good” letters, or _{4}C_{3}. So it should be clear that in general, the number of total choices is _{N}C_{r} where *N* is the number of items in the collection, *r* is the number of items we are picking, and the number of “good” choices is _{M}C_{r} where *M* = *N* – *r* + 1.

So now the answer to the first question becomes easy… If you only have 10 items to choose from at a time, this means that 5 were chosen from the original 10, and then one of those 5 goes away and will be replaced by a new (unseen) item, so there are 4 out of 10 that need to be avoided. Thus in this case, N=10, r=5, and M=6. And thus, the likelihood that you WON’T see any of the original 5 (but really 4) items is:

Wow… the answer really is 42!!! (Sort of.)

Ok, so what about the second question… well the answer is still easy, because it is the same formula… except now the numbers get bigger… N=20, r=5, M=16:

Ok so what about the third problem? Well it’s just a matter of plugging numbers in, but calculating binomial coefficients over and over for such large numbers really gets tiring… and besides, like any good programmer, I wanted confirmation that I was doing my math right… And that’s where we move to the fun part of this story.

##### Where’s the LINQ?

Ok so at this point I have two of the three answers, but I wanted to make sure I wasn’t screwing up the “easy” part. I figured that this type of problem isn’t exactly custom-tailored for a LINQ query, but I figured I could have some fun and learn a bit more about Linq by using “Enumerable.Range” and some Linq extension methods. So it’s just a matter of getting the code working… But one thing that I really hate is using Visual Studio to create some dinky little project to do this sort of stuff – I’d much prefer a “snippet compiler” that will let me just type a bit of code and run it. There are some programs around that do this, such as… (wait for it…) SnippetCompiler. It is a neat program because it basically let you write fully functional program without having to open VS.net. But I found it’s IntelliSense somehow clunky and something about it always made me not like it too much, but it was still better than the alternative. However, recently I have found a FANTASTIC application called LinqPad. LinqPad can also compile snippets, but it is much more than just that… It can dynamically evaluate LINQ expressions (or entire .NET programs), translate between LINQ syntax and lambda syntax, show you the generated IL, and can access databases on the fly so you are working with real data. The author even touts it as a replacement for SSMS, and it certainly could, but I think that’s going a bit far since the results display isn’t quite tailored to that type of data, and I don’t think you can edit data in place (but I might be wrong). It includes a lot of examples and apparently you can download more. BUT… one of the ** really** nice things is that it can compile and execute single statements – they don’t even have to be full programs as they do in SnippetCompiler. For example, you can just type in “Enumerable.Range(10,10).ToList().ConvertAll(x => new { X=x, XSquared = x*x })” and here’s what you get out:

Amazing! And trust me… that’s only the tip of the iceberg. (Note that the shaded line is a “total sum” of the column – I don’t know why they do that, but I don’t see a way to turn it off.) LinqPad also provides IntelliSense, although that is a feature you have to purchase.

##### Now comes the crazy part…

OK so I now started writing up some some LINQ and I quickly realized that there was no Factorial function in Math or any of the other standard libraries, and there was no way I was getting by without it. So my first hunt was to find a Factorial function. Not just any factorial function, but one written in LINQ, of course. (Note that this was only because I was having fun; LinqPad is fully capable of compiling and executing non-LINQ C# code, so I could have just written a function to do it.) My hunt took an odd turn, though, because it seems that writing a recursive function in Linq isn’t as easy as it is in good old fashioned code. But with the magic and a bit of wizardry, it can be done. I won’t try to explain it (because I don’t even come close to understanding it) but you can find the solution here. They use something called a “YCombinator” to do it, but to keep things simple and clear, here is just the lambda that results from its use:

Func<int, int> factorial = Extensions.YCombinator<int, int>(fact => n => n < 2 ? 1 : n * fact(n - 1));

And a quick test of this shows that it works:

var N = Enumerable.Range(1,10).ToList().ConvertAll(x => new { x = x, factorial = factorial(x) });

Ok so now instead of building up a whole lot of factorial expressions, I’ll also define a lambda that handles the Combination function, like so:

Func<int, int, int> C = (n, r) => factorial(n) / factorial(r) / factorial(n - r);

var N = Enumerable.Range(3,10).ToList().ConvertAll(x => new { x = x, factorial = factorial(x), nC3 = C(x, 3) });

Coolio!!! Alright now we are talking, because now I can verify my answers. In order to do this, I’ll just calculate each of the values from N=10 to N=20 (why not!), along with the corresponding percentage calculation:

var N = Enumerable.Range(10,10).ToList(); int r = 5; var y = N.ConvertAll(x => new { M = x-r+1, mCr = C(x-r+1, r), N = x, nCr = C(x, r), Frac = ((decimal)C(x-r+1, r) / C(x, r)).ToString("P2") });

Oops!! Divide by zero??? How in the heck could that happen?? So I commented out the only place there could be a divide by zero, the Frac calculation:

var N = Enumerable.Range(10,11).ToList(); int r = 5; var y = N.ConvertAll(x => new { M = x-r+1, mCr = C(x-r+1, r), N = x, nCr = C(x, r), //Frac = ((decimal)C(x-r+1, r) / C(x, r)).ToString("P2") });

Sure enough – there’s the zeros… but WHY – there’s no way that a binomial coefficient can produce zero! …And the first several answers are all correct. Then I realized that the problem you always have when dealing with Factorials – the numbers get big really fast. Maybe int wasn’t the best choice… so I switched to long:

static Func<long, long> factorial = Extensions.YCombinator<long, long>(fact => n => n < 2 ? 1 : n * fact(n - 1)); static Func<long, long, long> C = (n, r) => factorial(n) / factorial(r) / factorial(n - r); var N = Enumerable.Range(10,11).ToList(); int r = 5; var y = N.ConvertAll(x => new { M = x-r+1, mCr = C(x-r+1, r), N = x, nCr = C(x, r), Frac = ((decimal)C(x-r+1, r) / C(x, r)).ToString("P2") });

Now that is more like it… and so now I had the answers to TWO of the questions… With 10 in the collection and choosing 5 at a time, you have only a 2.38% chance of not seeing one of the original items, and with 20 items you have a 28.17% chance of not seeing one of the original items. So what would it take to get a 90% chance of not seeing one of the original items? OK no big deal, I’ll just keep going… Uhhhh… wait a minute. The long type ALSO has a maximum value… I wondered how “long” would it be before I started getting mangled numbers? Not very long it turns out. In fact, the calculation goes bonkers on the very next value of N, N=21:

var N = Enumerable.Range(20,2).ToList();

So now what? Double?? At first it appeared that it would work… but RIGHT BEFORE I got to my answer of 90%, KABOOM!

static Func<double, double> factorial = Extensions.YCombinator<double, double>(fact => n => n < 2 ? 1 : n * fact(n - 1)); static Func<double, double, double> C = (n, r) => factorial(n) / factorial(r) / factorial(n - r); var N = Enumerable.Range(160,14).ToList(); int r = 5; var y = N.ConvertAll(x => new { M = x-r+1, mCr = C(x-r+1, r), N = x, nCr = C(x, r), Frac = ((double)C(x-r+1, r) / C(x, r)).ToString("P2") });

At this point, I want you to take notice of the fact that I’m already up to 160+ items in the collection… even with this many, when choosing 5, and replacing 1 of those 5 with a new item, and then picking 5 more from the full collection, you still don’t quite have a 90% chance of NOT seeing one of the original 5. REALLY?!?! ** Out of 160+ items I’m only picking 5, trying to avoid 4 of them, and I don’t have a 90% chance of doing that!?!?!** No wonder I always lost at Battleship. I could pick 5 items from the collection 10 times over and still not have picked one-third of the available items… odd… but ok, I trust the math.

Anyway… Still without an answer to the question, and coming SO close, I couldn’t drop it now. So what is out there that’s larger than a double and will operate on my measly 32 bit machine? Ahhh yes… BigInt! So that started my quest for a BigInt class for C#, and I was surprised how hard it was to find one. Apparently, there used to be a BigInt included in the System.Core library in the pre-release versions of .NET 3.5, but it was taken out before the official release. Too bad – that would have been the easiest way to go. I did eventually find a C# BigInt here.

So now I plugged in my BigInt, and after a few tweaks to handle the implementation of the BigInt class and some tweaks for efficiency and display, I arrived at my answer:

var N = Enumerable.Range(190,11).ToList(); int r = 5; var y = N.ConvertAll(x => new { M = x-r+1, mCr = C(x-r+1, r), N = x, nCr = C(x, r) }); var z = y.ConvertAll(yy => new { M = yy.M, mCr = yy.mCr.ToString(), N = yy.N, nCr = yy.nCr.ToString(), Frac = (double.Parse(yy.mCr.ToString()) / double.Parse(yy.nCr.ToString())).ToString("P2") });

Thus… I calculate that you would need a collection of 194 items from which to pull 5 items in order to have a >90% chance that you wouldn’t see one of the original 5 (really 4, since one is taken away). That seems totally counter-intuitive to me, but this apparently a good example of diminishing returns!

My next quest… to find Factorial/Combination classes (probably utilizing BigInt) that can efficiently handle the fact that calculations of this kind, although they involve HUGE numbers if you work them out directly as I have here, actually represent relatively small numbers since factors often cancel out and I probably could have calculated the answer as fast on paper faster as it took me to write the code (assming I knew what the answer was!). The above query didn’t take that long to run on my machine (almost 25 seconds) but it took me a while to figure out where the right value was. The point is, with specially coded Factorial and Combination classes, I think this could be much faster and much more efficient.

Finally, along the course of this, I also found a nice LaTeX Equation Image Genarator on the web that I used for the math images… actually I found 3 or 4, but this was the only one that seemed to be working, so thanks to Roger!

* *

~ Michael D. Bray