Fun With Chemistry and Statistics

by Rant on March 16, 2007

in Doping in Sports, Floyd Landis, Tour de France

Among the many assertions that have been made in the Floyd Landis case is whether or not the T/E ratios for Landis’ A sample indicate a positive test. On the surface, one might think they do, as the A sample screening tests show T/E ratios of 4.9, 5.1 and 11.4, all of which exceed WADA’s threshold of 4. But all is not necessarily as it appears.

Investigating further, it turns out that LNDD states that the T measurements are within +/- 20 percent and the E measurements are within +/- 30 percent. In their certification of the A sample results, LNDD also state that their measurement of the T/E ratio is within +/- 30 percent.

This last part is not correct. Given the margins of error for T and the E separately, it turns out that the margin of error for the T/E screening tests is -38.5 percent to + 71.4 percent. (If you’re mathematically inclined, it’s pretty easy to derive these values. If you want an explanation of how I did so, leave a comment or email me and I’ll give you the derivations.)

Before going further, perhaps it’s a good idea to define a few terms.

Measured value — The measured value is what you’ve determined by using some sort of measuring system or device. So if you crest a hill in your little sports car while traveling over the speed limit and a cop at the bottom of the other side shines his radar gun on you at the same time you hit the brakes, the speed his radar gun records is a measured value. Let’s just hope it’s not too big, so your speeding ticket isn’t too costly. [Full disclosure: This is a true story.] His reading is subject to a margin of error that is +/- some percentage of the measured value.

True value — The true value is the real value of whatever you’ve measured. Due to some uncertainty in your car’s speedometer, and in the police officer’s radar gun, the real value — the speed you were traveling when the cop focused that radar gun on your car — may not be known. But while the real value isn’t always knowable, there’s a range within which the real value is likely to fall. That range is encompassed by the margin of error, described below.

Margin of error — The margin of error is expressed as +/- a percentage or some value to indicate that there is a range within which the measurement is correct, to some degree of confidence which is not always defined. This is similar to a confidence interval.

Confidence interval — The confidence interval is a range of values, within which — to a certain degree of confidence — the true value may lie. So, for example, if the confidence interval is 95%, there’s a 95% chance that the true value is within that range. Put another way, there’s also a 5% (or 1 in 20) chance that the true value falls outside the range. The higher the degree of confidence, the more likely the range of numbers includes the true value. But what needs to be mentioned here is that if it’s a 95% confidence interval, that does not mean that the measured value has a 95% chance of being the true value. It simply means that there’s a 95% chance that the true value is within that range.

I’m going to skip over definitions of standard deviation and delta units, to keep from making this too complicated.

Going back to the range of values for the T/E ratio from above: To make the math simpler, consider a T/E ratio of 5 (which is, conveniently, the average of 4.9 and 5.1, two of Landis’ values). Given LNDD’s margins of error for determining T and determining E, the range within which the true T/E value lies in this case is between 3.07 and 8.57. But what you don’t know is what the true value is. Could be the low value. Could be the high value. Could be any value in-between. Also, we don’t know anything about how LNDD came up with their margins of error.

Were these margins of error determined experimentally or were these margins of error supplied by the instrument’s manufacturer? Or were they derived from some formula?

Why is that important? Because we don’t know to what degree of confidence to assign to the range of possible values. So we don’t know how likely it is that the true value could exist outside the range either. (And note: Being outside the range could mean either higher than or lower than the endpoints.)

How about approaching this from a different angle? We know that the threshold established by WADA to investigate whether there’s been a doping infraction involving testosterone is a T/E ratio of 4. How high a measured value, given LNDD’s margins of error, do you need to get at LNDD in order to say (with whatever degree of confidence) that the true value exceeds the threshold value?

It turns out, the answer to that question is a T/E ratio of approximately 6.51. Once the T/E ratio exceeds this value, then the low end of the range exceeds 4, and to whatever degree of confidence the range represents, that is your likelihood that the true value exceeds WADA’s threshold. So for the range determined earlier, notice that the low end of the range does not exceed 4 (it’s 3.07). This means that among the possibilities for the true value, there are values less than the threshold for investigation. Which to me indicates that this is not a positive test. But what about that 11.4 value from the confirmation testing?

Well, it’s an interesting number, to be sure. How it came to be is subject to speculation. The most interesting part is that WADA protocols require confirmation testing in triplicate. On close inspection of the A sample documentation, one can see that this didn’t happen. This is yet another reason the A sample test cannot be considered a positive test.

But on to the 11.4 reading, itself. It is higher than the upper range of possibilities from the screening test and is more than double the first two values. What could have happened here? Could the true value really be this much different than the screening value? That seems to be pretty incredible. How can we account for this? Well, a couple of possibilities come to mind.

First, consider the B sample. As Arnie Baker points out in his slide show presentation, there’s a test result for the B sample that exceeds a threshold established by WADA for the amount of free testosterone or epitestosterone in the sample. This is supposed to indicate either degradation or contamination, and because it exceeds the threshold, all other testing should have been halted. But it wasn’t. And what did it find? Well, the T/E ratios are all close to the A sample confirmation test, and the CIR studies confirm the CIR studies from Landis’ A sample.

Looks pretty damning, doesn’t it? But not so fast. Remember, WADA has established that threshold for degradation and contamination for a reason. It’s their belief that this level of free T or E would seriously affect the results of any testing performed and make those results unreliable. Now, others have argued that you can get valid results at even higher concentrations than the WADA threshold, in effect arguing that the results are nonetheless valid.

But that’s not the rules of the game here. The rules say that no testing is supposed to go on after this determination. So, in this case, by WADA’s own rules this data should be thrown out. Now where it gets interesting is with the confirmation test.

The A sample confirmation test has a similar T/E ratio, as discussed above. And the CIR studies are quite similar, too. Could it be that Landis’ sample had already degraded by the time these tests were performed? As it turns out, it could be the case.

One thing that the WADA protocols recognize is that in urine, testosterone and its metabolites aren’t completely stable. Under the right conditions, these chemicals can break down and the breakdown of these chemicals can have an impact on the measured values for T, for E, and for their metabolites. What are the right conditions? Well, if the samples were stored at room temperature for an extended period of time for starters.

And if bacteria were present in the original sample that might facilitate the breakdown of these chemicals. And if the solutions weren’t properly stabilized or buffered to prevent such breakdowns. Even if you take all the right precautions, it’s still possible that over time breakdown will occur.

This is why WADA established a standard that says, in effect, if you have too much of these breakdown products you can’t use the sample for further testing. Somewhere in the testing of the A sample, I believe you can find the determination of whether the threshold was exceeded when testing began, or the data to make that determination. We know that the value can be calculated for the B sample, because Arnie Baker cites it in his slide show, and he shows that where the value shouldn’t exceed 5%, the B sample value is (if I recall) 7.7%.

Somewhere between initial A sample testing and the B sample testing something happened. What it was isn’t clear and when it happened isn’t clear. Various parts of the A sample testing were performed on different days. And knowing that T, E, and their metabolites can break down, even though WADA might not require it, good lab technique would be to make certain at the beginning of each day that the sample hasn’t degraded since the last tests were performed. One could argue that if the buffering and stabilizing were done properly, this might not need to be checked as often, but given the stakes involved, I believe it would be better to err on the side of caution.

So here’s what I believe may (emphasis on the word may) have happened: Either the Landis’ sample had already become degraded or contaminated when the A sample confirmation test was run, or knowing what we now know about that sample number from Stage 19, perhaps someone else’s sample was tested instead.

What would be interesting to look at would be that other person’s test results from Stage 19. Assume that he had screening tests of around 11, and a confirmation test that was around 4 or 5 (the exact opposite of Landis’ screening and confirmation results). Given that information, I would conclude that there would be a high likelihood that the data for the two samples may have been incorrectly reported. Landis’ real data may have been marked down for the other rider, and the other rider’s data may have been recorded for Landis.

Perhaps in a lab technician’s haste, the samples were inadvertently mixed together or inadvertently switched. It’s hard to know exactly what happened. But one thing is certain: The discrepancy between Landis’ screening tests and his confirmation tests on the A sample seems too big. If you go by LNDD’s assertion that they can measure T/E ratios to within +/- 30 percent of the true value, then the problem is worse still.

Because while the confirmation test should be more accurate, as Arnie Baker says, one should expect that the value it determines would be within +/- 30% (going by LNDD’s margins of error) of the screening test results. To be otherwise suggests some problems with the accuracy measuring instruments, the lab’s procedures or the ability of the technicians to properly follow procedures.

Given the lab’s history, problems with equipment and/or lab procedure are not out of the realm of possibility. Every day this case seems to get curiouser and curiouser. Who knows what the next revelation may bring?

Previous post:

Next post: