Calculate Mode for CONTINUOUS DATA
Hi
I am trying to code a method to calculate the mode for continuous data. The challenge here is getting the 'right' bin size for the data. I know this can be done using something called the "Incomplete gamma function" but I have no idea how to do this.
Any help would be greatly appreciated.
-
Bio
Either this is beyond my scope of math or you are using terms which are too general. Please, elaborate.
parza at 2007-7-13 8:07:03 >

Hi Parz,
Apologies if I am simplifying it too much now and am being patronising.
Normaly for discrete data we would simply take the integer value that occurs most frequently in the sample.
However in a continuous distribution (as is the case in the biological data samples that I am interested in) we have to bin the data into some form of histogram - take the bin range that has the highest frequency and then use the middle value of this bin range as the mode.
However, the valid bin range to use is dependent on the distribution, and this can be calculated with something called the "incomplete gamma function".
While I think I can code the "incomplete gamma function" using 'Numerical Recipes in C' - I have no idea how this is tied to calculating the mode. Any help regrading how to use this function to calculate the mode - or any other method that can be used to efficiently calculate the mode for continuous data would be very much appreciated.
I hope I have explained it better - sorry if I haven't.
-
BioInfoMan
Which distribution does your data come from?
parza at 2007-7-13 8:07:03 >

Mostly Gaussian, often skewed. Have I misunderstood your question?
The mode for a normal distribution is the same as the
expected value (also called mean) and you can compute
this by x_m = 1/n*sum(xi). If you then want the f(x_m)
value at this point you can approximate the variance
by sigma^2 = 1/(n-1)*sum(xi-x_m)^2 and then
f(x_m) = 1/(sigma*sqrt(2*pi)).
Or am I missing something?
http://mathworld.wolfram.com/NormalDistribution.html
parza at 2007-7-13 8:07:03 >

> While I think I can code the "incomplete gamma
> function" using 'Numerical Recipes in C' - I have no
> idea how this is tied to calculating the mode.
I've never heard of the incomplete gamma function before, so the
following is just a guess:
So if the incomplete gamma function tells you the bin size, and
your data is in the range [min, max], then number of bins you need
to create is Math.ceil((max - min) / binSize);
Now you have to go through the values in the array, and add each
one to the appropriate bin. The appropriate bin can be computed
as (value - min) / binSize;
Now determine which bin has the most values. Sort the values in
that bin and take the middle value.
Sorry, I should have made it more clear. The data is often heavily skewed enough for the mode to be significantly different from the mean and median (at least signficantly different for the purposes I require them for) - so I definitely need to calculate the mode.
I know that there is a connection between the "Incomplete Gamma Function" and the bin range, I just don't know what the connection is. Once I have the bin range, I agree, that getting the mode is very trivial.
Information on the "Incomplete Gamma Function" can be found at
http://mathworld.wolfram.com/IncompleteGammaFunction.html
and some relationship to the mode, which I don't really understand (I am a biologist and have no nackground in Maths or Comp. Sci):
http://www.itl.nist.gov/div898/handbook/eda/section3/eda366b.htm
To be honest, I am just looking for any algorithm that will calculate the mode for continuous data in general (with or without the use of the Incoomplete Gamma Function).
Thanks in advance.
Hi Guys,
Thanks for all your help so far. I have found a crude way of calculating the mode. Basically take the sqrt(n), for sample size n, as the number of required bins. Thus take (max-min)/sqrt(n) as the bin range.
From this I can, of course, quite trivially find the most frequent bin-interval, and take the mid-point of this bin-interval as the mode.
This seems like a very crude method, so please let me know if anyone finds a better method. I will be checking in on this thread from time to time.
Thanks.