Calculate Mode for CONTINUOUS DATA

Hi

I am trying to code a method to calculate the mode for continuous data. The challenge here is getting the 'right' bin size for the data. I know this can be done using something called the "Incomplete gamma function" but I have no idea how to do this.

Any help would be greatly appreciated.

-

Bio

[331 byte] By [BioInfoMana] at [2007-10-1 22:02:50]
# 1
Either this is beyond my scope of math or you are using terms which are too general. Please, elaborate.
parza at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...
# 2

Hi Parz,

Apologies if I am simplifying it too much now and am being patronising.

Normaly for discrete data we would simply take the integer value that occurs most frequently in the sample.

However in a continuous distribution (as is the case in the biological data samples that I am interested in) we have to bin the data into some form of histogram - take the bin range that has the highest frequency and then use the middle value of this bin range as the mode.

However, the valid bin range to use is dependent on the distribution, and this can be calculated with something called the "incomplete gamma function".

While I think I can code the "incomplete gamma function" using 'Numerical Recipes in C' - I have no idea how this is tied to calculating the mode. Any help regrading how to use this function to calculate the mode - or any other method that can be used to efficiently calculate the mode for continuous data would be very much appreciated.

I hope I have explained it better - sorry if I haven't.

-

BioInfoMan

BioInfoMana at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...
# 3
Which distribution does your data come from?
parza at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...
# 4
Mostly Gaussian, often skewed. Have I misunderstood your question?
BioInfoMana at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...
# 5

The mode for a normal distribution is the same as the

expected value (also called mean) and you can compute

this by x_m = 1/n*sum(xi). If you then want the f(x_m)

value at this point you can approximate the variance

by sigma^2 = 1/(n-1)*sum(xi-x_m)^2 and then

f(x_m) = 1/(sigma*sqrt(2*pi)).

Or am I missing something?

http://mathworld.wolfram.com/NormalDistribution.html

parza at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...
# 6

> While I think I can code the "incomplete gamma

> function" using 'Numerical Recipes in C' - I have no

> idea how this is tied to calculating the mode.

I've never heard of the incomplete gamma function before, so the

following is just a guess:

So if the incomplete gamma function tells you the bin size, and

your data is in the range [min, max], then number of bins you need

to create is Math.ceil((max - min) / binSize);

Now you have to go through the values in the array, and add each

one to the appropriate bin. The appropriate bin can be computed

as (value - min) / binSize;

Now determine which bin has the most values. Sort the values in

that bin and take the middle value.

rkippena at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...
# 7
Sorry, I should have made it more clear. The data is often heavily skewed enough for the mode to be significantly different from the mean and median (at least signficantly different for the purposes I require them for) - so I definitely need to calculate the mode.
BioInfoMana at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...
# 8

I know that there is a connection between the "Incomplete Gamma Function" and the bin range, I just don't know what the connection is. Once I have the bin range, I agree, that getting the mode is very trivial.

Information on the "Incomplete Gamma Function" can be found at

http://mathworld.wolfram.com/IncompleteGammaFunction.html

and some relationship to the mode, which I don't really understand (I am a biologist and have no nackground in Maths or Comp. Sci):

http://www.itl.nist.gov/div898/handbook/eda/section3/eda366b.htm

To be honest, I am just looking for any algorithm that will calculate the mode for continuous data in general (with or without the use of the Incoomplete Gamma Function).

Thanks in advance.

BioInfoMana at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...
# 9

Hi Guys,

Thanks for all your help so far. I have found a crude way of calculating the mode. Basically take the sqrt(n), for sample size n, as the number of required bins. Thus take (max-min)/sqrt(n) as the bin range.

From this I can, of course, quite trivially find the most frequent bin-interval, and take the mid-point of this bin-interval as the mode.

This seems like a very crude method, so please let me know if anyone finds a better method. I will be checking in on this thread from time to time.

Thanks.

BioInfoMana at 2007-7-13 8:07:03 > top of Java-index,Other Topics,Algorithms...