general statistics problem: how to best characterize non-normal distributions

38 views (last 30 days)
Even though it is not directly MATLAB related, I figured I would pose this question to the MATLAB community because there are a bunch of smart and helpful people here :D
I have looked and looked but I cannot find a straightforward test or method to characterize a distribution that fails a normality test. I have read several peer-reviewed scientific journal articles where this does not stop authors from giving a mean and standard deviation (!) but I think that is a bad thing to do.
My current approach is to get a kernel smoothing density estimate of the distribution using a function I wrote around the built-in ksdensity() function, and play with the smoothing window width until it gives something that nicely portrays the data (not too spikey, not too round). I then give the peak value of the kernel estimate as my "mean" (i.e. the one number people will look at and prematurely judge everything by). The only way I know to then characterize the distribution width or deviation would be to give a full width at half maximum. Of course this is not good because the distribution tends not to be symmetric around the peak, and is often on the order of the peak value in magnitude.
So people I am working with want to see some kind of error bars, and I have no idea what to give them to make them happy.
This is a recurring theme in my current work and I am desperate to find a good solution, so any pointers would be greatly appreciated. I am sure I am not the only one who has to deal with non-gaussian distributions.
If you want to see an example of one of these distributions, there are a couple in Figure 3 in the paper you can find here:
Thanks in advance, Rory

Accepted Answer

Andrew Newell
Andrew Newell on 10 Jun 2011
You should NOT use the peak of your distribution to estimate the mean, because it is not the mean. It is the mode.
Since your distribution is skewed, it might be better to use the geometric mean or harmonic mean (see Measures of central tendency). You could also estimate some measure of dispersion and shape.
For estimating the errors in these statistics, you could use the boostrap or the jacknife (see Resampling Statistics).
You could also explore MATLAB's collection of distributions to see if any look like your data (see Distribution Reference). For example, some of the curves look like the Gamma distribution. However, each distribution is a model of a particular kind of statistical process, so ideally you should understand what a distribution represents before using it.
  2 Comments
Rory Staunton
Rory Staunton on 11 Jun 2011
I shouldn't have written "mean", not even in scare quotes---I know the peak is not the mean and I have never actually conflated the two, until now apparently.
Thanks for your help and I will look into your suggestions, especially the resampling statistics, as I am unfamiliar with bootstrap and jackknife methods.
Andrew Newell
Andrew Newell on 11 Jun 2011
Sorry for overlooking the scare quotes. Notice, by the way, that all the links are MATLAB links. That makes this a MATLAB question!

Sign in to comment.

More Answers (2)

Tom Lane
Tom Lane on 12 Jun 2011
Some other things I might consider:
1. Look at distributions of the log(data).
2. Consider using the median and quartiles (it may be more intuitive to use the interquartile range) or other quantiles. It may be possible to find theoretical ways to compute confidence intervals for those quantities, but the bootstrap approach may be adequate. Also Google for "five number summary."
3. There are larger families of distributions that include the normal as a special case. Look into the Johnson and Pearson families. There are Statistics Toolbox functions johnsrnd and pearsrnd for generating random samples from these distributions, but the "fitting" step is simply computing quantiles or moments.
-- Tom
  1 Comment
Andrew Newell
Andrew Newell on 12 Jun 2011
See http://www.mathworks.com/help/toolbox/stats/br5k833-1.html#br5k833-2 for the Johnson and Pearson distributions.

Sign in to comment.


bym
bym on 10 Jun 2011
I think a good distribution would be the Weibull and it is available in the statistics toolbox. You could then use the distributions parameters to compare datasets rather than mean & standard deviation
doc wblfit
you can get confidence intervals for the parameters - would that suffice for error bars?

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!