Anderson Darling Goodness-of-the-fit test?

I want to assess the GOF of a given dataset to a variety of different distributions (i.e. exponential, lognormal, normal etc) using the Anderson-Darling test-statistics. Unfortunately, this doesn't seem to have been implemented in Matlab, so I was wonderung whether some of you might have a function that does exactly this?

3 Comments

I second this question. I am trying to use the Anderson-Darling test to determine GOF of a Generalized Pareto Distribution.
Greetings
I'm new to Matlab and to statistics.
I have a data table in the form of time series (column-oriented variables). I did the Anderson-Darling test with the following distributions: 'norm', 'exp', 'ev', 'logn', 'weibull'?
I read https://ch.mathworks.com/help/stats/adtest.html, I learned that I don't need to specify population parameters.
My problem starts when I try to do the Anderson-Darling test with Nakagami, Gamma, Beta and others that are not 'norm', 'exp', 'ev', 'logn', 'weibull'.
I noticed that there is a need to include parameters. I have no idea how I can do this.
Can anyone kindly help?
Attached is the type of error that matlab gives me
As I think you know, the AD test tests the null hypothesis that a set of numbers is drawn from a specified distribution, where specified means that all parameters are specified. Matlab's adtest will automatically use the max likelihood esitmate for the parameters, if you pass one of the standard distribution names without parameters. If you use a nonstandard distribution, such as Nakagami, you have to create a probabilty distribution object with specified parameters, and then pass the name of th PDO to adtest. Here's how it works:
x=rand(1,100); %generate 100 uniformly distributed random numbers on [0,1)
dist=makedist('Nakagami','mu',1,'omega',1); %create a Nakagami distribution with mu=1, omega=1
[h,p]=adtest(x,'Distribution',dist) %do the AD test
h = logical
1
p = 6.0000e-06
h=1 means that the null hypothesis, which is that the values in x come from distribution dist, is rejected. The p value is very small. This is what we expect, since the values in x were not from the specified Nakagami distribution.
Now let's make randoms that ARE from the specified distribution, and run the AD test.
dist2=makedist('Gamma','a',2,'b',3); %gamma distribution with a=shape=2, b=scale=3
y=random(dist2,[1,50]); %randoms with specified distribution
[h,p]=adtest(y,'Distribution',dist2) %do the AD test
h = logical
0
p = 0.3271
h=0 means the null hypothesis, that the values in y come from the specified distribution, is accepted. The p value is high, as expected.
In your case, you probably want to estimate the parameters, because you don't know them for your experimental wind data. You can do this as follows. We will create some data with a known distirbution and then pretend we don't know the parameters. Then we will fit the distribution to the data, and use the distribution with the fitted parameters for the AD test.
dist3=makedist('Beta','a',1.5,'b',0.5); %define a distribution
x=random(dist3,[100,1]); %create randoms with specified distribution
dist4=fitdist(x,'Beta'); %fit a beta distribution to the randoms
[h,p]=adtest(x,'Distribution',dist4) %do the AD test
h = logical
0
p = 0.9046
dist5=fitdist(x,'Nakagami'); %fit a Nakagami distribution to the randoms
[h,p]=adtest(x,'Distribution',dist5) %do the AD test
h = logical
1
p = 5.5220e-04
When we fit a beta distribution to the beta-distributed random numbers, the null hypothesis is accepted, and the p value is high, as expected.
When we fit a Nakagami distribution to the same random numbers, the null hypothesis (which is tha the data is from the fittd Nakagami distribution) is rejected, and the p value is low, as expected.

Sign in to comment.

 Accepted Answer

The statistics toolbox has an adtest function. It can test against a specific distribution with known parameters, or a more general test against any of the following distributions with unknown parameters: 'norm', 'exp', 'ev', 'logn', 'weibull'

4 Comments

@Mike Croucher, Thanks! Very nice!
@Debbie Green, @Ines, I did Monte Carlo testing of adtest.m and it is good, at least for normal and exponential distributions. By "good" I mean that 1% of the p-values returned are less than 0.01, 5% are less than 0.05, and 10% are less than 0.10, when the null hypothesis is true. This is what should happen. I did 10,000 simulations with sample size N=20, 10,000 with N=50, and 10,000 with N=100. For normal and for exponential. See results below. The test scripts are attached.
Testing versus normal distribution, when the null hypothesis is true:
>> adtestTestNormal
Number of p values in various ranges
n(p<.01) n(p<.05) n(p<.10)
-------- -------- --------
N= 20: 95 524 1048
N= 50: 104 511 1030
N=100: 108 478 988
Expected: 100 500 1000
Testing versus exponential distribution, when the null hypothesis is true:
>> adtestTestExpon
Number of p values in various ranges
n(p<.01) n(p<.05) n(p<.10)
-------- -------- --------
N= 20: 111 535 1012
N= 50: 72 504 950
N=100: 96 516 1020
Expected: 100 500 1000
@Debbie Green, you requested an adtest for generalized Pareto. That distribution is not testable with adtest.m. I recommend that you try the extreme value distribution ('ev') instead. If you must use the generalized Pareto, then you should use my code, and determine the critical value by Monte Carlo simulation, for the sample size and alpha level which you want. Let me know if you want help doing that.
@Mike Croucher: An "Accept this answer" button appears next to your answer when I view it. That's wierd, because I did not post the question. Why should I be able to accept an answer?
Here are the scripts I used to test adtest.m.
@William Rose You can accept answers to this question because of your high community reputation level. The list of priviledges, and the number of points needed to get them, is at https://in.mathworks.com/matlabcentral/answers/help?s_tid=al_priv
The OP has 7 days to accept an answer, after which anyone at the right level can do it
@Mike Croucher, thank you for explaining that, and thank you for the answer.

Sign in to comment.

More Answers (1)

Calculating the AD statistic is straightforward, and the attached function does it. Figuring out the p-value associated with a given AD statistic is not straightforward. The attached function does it, but beware. This 2018 paper gives a formula for the p-value (Table 3). They cite a textbook from 1986, which I don't find online. They say this formula is supposed to work no matter what distribution is being tested. However, I have tested their Table 3 formula with Monte Carlo simulation*. The tests show that the Table 3 formula gives very wrong answers if the data is uniform and the tested distribution is also uniform. The formula also gives very wrong answers if the data is exponential and the tested distribution is also exponential. The formula gives reasonable answers if the data is normal and the tested distribution is normal. By "very wrong answers", I mean that if you generate 10,000 sets of 100 uniformly distributed random numbers, and test versus a uniform distribution, about 1% of the p-values should be <.01, about 5% should be <.05, and about 10% should be <.10. But the actual number of p-values that are less, at each level, is much greater than expected. In other words, there are too many low p-values. Thus, when testing data that is actually uniformly (or exponentially) distributed, one would reject the hypothesis "this data came from a uniform (or exponential) distribution" far more often than one should. The p-values for nomal data tested versus a normal distribution appear to be approximately correct, at least at p=0.01, 0.05, and 0.10, with data sets of 20, 50, and 100 points per set.
The function that computes the A-D statistic and the associated p-value is attached: AndersonDarling.m. Note the warnings above. I am also attaching four scripts which test the function. See the comments in the function and in each script.
The 2018 paper linked above and here has references to other papers and books which may have formulas that give better results for non-normal distributions.
*I did the Monte Carlo testing with 10,000 data sets of normally distributed random numbers, with 20 data points in each data set. The I did it again with 50 and with 100 in each data set. Then I did it all again with uniformly and exponentially distributed random numbers, for 90,000 data sets in all.

5 Comments

In response to @Debbie Green's initial request and in response to experience, I have modified the function AndersonDarling.m.
  • Add 'GPD' option for generalized Pareto distribution.
  • Change the paraeter to select a distribution type, from integer to string. Now the user passes 'Normal' or 'GPD' or 'Exponential' or 'Uniform' to select a distribution, instead of passing a number.
  • if dtype='Exponential' or 'Uniform', return p=-9 (which is nonsensical). As noted in my previous post, the p value formula from Table 3 of doi:10.3390/math6060088 does not give accurate results when the values and gtype are Exponential. I have not found a good alternative. I tried using the critcal values from Table 2 of doi:10.3390/math6060088 to estimate p values for the UNiform case. The reuslts were better than before, but still not acceptable. AD is still returned as before, for both 'Exponential' and 'Uniform'.
  • Add more comments.
I tested the "GPD" option by generating 30,000 sample data sets, and computing the p value for each data set with the AndersonDarling() function. The number of p values less than 0.10, less than 0.05, and less than 0.01 was counted. We expect 10%, 5%, and 1% of the p values to be in each of the respective categories. The actual values are shown below, for two runs of 30,000 sets each. There are 10,000 set of size 20, 10K of size 50, and 10K of size 100. The number of p values in eight out of nine categories are slightly lower than expected, in both runs. Big N is the number of data points in each sample. Little n is the number of p values whose values are in the specified range.
First run
Number of p values in various ranges
n(p<.01) n(p<.05) n(p<.10)
-------- -------- --------
N=20: 107 484 979
N=50: 99 432 873
N=100: 79 407 853
Expected: 100 500 1000
Second run
Number of p values in various ranges
n(p<.01) n(p<.05) n(p<.10)
-------- -------- --------
N=20: 105 452 930
N=50: 76 439 879
N=100: 80 418 867
Expected: 100 500 1000
The results suggest that the computed p values are a bit on the high side. However, the numbers are not too different from the expected values. Therefore I would be comfortable using the p values generated by this function to accept (if p>=.05) or reject (if p<.05) the null hypothesis that the data have a generalized Pareto distribution.
I tested modified code for the uniform distribution p values with AndersonDarlingpValueTestUniform.m. It generates 30,000 data sets each time it runs. The results from two runs are below.
Run 1
Number of p values in various ranges
n(p<.01) n(p<.05) n(p<.10)
-------- -------- --------
N= 20: 167 912 1796
N= 50: 105 609 1213
N=100: 107 544 1114
Expected: 100 500 1000
Run 2
Number of p values in various ranges
n(p<.01) n(p<.05) n(p<.10)
-------- -------- --------
N= 20: 202 953 1916
N= 50: 120 619 1272
N=100: 119 568 1134
Expected: 100 500 1000
The results above indicate that the modified p values for the Uniform case give better results than before. However, the results are still not acceptable, because the number of p values in each range is significantly greater than expected, especially when the sample size is 20. This means there are too many low p values. This means the null hypothesis (which is that the values came from a uniform distribution) will be rejected more frequently than it should be, when the data is in fact uniformly distributed. Therefore I have decided to return p=-9 (which is non-sensical) when dtype='Uniform', to indicate that the correct p-value is not available.
The modified function and test script are attached. See the comments in the code for more info.
@William Rose, thanks very much for the function and the helpful references. I edited your function for my personal use so that I could calculate the AD statistic with my own (k,sigma) rather than with the estimated parameters from gpfit.
@Debbie Green, You're welcome.
Since the formula used to compute p in AndersonDarling.m is not supported (and is not contradicted) in the statistics literature, I recommend determining the critical value for generalized Pareto by Monte Carlo simulation. It sounds like you know the values of k and sigma which you want to use for your gen.Pareto distribution.
Example: I have N=20 values and I want to use the AD test to determine if they are consistent with a generalized Pareto distribution with k=2, sigma=1, theta=0. The null hypothesis is that the data ARE consistent with genPar(k=2,σ=1,θ=0). I would like to reject the null hypothesis if p<0.05. What should I use as the critical value for the AD test? Answer: Simulate 10,000 data sets with N=20, where the values are distributed X~genPar(k=2,σ=1,θ=0). Compute the AD statistic for each set. Sort the AD stats from low to high. The critical value for a p=.05 test is the average of the 9500th and 9501th values. Five percent of AD statistics are greater than this critical value, when the null hypothesis is true. Code to compute this follows.
>> M=10000; N=20; %M data sets, each with N samples
>> x=random('Generalized Pareto',2,1,0,[M,N]); %generate x~Gen.Par.(2,1,0)
>> AD=zeros(1,M); %allocate array for AD statistics
>> for i=1:M, [AD(i),~]=AndersonDarlingKnownPar(x(i,:),'GPD',2,1,0); end
>> ADsort=sort(AD);
>> ADcrit=(ADsort(9500)+ADsort(9501))/2; disp(ADcrit)
2.5043
>> pci=.05+[-1,1]*1.96*sqrt(.05*.95/M); disp(pci)
0.0457 0.0543
The critical value for rejecting the null hypothesis (p<.05) is ADcrit=2.5043. pci is the 95% confidence interval associated with ADcrit, computed with .
Let's test this critical value by generating 1000 data sets with N=20 that are NOT gen.Pareto(2,1,0). To make it a challenging test, let z~Uniform(0,3), which has the same lower limit and the same median as gen.Pareto(2,1,0). Let's see if the AD test with the specified critical value rejects the null hypothesis for most of these data sets.
>> z=3*rand(1000,N);
>> for i=1:1000, [AD2(i),~]=AndersonDarlingKnownPar(z(i,:),'GPD',2,1,0); end
>> disp(sum(AD2>ADcrit)) %count the AD stats that exceed ADcrit
946
946 out of 1000 data sets got rejected by the test, because their ADstat exceed the critical value. This was true even though the data sets had the same lower limit and same median as the hypothesized distribution. It is kind of a coincidence that the rejection rate is 95%. If the sample size were larger, the rejection rate would have been higher, I bet. (But if the sample size is larger, you should recalculate the critical value, ADcrit, because it may depend on sample size.) If the test data's distribution is more similar to gen.Pareto(2,1,0), the rejection rate will be lower. For example, if y~Expon(2.164), which also has a lower limit of 0 and a median of 1.5, only 15% of data sets get rejected by the AD test, with ADcrit=2.5043. This indicates that it is hard to distinguish an exponential distribution from genPar with shape=2, if you only have 20 data points.
The attached script computes the AD statistic (and the p value, but I don't trust the p value if the distribution is not normal) when the distribution parameters are specified by the user, rather than fitted. I assume your script is similar.
Good luck.
Thanks again. @William Rose, I understood the logic of using the Monte Carlo (MC) method to test the null hypothesis rather than comparing the function-generated p-value to 0.05. I will incorporate that critical value calculation routine into your "AndersonDarlingKnownPar.m" function.
In general, I'm more interested in observing the change in the Anderson-Darling statistic as I change the theta parameter in the GPD fit. But having the associated p-value is very helpful in my experiment.
@Debbie Green, you;re welocme.
Despite my bias in favor of my own scripts and functions :), I think Matlab's adtest() is superior. It is fast. It computes a critival value for the AD test, for a gen.Pareto distibution with known parameters, analytically. (I don't know how, but it does.) Here is an example of how you can use it and evaluate it. You can obviously modify this to address your interest: "observing the change in the Anderson-Darling statistic as I change the theta parameter in the GPD fit":
M=10000; N=20; %M=number of data sets, N=number of values in each set
x=random('Generalized Pareto',2,1,0,[M,N]); %generate randoms with specified distribution
pd=makedist('Generalized Pareto',2,1,0); %create a probability distribution object
h=zeros(M,1); p=h; adstat=h; cv=h; %allocate arrays
%next line runs the adtest M times on the M data sets
for i=1:M, [h(i),p(i),adstat(i),cv(i)]=adtest(x(i,:),'Distribution',pd); end
histogram(p)
figure; histogram(adstat)
disp([sum(h) sum(p<.05) sum(adstat>cv(1))]);
469 469 469
Inspection of vector cv (critical value) shows that all the values are identical. This confirms that adtest() computes the critical value analytically, rather than by Monte Carlo. If adtest() were computing the CV by Monte Carlo, the CV would be a litlle different every time.
The histogram of p values appears flat, as we expect it to be. That is reassuring.
The histogram of adstat does not look crazy.
The display of the sums of h, of p<.05, and of adstat<cv(1) show that it adtest() is working as expected, and that h is true if the null hypothesis is rejected. Since the null hypothesis is true in this case, and since alpha=.05 by default, we expect 500 rejections (since there were M=10000 tests). The variance (using the binomial approximation) is , so the observed value of 469 rejections is less than 1.5 sigma away from the expected value. Therefore the observed value is consistent with expectations.

Sign in to comment.

Tags

Asked:

on 16 May 2012

Commented:

on 16 Jun 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!