
working with kolmogrov test
2 views (last 30 days)
Show older comments
hamidreza hamidi
on 13 Nov 2018
Commented: hamidreza hamidi
on 14 Nov 2018
Hi, I am trying to use kolmogorov test which I' going to use it in my artickle , I generate a data set A then I randomly made a sample set from A. then I wanated to compare these two sample sets with kstest. but It showed me they don't have same distribution.
here is my simple code:
clc
clear all
close all
n_s = 1000;
mother_random_variable = lognrnd(0.3,0.5,[1,100000]); %data lognormal
S = mother_random_variable(randi(numel(mother_random_variable),1,n_s)) %sample
S_y = [S]'; %selected data
S_mean=mean(S_y); %mean sample
S_var=std(S_y); %variance sammple
test_cdf = [S_y,cdf('Lognormal',S_y,S_var,S_mean)]; %make cdf
kstest(S_y,'CDF',test_cdf) %ktest
plot(sort(S_y),logncdf(sort(S_y)),'r--')
hold on
cdfplot(S_y)
they have same distribution and ITs srange result . I found more strage result when I compare my data set with itself, Its result shows me they don't have same distribution.
clc
clear all
close all
n_s = 1000;
mother_random_variable = lognrnd(0.3,0.5,[1,100000]); %data
S=mother_random_variable; % I named data with S for simpler code
S_y = [S]'; %selected data
S_mean=mean(S_y);
S_var=std(S_y);
test_cdf = [S_y,cdf('Lognormal',S_y,S_var,S_mean)];
kstest(S_y,'CDF',test_cdf)
plot(sort(S_y),logncdf(sort(S_y)),'r--')
hold on
cdfplot(S_y)
DO you have any Idea. tanks
0 Comments
Accepted Answer
Adam Danz
on 13 Nov 2018
Edited: Adam Danz
on 13 Nov 2018
Having only looked at your 2nd block of code, I have some comments and suggestions.
1) The parameters for a lognormal distribution are mean and standard deviation in that order. In your code, you're entering them in reverse when you call the cdf() function and this is creating a totally different distribution than you intend to do.
y = cdf('Lognormal', S_y, S_var, S_mean); % your code, incorrect
y = cdf('Lognormal', S_y, S_mean, S_var); % correct
2) This is just a suggestion but it's a bit cleaner to use the makedist() function rather than entering the parameters manually into cdf().
doc cdf
pd = makedist('Lognormal', 'mu', S_mean, 'sigma', S_var);
y = cdf(pd, S_y); % instead of cdf('Lognormal', S_y, S_mean, S_var)
3) " when I compare my data set with itself, Its result shows me they don't have same distribution." But you aren't comparing your data with itself. You're comparing your data with the results of the cumulative distribution function of your data. The plot below shows the distribution of values from your data (top) and the distribution of values from the CDF. Clearly those distributions differ and the kstest() correctly rejects the null hypothesis.
figure
subplot(2,1,1)
histogram(S_y)
title('mother random variable')
subplot(2,1,2)
histogram(cdf('Lognormal', S_y, S_mean, S_var))
title('CDF distribution')

4) This may be irrelevant given the points above but you are using different means and standard deviations to create the "mother_random_variable" and the cdf() data. For the random variables you are using (0.3, 0.5) for the mean and std but for the cdf you're using the mean and std of the data which are ~(1.5, 0.8).
3 Comments
Adam Danz
on 14 Nov 2018
" I wanted to use kstest lognormal distribution with itself. then I want to write a code to compare part of my generated data with whole data and find that they have the same distribution."
If I'm understanding you correctly, you want to create a log-normal set of data; then you want to take a random subsample of that dataset. Then you want to use the kstest to determine if these two sets of data come from the same distribution. I suppose this is a sanity check since it's obvious that the two data set are (literally) from the same distribution.
Here's how:
1) Create your data set.
n_s = 1000;
mother_random_variable = lognrnd(0.3,0.5,[1,n_s]);
2) Create the sub-sampled data set.
m_s = 400; %number of random samples from your data
child_random_variable = datasample(mother_random_variable, m_s, 'Replace', false);
3) Use kstest2 (documentation) to determine if those two vectors of data come from the same distribution.
[h, p] = kstest2(mother_random_variable, child_random_variable);
The null hypothesis is that the two inputs are from the same distribution so if h=0, that confirms the null hypothesis.
More Answers (0)
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!