How to improve accuracy for unseen data

Question

Chetana on 20 May 2014

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/130271-how-to-improve-accuracy-for-unseen-data

Answered: farzad on 21 Feb 2015

This is Neural Network Pattern Recognition. Entire dataset is consists of (10 users and 8 samples per user) total 80 images to classify. I have divided entire dataset in two parts- 50 images for training (10 users x 5 samples per user) and 30 images as unseen images (10 users x 3 samples per user). Using nprtool patternet neural network is designed, the code is as follows:

 inputs = mapstd(train_data); %%train_data [I N ] = [ 60 50 ]
 targets = mapstd(Targets); %%[ O N ] = [ 10 50 ]
% Create a Pattern Recognition Network
hiddenLayerSize = 10;
net = patternnet(hiddenLayerSize);
% Choose Input and Output Pre/Post-Processing Functions
% For a list of all processing functions type: help nnprocess
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
% Setup Division of Data for Training, Validation, Testing
% For a list of all data division functions type: help nndivide
    net.divideFcn = 'dividerand';  % Divide data randomly
    net.divideMode = 'sample';  % Divide up every sample
    net.divideParam.trainRatio = 70/100;
    net.divideParam.valRatio = 15/100;
    net.divideParam.testRatio = 15/100;
% For help on training function 'trainlm' type: help trainlm
% For a list of all training functions type: help nntrain
net.trainFcn = 'trainscg';  % Scaled conjugate gradient backpropagation
% Choose a Performance Function
% For a list of all performance functions type: help nnperformance
net.performFcn = 'mse';  % Mean squared error
%net.trainParam.max_fail= 6; %Maximum validation failures
%net.trainParam.lr = 0.25;
% Choose Plot Functions
% For a list of all plot functions type: help nnplot
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', ...
  'plotregression', 'plotroc', 'plotconfusion'};
%net = configure(net,inputs,targets);
% Train the Network
[net,tr] = train(net,inputs,targets);
% Test the Network
outputs = net(inputs);
errors = gsubtract(targets,outputs);
performance = perform(net,targets,outputs)
% Recalculate Training, Validation and Test Performance
trainTargets = targets .* tr.trainMask{1};
valTargets = targets  .* tr.valMask{1};
testTargets = targets  .* tr.testMask{1};
trainPerformance = perform(net,trainTargets,outputs)
valPerformance = perform(net,valTargets,outputs)
testPerformance = perform(net,testTargets,outputs)
% View the Network
% view(net)
% Plots
% Uncomment these lines to enable various plots.
%figure, plotperform(tr)
%figure, plottrainstate(tr)
%figure, plotconfusion(targets,outputs)
%figure, ploterrhist(errors)
%figure, plotroc(targets,outputs)
save net net 
disp('training completed')
*For testing network performance on unseen data*
test_data = mapstd(test_data);
load net;
netoutput = sim(net,test_data)  %%simulation with test data
% locate index of maximum value of output node
[y, ind] = max(netoutput);

Training stops on validation and confusion matrix shows very poor accuracy. What should I do for getting good classification accuracy? I refer ‘Greg Heath’ post on Matlab central. Thank to ‘Greg Heath’ for providing insight on neural network. As per post reply I have calculated H = 4 for may problem, still I am getting good accuracy.

[ I N ] = size(train_data) %  [ 60 50 ] each column contains real data of 3 digit
 [ O N ] = size(Targets)%  [ 10 50 ]  
 Ntrn   = N-2*round(0.15*N) %   34 training examples
 Ntrneq = Ntrn*O            % 340 training equations
    %For a robust design desire Ntrneq >> Nw or
    Hub = -1+ceil( (Ntrneq-O) / (I+O+1)) % 4
    Nw = (I+1)*H+(H+1)*O       % Number of unknown weights
   %H << Hub = -1+ceil( (Ntrneq-O) / (I+O+1))

Please guide me how to proceed further…… Thank you in advance

2 Comments
Show NoneHide None

Greg Heath on 21 May 2014

I wrote out a detailed response. However, my computer doesn't allow me to post or email. I will try to get it fixed tomorrow and post the response.

Chetana on 22 May 2014

Open in MATLAB Online

Thank you Greg sir. I am excited to see your reply. I tried to improve accuracy by changing

1) number of hidden nodes  (Now it is 40)
2) Using,
  MSEgoal = 0.1*mean(var(targets',1));  %
  MinGrad = MSEgoal/ 100;
  net.trainParam.goal = MinGrad;
3) Using,
 net.performFcn = 'mse';  % Mean squared error
 net.trainParam.max_fail= 100; %Maximum validation failures
 net.trainParam.epochs = 1000;
 net.trainParam.lr = 0.25;

Now, confusion matrix shows, 90% classification

But, network does not classify unseen images correctly. Only 4 images out of 30 unseen images (10 users x 3 samples per user)gets classified correctly. I think my network learned for train data but fails in generalizing it. Please guide me. Thank you.

Sign in to comment.

Sign in to answer this question.

Answer 1

Greg Heath on 26 May 2014

3
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/130271-how-to-improve-accuracy-for-unseen-data#answer_138247

Open in MATLAB Online

0. please excuse no caps. originally had problems and now too lazy to change...

 1. do the 60-dim inputs represent extracted features (e.g., plsregress    (not pca!)) or a columnized image? if the latter, what size?
 2. have you tried to reduce the input vector size via plsregress?
 3. are the target columns from eye(10) so that
   target                           = ind2vec(trueclassindices)
   trueclassindices           = vec2ind(target)
   estimatedclassindices = vec2ind(output)
 4. no need to explicitly separate 'unseen data': test data is in no    way used for training or validation. therefore, it can be used to    obtain "unbiased" estimates of unseen data performance. this holds    true for averages of multiple designs obtained from 
   a. random or stratified (e.g., k-fold xval) data divisions 
   b. and/or random weight initializations
 5. wise to use minmax to check for outliers after using mapstd
 6. no need to explicitly include assignments of defaults.
 7. syntax error: save net net
 8. need numh*numw designs: numw sets of random initial weights and/or 
   data divisions for each of numh different values for hidden layer 
   size, h. 
 9. explicitly calculate trn/val/tst classification error rates for 
   each design (unless you can figure out how to get them from the confusion 
   matrix). 
 10. for each value of h, rank designs w.r.t. performance on validation data. 
 11. select best val performers at minimum successful value of h. combine 
    corresponding test performances to obtain unbiased estimates of unseen 
    data error rates and confidence intervals.
 12. search for my examples in the newsgroup and answers
   greg patternnet ntrials
 13. test on a matlab classification dataset so we can compare results
   help nndatasets
   doc nndatasets

hth

greg

1 Comment
Show -1 older commentsHide -1 older comments

Chetana on 26 May 2014

Thank you Sir, I try to act according.

Sign in to comment.

Answer 2

Greg Heath on 22 May 2014

4
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/130271-how-to-improve-accuracy-for-unseen-data#answer_137818

Open in MATLAB Online

The dramatic difference between your training and nontraining performance is a classic example of OVERTRAINING an OVERFIT net when H/Hub and max_fail are too large.

 1. Why would you use H = 40 when Hub = 4???
 2. Why would you use max_fail = 100 when H >> Hub AND the max_fail default is only 6???

Training performance tends to be irrelevant. Rank multiple designs using the validation performance. Obtain unbiased estimates of performance on unseen data from the test subset performance on the best ranked designs. You do not need a separate "unseen" test data set. Performance uncertaincies are easy to estimate by designing multiple examples with random data divisions and random initial weights.

 3. Why do you have a 60-dim input when you only have 50-2*round(0.15*50) = 34 training examples? At most, they span a 33-D input space.
 ===========================================================================

You need to get a better feel for your problem. Start simple:

1. How many inputs do you really need? Considering linear models can be very helpful

   a. Stepwisefit results tend to be useful for selecting original variables.
   b. Plsregress results tend to be useful for selecting linear combinations
   c. PCA is not guaranteed to be useful for classification.

2. Back to NNs

   a. Use all data for training: divideFcn = 'dividetrain'
   b. Otherwise USE ALL DEFAULTS except for a chosen RNG seed so that results can be duplicated.
   c. Obtain 10 designs, explicitly calculate class error rates and compare with confusion matrix results

3. Repeat 2 but try to minimize H as much as possible.

4. Using the minimum acceptible value for H, try to reduce the number of inputs.

5. Finally, you can return to the original goal of estimating performance on nondesign test data. Confidence limits can be deduced from multiple designs with random data divisions and random initial weights.

Hope this helps.

Greg

PS Sorry my original response is still unavailable. Maybe tomorrow.

4 Comments
Show 2 older commentsHide 2 older comments

Greg Heath on 26 May 2014

Open in MATLAB Online

I think there is a misunderstanding. With 'dividetrain' all of the data is used for training and max_fail is irrelevant. The purpose of this excercise is to

 1. Make sure that the data is consistent 
 2. Find the minimimum no. of hidden nodes that are necessary.

For unbiased estimates of performance on nontraining data. Design multiple nets with different random initial weights. The multiplicity allows the estimation of confidence intervals.

P.S. Am trying to deal with the computer virus myself. I will soon post my previous comments.

Greg Heath on 31 May 2014

Open in MATLAB Online

Data successfully received. However, currently, I don't have much time. Will be traveling until June 4 with malfunctioning laptop.

I STRONGLY suggest you look into reducing the input dimensions.

It would also be helpful to use more examples per class.

Currently, you are "defining" a class in 60 dimensions with at most, 6 training examples that, span at most, 5 dimensions.

That would not be bad if the different 5-dimensional class subspaces do not intersect. However, that is like whistling in the dark to ward off demons. I tend to feel comfortable with classes defined with, at least, two independent examples per dimension.

Again: Try to reduce the dimensionality. I made some recommendations in an earlier post. Also search the NEWSGROUP and ANSWERS with combination of keywords like

image feature extraction

Hope this helps.

Sign in to comment.

Answer 3

Greg Heath on 1 Jun 2014

3
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/130271-how-to-improve-accuracy-for-unseen-data#answer_139022

Open in MATLAB Online

Revelations from New Data:

 clear all, close all, clc
 tic
 load P1.txt
 whos
% P1  61x350  170800  double 
 inputs         = P1(1:end-1,:);                   
 trueclasses    = P1(end,:);
 minmaxindices  = minmax(trueclasses)         % [ 1 50 ]
 Nclasses       = numel(unique(trueclasses))  % 50
 targets  = ind2vec(trueclasses);  
 [ I N ]  = size(inputs)        % [ 60 350 ]
 [ O N ]  = size(targets)       % [ 50 350 ]

%No val or test data

Ntrneq = N*O % 17500 training equations

% NAIVE CONSTANT MODEL

 ynaive = mean(targets,2);    % 0.02*ones(O,1)
 Nw00   = numel(ynaive)       % O = 50
 Ndof00 = Ntrneq-Nw00         % 17450 DegsOfFreedom
 y00    = repmat(ynaive,1,N); % [50 350]
 SSE00  = sse(targets-y00)    % 343
 MSE00  = SSE00/Ntrneq        % 0.0196   biased
 MSE00a  = SSE00/Ndof00       % 0.0197 DOF adjusted                    
% MSE00  = mean(var(targets',1))  %  0.0196
% MSE00a = mean(var(targets',0))  %  0.0197

% LINEAR MODEL y0 = W0*[ones(1,N); inputs];

 W0    = targets/[ones(1,N); inputs];
 Nw0   = numel(W0)               % 3050
 Ndof0 = Ntrneq-Nw0              % 14450 DegsOfFreedom
 y0    = W0*[ones(1,N); inputs];
 SSE0  = sse(targets-y0)         % 278.3153
 MSE0  = SSE0/Ntrneq             %   0.0159   biased
 MSE0a = SSE0/Ndof0              %   0.0193 DOFa                    
 R20   = 1-MSE0/MSE00            %   0.1886 ~ 19%
 R20a  = 1-MSE0a/MSE00a          %   0.0201 ~  2%

Elapsedtime = toc % 180 sec

% When the degree of freedom adjustment is made to compensate for estimating performance with training data, a conjecture that ~19% of the target variance is "explained" (R20) must be modified to only ~2% (R20a). Therefore, the Linear Model doesn't appear to be significantly better than the Naive Constant Model that is based on apriori probabilities. Complicating the analysis via trn/val/tst data division will not improve results.

% I will let you calculate the resulting 50 class error rates

% Escalating to quadratic classifiers, one for each of the 50 classes might be feasible. The simplest model would use 50 hidden node radial basis functions centered at the class means. This can be constructed using NEWRB.

Hope this helps.

Greg

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 4

Greg Heath on 2 Jun 2014

3
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/130271-how-to-improve-accuracy-for-unseen-data#answer_139093

Open in MATLAB Online

%A quick way to see if any of the variables or classes appear to be different from the others is to standardize the inputs to zero-mean/unit-variance and compare the rows and columns of W0

%STANDARDIZATION (To compare Linear Model coefficients)

    zinputs = zscore(inputs')';
    W0z     = targets/[ones(1,N); zinputs];
    minmaxW0z  = minmax(W0z);   % [ 50 2 ]
    minmaxW0zp = minmax(W0z');  % [ 61 2 ]
    whos
    figure
    subplot(2,1,1)
    hold on
    plot(1:50,minmaxW0z(:,1),'bo')
    plot(1:50,minmaxW0z(:,2),'ro')
    subplot(2,1,2)
    hold on
    plot(1:61,minmaxW0zp(:,1),'bo')
    plot(1:61,minmaxW0zp(:,2),'ro')

%When the inputs are standardized, I see no significant differences between the weights associated with different classes or different variables

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 5

Greg Heath on 2 Jun 2014

3
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/130271-how-to-improve-accuracy-for-unseen-data#answer_139200

Open in MATLAB Online

Although previous results are not encouraging, I'm curious what a biased MLP design would yield.

BIAS: All of the data is used for training and regularization (e.g., TRAINBR ) is NOT used

% NEURAL NETWORK MODEL

 Hub     = -1+ceil( (Ntrneq-O) / (I+O+1))  % 157
 Ntrials = 10
 rng(0)
 j=0
 for h =round([Hub/10, Hub/2, Hub])
    j  = j+1
    h  = h
    Nw            = (I+1)*h+(h+1)*O   
    Ndof          = Ntrneq-Nw       
    net           = patternnet(h);
    net.divideFcn = '';           % 'dividetrain'
    for i = 1:Ntrials
        net                          = configure(net,inputs,targets);
        [ net tr outputs regerrors ] = train(net,inputs,targets);
        assignedclasses              = vec2ind(outputs);
        classerr                     = assignedclasses~=trueclasses;
        Nerr(i,j)                    = sum(classerr);
        % FrErr                      = Fraction of Errors (Nerr/N) 
        [FrErr(i,j),CM,IND,ROC]  = confusion(targets,outputs);
        FN(i,j) = mean(ROC(:,1));    % Fraction of False Negatives
        TN(i,j) = mean(ROC(:,2)) ;   % Fraction of True Negatives
        TP(i,j) = mean(ROC(:,3));    % Fraction of True Positives
    end
 end
 PctErr=100*Nerr/N

elapsedtime = toc %~412 sec

 %%%%%%             Percent Error
 %             ____________________________
 %       H =       16        79       157
 %             ____________________________
 %                80.9     96.9      63.4
 %                57.4     89.7      82.3
 %                75.1    >19.1      19.4<
 %                84.0     56.9      19.1<
 %                60.3     75.4      96.3
 %               >41.7     83.7      77.1
 %                94.9     84.3      21.1<
 %                46.0     61.7      94.9
 %                57.1     90.6      83.7
 %                48.0     79.7     >17.8<

1 Comment
Show -1 older commentsHide -1 older comments

Greg Heath on 4 Jun 2014

Open in MATLAB Online

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%    Repeat with Ntrials = 20
%
%%%%%%Percent Error
%                  ________________________
%       H =        16        79       157
%                  _________________________
%                    82.3     74.0    >19.7
%                    56.0    >24.3     64.0
%                    72.6     94.9    >21.7
%                    83.1     65.4     86.3
%                    62.3    >26.9    >20.0
%                   >40.9     30.6     30.0
%                    96.0     86.3     88.3
%                   >46.6     32.6     72.6
%                    55.4     62.3    >19.7
%                   >46.3    >20.6     74.6
%                    77.7     95.4     86.3
%                    70.0     63.1    >18.9
%                    97.7     79.4     89.4
%                    67.4     84.6     83.4
%                    56.9     82.9     84.6
%                    70.6     67.4     74.0
%                    55.1     55.7     77.1
%                    57.4     78.3     88.0
%                    75.1     34.9     72.9
%                    74.9     43.1     82.9

Sign in to comment.

Answer 6

farzad on 21 Feb 2015

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/130271-how-to-improve-accuracy-for-unseen-data#answer_168797

Hi All

I tried to use this code , though not the updated one , containing Proff. Heath's points , because I couldn't know to which part of the code should I add them , it is also important for me to use the NNW after training by giving it a new input to get the desired answer , but yet I couldn't know how to use sim, you have used test_data , which MATLAB gives error and doesn't know that , what should I use instead ?

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

How to improve accuracy for unseen data

2 Comments
Show NoneHide None

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (5)

4 Comments
Show 2 older commentsHide 2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

How to improve accuracy for unseen data

2 Comments Show NoneHide None

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (5)

4 Comments Show 2 older commentsHide 2 older comments

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

1 Comment Show -1 older commentsHide -1 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments

4 Comments
Show 2 older commentsHide 2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments

0 Comments
Show -2 older commentsHide -2 older comments