Out-Sample normalization problem

1 view (last 30 days)
Jack
Jack on 3 Apr 2014
Commented: Greg Heath on 11 Apr 2014
Hi. I’m working on a binary classification system that I have 21 financial ratios and variables for inputs and my output is one of financial criteria that could be 0 or 1. Before insert data to my classification model (MLP, SVM or ELM) I normalize data (max/min mapping or whitening). My financial ratios are from companies’ statements so we have various size of companies in our data.
Otherwise I'm using 5-fold cross validation for designing my model. After design the model now I want use it by new data so I must normalize these data. I find that for Max-Min mapping I must use Maximum and Minimum of designing phase data-set and for whitening I must use mean and variance of it.
Suppose that in x-min/max-min, my new data set has a feature sample that x of it is lower than previous minimum so now this normalized feature (for that specific sample) is negative. This is not a problem? Is the output (1 or 0) true for this specific sample? Besides this in whittling method we can have same problem.
Thanks.

Accepted Answer

Greg Heath
Greg Heath on 3 Apr 2014
Edited: Greg Heath on 3 Apr 2014
Regardless of what you use in the model, I always standardize pre-modelling using zscore or mapstd to identify outliers for removal or modification.
Warning: Each dimension should be normalized separately.
P.S. If you use neural nets the default is mapminmax to [-1,1] and the hidden layer transfer functions are the odd function tanh.
Hope this helps
Thank you for formally accepting my answer
Greg
  6 Comments
Image Analyst
Image Analyst on 7 Apr 2014
Jack's second so-called "Answer" moved here:
Thank you again Greg.
I don’t use k-means clustering after employ other outlier detection techniques. Outlier detection using k-means clustering is an option for outlier detection in my system besides your proposed technique. So I can choose any of these two techniques. With regard to the above discussion, what is your idea about k-means clustering?
You mentioned that I can use ‘(x-meanx)/std > threshold of your choice ‘so your proposed technique does not consider all inputs (in my case: 21 variables) simultaneously and I can analyze one feature with it at a time. Is this true?
Thanks.
Greg Heath
Greg Heath on 11 Apr 2014
No. You consider all at once using matrix coding. I consider a 21 dimensional vector an outlier if one or more components is an outlier.
All MATLAB code is matrix based. So if you find one or more outlying components in a column of an input or target matrix, either modify or delete the column. Any target column corresponding to a deleted input must also be deleted and vice versa.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!