# Maximum separation of certain points on a 2d plot

Asked by Vitaly on 5 Jul 2012
Latest activity Answered by Walter Roberson on 6 Jul 2012

Hello matlab users,

I'm trying to solve the following problem. Suppose we have a number of point on a 2d plot with X&Y coordinates. Each point has it's own label (e.g. 1 or 2). I want to draw a line, which will separate a region of plot with maximum number of points of certain label (e.g. 2). Here is a sample code for the first part of the problem, which plots points and assigns labels:

```X = [0.5588;0.3603;0.3309;0.3603;0.1985];
Y = [1.9234;0.3406;1.1295;0.8748;0.2416];
scatter(X,Y);
a = [1;2;2;2;1];
b = num2str(a); c = cellstr(b);
%displacement so the text does not overlay the data points
dx = 0.01; dy = 0.01;
text(X+dx,Y+dy,c);
```

As I see it the brut force approach would be to write a cycle which would iterate coefficients k and b of linear line equation (y=kx+b) and calculate number of points with different labels to each side of the line trying to maximize number of points with given label (note: the optimum kx+b line should separate points in such a manner so that there are no points with label 1 in region of points with label 2). But I am afraid that this approach will take lots of machine time, since I will be trying different combinations of X&Y and for each X&Y combination I'll have to find it's optimal separating line.

May be someone cane suggest more elegant and less time consuming approach? May be there is an analytical approach to solve this problem (instead of graphical).

## Products

No products are associated with this question.

Answer by Matt Kindig on 5 Jul 2012

You write that "the optimum kx+b line should separate points in such a manner so that there are no points with label 1 in region of points with label 2". Can you guarantee that such a perfect discrimination exists? It seems highly likely that the points will be distributed such that a single straight line is insufficient to completely isolate the label 1 points from the label 2 points.

That said, if you have the Optimization Toolbox, you should be able to code up your algorithm fairly easily. I suspect that given that you only have two variables, your algorithm will converge fairly quickly.

Vitaly on 5 Jul 2012

Suppose we have 100 points. 40 of them have label 1, another 60 - 2. In my case, even if it is possible to find a region, which will contain only 5-10 points with label 2, effectively separating it from the rest bunch of points, it should be good enough. In some cases region will contain more uniform points (i.e. having the same label), in others less. But the idea is to draw the line in such a manner so that it separates maximum number of uniform points. I am new to programming and Matlab, that's why I am here asking this question, if you could suggest some starting point on how to implement this algorithm, I'll appreciate that. I do have optimization toolbox. I also want to avoid a problem of lengthy calculations. I have a large data set, which consists of 200 time series. Each time series could be either X or Y, I'll have to test all possible combinations of 2 variables from these dataset and then select those, which yield regions with biggest numbers of separated uniform points.

Matt Kindig on 5 Jul 2012

So if I understand you correctly, you basically want a line that separates the largest cluster of uniform points. On one side of this line can be any number of points, as long as they are all labeled as 2.

Hmmm... this isn't really my field of expertise, but I still think this sounds like a clustering problem. I know there are some tools in the Statistics Toolbox to do clustering -- kmeans comes to mind. Maybe start there?

Answer by Walter Roberson on 6 Jul 2012

If you are looking at straight lines to divide clusters, then consider using svm.

Answer by Walter Roberson on 6 Jul 2012

The natural search space for an algorithm like this involves both slope and intercept. Both are important to get the best answer, but we found in practice that getting "close" to the best was somewhat insensitive to slope but was fairly sensitive to intercept.

Starting from "transvariation" techniques suggested by my co-workers, I found a way to optimize the search so that the calculation of best intercept for any one slope, became linear in the number of points. Latching on to the absolute best slope could still require a fairly fined-grained search, but in practice it was amenable to bracketing and refinement, that if you went more than a little way away from "close" to the peak, that the counts would fall fast enough that managing to dig out a tiny tiny angle of separation would not be enough to counter the grosser-level losses. For our purposes, worrying about angles that narrow implied that we were overtraining.

Unfortunately, the project that this was all implemented for was permanently shut down this week, so the code is no longer available.