Occlusion Resolving in Text Patterns

Question

shivasakthi on 4 Aug 2014

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/145317-occlusion-resolving-in-text-patterns

PROBLEM DESCRIPTION

I am working on occlusion resolving in text patterns. I have a dictionary of text strings containing six text patterns like this: {‘A R E A’, ‘P E R I’, ‘B O N U S’, ‘A S U M’, ‘M U L T I P L E’, ‘D I V I’}. I am going to manually occlude these text patterns and try to disentangle them at the receiving end. For instance, I am taking the first two text strings 'AREA' AND 'PERI' (YOU PLEASE CONSIDER TAKING TWO STRINGS OF DIFFERENT LENGTH). I am occluding these strings in different ways so that the occluded pattern sounds something like this: A R E E A I (Visualize the texts are lying one over the other so that 'P’ term in the first string and 'R' term in the second string are not visible. From the occluded pattern, I am randomly generating populations like S1=’A R E E’; S2=’R E E A’; S3=’E E A I’; S4=’E A I A’; S5=’A I A R’; S6=’E E A R’; (POPULATION SIZE IS 6 AS EQUAL TO THE NUMBER OF STANDARD TEXT PATTERNS IN THE DICTIONARY). I have to compute the fitness (which is nothing but the degree of similarity of these sample population members to each of the standard text patterns in the dictionary). I am computing a parameter called Similarity Quotient (SQ) (called the OBJECTIBVE FUNCTION) to compute the similarity of the text strings in the sample population to each of the standard text patterns in the Dictionary. The expression for ‘SQ’ between a pattern ‘a’ in the sample population and ‘b’ in the standard text pattern Dictionary is calculated as, SQ_ab = W_L/L*LD_ab + W_J*JD_ab + W_H*RS_ab + W_V*〖Rσ^2〗_ab +W_μ*〖Rμ〗_ab where ‘W’ is the weight associated with each component. LDab’ is the Levenshtein Distance between the two text strings ‘a’ and ‘b’, and ‘JDab’ is the Jaccard Distance between the two text strings ‘a’ and ‘b’. Levenshtein Distance (LD) is a string metric that measures the minimum numbers of insertions, deletions, and substitutions to be done to change a given text string to another. The value of ‘LD’ between the two text strings ‘a’ and ‘b’ is given by, 〖LD〗_(a,b) (i,j)=Max (i,j) if Min (i,j)=0 〖LD〗_(a,b) (i,j)=Min {〖LD〗_(a,b) (i-1,j)+1,〖LD〗_(a,b) (i,j-1)+1,〖LD〗_(a,b) (i-1,j-1)+[a_i≠b_j]} for all values of Min (i,j)≠0 The upper and lower bounds of Leva,b are ‘0’ and ‘L’, where ‘L’ is the length of the longest string. Jaccard Index (or) Jaccard Similarity Co-efficient is a string metric, used for measuring the similarity between two text strings, given by, J(a,b) = a∩b/|a∪b| where ‘|a∩b|’ signifies the number of elements common to ‘a’ and ‘b’ and ‘|a∪b|’ denotes the total number of elements in ‘a’ and ‘b’. A measure of dissimilarity between the two text string patterns can be computed with the Jaccard distance (JD) and is calculated by subtracting the Jaccard Index from ‘1’. JD (a,b) = 1 – J(a,b) Length (μ) is the length of the string pattern, Entropy (S) is found by dividing the frequency value of each character by the total length of the string pattern, Variance (σ^2) gives the variance of the individual characters in the given string. The term ‘R’ signifies the matching rank between the patterns ‘a’ and ‘b’ for every parameter. The higher the match, the higher will be the rank value that lies in the range [0, 1]. The Matching rank for Length Rμab, Entropy RSab, Variance Rσ^2ab for the pattern ‘a’ and ‘b’ are given as, Rμ_ab = 1-|(μ_a – μ_b)/ μ_a | RS_ab = 1-|(S_a –S_b)/ S_a 〖Rσ^2〗_ab = 1-(〖σ^2〗_a – 〖σ^2〗_b )/ 〖σ^2〗_a |

The upper and lower bounds of all the parameters involved in the measurement of ‘SQ’ are 0 and 1 respectively. The value of ‘SQ’ lies in the range [0, 1] depending on the degree of match of the sample in the population to the standard text pattern in the Dictionary. The maximum value of ‘SQ’ is always ‘1’, so that, W_L+W_J+W_H+W_V+W_μ= 1 (YOU ASSUME YOUR OWN VALUES FOR THESE WEIGHTS HOWEVER THE MOST OPTIMUNM VALUE WILL BE WL = 0.5, WJ = 0.2, WH = 0.1, WV = 0.1, Wμ = 0.1). The fitness value of an individual ‘a’ in the sample population is then given by, F_a =Max(SQ_a1,SQ_a2,SQ_a3,…,SQ_am), where ‘m’ is the number of standard text patterns in the Dictionary. This is a maximization problem that should be given to the meta-heuristic Micro-level, optimization algorithms and the performance needs to be evaluated. Let me ASSUME the initial fitness values I have obtained are something like this: S1=’A R E E’ = 0.6515 S2=’R E E A’ = 0.7512 S3=’E E A I’ = 0.6951 S4=’E A I A’ = 0.6384; S5=’A I A R’ = 0.7859 S6=’E E A R’ = 0.8254 (THESE ARE THE ASSUMED INITIAL FITNESS VALUES OF THE POPULATION MEMBERS AND THE VALUES ARE NOT NORMALIZED. WHEN YOU DO THE ABOVE COMPUTATIONS, YOU WILL GET THE EXACT FITNESS VALUES FOR EACH OF THE SAMPLE POPULATION MEMBERS). I have to do selection, crossover and mutation for GA, the respective iterations for PSO and that for the Firefly Algorithm (FA) so that, I get both the original patterns involved in occlusion, at the output (‘A R E A’ and ‘P E R I’) when I run the optimization program. That is, when I run my code, after several iterations, my output will contain the actual two patterns 'AREA' AND 'PERI' that were occluded. Also, I should show a graph with the number of generations on the x-axis and the distance between the actual and target values for every generation, in the Y-axis. I suggest doing the following to get the results at the end of the optimization program because the methodology has already been tried and results were obtained. Once u have evaluated the fitness of each member in the population by comparing the similarity with the standard text patterns, with that existing population u perform selection, xover & mutation. Perform once in a particular generation for GA. Repeat respectively for PSO and FA. 2) You perform number of generations, such that ur o/p has the given two patterns or else, u evaluate the fitness of the given two patterns and perform generations until the given fitness becomes nearer to that of the fitness of the given two patterns. Then stop. 3) For plot, take number of generations in x-axis, and the difference between the fitness of calculated pattern and actual pattern on y-axis for each generation and plot it. Repeat the same logic for PSO and FA so that we get the actual un-occluded patterns at the output. Since we have 6 population members, the output will resemble something like this: (‘AREA’,’PERI’, PERI’,’PERI’,’AREA’,’PERI’) Here ‘PERI’ is repeated more because its degree of visibility is more in the occluded pattern as compared to the other text string involved in the occlusion.