I have more than 10k text files look similar like this, all of them are similar in format but not in size, sometime is bigger or smaller.
The output should be like this
[code][{u'language': u'english', u'area': 3825.8953168044045, u'class': u'machine printed', u'utf8_string': u'troia', u'image_id': 428035, u'box': [426.42422762784093, 225.33333055900806, 75.15151515151516, 50.909090909090864], u'legibility': u'legible', u'id': 1056659}, {u'language': u'na', u'area': 24201.285583103767, u'id': 1056660, u'image_id': 428035, u'box': [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606], u'legibility': u'illegible', u'class': u'machine printed'}]
[/code]
I want to extract two changeable variable in every text using regular expression.The output should be like this
[code]box = [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606]
box1 = .. sometime there is more than one [/code]
& second output[code]word = troia
word1 = ... sometime there is more than one word [/code]
My code 1: for the word extraction[code]fid = fopen('text1.txt','r');
C = textscan(fid, '%s','Delimiter','');
fclose(fid);
C = C{:};
Lia = ~cellfun(@isempty, strfind(C,'utf8_string'));
output = [C{find(Lia)}];
expression = 'u''utf8_string'': u+'
matchStr = regexp(output, expression,'match');[/code]
My code 1 result give me only the[code]utf8_string[/code]
My code 2: for the box number extraction[code]s = sprintf('text_.txt');
fid = fopen(s);
tline = fgetl(fid);
C = regexp(tline,'u''box'': +\[([0-9\. ,]+)\]','tokens');
C = cellfun(@(x) x{1},C,'UniformOutput',false)';
M = cell2mat(cellfun(@(x) x', cat(1,C2{:}),'UniformOutput',false));[/code]
This code 2 is running but not with every text something i got this error[code]Error using cat Dimensions of matrices being concatenated are not consistent[/code]
Small Bio