How extract (changeable variable) word & number using regular expression matlab

syedzainnasir

7.5 7.5 / 10

Level: Moderator

Joined: 20 Mar 2022

Last Active: 2:21 PM

Location:

How extract (changeable variable) word & number using regular expression matlab

Question: 06-Mar-2017

In: MATLAB Projects

I have more than 10k text files look similar like this, all of them are similar in format but not in size, sometime is bigger or smaller.

[code][{u'language': u'english', u'area': 3825.8953168044045, u'class': u'machine printed', u'utf8_string': u'troia', u'image_id': 428035, u'box': [426.42422762784093, 225.33333055900806, 75.15151515151516, 50.909090909090864], u'legibility': u'legible', u'id': 1056659}, {u'language': u'na', u'area': 24201.285583103767, u'id': 1056660, u'image_id': 428035, u'box': [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606], u'legibility': u'illegible', u'class': u'machine printed'}]
[/code]

I want to extract two changeable variable in every text using regular expression.

The output should be like this

[code]box  = [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606]
box1 = .. sometime there is more than one [/code]

& second output

[code]word = troia 
word1 =  ... sometime there is more than one word [/code]

My code 1: for the word extraction

[code]fid = fopen('text1.txt','r');
C = textscan(fid, '%s','Delimiter','');
fclose(fid);

C = C{:};

Lia = ~cellfun(@isempty, strfind(C,'utf8_string'));

output = [C{find(Lia)}];
expression = 'u''utf8_string'': u+'
matchStr = regexp(output, expression,'match');[/code]

My code 1 result give me only the
[code]utf8_string[/code] My code 2: for the box number extraction

[code]s = sprintf('text_.txt'); 
fid = fopen(s);
tline = fgetl(fid);

C = regexp(tline,'u''box'': +\[([0-9\. ,]+)\]','tokens');
C = cellfun(@(x) x{1},C,'UniformOutput',false)';
M = cell2mat(cellfun(@(x) x', cat(1,C2{:}),'UniformOutput',false));[/code]

This code 2 is running but not with every text something i got this error
[code]Error using cat Dimensions of matrices being concatenated are not consistent[/code]