I'm currently developing a tool aiming to detect addresses (or any pattern, like job, sport team or anything) in a text.

So what I'm currently doing:

1/ Splitting the text in words 2/ Stemming the words

Users can create categories (job, sport team, address...) and will manually assign a sentence to a category.

Each stemmed word of this sentence will be stored in DB, with an updated score (+1)

When I will browse a new document, I will compute for each sentence the score thanks to all words in it.

Example:

I live in Brown Street, in London

=> (live+1, Brown +1, Street+1, London+1)

Then next time I see

I live in Orange Street, in London The score will be 3 (live +1, Street+1, London+1) so I can say "this sentence might be an address". If user validates, I update the words (live+1, orange+1, street+1, london+1). If he says "inaccurate", all words will be downvoted.

I think with more runs, I will be able to detect addresses since "Street" and "London" will have a large score (same for zip code etc)

My question is:

First, what do you think about this approach? Secondly, context is just ignored with this approach. A sentence with Street & London should have a better score. It means if I detect Street & London in the same sentence, we can likely say it's an address.

How can I store this information in a database? I'm currently using a relational database (MySQL), but I'm afraid the size will become huge if I store the link between each word.

Is it what we call a neural network? What is the best way to store it?

Do you have any tips to upgrade my detection algorithm?