machine learning - Extracting time, date, flight # from airplane eticket -
problem
given (x, y) positions of every word in flight e-ticket, extract flight numbers , corresponding destination/arrival times + dates.
my first try
use regex flight numbers, dates, , times. match flights correct dates & times using (x, y) positions. done set of rules came with. issue these rules becoming longer , more complicated try make work variety of e-tickets.
- for example, "a320" "aegean airlines (a3) flight 20" or irrelevant "airbus a320".
- another example: "320p" time, 3:20pm, or part of irrelevant code appears in e-ticket.
how approach it? topics should into?
handwritten rules ok such things, need better tools simple regular expressions. try out gate framework , jape rule engine, regexes, operates so-called annotations (e.g. "token", "word", "noun", "number", sentence, etc.).
here's gate manual give both - quick introduction , in depth feature description. pay attention chapters basic jape rules , gazeteers, dictionaries large number of pre-included names, e.g. cities, airports, people names, etc.
Comments
Post a Comment