machine learning - Extracting time, date, flight # from airplane eticket -


problem

given (x, y) positions of every word in flight e-ticket, extract flight numbers , corresponding destination/arrival times + dates.

my first try

use regex flight numbers, dates, , times. match flights correct dates & times using (x, y) positions. done set of rules came with. issue these rules becoming longer , more complicated try make work variety of e-tickets.

  • for example, "a320" "aegean airlines (a3) flight 20" or irrelevant "airbus a320".
  • another example: "320p" time, 3:20pm, or part of irrelevant code appears in e-ticket.

how approach it? topics should into?

handwritten rules ok such things, need better tools simple regular expressions. try out gate framework , jape rule engine, regexes, operates so-called annotations (e.g. "token", "word", "noun", "number", sentence, etc.).

here's gate manual give both - quick introduction , in depth feature description. pay attention chapters basic jape rules , gazeteers, dictionaries large number of pre-included names, e.g. cities, airports, people names, etc.


Comments

Popular posts from this blog

windows - Single EXE to Install Python Standalone Executable for Easy Distribution -

c# - Access objects in UserControl from MainWindow in WPF -

javascript - How to name a jQuery function to make a browser's back button work? -