java - Regular expression to match strings in quotes with double-quotes inside -


i face challenge match input in following format:

  • the input consists of key=value pairs. key starts slash. value may number or string in quotes.
  • the value may optionally contain escaped quotes, quote following quote (""). such escaped quote should considered part of value. there no need check escaped quotes balanced (e.g. ends escaped quote).

the regular expression should match given key=value part of sequence , should not break long inputs (e.g. value 10000 characters).

first came solution:

/(\w+)=(\d+|"(?:""|[^"])+"(?!")) 

and performs not bad, fails in java6 stackoverflowerror long inputs (cashes regexplanet example). tried improve bit run faster:

/(\w+)=(\d+|"(?:""|[^"]+)+"(?!")) 

but if input not matching, enters endless loop in backtracking trying match it.

then came regex:

/(\w+)=(\d+|".+?(?<!")(?:"")*"(?!")) 

which performing slower, seems solve task.

can suggest better / faster regex?

sample input:

/mol_type="protein" /transl_table=11 /note="[cds] (""multi line)"  nn  /organism="""some"" sequence" nn  /organism="some ""sequence""" /translation="mhpsssriphiavvgvsaifpgsldahgfwrdilsgtdlitdvpsthwlve dyydpdpsapdktyakrgaflkdvpfdplewgvppsivpatdttqllalivakrvledaaqgqfe smsrermsvilgvtsaqellasmvsriqrpvwakalrdlgypedevkracdkiagnyvpwqessf pgllgnvvagrianrldlggtncvtdaacasslsamsmainelalgqsdlviaggcdtmndafmy mcfsktpalsksgdcrpfsdkadgtllgegiamvalkrlddaerdgdrvyavirgigsssdgrsk svyapvpegqakalrrtyaaagygpetvelmeahgtgtkagdaaefeglramfdesgredrqwca lgsvksqightkaaagaaglfkaimalhhkvlpptikvdkpnpkldiektafylntqarpwirpg dhprrasvssfgfggsnfhvaleeytgpapkawrvralpaelfllsadtpaaladraralakeae vpeilrflaresvlsfdasrparlglcatdeadlrkkleqvaahlearpeqalsaplvhcasgea pgrvaflfpgqgsqyvgmgadalmtfdparaawdaaagvaiadaplhevvfprpvfsdedraaqe arlretrwaqpaigatslahlallaalgvraeafaghsfgeitalhaagalsaadllrvarrrge lrtlgqvvdhlraslpaagpaasaspaaaasvpkastaavpavasvaapgaaevervvmavvaet tgypaemlglqmelesdlgidsikrveilsavrdrtpglsevdasalaqlrtlgqvvdhlraslp aasagpavaapaakapavaaptgvsgatpgaaevervvmavvaettgypaemlglqmelesdlgi dsikrveilsavrdrtpglaevdasalaqlrtlgqvvdhlraslgpaavtagaapaepaeepast plgrwtlveepapaaglampglfdagtlvitghdaigpalvaalaargiaaeyapavprgargav flgglrelatadaalavhreaflaaqaiaakpalfvtvqdtggdfglagsdrawvgglpglvkta alewpeascraidleragrsdgelaeaiasellsggveleiglradgrrttprsvrqdaqpgplp lgpsdvvvasggargvtaatlialarasharfallgrtaledepaacrgadgeaalkaalvkaat sagqrvtpaeigrsvakilanrevratldairaaggealyvpvdvndaravaaaldgvrgalgpv taivhgagvladklvaektveqfervfstkvdglrallgatagdplkaivlfssiaarggnkgqc dyamanevlnkvaaaeaarrpgcrvkslgwgpwqggmvnaaleahfaqlgvpliplaagakmlld elcdasgdrgargqggappgavelvlgaepkalaaqghggrvalavradrathpylgdhaingvp vvpvvialewfaraaracrpdlvvtelrdvrvlrgiklaayesggevfrvdcrevsnghgavlaa elrgpqgalhyaatiqmqqpegrvapkgpaapelgpwpaggelydgrtlfhgrdfqvirrldgvs rdgiagtvvglreagwvaqpwktdpaaldgglqlatlwtqhvlggaalpmsvgalhtfaegpsdg plravvrgqivardrtkadiafvddrgslvaelrdvqyvlrpdtargqa" /note="primer of  streptococcus pneumoniae 

expected output (from regexhero.net):

regex

in order fail in reasonable time need, indeed, avoid catastrophic backtracking. can done using atomic grouping (?>...):

/(\w+)=(\d+|"(?>(?>""|[^"]+)+)"(?!"))  # (?>(?>""|[^"]+)+) (?>               # throw away states created (...)+     (?>           # throw away states created [^"]+         ""|[^"]+     )+ ) 

your issue when using (?:""|[^"]+)+ on string never match, linked fact each time match new [^"] character regex engine can choose use inner or outer + quantifier.

this leads lot of possibilities backtracking, , before returning failure engine has try them all.

we know if haven't found match time engine reaches end, never will: need throw away backtracking positions avoid issue, , that's atomic grouping for.

see demo: 24 steps on failure, while preserving speed on successful cases (not real benchmarking tool, catastrophic backtracking pretty easy spot)


Comments

Popular posts from this blog

windows - Single EXE to Install Python Standalone Executable for Easy Distribution -

c# - Access objects in UserControl from MainWindow in WPF -

javascript - How to name a jQuery function to make a browser's back button work? -