openoffice.org - dealing with .doc file in python and getting a limited list of character -


i know .doc files not directly read in python. thus, when read in python using os.open() , os.read() following result no matter how long actual document is, want know these characters are?

b'\xd0\xcf\x11\xe0\xa1\xb1' 

it signature of olecf file:

http://www.forensicswiki.org/wiki/ole_compound_file#file_signature

the olecf used store:

  • microsoft office 97-2003 documents:
    • word document (doc)
    • excel spreadsheet (xls)
    • powerpoint presentation (ppt)
  • msn (toolbar) (c:\documents , settings\%username%\local settings\application - data\microsoft\msne\msninfo.dat)
  • jump lists
  • stickynotes.snt
  • thumbs.db
  • windows installer (.msi) , patch file (.msp)
  • windows search (srchadm.msc)

for more information, see compound binary file specification

that being said, reading .doc files not easy way extract text ms word files. may try python-docx library if files dealing .docx files.


Comments

Popular posts from this blog

windows - Single EXE to Install Python Standalone Executable for Easy Distribution -

c# - Access objects in UserControl from MainWindow in WPF -

javascript - How to name a jQuery function to make a browser's back button work? -