openoffice.org - dealing with .doc file in python and getting a limited list of character -
i know .doc files not directly read in python. thus, when read in python using os.open()
, os.read()
following result no matter how long actual document is, want know these characters are?
b'\xd0\xcf\x11\xe0\xa1\xb1'
it signature of olecf file:
http://www.forensicswiki.org/wiki/ole_compound_file#file_signature
the olecf used store:
- microsoft office 97-2003 documents:
- word document (doc)
- excel spreadsheet (xls)
- powerpoint presentation (ppt)
- msn (toolbar) (c:\documents , settings\%username%\local settings\application - data\microsoft\msne\msninfo.dat)
- jump lists
- stickynotes.snt
- thumbs.db
- windows installer (.msi) , patch file (.msp)
- windows search (srchadm.msc)
for more information, see compound binary file specification
that being said, reading .doc files not easy way extract text ms word files. may try python-docx library if files dealing .docx files.
Comments
Post a Comment