openoffice.org - dealing with .doc file in python and getting a limited list of character -

- June 15, 2012

i know .doc files not directly read in python. thus, when read in python using os.open() , os.read() following result no matter how long actual document is, want know these characters are?

b'\xd0\xcf\x11\xe0\xa1\xb1'

it signature of olecf file:

http://www.forensicswiki.org/wiki/ole_compound_file#file_signature

the olecf used store:

microsoft office 97-2003 documents:
- word document (doc)
- excel spreadsheet (xls)
- powerpoint presentation (ppt)
msn (toolbar) (c:\documents , settings\%username%\local settings\application - data\microsoft\msne\msninfo.dat)
jump lists
stickynotes.snt
thumbs.db
windows installer (.msi) , patch file (.msp)
windows search (srchadm.msc)

for more information, see compound binary file specification

that being said, reading .doc files not easy way extract text ms word files. may try python-docx library if files dealing .docx files.

Search This Blog

EIght

openoffice.org - dealing with .doc file in python and getting a limited list of character -

Comments

Post a Comment

Popular posts from this blog

windows - Single EXE to Install Python Standalone Executable for Easy Distribution -

c# - Access objects in UserControl from MainWindow in WPF -

javascript - How to name a jQuery function to make a browser's back button work? -