i created simple program reads file , asks user input word , tells how many times word used. want improve don't have type exact directory each time. imported tkinter , used code filename= filedialog.askfilename() box pops , lets me choose file. every time try use though following error code...
traceback (most recent call last): file "/users/ashleystallings/documents/school work/computer programming/side projects/how many? (python).py", line 24, in <module> line in filescan.read().split(): #reads line of file , stores file "/library/frameworks/python.framework/versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] unicodedecodeerror: 'ascii' codec can't decode byte 0x8e in position 12: ordinal not in range(128)
the time don't seem error code when try open .txt file. i'm wanting open .docx files also. in advance :)
# name: ashley stallings # program decription: asks user input word search in specified # file , tells how many times it's used. tkinter import filedialog print ("hello! welcome 'how many' program.") filename= filedialog.askopenfilename() #gets file name cont = "yes" while cont == "yes": word=input("please enter word scan for. ") #asks word capitalized= word.capitalize() lowercase= word.lower() accumulator = 0 print ("\n") print ("\n") #making pretty print ("searching...") filescan= open(filename, 'r') #opens file line in filescan.read().split(): #reads line of file , stores line=line.rstrip("\n") if line == capitalized or line == lowercase: accumulator += 1 filescan.close print ("the word", word, "is in file", accumulator, "times.") cont = input ('type "yes" check word or \ "no" quit. ') #deciding next step cont = cont.capitalize() if cont != "no" , cont != "yes": print ("invalid input!") print ("\n") print ("thanks using how many!") #ending
p.s. not sure if matters, i'm running osx
the time don't seem error code when try open .txt file. i'm wanting open .docx files also.
a docx
file isn't text file; it's office open xml file: zipfile containing xml document , other supporting files. trying read text file isn't going work.
for example, first 4 bytes of file going this:
b'pk\x03\x04`
you can't interpret utf-8, ascii, or else without getting bunch of garbage. you're not going find words in this.
you can processing on own—use zipfile
access document.xml
inside archive, use xml parser text nodes, , rejoin them can split them on whitespace. example:
import itertools import zipfile import xml.etree.elementtree et zipfile.zipfile('foo.docx') z: document = z.open('word/document.xml') tree = et.parse(document) textnodes = tree.findall('.//{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t') text = itertools.chain.from_iterable(node.text.split() node in textnodes) word in text: # ...
of course better parse xmlns
declarations , register w
namespace properly, can use 'w:t'
, if have idea means, know that, , if don't, isn't place tutorial on xml namespaces , elementtree
.
so, how supposed know it's zipfile full of stuff, , actual text in file word/document.xml
, , actual text within file in .//w:t
nodes, , namespace w
maps http://schemas.openxmlformats.org/wordprocessingml/2006/main
, , on? well, read relevant documentation , figure out, using sample files , bit of exploration guide you, if know enough stuff. if don't, there's major learning curve ahead of you.
and if know you're doing, better idea search pypi docx parser module , use that.
Comments
Post a Comment