Unicode Decode Error while trying to use Tkinter (Python) -


i created simple program reads file , asks user input word , tells how many times word used. want improve don't have type exact directory each time. imported tkinter , used code filename= filedialog.askfilename() box pops , lets me choose file. every time try use though following error code...

traceback (most recent call last):   file "/users/ashleystallings/documents/school work/computer programming/side projects/how many? (python).py", line 24, in <module>     line in filescan.read().split():   #reads line of file , stores   file "/library/frameworks/python.framework/versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode     return codecs.ascii_decode(input, self.errors)[0] unicodedecodeerror: 'ascii' codec can't decode byte 0x8e in position 12: ordinal not in range(128) 

the time don't seem error code when try open .txt file. i'm wanting open .docx files also. in advance :)

# name: ashley stallings # program decription: asks user input word search in specified # file , tells how many times it's used. tkinter import filedialog  print ("hello! welcome 'how many' program.") filename= filedialog.askopenfilename()  #gets file name   cont = "yes"  while cont == "yes":     word=input("please enter word scan for. ") #asks word     capitalized= word.capitalize()       lowercase= word.lower()     accumulator = 0      print ("\n")     print ("\n")        #making pretty     print ("searching...")      filescan= open(filename, 'r')  #opens file      line in filescan.read().split():   #reads line of file , stores         line=line.rstrip("\n")         if line == capitalized or line == lowercase:             accumulator += 1     filescan.close      print ("the word", word, "is in file", accumulator, "times.")      cont = input ('type "yes" check word or \ "no" quit. ')  #deciding next step     cont = cont.capitalize()      if cont != "no" , cont != "yes":         print ("invalid input!")  print ("\n") print ("thanks using how many!")  #ending 

p.s. not sure if matters, i'm running osx

the time don't seem error code when try open .txt file. i'm wanting open .docx files also.

a docx file isn't text file; it's office open xml file: zipfile containing xml document , other supporting files. trying read text file isn't going work.

for example, first 4 bytes of file going this:

b'pk\x03\x04` 

you can't interpret utf-8, ascii, or else without getting bunch of garbage. you're not going find words in this.


you can processing on own—use zipfile access document.xml inside archive, use xml parser text nodes, , rejoin them can split them on whitespace. example:

import itertools import zipfile import xml.etree.elementtree et  zipfile.zipfile('foo.docx') z:     document = z.open('word/document.xml')     tree = et.parse(document)  textnodes = tree.findall('.//{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t') text = itertools.chain.from_iterable(node.text.split() node in textnodes) word in text:     # ... 

of course better parse xmlns declarations , register w namespace properly, can use 'w:t', if have idea means, know that, , if don't, isn't place tutorial on xml namespaces , elementtree.


so, how supposed know it's zipfile full of stuff, , actual text in file word/document.xml, , actual text within file in .//w:t nodes, , namespace w maps http://schemas.openxmlformats.org/wordprocessingml/2006/main, , on? well, read relevant documentation , figure out, using sample files , bit of exploration guide you, if know enough stuff. if don't, there's major learning curve ahead of you.

and if know you're doing, better idea search pypi docx parser module , use that.


Comments