pdftextstripper stripper = new pdftext2html(encoding); string result = stripper.gettext(document).trim();
result contains like
<!doctype html public "-//w3c//dtd html 4.01 transitional//en" "http://www.w3.org/tr/html4/loose.dtd"> <html><head><title>inserat sele ee rev</title> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body> <div style="page-break-before:always; page-break-after:always"><div><p>�&#...
instead of
<!doctype html public "-//w3c//dtd html 4.01 transitional//en" "http://www.w3.org/tr/html4/loose.dtd"> <html><head><title>inserat sele ee rev</title> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body> <div style="page-break-before:always; page-break-after:always"><div><p>any blablabla characters...
when changed encoding windows-1252 or utf-8 result not changed. bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/inserat_sele_ee_rev.pdf
how parse pdf?
how parse this pdf?
short of ocr'ing don't.
the pdf in question not contain information required extract text without doing @ least ocr (at least ocr'ing each character of used font find mapping glyph character) require additional libraries , code.
as requirement text extraction pdf specification iso 32000-1:2008 correctly states in section 9.10.2 font used text extract needs
- either contain tounicode cmap — font used in document doesn't —
- or be composite font uses 1 of predefined cmaps listed in table 118 (except identity–h , identity–v) or descendant cidfont uses adobe-gb1, adobe-cns1, adobe-japan1, or adobe-korea1 character collection — font used in document isn't —
- or be simple font uses 1 of predefined encodings macromanencoding, macexpertencoding, or winansiencoding, or has encoding differences array includes character names taken adobe standard latin character set , set of named characters in symbol font — font used in document neither uses 1 of predefined encodings nor character names in differences array selections mentioned: names used /0, /1, ..., /155.
generally first test try , copy&paste text using adobe reader text extraction experience in reader's code. when trying so, you'll see garbage.
Comments
Post a Comment