java - PDFTextStripper parsing with wrong encoding -


pdftextstripper stripper = new pdftext2html(encoding); string result = stripper.gettext(document).trim(); 

result contains like

<!doctype html public "-//w3c//dtd html 4.01 transitional//en"  "http://www.w3.org/tr/html4/loose.dtd"> <html><head><title>inserat  sele ee rev</title> <meta http-equiv="content-type"  content="text/html; charset=utf-8"> </head> <body> <div  style="page-break-before:always;  page-break-after:always"><div><p>&#0;&#1;&#2;&#3;&#4;&#5;&#6;&#7;&#... 

instead of

 <!doctype html public "-//w3c//dtd html 4.01 transitional//en"  "http://www.w3.org/tr/html4/loose.dtd"> <html><head><title>inserat  sele ee rev</title> <meta http-equiv="content-type"  content="text/html; charset=utf-8"> </head> <body> <div  style="page-break-before:always; page-break-after:always"><div><p>any  blablabla characters... 

when changed encoding windows-1252 or utf-8 result not changed. bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/inserat_sele_ee_rev.pdf

how parse pdf?

how parse this pdf?

short of ocr'ing don't.

the pdf in question not contain information required extract text without doing @ least ocr (at least ocr'ing each character of used font find mapping glyph character) require additional libraries , code.

as requirement text extraction pdf specification iso 32000-1:2008 correctly states in section 9.10.2 font used text extract needs

  • either contain tounicode cmap — font used in document doesn't —
  • or be composite font uses 1 of predefined cmaps listed in table 118 (except identity–h , identity–v) or descendant cidfont uses adobe-gb1, adobe-cns1, adobe-japan1, or adobe-korea1 character collection — font used in document isn't —
  • or be simple font uses 1 of predefined encodings macromanencoding, macexpertencoding, or winansiencoding, or has encoding differences array includes character names taken adobe standard latin character set , set of named characters in symbol font — font used in document neither uses 1 of predefined encodings nor character names in differences array selections mentioned: names used /0, /1, ..., /155.

generally first test try , copy&paste text using adobe reader text extraction experience in reader's code. when trying so, you'll see garbage.


Comments