![]() | |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Can anyone recommend a tool to extract the data from PDF's (other than Acrobat)? |
#3
| |||
| |||
|
|
Can anyone recommend a tool to extract the data from PDF's (other than Acrobat)? |
#4
| |||
| |||
|
| http://pdftohtml.sourceforge.net/ It works in windows and linux. |
#5
| |||
| |||
|
|
Can anyone recommend a tool to extract the data from PDF's (other than Acrobat)? |
#6
| |||
| |||
|
|
Marc Nadeau <marcnadoNOSPAMSVP (AT) yahoo (DOT) fr> wrote: http://pdftohtml.sourceforge.net/ It works in windows and linux. For some values of "work". It generates an attempt at exact imitation of the visual appearance of the PDF document, hence trying to fight against the strengths of HTML. It creates _no_ structural markup, just div, span, nobr, b, etc., combined with the use of CSS positioning in a manner that relies on browsers' violations of explicit requirements in CSS specifications. And instead of generating a single HTML document, it converts each page separately and makes them appear in a frame, inside a frameset with no noframes element and with frames named "links" and "rechts", which is _so_ informative e.g. to a blind person, is it not? And with <title>014-048_Man_282198_50</title>. It's probably easier to use cut and paste to get the textual content of a PDF file (and grab the images separately) and add adequate HTML markup by hand. At least you wouldn't need to remove randomly generated markup first. But it's not a big difference really, so if you don't know how to cut and paste from your favorite PDF viewer, you might almost as well use pdftohtml. |
#7
| |||
| |||
|
|
It's probably easier to use cut and paste to get the textual content of a PDF file (and grab the images separately) and add adequate HTML markup by hand. At least you wouldn't need to remove randomly generated markup first. But it's not a big difference really, so if you don't know how to cut and paste from your favorite PDF viewer, you might almost as well use pdftohtml. Agreed. I should have say it works BUT you have to do a *lot* of hand editing after the conversion. |
#8
| |||
| |||
|
#9
| |||
| |||
|
|
: I have tried a similar utility called pdf2htm (it also produces dreadful : code btw). Getting rid of the code isn't a major problem, but I want to : extract images in their native format (assuming that images inside pdf's : are in a standard image format (?)) without recompressing them. Screen capture to bmp ? then convert it... |
#10
| |||
| |||
|
|
Marc Nadeau <marcnadoNOSPAMSVP (AT) yahoo (DOT) fr> wrote: It's probably easier to use cut and paste to get the textual content of a PDF file (and grab the images separately) and add adequate HTML markup by hand. At least you wouldn't need to remove randomly generated markup first. But it's not a big difference really, so if you don't know how to cut and paste from your favorite PDF viewer, you might almost as well use pdftohtml. Agreed. I should have say it works BUT you have to do a *lot* of hand editing after the conversion. I have tried a similar utility called pdf2htm (it also produces dreadful code btw). Getting rid of the code isn't a major problem, but I want to extract images in their native format (assuming that images inside pdf's are in a standard image format (?)) without recompressing them. Pdf2htm converts all graphics to jpg's, this is especially unwanted because the images in question (1bit line drawings) are unsuitable for the jpeg format. So I'm still looking for a utility that extracts the raw data, unformatted text and the native images. |
![]() |
| Thread Tools | |
| Display Modes | |
| |