HighDots Forums  

OT: extract data from PDF's

alt.html alt.html


Discuss OT: extract data from PDF's in the alt.html forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Spartanicus
 
Posts: n/a

Default OT: extract data from PDF's - 02-06-2004 , 11:11 AM






Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?

--
Spartanicus

Reply With Quote
  #2  
Old   
Paul Furman
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-06-2004 , 11:19 AM






Automatically with php, manually with the select or text tools in
acrobat. http://us3.php.net/manual/en/ref.pdf.php

Spartanicus wrote:

Quote:
Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?



Reply With Quote
  #3  
Old   
Marc Nadeau
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-06-2004 , 09:24 PM



Spartanicus a écrit:

Quote:
Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?

There is a command line utility called pdftohtml at

http://pdftohtml.sourceforge.net/

It works in windows and linux.

You may have to compile the source.

I tried it and it works.

Bonne chance!

--
Ce qui fait que la plupart des femmes sont peu touchées de l'amitié,
c'est qu'elle est fade quand on a senti l'amour. La Rochefoucauld



Reply With Quote
  #4  
Old   
Jukka K. Korpela
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-07-2004 , 03:51 AM



Marc Nadeau <marcnadoNOSPAMSVP (AT) yahoo (DOT) fr> wrote:

Quote:
http://pdftohtml.sourceforge.net/

It works in windows and linux.
For some values of "work". It generates an attempt at exact imitation
of the visual appearance of the PDF document, hence trying to fight
against the strengths of HTML. It creates _no_ structural markup, just
div, span, nobr, b, etc., combined with the use of CSS positioning in a
manner that relies on browsers' violations of explicit requirements in
CSS specifications. And instead of generating a single HTML document,
it converts each page separately and makes them appear in a frame,
inside a frameset with no noframes element and with frames named
"links" and "rechts", which is _so_ informative e.g. to a blind person,
is it not? And with <title>014-048_Man_282198_50</title>.

It's probably easier to use cut and paste to get the textual content of
a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly generated
markup first. But it's not a big difference really, so if you don't
know how to cut and paste from your favorite PDF viewer, you might
almost as well use pdftohtml.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html




Reply With Quote
  #5  
Old   
Toby A Inkster
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-07-2004 , 05:39 AM



Spartanicus wrote:

Quote:
Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?
Ghostscript contains a tool ps2ascii, which can handle PDF input.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me - http://www.goddamn.co.uk/tobyink/?page=132



Reply With Quote
  #6  
Old   
Marc Nadeau
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-07-2004 , 09:00 PM



Jukka K. Korpela a écrit:

Quote:
Marc Nadeau <marcnadoNOSPAMSVP (AT) yahoo (DOT) fr> wrote:

http://pdftohtml.sourceforge.net/

It works in windows and linux.

For some values of "work". It generates an attempt at exact imitation
of the visual appearance of the PDF document, hence trying to fight
against the strengths of HTML. It creates _no_ structural markup, just
div, span, nobr, b, etc., combined with the use of CSS positioning in a
manner that relies on browsers' violations of explicit requirements in
CSS specifications. And instead of generating a single HTML document,
it converts each page separately and makes them appear in a frame,
inside a frameset with no noframes element and with frames named
"links" and "rechts", which is _so_ informative e.g. to a blind person,
is it not? And with <title>014-048_Man_282198_50</title>.

It's probably easier to use cut and paste to get the textual content of
a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly generated
markup first. But it's not a big difference really, so if you don't
know how to cut and paste from your favorite PDF viewer, you might
almost as well use pdftohtml.

Agreed. I should have say it works BUT you have to do a *lot* of hand
editing after the conversion.

Some of my customers send me their documents as .doc files and altough these
can easily be exported as html files it is much better (and finally less
work) to just cut and paste the text and add markup by hand.


--
Passer pour un idiot aux yeux d'un imbecile est
une volupte de fin gourmet. Alphonse Allais.



Reply With Quote
  #7  
Old   
Spartanicus
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-08-2004 , 03:15 AM



Marc Nadeau <marcnadoNOSPAMSVP (AT) yahoo (DOT) fr> wrote:

Quote:
It's probably easier to use cut and paste to get the textual content of
a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly generated
markup first. But it's not a big difference really, so if you don't
know how to cut and paste from your favorite PDF viewer, you might
almost as well use pdftohtml.

Agreed. I should have say it works BUT you have to do a *lot* of hand
editing after the conversion.
I have tried a similar utility called pdf2htm (it also produces dreadful
code btw). Getting rid of the code isn't a major problem, but I want to
extract images in their native format (assuming that images inside pdf's
are in a standard image format (?)) without recompressing them. Pdf2htm
converts all graphics to jpg's, this is especially unwanted because the
images in question (1bit line drawings) are unsuitable for the jpeg
format.

So I'm still looking for a utility that extracts the raw data,
unformatted text and the native images.

--
Spartanicus


Reply With Quote
  #8  
Old   
Sid Ismail
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-08-2004 , 04:09 AM



On Sun, 08 Feb 2004 08:15:24 +0000, Spartanicus <me (AT) privacy (DOT) net> wrote:

: I have tried a similar utility called pdf2htm (it also produces dreadful
: code btw). Getting rid of the code isn't a major problem, but I want to
: extract images in their native format (assuming that images inside pdf's
: are in a standard image format (?)) without recompressing them.


Screen capture to bmp ? then convert it...

Sid




Reply With Quote
  #9  
Old   
Spartanicus
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-08-2004 , 09:15 AM



Sid Ismail <elsid (AT) nospam (DOT) com> wrote:

Quote:
: I have tried a similar utility called pdf2htm (it also produces dreadful
: code btw). Getting rid of the code isn't a major problem, but I want to
: extract images in their native format (assuming that images inside pdf's
: are in a standard image format (?)) without recompressing them.

Screen capture to bmp ? then convert it...
That would prevent me from reusing the images in their native format,
and there are hundreds of images in the pdf's, all would need manual
cropping etc, not a realistic option.

--
Spartanicus


Reply With Quote
  #10  
Old   
Zak McGregor
 
Posts: n/a

Default Re: OT: extract data from PDF's - 02-09-2004 , 05:17 PM



On Sun, 08 Feb 2004 10:15:24 +0200, Spartanicus <"Spartanicus"
<me (AT) privacy (DOT) net>> wrote:

Quote:
Marc Nadeau <marcnadoNOSPAMSVP (AT) yahoo (DOT) fr> wrote:

It's probably easier to use cut and paste to get the textual content
of a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly
generated markup first. But it's not a big difference really, so if
you don't know how to cut and paste from your favorite PDF viewer, you
might almost as well use pdftohtml.

Agreed. I should have say it works BUT you have to do a *lot* of hand
editing after the conversion.

I have tried a similar utility called pdf2htm (it also produces dreadful
code btw). Getting rid of the code isn't a major problem, but I want to
extract images in their native format (assuming that images inside pdf's
are in a standard image format (?)) without recompressing them. Pdf2htm
converts all graphics to jpg's, this is especially unwanted because the
images in question (1bit line drawings) are unsuitable for the jpeg
format.

So I'm still looking for a utility that extracts the raw data,
unformatted text and the native images.
There are a slew of pdf2* and pdfto* utilities:
df2dsc pdf2ps pdffonts pdfimages pdfinfo pdfopt
pdftopbm pdftops pdftotext

(just on my machine - have made no real effort to get them there either).
Presumably pdfimages will do more or less what you want, it uses ppm
format by default although if the pdf contains jpegs you can specify that
they be left as jpegs.

HTH

Ciao

Zak

--
================================================== ======================
http://www.carfolio.com/ Searchable database of 10 000+ car specs
================================================== ======================


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.