Help - Search - Members - Calendar
Full Version: automatic webpage screenshot via linux?
The Planet Forums > Operating Systems > Red Hat Linux
Seeker
Hi,

Is there a way to make automatic webpage screenshots via linux?
ie. some graphics lib/program etc. (not a linux box running X)

I would like to make thumbnails of webpages for any given amt of domain names randomly determined.

any ideas?

Thanks
smoker
You are probably going to need ImageMagick and either htmldoc, ghostscript or html2ps to achieve this.

Grab the page and convert to pdf or ps then use ImageMagick to convert to jpeg, png WHY and resize the image.

All hand carved really icon_wink.gif

alan
ideasmultiples
Any sample or HOW-TO icon_wink.gif
smoker
Not to hand.
I have used htmldoc before but after reading a bit about ImageMagick, apparently there are problems converting pdf to image formats anyway, so html2ps is probably better.

My partner in crime has installed imagemagick on one of our servers but I haven't used it yet, but it should be easy to script the image conversion once you have acquired and converted the html page to postscript.

If you are looking to make this automatic, then you will need a fairly complex script to grab pages then run them through the process.

For example, using perl you would need the LWP modules to grab the pages then pass them to html2ps, and then on to ImageMagick.

If the pages are all on your server it makes life easier of course.

check out the ImageMagick mailing list archives for more details on that program.

http://studio.imagemagick.org/mailman/list...fo/magick-users

alan

[edit] You don't need LWP for perl, htmldoc can grab the pages by itself, LWP would create a local copy of the page [/edit]
smoker
Ok, I've been trying things out, and have managed to get the basic idea working.

Please note this is only to demonstrate the basics icon_wink.gif

go to http://www.headru.sh/pdf

then enter a full url into the text box and hit submit

If you don't get an error then you should get a jpg appear.

Tip. Make sure the webpage you choose is only about 1 page of a pdf or you get more than 1 file.

So, if you enter a url, hit submit, and then get page not found, try adding .0 to the end of the url showing in your browser.

This is because imagemagick will create a jpeg for each page of the pdf file and number them. (ie, image.jpg.0 image.jpg.1 etc)

This is really basic stuff and you can specify options to both htmldoc and imagemagick so it shouldn't be too hard to clean up the output.

good url to try is http://www.abbey.com/index
that fits nicely on 1 page !
the files are named as the exact time they are created at the moment (minutes & seconds).

see what you think.

You will need :

htmldoc
ImageMagick
Ghostscript (for use by imagemagic)
gtk+-1.2.10-11:1.i386.rpm (libs for ghostscript)

alan
Seeker
Thanks for the info and the time.

[edited]

Is there a way to prevent htmldoc from creating a pdf file with 4 pages (in case of a long webpage)
smoker
Not that I've found, but you can make imagemagick append all the pages to the same image and then crop the image.

Here are my command lines for htmldoc and imagemagick to produce an image 400x350 at 50% resolution.

htmldoc :
CODE
htmldoc --continuous --browserwidth 800 --landscape --size A4 --header ... --left 1in --embedfonts -f output.pdf url


( this gives output with no page breaks, a pdf browser width of 800, in landscape format on A4 pages, a blank header, 1inch left hand margin, embedding all original fonts used, to file output.pdf using url as input )

ImageMagick :

CODE
convert -scale 50% output.pdf -append image.jpg


This converts the pdf and scales the image to 50% while appending each page of the pdf to the same image


CODE
convert -crop 400x350-0-0 image.jpg image.jpg


This crops the image to 400x350 starting at 0 column and 0 row

CODE
convert -font spacetoa.ttf -fill red -pointsize 36 -gravity center -draw 'text 0,0 "Headrush Inc"' image.jpg image.jpg


This writes my logo on the image using an uploaded font (must be in the working directory), in red, 36 points type, centered on the image.


Note that all these commands are on one line each to be done sequentially.

I use them in a perl script so you can use variables instead of fixed names, urls etc.

It is still buggy, as some sites use frames, so you only get the main frame, or some sites have javascript browser checkers so all you see is a warning page. Also, some sites use so many graphics or java that the pdf file is enormous and you can't process it.

There are many many options to both these tools though, and it appears that you can use ImageMagick by itself and lose htmldoc, but you need html2ps installed first.

links:
http://www.imagemagick.org/
http://www-106.ibm.com/developerworks/libr...raf/?ca=dnt-428
http://www.easysw.com/htmldoc/htmldoc.html


have fun

icon_biggrin.gif

alan

oh BTW, make sure you delete the pdf files after use, some of them are over 400 k !!!
Seeker
Thanks again.

I also found out that htmldoc will download the webpage without the need to use some other application like wget.
smoker
No worries ,now all I need is a use for this icon_biggrin.gif

Maybe good for checking content of sites on my servers.

It doesn't handle asp pages well though.

alan
newuser
htmldoc doesn't render most pages very well.

It's unfortunate too, since this is a highly desirable functionality.

If you want more true renderings of webpages, you will be better off investigating the use of mozilla on unix systems to render.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2010 Invision Power Services, Inc.