HighDots Forums  

How to find out un-referenced webpages,images and files in web pages directory tree ?

HTML Writing HTML for the Web (comp.infosystems.www.authoring.html)


Discuss How to find out un-referenced webpages,images and files in web pages directory tree ? in the HTML forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Patricia Mindanao
 
Posts: n/a

Default How to find out un-referenced webpages,images and files in web pages directory tree ? - 12-06-2007 , 05:15 AM






I have a directory tree on my hard disc which represents all the web pages and linked stuff
on my mirrored web hoster server.

All web pages and files are statically linked. So dynamically composed links e.g.
with javascript do not matter here.

Now I want to find out which of all these (many) files are un-reference orphans
starting from the main page index.html (or index.shtml)

In other words if e.g. a file aaa.log can not be referenced by a chain like

index.html -> subpage8.html -> details2345.html -> aaa.log

Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)

Keep in mind that the static link URLs can be absolute (http://www.mywebpages.com/content/subpage8.html)
or relative (content/subpage8.html)

Pat

Reply With Quote
  #2  
Old   
GArlington
 
Posts: n/a

Default Re: How to find out un-referenced webpages,images and files in webpages directory tree ? - 12-06-2007 , 06:26 AM






On Dec 6, 11:15 am, pat... (AT) hotmail (DOT) com (Patricia Mindanao) wrote:
Quote:
I have a directory tree on my hard disc which represents all the web pages and linked stuff
on my mirrored web hoster server.

All web pages and files are statically linked. So dynamically composed links e.g.
with javascript do not matter here.

Now I want to find out which of all these (many) files are un-reference orphans
starting from the main page index.html (or index.shtml)

In other words if e.g. a file aaa.log can not be referenced by a chain like

index.html -> subpage8.html -> details2345.html -> aaa.log

Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)

Keep in mind that the static link URLs can be absolute (http://www.mywebpages.com/content/subpage8.html)
or relative (content/subpage8.html)

Pat
You will have to write a small script, or look for one on the net,
that will:
1) select your starting file (index.html if you want to check which
pages are accessible/not accessible starting from your index page).
2) read selected file
3) scan the file, extract all links, calculate what they should be
referring to and check if the corresponding files exist, flag them as
used....
4) select each of the files from step 3, proceed to step 2) for each
of them
continue until there are no more files to check...


Reply With Quote
  #3  
Old   
William Hughes
 
Posts: n/a

Default Re: How to find out un-referenced webpages,images and files in web pages directory tree ? - 12-07-2007 , 07:45 AM



On 06 Dec 2007 11:15:12 GMT, in comp.infosystems.www.authoring.html
patmin (AT) hotmail (DOT) com (Patricia Mindanao) wrote:

Quote:
Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)
Xenu - http://home.snafu.de/tilman/xenulink.html

Also checks external (off-site) links.
--
William Hughes, San Antonio, Texas: cvproj (AT) grandecom (DOT) net
The Carrier Project: http://home.grandecom.net/~cvproj/carrier.htm
Support Project Valour-IT: http://soldiersangels.org/valour/index.html


Reply With Quote
  #4  
Old   
salmobytes
 
Posts: n/a

Default Re: How to find out un-referenced webpages,images and files in webpages directory tree ? - 12-07-2007 , 08:50 AM



On Dec 6, 4:15 am, pat... (AT) hotmail (DOT) com (Patricia Mindanao) wrote:
Quote:
I have a directory tree on my hard disc which represents all the web pages and linked stuff
on my mirrored web hoster server.

All web pages and files are statically linked. So dynamically composed links e.g.
with javascript do not matter here.

On any linux server (most website are served this way, no....linux/
apache?)

############
create this file, call it "ifgrep"
If you don't put it in your current working
directory, make sure it is in your execution path.
chmod +x ifgrep <enter> (do this to make it executable)

the ifgrep file:
#!/bin/sh

T=`grep $1 $2`
if [ "$T" ]; then
echo $2
fi
############
create this file, call it peeper:
chmod +x peeper
the peeper file:

#!/bin/sh

for x in `cat htmlfileExistsList`
do
file=`basename $x`
for lookee in `find . -type f -name "*html"`
do
look=`basename $file`
/home/sandy/bin/ifgrep $look $lookee;
done
done
###############

from the terminal prompt:
find . -name "*html" > htmlfileExistsList
peeper > htmlfileReferencedInAnHtmlFileList

################
Now you have two text files:
htmlfileExistsList and htmlfileReferencedInAnHtmlFileList
Any file names found in existsList not
found in foundINAFileList are orphaned html files



Reply With Quote
  #5  
Old   
salmobytes
 
Posts: n/a

Default Re: How to find out un-referenced webpages,images and files in webpages directory tree ? - 12-07-2007 , 09:05 AM



On Dec 7, 7:50 am, salmobytes <Sandy.Pittendr... (AT) gmail (DOT) com> wrote:
Quote:
On any linux server (most website are served this way, no....linux/
apache?) ...apache and/or tomcat, that is.....
code snipped:
Quote:
/home/sandy/bin/ifgrep $look $lookee;
You won't have installed ifgrep at this path, on your server.
But it is important to use the full path to the ifgrep
file (where ever it is) because, on most systems,
the shell won't have that file in its execution path,
.....so, use the full path to ifgrep in the peeper script.




Reply With Quote
  #6  
Old   
David E. Ross
 
Posts: n/a

Default Re: How to find out un-referenced webpages,images and files in webpages directory tree ? - 12-07-2007 , 06:07 PM



On 12/6/2007 3:15 AM, Patricia Mindanao wrote:
Quote:
I have a directory tree on my hard disc which represents all the web pages and linked stuff
on my mirrored web hoster server.

All web pages and files are statically linked. So dynamically composed links e.g.
with javascript do not matter here.

Now I want to find out which of all these (many) files are un-reference orphans
starting from the main page index.html (or index.shtml)

In other words if e.g. a file aaa.log can not be referenced by a chain like

index.html -> subpage8.html -> details2345.html -> aaa.log

Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)

Keep in mind that the static link URLs can be absolute (http://www.mywebpages.com/content/subpage8.html)
or relative (content/subpage8.html)

Pat
Using your example, use a search tool (e.g., Search on Windows, grep on
UNIX) to search the directory for all files of the form *.html, first
for the string href="aaa.log" and second for the string
href="http://www.mywebpages.com/content/aaa.log". I do this often but
not often enough to create a search script.

--
David E. Ross
<http://www.rossde.com/>

Natural foods can be harmful: Look at all the
people who die of natural causes.


Reply With Quote
  #7  
Old   
Klaus Johannes Rusch
 
Posts: n/a

Default Re: How to find out un-referenced webpages,images and files in webpages directory tree ? - 12-16-2007 , 06:22 PM



Patricia Mindanao wrote:
Quote:
Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)
linklint <URL:http://www.linklint.org/> can determine orphans and
supports both local-file and HTTP site checking

--
Klaus Johannes Rusch
KlausRusch (AT) atmedia (DOT) net
http://www.atmedia.net/KlausRusch/


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.