wwwchecklinks

In this document:

Name
Synopsis
Description
Options
Examples
Windows and Output
Version and Limitations
Availability
Author

Name

wwwchecklinks - check web pages for broken links

Synopsis

wwwchecklinks [ -imagelinks yes|no ] [ -checkalllinks yes|no ] rooturl₁ ... rooturl_n [ -prune url₁ ... url_n ]

wwwchecklinks is a program that looks for broken links in web page hierarchies. The root of the hierarchy to be checked is determined by one or more URLs given on the command line. The result is displayed in an X window which allows you to browse the result (even while the search is in progress). The result can also be saved to two files: one summary file (called CheckLinks.Summary) and one complete cross reference listing for the checked documents (called CheckLinks.Report).

Options

[ -imagelinks yes|no ]: yes means that links to inlined images are checked. This is the default.
[ -checkalllinks yes|no ]: yes means that all links are checked. no means that only links to documents on the same server as one of the root documents are checked. The default is no.
[ -prune url₁ ... url₁ ]: Normally all reachable documents below the root documents are checked. Using this option you can prune selected subhierarchies.

Examples

Some example usages of wwwchecklinks:

wwwchecklinks http://www.cs.chalmers.se/~hallgren/: The program will check that all links from my home page to other documents on the same server work. It will also follow links that lead to my other documents (i.e., documents with URLs that start with http://www.cs.chalmers.se/~hallgren/) and check them too.
wwwchecklinks -checkalllinks yes http://www.cs.chalmers.se/~hallgren/: As in the previous example, but program will check all links, not just links to documents on the same server (e.g., www.cs.chalmers.se). This will probably take some time, since my bookmarks.html file contains over 400 links to various servers around the world.
wwwchecklinks http://www.cs.chalmers.se/~hallgren/ -prune http://www.cs.chalmers.se/~hallgren/naptv: In my www directory, I have two subdirectories, naptvb94 and naptvb95, with course related information. If I only want to check my personal pages I prune those away. The program still checks the links from my home page to documents in naptvb94 and naptvb95, but it doesn't descend into the naptv directories and check the documents there.

Windows and Output

When you start the program it starts looking for broken links and opens a window which looks something like this:

[Window dump of wwwchecklinks while running]

The top part of the window shows a summary of the result, which is updated only when you press the Update button. You can press Update at any time to see how the search is progressing. You can also press the Save button at any time to save the information collected so far. (The files will be called CheckLinks.Summary and CheckLinks.Report.)

The bottom part of the window consists of three boxes showing the progress of the search. From top to bottom they show: which document is being checked at the moment, server connection status, which link is being checked at the moment.

When the search is complete (and you have pressed the Update button) the window will look something like this:

[Window dump of wwwchecklinks after pressing the Update button]

The summary window shows one line for each URL encountered during the search. The lines have the following general format:

reference_count -> information URL

where reference_count is the number of references to this URL, information is some brief information about the URL or the document it refers to and URL is the URL in question.

The URLs encountered during the search are displayed in the following order:

Broken Links. In this case, the information field indicates what kind of error occurred when trying to fetch the document. Common errors are:
- BAD 404 Not Found. The web server replied that there is no document with the given URL.
- BAD 301 Moved Permanently. The web server replied that the document has been moved. The most common reason for this is that you forgot to put a / at the end of an URL that refers to a directory. You usually don't notice this error in ordinary Web browsers since they automatically reissue the request with the correct URL. This slows down the access and increases the load on the server, though.
The broken links are ordered by the error number and the number of references to them.
Unchecked Links. The information field simply says Not checked.
Working links to Checked Documents. The information field indicates the MIME type (e.g. text/html) of the document and the number of unchecked, broken and working links in the document. The documents are ordered by the number of broken links.
Working links to Unchecked Documents. The information field contains the MIME type of the document and ? ? ? (indicating that the number of working/broken links is not known).

The list shown in the summary window is saved in the CheckLinks.Summary when you press the Save button.

Clicking on a line in the summary window opens a window containing more detailed information on that link/document. For example, clicking on the line

3 -> text/html 7 1 24 http://www.cs.chalmers.se/~hallgren/

(which by the way says that there are three references to my home page among the documents checked and that my home page contains 7 unchecked links, one broken link and 24 working links) in the above window produces the following information:

Document http://www.cs.chalmers.se/~hallgren/
Type: text/html

References to this document from:
  http://www.cs.chalmers.se/~hallgren/lic-abstract.html
  http://www.cs.chalmers.se/~hallgren/videoband.html
  http://www.cs.chalmers.se/~hallgren/klockan.cgi

BAD links
  http://www.cs.chalmers.se/Fudgets/

Unchecked links
  http://lips.cs.chalmers.se:8888/trams
  gopher://sunic.sunet.se:43/0thomas-h.pp.se
  gopher://cs.chalmers.se:79/0/w hallgren
  http://slip-02.cs.chalmers.se/
  ftp://ftp.cs.chalmers.se/pub/users/hallgren
  http://www.chalmers.se/

Good links
  http://www.cs.chalmers.se/~hallgren/count.cgi
  http://www.cs.chalmers.se/~hallgren/klockan.cgi
  http://www.cs.chalmers.se/~hallgren/wget.cgi
  http://www.cs.chalmers.se/~hallgren/ibtelpre.html

(+ the remaining 20 good links)

This information (for all documents) is saved in CheckLinks.Report when you press the Save button.

Version and Limitations

This page describes version 1.0, released in the mid 1990s. Please notice the following limitations:

The program checks links using the http protocol only. ftp, gopher, telnet, mailto and other links are not checked.
If the HTML parser fails, links from that document are silently ignored.
This is probably not a problem when running on computers with gigabytes of RAM, but on the computers available when it was first released, the program could handle document hierarchies with upto a couple of hundred documents efficiently. Performance degraded with increasing number of documents. Checking large hierarchies could take a very long time (like an hour for a thousand documents). The program used an obscene amount of memory, so if there were too many documents the program could run out of memory and die. Workaround: instead of checking a complete document hierarchy in one run, run the program separately on subhierarchies of the complete hierarchy. Prune away uninteresting subhierarchies.

Availability

The program was installed on our local Sun4 computers in {cs,math,md,mdstud}.chalmers.se.

Author

Send any questions or comments to the author: Thomas Hallgren.

Last modified: Thu Feb 19 16:31:23 CET 2026

Thomas Hallgren