Composing Good HTML

Note: This document is available as both a single document (suitable for printing) and a multi-part document (more appropriate to hypertext). There is also a postscript version available via FTP, at jupiter.willamette.edu, as /outgoing/jtilton/strict-html.ps. These multiple views are automatically generated with a Perl script called "multiview".

What's New with this document

New (Nov 7, 1995): I finally fixed several bad links to the HTML 2.0 specification. How dare those pesky W3C folks change all those anchors?

New (June 2, 1995): The current edition of this document is available online at http://www.cs.cmu.edu/~tilt/cgh/. This is a new address, reflecting my current status as a graduate student in Computer Science at Carnegie Mellon University. This new URL will be stable until the year 2000, I expect. The old URL will remain available at least until December 1995, but is not guaranteed to be current.

New (April 26, 1995): This document will appear in substantially revised form this fall as a chapter in a new book from Addison-Wesley, Web Weaving (ISBN 0-201-48959-7), by Tilton, Steadman, and Jones; the book will address issues of creating and maintaining Webs as infostructures that are relevant, usable, and maintainable. Watch this space for upcoming details.

Introduction

As the Web continues to explode in its own inimitable fashion, it is becoming more and more important to write HTML that conforms to certain guidelines. Specifically, with the current diversity of clients for the Web (and we can only expect to see more!), it's become important to write HTML that will look good on any client, and not just on the specific client which the author may have access to.

To that end, there are a few solutions. One approach is this one -- documents which point out common errors one might make in the composition of HTML. The other approach is software based -- a "lint"-like program for catching semantic errors in HTML, and perhaps even correcting them (for this, you should examine either HalSoft's HTML Validation Service or WebLint, two services which I have failed to list here for far too long). Several astute observers have noted that "Composing Good HTML" contains some HTML errors -- although there's a reason for that.

The thing to bear in mind is that, if you follow these guidelines, your document may not look as best as it possibly can on a particular browser. However, it also will not look ugly on any browser, which is the risk you take by disregarding these recommendations and tweaking your HTML for, say, Mosaic. Unfortunately, Mosaic may render things differently from Lynx which may render things differently from TkWWW, etc, etc, etc. These guidelines, in essence, should ensure the best fit across the space of all possible browsers, if you get my drift.

This document does not purport to be a style guide, or a beginner's manual to HTML. Fine documents already exist for these purposes.

(Note: This document is fairly stable, but still open to amendment. Please feel free to comment on that which is missing, wrong, right, or silly. Especially, please point out anywhere that I don't follow my own guidelines -- I'll slink back and fix it, I promise! Thanks to everyone who's already done so!)

Contents of this Document

What's New with this document
Introduction
Contents of This Document (Douglas R. Hofstadter, Please...)
Good Practices
- Signing Documents, and Time-Stamps
- (anything else?)
Common Errors
Things to Avoid
Deprecated and Obsolete Elements
For More Information
Acknowledgments

Good Practices

Things contained in this section are good practices for the generation of any HTML document. Specifically, this would include anything which should routinely be done in the creation of documents for the benefit of both reader and author.

Signing Documents, and Time-Stamps

It is a good idea to sign and date all documents served on the Web, so that people viewing the documents can form some impression of the authority of the document (i.e. how recent it is, and how reliable the information provider is). For example, this document has been signed.

Also, when dating a document, try to avoid ambiguous formats. For example, both the month/day/year and day/month/year format are used on the web -- so is "4/2/94" April 2 or February 4? A solution to this is to use the name of the month (or an abbreviation).

Finally, the best way to sign a document is to include a LINK element of type "made" in your HEAD element. For example:

<HEAD>
<TITLE>This is my Title</TITLE>
<LINK REV="made" HREF="mailto:author@some.site.org">
</HEAD>

For an example, look at the HTML source of this document. Notice the LINK line near the beginning, as well as the signature at the bottom.

Why the LINK element? The LINK element is equivalent to the A element; that is, it provides a link to some other document. However, since it is part of the HEAD information (which is information about the document, rather than part of the document itself), this is a link from the entire document to another object. (Anchors, on the other hand, are links from some small subset of the document, like a word or a phrase, to another document). This link, like most other HEAD information, is typically not displayed by a browser, or followable by a reader.

The fact that it is not displayed does not make it useless, however. Many browsers, such as Lynx, supply a "reply to author" function. The information about who the author is comes from using the LINK element. Other applications which can make use of the information include Web spiders and other maintenance tools, which can benefit from having authority information in machine readable format.

The format of the LINK element is the same as that of the A element. Notice the use of the REV attribute, which describes this relationship as a REVerse relationship of the type made. This means that this document was made by the object at the other end of the anchor (i.e. the person specified by the "mailto:" URL).

Common Errors

This section details common errors in HTML composition, that may lead to documents which are not fully device-independent. The behaviors of these errors are undefined, so certain browsers may render them as intended but not all browsers are guaranteed of doing so. Therefore, these mistakes should be avoided, even if your browser of choice renders your documents correctly.

Paragraph Element Errors
Character and Entity Reference Errors
URL Errors
Missing Quotes in Start Tags
Missed End Tags

Paragraph Element Errors

The use of the paragraph element (P) can be confusing. When HTML was first introduced, served as a paragraph seperator, not as an end-of-paragraph; a confusion which originally prompted this doucment. However, more recent version of the HTML 2.0 and later specifications have changed this behaviour.

The current recommended use of the P element is to be placed at the beginning of paragraphs; for example:

<P> In this paragraph, our hero discovers that he really likes
baloney sandwiches. He also listens to some disco, and has a lovely
beverage. Ah, if only all paragraphs were this exciting!

This is in contrast to previous usage, where the was usually placed at the end of the paragraph.

Still, in certain contexts, use of should be avoided, such as directly before any other element which already implies a paragraph break. To wit, the element should not be placed before the headings, HR, ADDRESS, BLOCKQUOTE, or PRE.

It should also not be placed immediately before a list element of any stripe. That is, a should not be used to mark the end-of-text for <LI>, <DT> or <DD>. These elements already imply paragraph breaks.

Caveats

Some clarifications on the above might be in order. One is the difficulties of rendering appropriate white space by a browser. While it is true that all of the entities mentioned above imply a paragraph break, this only occasionally means that they also imply white space between sections -- this depends on the browser. So, while you might feel inclined to add a in order to fix white space problems, please think twice and avoid it if you can.

Also, when using the glossary list (DL), please try to avoid using multiple DDs (definitions of terms) in order to provide multiple entries for a term (DT). Instead, use a marker between paragraphs in a definition.

All clear now?

Character and Entity Reference Errors

Simply put, a character reference and an entity reference are ways to represent information that might otherwise be interpreted as a markup tag. For instance, in order to represent in this text, I had to use  in my raw HTML. There are currently four entities for this purpose in HTML, as well as several entities which allow encoding of the ISO Latin-1 Character Set.

The most common error in the use of references is to leave off the trailing semicolon. Also, no additional spaces are needed before or after the entity/character reference.

URL Errors

Another misunderstood aspect of HTML is in the composition of URLs.

Directory Reference Errors

One grey area involves references to directories. It is possible to request an index of a directory from an HTTP server. The typical response from the server is to either return a pregenerated index document (which is often the document "index.html" in the referenced directory), or to construct an HTML document on the fly which contains a listing of all files in the directory. However, when making such a directory reference, it is important to make sure to have a trailing slash on the URL. That is, if you were to request the index of the directory which this document resides in, you would want to refer to it as http://www.cs.cmu.edu/~tilt/, not as http://www.cs.cmu.edu/~tilt.

Some servers are able to catch these errors, and provide redirection to the proper URL, but it's best to get the URL right in the first place -- notably because not all browsers support transparent redirection.

Not Using Fully Qualified Domain Names

Problems can arise when the hostnames in URLs aren't fully qualified In local networks, you can usually refer to your own machines simply by their names -- for instance, here at Willamette we refer to our local WWW server as "www". However, the server's FQDN (fully qualified domain name) is "www.cs.cmu.edu". The FQDN provides enough information that any host, anywhere on the Internet, can find this particular machine. (It's like trying to find all the Vermeers in New York :).

What happens is that an HTML author might construct a link that looks like this:

<A HREF="http://www.cs.cmu.edu/~tilt/metanoia/">Metanoia -- A Change In Spirit</A>

which produces a link to Metanoia -- A Change In Spirit that will only work for people in the local network that that machine is on. A correct link would look like this, instead:

<A HREF="http://www.cs.cmu.edu/~tilt/metanoia/">Metanoia</A>

which would allow all of you who are interested in Metanoia to actually follow the link.

This leads almost directly into:

Improper Use of Relative URLs

Finally, a brief section on relative URLs. It is possible to construct a "relative" URL, which gives you the following advantages:

It's shorter.
It makes a collection of documents which are linked together more portable (easier to move from directory to directory, or server to server).

However, relative URLs can also break things.

A relative URL is a URL which doesn't contain all the necessary parts of a "full" URL (scheme, host, path information). There's a large number of things which might fit this description! The browser will try to assume the parts that have been "left out" by using the information from the URL of the document which contains the link. However, not all browsers will make these assumptions in the same way. Here's a short list of what's "safe" and "unsafe" (based on experience, and not on a specification anywhere -- unfortunately).

Safe: Same directory relative URLs: A reference to a document in the same logical directory (such as <A HREF="strict-html-gp.html">Good Practices</A>) is safe. This kind of reference, roughly speaking, contains no "/"'s.
Safe: Same server relative URLs: A reference to a document in the same server (such as <A HREF="/~tilt/">Eric's Hyplan</A>) is also safe. This kind of reference, roughly speaking, will begin with a "/". (It will also be semi-absolute, in that it starts at the top of that server's directory structure...)
Unclear: Most other kinds of relative URLs: References such as <A HREF="~tilt/euphonium.html"></A> can be dangerous -- sometimes browsers will interpret that as meaning "go up one directory level, find the directory '~tilt', and then find 'euphonium.html' in it." And sometimes they won't.
Currently, I don't understand this problem well enough to speak about it. I will try and get a canonical answer when next I have the energy to update this document.
Unsafe: "file://localhost/...": It's also possible to have a reference to "file://localhost/some/file/pathname". What this does is references the file described on the local host of whoever is browsing the document. Which is why a reference to <A HREF="file://localhost/etc/motd"></A> will display the message of the day on your machine, not the message of the day on my machine. Unless you know what you are doing, these references will really mess up your documents.

(This sub-section isn't written very well, I fear. If anyone has any better copy, I'll gladly put it here instead. -et/April 7, 1994)

Missing Quotes in Start Tags

One common error that I used to make all the time (I use Marc Andreesen's html-mode.el for Emacs these days -- I had to learn Emacs, but now it's so much easier to write HTML!) was to leave off a quote in my start tags. For example, this reference to the euphonium, king of instruments should look like:

<A HREF="http://www.cs.cmu.edu/~tilt/euphonium.html">

but I would often use

<A HREF="http://www.cs.cmu.edu/~tilton/euphonium.html>

instead. I suppose by the end of that huge URL, I'd forgotten it was supposed to be quoted. The behaviour of browsers upon encountering this varies -- some display a proper link, but you can't follow it, while others actually eat up huge portions of the following text, thinking it to be part of the URL.

Missed End Tags

Many of the HTML elements contain information within them. For example, emphasized text would be rendered as emphasized text. There is a start tag (), some content (which may include text, and in some cases, other nested elements), and an end tag (, indicated by the </). A common mistake is to miss the / in the end tag. All elements (except empty elements, see next paragraph) must be terminated by an end tag -- otherwise, undefined behavior may occur.

Some HTML elements may be empty, such as and <HR> (the HTML 2.0 specification provides more information about element content). If this is the case, there is no need for an end tag.

Things to Avoid

This section concentrates on mistakes in HTML authoring that are more problems of aesthetics then problems of device-independence.

Mixing HEAD and BODY Elements
Using White Space Around Element Tags
Heading Usage
Meaningless Link Text
Physical vs. Logical Character Emphasis

Mixing HEAD and BODY Elements

HTML documents should not mix those elements which belong in the HEAD of a document with those which belong in the BODY. This is not an absolute requirement, but it does make a certain amount of common sense for readability of HTML code, and for conformance with possible future browsers which may not support the mixing of these elements. Essentially, it lacks serious style points >=).

Using White Space Around Element Tags

In general, the use of white space around element tags should be avoided. If white space immediately follows a start tag, for example, the style changes implied by that element may be applied to the initial space, as well. For instance, <A HREF="http://www.cs.cmu.edu/~tilt/"> CZeCh THIZ 0uT </A> would be rendered as CZeCh THIZ 0uT . On some browsers, there may be white space around the anchor, which adds unwanted unsightliness to the rendering, and may lessen the impact of the document. (This comment really applies to white space immediately following start tags, and immediately preceding end tags).

Heading Usage

The HTML specification points out that a heading should not be more then one level below the heading which preceded it. That is, <H3> should not follow <H1>, etc.

Also, it is pointed out that "a heading element implies all the font changes, paragraph breaks before and after, and white space (for example) necessary to render the heading". Extra highlighting elements are discouraged, therefore.

Meaningless Link Text

When creating documents, make sure that your links are meaningful -- that is, that they avoid online-specific references, and that they don't detract from readability. The text of your links should flow well in the context of the rest of your text (especially avoid the click here syndrome!), and your text should also be able to stand alone as a printable document.

In other words, avoid using sentences like, "You can find out more information about cows by clicking here". (This is also bad because it refers to "clicking", which assumes that everyone is using a mouse with their browser!) A much better alternative is "More information about cows is available."

Physical vs. Logical Character Emphasis

Since HTML (and also SGML) is designed to be a device independent language for describing the content of documents, most of the elements within it aren't intended to give direct control to the author over how the final page layout will look. The major exceptions to this are in the character highlighting elements.

There are two types of character highlighting elements -- physical and logical. The physical styles involve things like "italic font", "boldface", etc; while the logical styles are things like "emphasis", "citation", "strong", etc. It is strongly recommended that you employ the logical styles rather than the physical styles in your documents. Using to render text in italics will only be effective on those browsers which are capable of displaying italics -- which all browsers are not guaranteed to do. It is far better to encode semantic content -- to describe things in terms of logical styles -- and then allow the browser to display that semantic structure as best it can, given its display capabilities.

So, instead of

italics: you might use emphasized, or a <CITE>citation</CITE>, and instead of
bold: you might use strong.

This also leaves the possibilities open in the future for more sophisticated uses of these semantic renderings, which have much more inherent meaning than font styles like bold or italic.

(Unfortunately, the jury is still out to lunch on this one. One argument against logical character styles is that it turns out to be a bottomless pit, attempting to define logical styles for every possibility. Physical styles, combined with the context of the text in which they are placed, seem to provide a much richer set without a huge number of tags. Oh, well. Use logical styles when you can, though.)

Deprecated and Obsolete Elements

This section lists elements of HTML whose use should be avoided, whether because the element is now obsolete, or because the element is being deprecated (i.e. still supported, but its use is not recommended and the element may eventually become obsolete).

Obsolete Elements

Obsolete Elements

Several elements of HTML are obsolete, including PLAINTEXT, XMP, LISTING, HPx, and COMMENT. The first three should be replaced with PRE; HP (highlighted phrase) should be replaced with the character highlighting elements; and COMMENT should be replaced with , the SGML comment characters.

For More Information

There already exist documents on the Web which address this same topic, and perhaps in more detail. For definitive reference information you may wish to check the HTML specification from the World Wide Web Consortium (W3C). For a more detailed discussion of HTML composition style, you should also check the Style Guide (especially the section on device-independent formatting), which is also from the W3C.

If you're looking for a good document for learning the basics of HTML, you will want to check out the Beginner's Guide to HTML, from NCSA.

Acknowledgements

I'd like to thank all of you who have visited this document and commented on it, suggesting fixes, clarification, and even new sections. You know who you are (even if I managed to lose your addresses in the flood of information)! It is, in some senses, still a work in progress and is always amenable to suggestion, modification, and repair. I appreciate your help!

Copyright © 1994, 1995 by Eric Tilton. Permission is granted for individual use and reproduction provided that this document remains intact, with this copyright message clearly visible. Commercial use and reproduction rights are held by Addison-Wesley, and this document may not be resold or redistributed for compensation of any kind without prior written permission from Addison Wesley -- contact me for details. Parts of this document appear in a revised form in the upcoming book, Web Weaving (ISBN 0-201-48959-7), by Eric Tilton, Carl Steadman, and Tyler Jones, to be published by Addison-Wesley. Look for it in a bookstore near you!

The upshot is, this document has always been meant as a public service, and will remain a public service. I hope you've found it to be useful; I've had fun providing it for your use.

Last modified: Nov 7, 1995

James "Eric" Tilton, HTML Guru Wannabee, jtilton@willamette.edu