New (Nov 7, 1995): I finally fixed several bad links to the HTML 2.0 specification. How dare those pesky W3C folks change all those anchors?
New (June 2, 1995): The current edition of this document is available online at http://www.cs.cmu.edu/~tilt/cgh/. This is a new address, reflecting my current status as a graduate student in Computer Science at Carnegie Mellon University. This new URL will be stable until the year 2000, I expect. The old URL will remain available at least until December 1995, but is not guaranteed to be current.
New (April 26, 1995): This document will appear in substantially revised form this fall as a chapter in a new book from Addison-Wesley, Web Weaving (ISBN 0-201-48959-7), by Tilton, Steadman, and Jones; the book will address issues of creating and maintaining Webs as infostructures that are relevant, usable, and maintainable. Watch this space for upcoming details.
As the Web continues to explode in its own inimitable fashion, it is becoming more and more important to write HTML that conforms to certain guidelines. Specifically, with the current diversity of clients for the Web (and we can only expect to see more!), it's become important to write HTML that will look good on any client, and not just on the specific client which the author may have access to.
To that end, there are a few solutions. One approach is this one -- documents which point out common errors one might make in the composition of HTML. The other approach is software based -- a "lint"-like program for catching semantic errors in HTML, and perhaps even correcting them (for this, you should examine either HalSoft's HTML Validation Service or WebLint, two services which I have failed to list here for far too long). Several astute observers have noted that "Composing Good HTML" contains some HTML errors -- although there's a reason for that.
The thing to bear in mind is that, if you follow these guidelines, your document may not look as best as it possibly can on a particular browser. However, it also will not look ugly on any browser, which is the risk you take by disregarding these recommendations and tweaking your HTML for, say, Mosaic. Unfortunately, Mosaic may render things differently from Lynx which may render things differently from TkWWW, etc, etc, etc. These guidelines, in essence, should ensure the best fit across the space of all possible browsers, if you get my drift.
This document does not purport to be a style guide, or a beginner's manual to HTML. Fine documents already exist for these purposes.
(Note: This document is fairly stable, but still open to amendment. Please feel free to comment on that which is missing, wrong, right, or silly. Especially, please point out anywhere that I don't follow my own guidelines -- I'll slink back and fix it, I promise! Thanks to everyone who's already done so!)
It is a good idea to sign and date all documents served on the Web, so that people viewing the documents can form some impression of the authority of the document (i.e. how recent it is, and how reliable the information provider is). For example, this document has been signed.
Also, when dating a document, try to avoid ambiguous formats. For example, both the month/day/year and day/month/year format are used on the web -- so is "4/2/94" April 2 or February 4? A solution to this is to use the name of the month (or an abbreviation).
Finally, the best way to sign a document is to include a LINK element of type "made" in your HEAD element. For example:
<HEAD> <TITLE>This is my Title</TITLE> <LINK REV="made" HREF="mailto:author@some.site.org"> </HEAD>
For an example, look at the HTML source of this document. Notice the LINK line near the beginning, as well as the signature at the bottom.
Why the LINK element? The LINK element is equivalent to the A element; that is, it provides a link to some other document. However, since it is part of the HEAD information (which is information about the document, rather than part of the document itself), this is a link from the entire document to another object. (Anchors, on the other hand, are links from some small subset of the document, like a word or a phrase, to another document). This link, like most other HEAD information, is typically not displayed by a browser, or followable by a reader.
The fact that it is not displayed does not make it useless, however. Many browsers, such as Lynx, supply a "reply to author" function. The information about who the author is comes from using the LINK element. Other applications which can make use of the information include Web spiders and other maintenance tools, which can benefit from having authority information in machine readable format.
The format of the LINK element is the same as that of the A element. Notice the use of the REV attribute, which describes this relationship as a REVerse relationship of the type made. This means that this document was made by the object at the other end of the anchor (i.e. the person specified by the "mailto:" URL).
This section details common errors in HTML composition, that may lead to documents which are not fully device-independent. The behaviors of these errors are undefined, so certain browsers may render them as intended but not all browsers are guaranteed of doing so. Therefore, these mistakes should be avoided, even if your browser of choice renders your documents correctly.
The use of the paragraph element (P) can be confusing. When HTML was first introduced, <P> served as a paragraph seperator, not as an end-of-paragraph; a confusion which originally prompted this doucment. However, more recent version of the HTML 2.0 and later specifications have changed this behaviour.
The current recommended use of the P element is to be placed at the beginning of paragraphs; for example:
<P> In this paragraph, our hero discovers that he really likes baloney sandwiches. He also listens to some disco, and has a lovely beverage. Ah, if only all paragraphs were this exciting!
This is in contrast to previous usage, where the <P> was usually placed at the end of the paragraph.
Still, in certain contexts, use of <P> should be avoided, such as directly before any other element which already implies a paragraph break. To wit, the <P> element should not be placed before the headings, HR, ADDRESS, BLOCKQUOTE, or PRE.
It should also not be placed immediately before a list element of any stripe. That is, a <P> should not be used to mark the end-of-text for <LI>, <DT> or <DD>. These elements already imply paragraph breaks.
Some clarifications on the above might be in order. One is the difficulties of rendering appropriate white space by a browser. While it is true that all of the entities mentioned above imply a paragraph break, this only occasionally means that they also imply white space between sections -- this depends on the browser. So, while you might feel inclined to add a <P> in order to fix white space problems, please think twice and avoid it if you can.
Also, when using the glossary list (DL), please try to avoid using multiple DDs (definitions of terms) in order to provide multiple entries for a term (DT). Instead, use a <P> marker between paragraphs in a definition.
All clear now?
Simply put, a character
reference and an entity reference are ways to represent
information that might otherwise be interpreted as a markup tag. For
instance, in order to represent <P> in this text, I had to use
<P>
in my raw HTML. There are currently four
entities for this purpose in HTML, as well as several entities
which allow encoding of the ISO
Latin-1 Character Set.
The most common error in the use of references is to leave off the trailing semicolon. Also, no additional spaces are needed before or after the entity/character reference.
One grey area involves references to directories. It is possible to request an index of a directory from an HTTP server. The typical response from the server is to either return a pregenerated index document (which is often the document "index.html" in the referenced directory), or to construct an HTML document on the fly which contains a listing of all files in the directory. However, when making such a directory reference, it is important to make sure to have a trailing slash on the URL. That is, if you were to request the index of the directory which this document resides in, you would want to refer to it as http://www.cs.cmu.edu/~tilt/, not as http://www.cs.cmu.edu/~tilt.
Some servers are able to catch these errors, and provide redirection to the proper URL, but it's best to get the URL right in the first place -- notably because not all browsers support transparent redirection.
Problems can arise when the hostnames in URLs aren't fully qualified In local networks, you can usually refer to your own machines simply by their names -- for instance, here at Willamette we refer to our local WWW server as "www". However, the server's FQDN (fully qualified domain name) is "www.cs.cmu.edu". The FQDN provides enough information that any host, anywhere on the Internet, can find this particular machine. (It's like trying to find all the Vermeers in New York :).
What happens is that an HTML author might construct a link that looks like this:
<A HREF="http://www.cs.cmu.edu/~tilt/metanoia/">Metanoia -- A
Change In Spirit</A>
which produces a link to Metanoia -- A Change In Spirit that will only work for people in the local network that that machine is on. A correct link would look like this, instead:
<A HREF="http://www.cs.cmu.edu/~tilt/metanoia/">Metanoia</A>
which would allow all of you who are interested in Metanoia to actually follow the link.
This leads almost directly into:
Finally, a brief section on relative URLs. It is possible to construct a "relative" URL, which gives you the following advantages:
However, relative URLs can also break things.
A relative URL is a URL which doesn't contain all the necessary parts of a "full" URL (scheme, host, path information). There's a large number of things which might fit this description! The browser will try to assume the parts that have been "left out" by using the information from the URL of the document which contains the link. However, not all browsers will make these assumptions in the same way. Here's a short list of what's "safe" and "unsafe" (based on experience, and not on a specification anywhere -- unfortunately).
Currently, I don't understand this problem well enough to speak about it. I will try and get a canonical answer when next I have the energy to update this document.
(This sub-section isn't written very well, I fear. If anyone has any better copy, I'll gladly put it here instead. -et/April 7, 1994)
One common error that I used to make all the time (I use Marc Andreesen's html-mode.el for Emacs these days -- I had to learn Emacs, but now it's so much easier to write HTML!) was to leave off a quote in my start tags. For example, this reference to the euphonium, king of instruments should look like:
<A
HREF="http://www.cs.cmu.edu/~tilt/euphonium.html">
but I would often use
<A
HREF="http://www.cs.cmu.edu/~tilton/euphonium.html>
instead. I suppose by the end of that huge URL, I'd forgotten it was supposed to be quoted. The behaviour of browsers upon encountering this varies -- some display a proper link, but you can't follow it, while others actually eat up huge portions of the following text, thinking it to be part of the URL.
Many of the HTML elements contain information within them. For example,
<em>emphasized text</em>
would be rendered as
emphasized text. There is a start tag
(<EM>
), some content (which may include text, and
in some cases, other nested elements), and an end tag
(</EM>
, indicated by the </). A common mistake
is to miss the / in the end tag. All elements (except empty elements,
see next paragraph) must be terminated by an end tag -- otherwise,
undefined behavior may occur.
Some HTML elements may be empty, such as <P> and <HR> (the HTML 2.0 specification provides more information about element content). If this is the case, there is no need for an end tag.
This section concentrates on mistakes in HTML authoring that are more problems of aesthetics then problems of device-independence.
HTML documents should not mix those elements which belong in the HEAD of a document with those which belong in the BODY. This is not an absolute requirement, but it does make a certain amount of common sense for readability of HTML code, and for conformance with possible future browsers which may not support the mixing of these elements. Essentially, it lacks serious style points >=).
In general, the use of white space around element tags should be avoided. If white space immediately follows a start tag, for example, the style changes implied by that element may be applied to the initial space, as well. For instance, <A HREF="http://www.cs.cmu.edu/~tilt/"> CZeCh THIZ 0uT </A> would be rendered as CZeCh THIZ 0uT . On some browsers, there may be white space around the anchor, which adds unwanted unsightliness to the rendering, and may lessen the impact of the document. (This comment really applies to white space immediately following start tags, and immediately preceding end tags).
The HTML specification points out that a heading should not be more then one level below the heading which preceded it. That is, <H3> should not follow <H1>, etc.
Also, it is pointed out that "a heading element implies all the font changes, paragraph breaks before and after, and white space (for example) necessary to render the heading". Extra highlighting elements are discouraged, therefore.
When creating documents, make sure that your links are meaningful -- that is, that they avoid online-specific references, and that they don't detract from readability. The text of your links should flow well in the context of the rest of your text (especially avoid the click here syndrome!), and your text should also be able to stand alone as a printable document.
In other words, avoid using sentences like, "You can find out more information about cows by clicking here". (This is also bad because it refers to "clicking", which assumes that everyone is using a mouse with their browser!) A much better alternative is "More information about cows is available."
Since HTML (and also SGML) is designed to be a device independent language for describing the content of documents, most of the elements within it aren't intended to give direct control to the author over how the final page layout will look. The major exceptions to this are in the character highlighting elements.
There are two types of character highlighting elements -- physical and logical. The physical styles involve things like "italic font", "boldface", etc; while the logical styles are things like "emphasis", "citation", "strong", etc. It is strongly recommended that you employ the logical styles rather than the physical styles in your documents. Using <I></I> to render text in italics will only be effective on those browsers which are capable of displaying italics -- which all browsers are not guaranteed to do. It is far better to encode semantic content -- to describe things in terms of logical styles -- and then allow the browser to display that semantic structure as best it can, given its display capabilities.
So, instead of
This also leaves the possibilities open in the future for more sophisticated uses of these semantic renderings, which have much more inherent meaning than font styles like bold or italic.
(Unfortunately, the jury is still out to lunch on this one. One argument against logical character styles is that it turns out to be a bottomless pit, attempting to define logical styles for every possibility. Physical styles, combined with the context of the text in which they are placed, seem to provide a much richer set without a huge number of tags. Oh, well. Use logical styles when you can, though.)
This section lists elements of HTML whose use should be avoided, whether because the element is now obsolete, or because the element is being deprecated (i.e. still supported, but its use is not recommended and the element may eventually become obsolete).
Several elements of HTML are obsolete,
including PLAINTEXT, XMP, LISTING, HPx, and COMMENT. The first three
should be replaced with PRE; HP
(highlighted phrase) should be replaced with the character
highlighting elements; and COMMENT should be replaced with
<!-- blah blah blah -->
, the
SGML
comment characters.
There already exist documents on the Web which address this same topic, and perhaps in more detail. For definitive reference information you may wish to check the HTML specification from the World Wide Web Consortium (W3C). For a more detailed discussion of HTML composition style, you should also check the Style Guide (especially the section on device-independent formatting), which is also from the W3C.
If you're looking for a good document for learning the basics of HTML, you will want to check out the Beginner's Guide to HTML, from NCSA.
I'd like to thank all of you who have visited this document and commented on it, suggesting fixes, clarification, and even new sections. You know who you are (even if I managed to lose your addresses in the flood of information)! It is, in some senses, still a work in progress and is always amenable to suggestion, modification, and repair. I appreciate your help!
The upshot is, this document has always been meant as a public service, and will remain a public service. I hope you've found it to be useful; I've had fun providing it for your use.
James "Eric" Tilton, HTML Guru Wannabee, jtilton@willamette.edu