Linux Today: Linux News On Internet Time.

WebReference.com: Simplified DocBk XML on the Web

Mar 19, 2000, 17:11 (1 Talkback[s])
(Other stories by Jonathan Eisenzopf)

By Jonathan Eisenzopf, WebReference.com

Simplified DocBk XML on the Web

Introduction to DocBook


DocBook is an SGML format for writing structured documents. Until recently, it was maintained by the Davenport Group hosted at O'Reilly. Recently, it's been moved into the care of the Oasis group at XML.org. It's been used extensively by technical writers and publishers. The Linux Documentation Project (LDP) is one notable project that's used DocBook extensively. O'Reilly is a company that uses DocBook internally quite a bit. In fact, I'm writing my Perl XML book entirely in DocBook. If you've never used SGML or XML before and are really fond of WYSIWYG editors, you're really going to hate DocBook at first. That's ok, because after you publish a few articles with it, you'll wonder why you've been using HTML this long. It's particularly useful when you need to make global changes, like copyrights :) DocBook is also widely supported by commercial and many non-commercial tools. In fact, once you have articles in DocBook format, you can convert them to formats like RTF, Postscript, and HTML.

Simplified DocBk XML

Recently, Norman Walsh created an XML version of DocBook called DocBk XML. Fortunately for us, he also created a simplified subset for writing articles called the Simplified DocBK XML DTD. The SDocBk homepage is http://www.nwalsh.com/docbook/simple/. The DTD and a CSS style sheet that will work in IE 5 is available. Norman has also written a set of XSL style-sheets that will work with DocBk XML and Simplified DocBk XML which are available at http://www.nwalsh.com/docbook/xsl/dbx106.zip. Norman has also created a set of DSSSL style sheets which can be used with Jade, written by James Clark, to convert DocBk to multiple formats. Swell! Jade can be downloaded for free at http://www.jclark.com.

Writing Articles with Simplified DocBk

Starting Out

Before you get to writing the body of your article, you must first supply some information at the top of the file that tells the XML parser that it's XML and what DTD it's associated with. A Document Type Definition gives the XML parser the grammar that the document must follow such as, what elements we can use, where they are allowed to exist, and at what level.

The first thing at the top of each document is the XML and DTD declaration.


<?xml version="1.0"?>
<!DOCTYPE article
          PUBLIC "-Norman Walsh//DTD Simplified DocBk XML V3.1.7.1//EN"

Next, we need to complete the article header which contains important information about you (the author) and the article itself. Since I've recently begun to use the Simplified DocBk XML format to write all of my articles for Webreference, below is an example if an article header I might use.

<date>December 28, 1999</date>
<pubdate>January 21, 2000</pubdate>
<title>Unix Daemons in Perl</title>

Most of the elements above are self-explanatory. Some of them have special meaning for Webreference though. The issuenum element is used to determine the path at the the top left of each article page. It's also associated with the sub-directory that the article is sitting in. For example, the URL for Mother of Perl tutorials is http://www.webreference.com/perl/tutorial. When writing a new article, I usually assign it a number that corresponds to the subdirectory its in, i.e. http://www.webreference.com/perl/tutorial/10. I could have also used a string or sequence just as easily as long as it doesn't contain spaces.

The productname element is also critical because it's used in my Perl script to figure out the URL to the article since different authors have different directory structures. It should be the same string that's used in your homepage URL. These are currently: 3d, dlab, dhtml, graphics, html, js, perl, and xml.

The keywords will (eventually) be used to create HTML meta keywords to get better search engine rankings. You can add as many keywords as you like.

Article Body

The main article body consists of multiple sections denoted by numbers (1-5). Each section must contain a title element. Each section should contain one or more para elements which contains the article text. sect1 is the top level section whose beginning signifies the beginning of a new page. The sect1 header will show up centered at the top of each article page in a <H2> tag. The sect2 element signifies a new subsection. These normally exist to let the reader know you are switching to a new topic or talking point within the context of the sect1 title. These elements are converted to a <H3> tag.

The paragraph is the real meat of the article. It contains all kinds of elements for lists, tables, images, and programming related identifiers. All of them can exist inside the para element. I recommend taking a look at one of my articles as an example: http://www.webreference.com/perl/tutorial/10/tutorial10.xml.

Some of the elements I've been using so far are: emphasis, function, programlisting, constant, varname, ulink, citation, literallayout, command, filename.

XML Rules

Since you're writing an article in XML, you must be aware of a few basic rules. First, you must encode the default entities: &, <, and > when they're not part of an element or entity. Second, every start tag must have an ending tag. Third, when you have a large body of text with special characters, it's best to use a CDATA section. Look at the XML source of one of my articles to see what it looks like. When you wrap text in a CDATA section, it will print exactly what's inside verbatim.

Publishing Your Article


If you're brave, you can write your own script or style sheet to convert the Simplified DocBk XML document into HTML that's suitable to put on a Web site, but I wouldn't recommend it unless you really want to get into XML or you're already an expert. As I mentioned before, CSS, XSL, and DSSSL style-sheets already exist. You could modify those to do what my script does. Any element that I don't have an explicit conversion function for will simply create a span tag whose class is the same name as the element. This will make it easy to apply a CSS style sheet to your HTML article. I have a very basic CSS style sheet that I use which you can be download at http://www.webreference.com/perl/tutorial/tutorial.css.

I wrote a Perl script that takes a Simplified DocBk XML article and: splits it into multiple pages, converts it to HTML, and adds all headers and footers I use for Webreference. You can download it at http://www.webreference.com/perl/tutorial/11/webref_generate.pl. Feel free to modify it for your own use and let me know how you like it. The benefit of using such a script is that it saves you from doing all that HTML stuff. I used to spend 1-2 hours for each article changing links and text. I'm sure others are faster, but I only type with 4 fingers :)

To get the script to work on your system, you must have a recent version of Perl and the XML::Parser module. It should work on most *nix and Windows systems. I documented the process on installing the module in a previous tutorial http://www.webreference.com/perl/tutorial/8.

When you run the script, it will either create a bunch of nice HTML files, or it'll croak with an error message. If you get an error message, it either means that the XML is not well formed, or you don't have the XML::Parser module or current version of Perl installed.

Conversion via the Web

I've written a CGI script that will enable you to upload your Simplified DocBk XML article. Upon processing, it returns a zip file containing the HTML files. You can access the upload page at http://www.webreference.com/perl/upload.html. After the XML file has been processed, you should see a popup box prompting you to save the file. Make sure you rename it to a .zip extension when you download it so it will be recognized as such.

Current Limitations

Currently, I don't have the code in place to convert DocBook tables, images, or lists. There may be others too. For now, you'll have to add the HTML by hand after it's created. But that's alot easier than doing it from scratch. Plus, when I add features in the future, you'll be able to easily regenerate all your articles without having to do much HTML editing. Lastly, for some reason, when I tried uploading an XML file from a Mac, I get an error. I will try to track this problem down. If any of you Mac users have an answer, please send it my way.

Related Stories: