doc/wiki2docbook/html2db/index.src.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\r
   2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [\r
   3 <!ENTITY html2db "<code>html2db.xsl</code>">\r
   4 ]>\r
   5 <html xmlns:x="http://www.w3.org/1999/xhtml"\r
   6       xmlns:db="urn:docbook">\r
   7 <head>\r
   8 <title>This title is ignored</title>\r
   9 </head>\r
  10 <body>\r
  11 \r
  12 <h1>html2db.xsl</h1>\r
  13 \r
  14 <!-- The xmlns attribute escapes into the Docbook namespace -->\r
  15 <articleinfo xmlns="urn:docbook">\r
  16   <author>\r
  17     <firstname>Oliver</firstname>\r
  18     <surname>Steele</surname>\r
  19   </author>\r
  20   <revhistory>\r
  21     <revision>\r
  22       <revnumber>1</revnumber>\r
  23       <date>2004-07-30</date>\r
  24     </revision>\r
  25     <revision>\r
  26       <revnumber>1.0.1</revnumber>\r
  27       <date>2004-08-01</date>\r
  28       <revdescription><para>Editorial changes to the\r
  29       readme.</para></revdescription>\r
  30     </revision>\r
  31   </revhistory>\r
  32   <date>2004-07-30</date>\r
  33 </articleinfo>\r
  34 \r
  35 <h2>Overview</h2>\r
  36 \r
  37 <p>&html2db; converts an XHTML source document into a Docbook output\r
  38 document.  It provides features for customizing the generation of the\r
  39 output, so that the output can be tuned by annotating\r
  40 the source, rather than hand-editing the output.  This makes it useful\r
  41 in a processing pipeline where the source documents are maintained in\r
  42 HTML, although it can be used as a one-time conversion tool\r
  43 too.</p>\r
  44 \r
  45 <p>This document is an example of &html2db; used in conjunction with\r
  46 the Docbook XSL stylesheets.  The <a href="index.src.html">source\r
  47 file</a> is an XHTML file with some embedded Docbook elements and\r
  48 processing instructions.  &html2db; compiles it into a <a\r
  49 href="index.xml">Docbook document</a>, which can be used to generate\r
  50 this output file (which includes a Table of Contents), a <a\r
  51 href="docs/index.html">chunked HTML file</a>, a <a\r
  52 href="html2db.pdf">PDF</a>, or other formats.</p>\r
  53 \r
  54 <h2>Features</h2>\r
  55 <dl>\r
  56 <dt>XSLT implementation</dt>\r
  57 <dd>This tool is designed to be embedded within an XSLT processing\r
  58 pipeline.  <code>html2html.xslt</code> can be used in a custom\r
  59 stylesheet or integrated into a larger system.  See <a\r
  60 href="#embedding">Overriding</a>.</dd>\r
  61 \r
  62 <dt>Customizable</dt>\r
  63 <dd>The output can be customized by the means of additonal markup in\r
  64 the XHMTL source.  See the section on <a\r
  65 href="#customization">customization</a>.</dd>\r
  66 \r
  67 <dt>Creates outline structure</dt>\r
  68 <dd><code>h1</code>, <code>h2</code>, etc. are turned into nested\r
  69 <code>section</code> and <code>title</code> elements (as opposed to\r
  70 bridge heads).</dd>\r
  71 \r
  72 <dt>Accepts a wide variety of XHTML</dt>\r
  73 <dd>In particular, &html2db; automatically wraps <dfn>naked item\r
  74 text</dfn> (text that is not enclosed in a <code>&lt;p&gt;</code>)\r
  75 inside a table cell or list item.  Naked text is a common property of\r
  76 XHTML documents, but needs to be clothed to create valid\r
  77 Docbook.<db:footnote><p>This feature is limited.  See <a\r
  78 href="#implicit-blocks">Implicit Blocks</a>.)</p></db:footnote></dd>\r
  79 \r
  80 </dl>\r
  81 \r
  82 <h2>Requirements</h2>\r
  83 <ul>\r
  84 <li>Java: JRE or JDK 1.3 or greater.</li>\r
  85 <li>Xalan 2.5.0.</li>\r
  86 <li>Familiarity with installing and running JAR files.</li>\r
  87 </ul>\r
  88 \r
  89 <p>&html2db; might work with earlier versions of Java and Xalan, and\r
  90 it might work with other XSLT processors such as Saxon and\r
  91 xsltproc.</p>\r
  92 \r
  93 <h2>License</h2>\r
  94 <p>This software is released under the Open Source <a href="http://www.opensource.org/licenses/artistic-license.php">Artistic License</a>.</p>\r
  95 \r
  96 <h2>Installation</h2>\r
  97 <ul>\r
  98 <li>Install JRE 1.3 or higher.</li>\r
  99 <li>Install Xalan, if necessary.</li>\r
 100 <li>Download <code>html2db-1.zip</code> from <a href="http://osteele.com/sources/html2db.zip">http://osteele.com/sources/html2db-1.zip</a>.</li>\r
 101 <li>Unzip <code>html2db-1.zip</code>.</li>\r
 102 </ul>\r
 103 \r
 104 <h2>Usage</h2>\r
 105 <p>Use Xalan to process an XHTML source file into a Docbook file:</p>\r
 106 \r
 107 <pre class="example">\r
 108 java org.apache.xalan.xslt.Process -XSL html2dbk.xsl -IN doc.html &gt; doc.xml\r
 109 </pre>\r
 110 \r
 111 <p>See <a href="index.src.html"><code>index.src.html</code></a> for an\r
 112 example of an input file.</p>\r
 113 \r
 114 <p>If your source files are in HTML, not XHTML, you may find the <a\r
 115 href="http://tidy.sourceforge.net/">Tidy</a> tool useful.  This is a\r
 116 tool that converts from HTML to XHTML, and can be added to the front\r
 117 of your processing pipeline.</p>\r
 118 \r
 119 <p>(If you need to process HTML and you don't know or can't figure out\r
 120 from context what a processing pipeline is, &html2db; is probably not\r
 121 the right tool for you, and you should look for a local XML or Java\r
 122 guru or for a commercially supported product.)</p>\r
 123 \r
 124 <h2>Specification</h2>\r
 125 \r
 126 <h3>XHTML Elements</h3>\r
 127 <p><code>code/i</code> stands for "an <code>i</code> element\r
 128 immediately within a <code>code</code> element".  This notation is\r
 129 from XPath.</p>\r
 130 \r
 131 <p>XHTML elements must be in the XHTML Transitional namespace,\r
 132 <code>http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd</code>.</p>\r
 133 \r
 134 <table>\r
 135 <tr>\r
 136 <th>XHTML</th>\r
 137 <th>Docbook</th>\r
 138 <th>Notes</th>\r
 139 </tr>\r
 140 \r
 141 <tr>\r
 142 <td><code>b</code>, <code>i</code>, <code>em</code>, <code>strong</code></td>\r
 143 <td><code>emphasis</code></td>\r
 144 <td>The <code>role</code> attribute is the original tag name</td>\r
 145 </tr>\r
 146 \r
 147 <tr>\r
 148 <td><code>dfn</code></td>\r
 149 <td><code>glossitem</code>, and also <code>primary</code> <code>indexterm</code></td>\r
 150 </tr>\r
 151 \r
 152 <tr>\r
 153 <td><code>code/i</code>, <code>tt/i</code>, <code>pre/i</code></td>\r
 154 <td><code>replaceable</code></td>\r
 155 <td>In practice, <code>i</code> within a monospace content is usually used to mean replaceable text.  If you're using it for emphasis, use <code>em</code> instead.</td>\r
 156 </tr>\r
 157 \r
 158 <tr>\r
 159 <td><code>pre</code>, <code>body/code</code></td>\r
 160 <td><code>programlisting</code></td>\r
 161 </tr>\r
 162 \r
 163 <tr>\r
 164 <td><code>img</code></td>\r
 165 <td><code>inlinemediaobject/imageobject/imagedata</code></td>\r
 166 <td>In an inline context.</td>\r
 167 </tr>\r
 168 \r
 169 <tr>\r
 170 <td><code>img</code></td>\r
 171 <td><code>[informal]figure/mediaobject/imageobject/imagedata</code></td>\r
 172 <td>If it has a <code>title</code> attribute or <code>db:title</code> it's wrapped in a <code>figure</code>.  Otherwise it's wrapped in an <code>informalfigure</code>.</td>\r
 173 </tr>\r
 174 \r
 175 <tr>\r
 176 <td><code>table</code></td>\r
 177 <td><code>[informal]table</code></td>\r
 178 <td>XHTML <code>table</code> becomes Docbook <code>table</code> if it has a <code>summary</code> attribute; <code>informaltable</code> otherwise.</td>\r
 179 </tr>\r
 180 \r
 181 <tr>\r
 182 <td><code>ul</code></td>\r
 183 <td><code>itemizedlist</code></td>\r
 184 <td>But see the processing instruction <a href="#simplelist">below</a>.</td>\r
 185 </tr>\r
 186 </table>\r
 187 \r
 188 \r
 189 \r
 190 <h3>Links</h3>\r
 191 <table summary="Link Translation">\r
 192 <tr>\r
 193 <th>XHTML</th>\r
 194 <th>Docbook</th>\r
 195 <th>Notes</th>\r
 196 </tr>\r
 197 \r
 198 <tr>\r
 199 <td><code>&lt;a name="<var>name</var>"&gt;</code></td>\r
 200 <td><code>&lt;anchor id="{$anchor-id-prefix}<var>name</var>"&gt;</code></td>\r
 201 <td>An anchor within a <code>h<var>n</var></code> element is attached to the enclosing <code>section</code> as an <code>id</code> attribute instead.</td>\r
 202 </tr>\r
 203 \r
 204 <tr>\r
 205 <td><code>&lt;a href="#<var>name</var>"&gt;</code></td>\r
 206 <td><code>&lt;link linkend="{$anchor-id-prefix}<var>name</var>"&gt;</code></td>\r
 207 </tr>\r
 208 \r
 209 <tr>\r
 210 <td><code>&lt;a href="<var>url</var>"&gt;</code></td>\r
 211 <td><code>&lt;ulink url="<var>name</var>"&gt;</code></td>\r
 212 </tr>\r
 213 \r
 214 <tr>\r
 215 <td><code>&lt;a name="mailto:<var>address</var>"&gt;</code></td>\r
 216 <td><code>&lt;email&gt;<var>address</var>&lt;/email&gt;</code></td>\r
 217 </tr>\r
 218 \r
 219 </table>\r
 220 \r
 221 <h3 id="tables">Tables</h3>\r
 222 \r
 223 <p>XHTML <code>table</code> support is minimal.  &html2db; changes the\r
 224 element names and counts the columns (this is necessary to get table\r
 225 footnotes to span all the columns), but it does not attempt to deal\r
 226 with tables in their full generality.</p>\r
 227 \r
 228 <p>An XHTML <code>table</code> with a <code>summary</code> attribute\r
 229 generates a <code>table</code>, whose <code>title</code> is the value\r
 230 of that summary.  An XHTML <code>table</code> without a\r
 231 <code>summary</code> generates an <code>informaltable</code>.</p>\r
 232 \r
 233 <p>Any <code>tr</code>s that contain <code>th</code>s are pulled to\r
 234 the top of the table, and placed inside a <code>thead</code>.  Other\r
 235 <code>tr</code>s are placed inside a <code>tbody</code>.  This matches\r
 236 the commanon XHTML <code>table</code> pattern, where the first row is\r
 237 a header row.</p>\r
 238 \r
 239 <h3 id="implicit-blocks">Implicit Blocks</h3>\r
 240 <p>XHTML allows <code>li</code>, <code>dd</code>, and <code>td</code>\r
 241 elements to contain either inline text (for instance,\r
 242 <code>&lt;li&gt;a list item&lt;/li&gt;</code>) or block structure\r
 243 (<code>&lt;li&gt;&lt;p&gt;a block&lt;/p&gt;&lt;/li&gt;</code>).  The\r
 244 corresponding Docbook elements require block structure, such as\r
 245 <code>para</code>.</p>\r
 246 \r
 247 <p>&html2db; provides limited support for wrapping naked text in\r
 248 these positions in <code>para</code> elements.  If a list item or\r
 249 table cell item directly contains text, all text up to the position of\r
 250 the first element (or all text, if there is no element) is wrapped in\r
 251 <code>para</code>.  This handles the simple case of an item that\r
 252 directly contains text, and also the case of an item that contains\r
 253 text followed by blocks such as paragraphs.</p>\r
 254 \r
 255 <p>Note that this algorithm is easily confused.  It doesn't\r
 256 distinguish between block and inline XHTML elements, so it will only\r
 257 wrap the first word in <code>&lt;li&gt;some &lt;b&gt;bold&lt;/b&gt;\r
 258 text&lt;/li&gt;</code>, leading to badly formatted output.  Twhe\r
 259 workaround is to wrap troublesome content in explicit\r
 260 <code>&lt;p&gt;</code> tags.</p>\r
 261 \r
 262 <h3 id="docbook-elements">Docbook Elements</h3>\r
 263 \r
 264 <p>Elements from the Docbook namespace are passed through as is.\r
 265 There are two ways to include a Docbook element in your XHTML\r
 266 source:</p>\r
 267 \r
 268 <dl>\r
 269 <dt>Global prefix</dt>\r
 270 <dd><p>A <dfn>fake Docbook namespace</dfn><db:footnote><p>The fake\r
 271 Docbook namespace is <code>urn:docbook</code>.  Docbook doesn't really\r
 272 have a namespace, and if it did, it wouldn't be this one.  See <a\r
 273 href="#docbook-namespace">Docbook namespace</a> for a discussion of\r
 274 this issue.</p></db:footnote>\r
 275 \r
 276 declaration may be added to the document root element.  Anywhere in\r
 277 the document, the prefix from this namespace declaration may be used\r
 278 to include a Docbook element.  This is useful if a document contains\r
 279 many Docbook elements, such as <code>footnote</code> or\r
 280 <code>glossterm</code>, interspersed with XHTML.  (In this case it may\r
 281 be more convenient to allow these elements in the XHMTL namespace and\r
 282 add a customization layer that translates them to docbook elements,\r
 283 however.  See <a href="#customization">Customization</a>.)</p>\r
 284 \r
 285 <pre class="example"><![CDATA[\r
 286 <html xmlns="http://www.w3.org/1999/xhtml"\r
 287       xmlns:db="urn:docbook">\r
 288   ...\r
 289   <p>Some text<db:footnote>and a footnote</db:footnote>.</p>\r
 290 ]]></pre></dd>\r
 291 \r
 292 <dt>Local namespace</dt>\r
 293 <dd><p>A Docbook element may be introduced along with a prefix-less\r
 294 namespace declaration.  This is useful for embedding a Docbook\r
 295 document fragment (a hierarchy of elements that all use Docbook tags)\r
 296 within of a XHTML document.</p>\r
 297 \r
 298 <pre class="example"><![CDATA[\r
 299   ...\r
 300   <articleinfo xmlns="urn:docbook">\r
 301     <author>\r
 302       <firstname>...</firstname>\r
 303   ...\r
 304 ]]></pre></dd>\r
 305 </dl>\r
 306 \r
 307 <p>The source to <a href="index.src.html">this document</a>\r
 308 illustrates both of these techniques.</p>\r
 309 \r
 310 <p class="note">Both these techniques will cause your document to be\r
 311 invalid as XHTML.  In order to validate an XHTML document that\r
 312 contains Docbook elements, you will need to create a custom schema.\r
 313 Technically, you then ought to place your document in a different\r
 314 namespace, but this will cause &html2db; not to recognize it!</p>\r
 315 \r
 316 \r
 317 <h3>Output Processing Instructions</h3>\r
 318 \r
 319 <p>&html2db; adds a few of processing instructions to the output file.\r
 320 The Docbook XSL stylesheets ignore these, but if you write a\r
 321 customization layer for Docbook XSL, you can use the information in\r
 322 these processing instructions to customize the HTML output.  This can\r
 323 be used, for example, to set the <code>a</code> <code>onclick</code>\r
 324 and <code>target</code> attributes in the HTML files that Docbook XSL\r
 325 creates to the same values they had in the input document.</p>\r
 326 \r
 327 <dl>\r
 328 <dt><code>&lt;?html2db attribute="<var>name</var>" value="<var>value</var>"?&gt;</code></dt>\r
 329 <dd>Placed inside a link element to capture the value of the <code>a</code> <code>target</code> and <code>onclick</code> attributes.  <var>name</var> is the name of the attribute (<code>target</code> or <code>onclick</code>), and <var>value</var> is its value, with <code>"</code> and <code>\</code> replaced by <code>\"</code> and <code>\\</code>, respectively.</dd>\r
 330 \r
 331 <dt><code>&lt;?html2db element="br"?&gt;</code></dt>\r
 332 <dd>Represents the location of an XHTML <code>br</code> element in the\r
 333 source document.</dd>\r
 334 \r
 335 </dl>\r
 336 \r
 337 <p>You can also include <code>&lt;?db2html?&gt;</code> processing\r
 338 instructions in the HTML source document, and they will be copied\r
 339 through to the Docbook output file unchanged (as will all other\r
 340 processing instructions).</p>\r
 341 \r
 342 \r
 343 <h2 id="customization">Customization</h2>\r
 344 <h3>XSLT Parameters</h3>\r
 345 <dl>\r
 346   <dt><code>&lt;xsl:param name="anchor-id-prefix" select="''/&gt;</code></dt>\r
 347   <dd>Prefixed to every id generated from <code>&lt;a name=&gt;</code>\r
 348   and <code>&lt;a href="#"&gt;</code>.  This is useful to avoid\r
 349   collisions between multiple documents that are compiled into the\r
 350   same book.  For instance, if a number of XHTML sources are assembled\r
 351   into chapters of a book, you style each source file with a prefix of\r
 352   <code><var>docid</var>.</code> where <var>docid</var> is a unique id\r
 353   for each source file.</dd>\r
 354   \r
 355   <dt><code>&lt;xsl:param name="document-root" select="'article'"/&gt;</code></dt>\r
 356   <dd>The default document root.  This can be overridden by\r
 357   <code>&lt;?html2db class="<var>name</var>"&gt;</code> within the\r
 358   document itself, and defaults to <code>article</code>.</dd>\r
 359 </dl>\r
 360 \r
 361 <h3 id="processing-instructions">Processing instructions</h3>\r
 362 <p>Use the <code>&lt;?html2db?&gt;</code> processing instruction to\r
 363 customize the transformation of the XHTML source to Docbook:</p>\r
 364 \r
 365 <table>\r
 366 <tr>\r
 367 <th>Processing instruction</th>\r
 368 <th>Content</th>\r
 369 <th>Effect</th>\r
 370 </tr>\r
 371 \r
 372 <tr>\r
 373 <td><code>&lt;?html2db class="<var>xxx</var>"?&gt;</code></td>\r
 374 <td><code>body</code></td>\r
 375 <td>Sets the output document root to <var>xxx</var>.  Useful for\r
 376 translating to <code>prefix</code>, <code>appendix</code>, or <code>chapter</code>; the default is\r
 377 <var>$document-root</var>.</td>\r
 378 </tr>\r
 379 \r
 380 <tr id="simplelist">\r
 381 <td><code>&lt;?html2db class="simplelist"?&gt;</code></td>\r
 382 <td><code>ul</code></td>\r
 383 <td>Creates a vertical <code>simplelist</code>.<db:footnote><db:para>Note that the\r
 384 current implementation simply checks for the presence of <em>any</em>\r
 385 <code>html2db</code> processing instruction.</db:para></db:footnote></td>\r
 386 </tr>\r
 387 \r
 388 \r
 389 <tr>\r
 390 <td><code>&lt;?html2db rowsep="1"?&gt;</code></td>\r
 391 <td><code>[informal]table</code></td>\r
 392 <td>Sets the <code>rowsep</code> attribute on the generated <code>table</code>.<db:footnote><db:para>Note that the current implementation simply checks for the presence of <em>any</em> <code>html2db</code> processing instruction that begins with <code>rowsep</code>, and assumes the vlaue is <code>1</code>.</db:para></db:footnote></td>\r
 393 </tr>\r
 394 </table>\r
 395 \r
 396 <h3 id="embedding">Overriding the built-in templates</h3>\r
 397 <p>For cases where the previous techniques don't allow for enough\r
 398 customization, you can override the builtin templates.  You will need\r
 399 to know XSLT in order to do this, and you will need to write a new\r
 400 stylesheet that uses the <code>xsl:import</code> element to import\r
 401 <code>html2db.xsl</code>.</p>\r
 402 \r
 403 <p>The <a href="examples.xsl"><code>example.xsl</code></a> stylesheet\r
 404 is an example customization layer.  It recognizes the <code>&lt;div\r
 405 class="abstract"&gt;</code> and <code>&lt;p class="note"&gt;</code>\r
 406 classes in the <a href="index.src.html">source</a> for this document,\r
 407 and generates the corresponding Docbook elements.</p>\r
 408 \r
 409 \r
 410 <h2>FAQ</h2>\r
 411 <h3>Why generate Docbook?</h3>\r
 412 <p>The primary reason to use Docbook as an <em>output</em> format is\r
 413 to take advantage of the Docbook XSL stylesheets.  These are a\r
 414 well-designed, well-documented set of XSL stylesheets that provide a\r
 415 variety of publishing features that would be difficult to recreate\r
 416 from scratch for HTML:</p>\r
 417 \r
 418 <ul>\r
 419 <li>Automatic Table-of-Contents generation</li>\r
 420 <li>Automatic part, chapter, and section numbering.</li>\r
 421 <li>Creation of single-page, multi-page, PDF, and WinHelp files from the same source document.</li>\r
 422 <li>Navigation headers, footers, and metadata for multi-page HTML\r
 423 documents.</li>\r
 424 <li>Link resolution and link target text insertion across multiple pages and numbered targets.</li>\r
 425 <li>Figure, example, and table numbering, and tables of these.</li>\r
 426 <li>Index and glossary tools.</li>\r
 427 </ul>\r
 428 \r
 429 <h3>Why write in XHTML?</h3>\r
 430 \r
 431 <p>Given that Docbook is so great, why not write in it?</p>\r
 432 \r
 433 <p>Where there are not legacy concerns, Docbook is probably a better\r
 434 choice for structured or technical documentation.</p>\r
 435 \r
 436 <p>Where the only legacy concern is the documents themselves, and not\r
 437 the tools and skill sets of documentation contributors, you should\r
 438 consider using an (X)HMTL convertor to perform a one-time conversion\r
 439 of your documentation source into Docbook, and then switching\r
 440 development to the result files.  You can use this stylesheet to\r
 441 perform this conversion, or evaluate other tools, many of which are\r
 442 probably appropriate for this purpose.</p>\r
 443 \r
 444 <p>Often there are other legacy concerns: the availability of cheap\r
 445 (including free) and usable HTML editors and editing modes; and the\r
 446 fact that it's easier to teach people XHTML than Docbook.  If either\r
 447 of this is an issue in your organization, you may want to maintain\r
 448 documentation sources in XHTML instead of Docbook</p>\r
 449 \r
 450 <p>For example, at <a href="http://www.laszlosystems.com/">Laszlo</a>,\r
 451 most developers contribute directly to the documentation.  Requiring\r
 452 that developers learn Docbook, or that they wait on the doc team to\r
 453 get content into the docs, would discourage this.</p>\r
 454 \r
 455 <h3>Why not use an existing convertor?</h3>\r
 456 \r
 457 <p>This isn't the first (X)HTML to Docbook convertor.  Why not use one\r
 458 of the exisitng ones?</p>\r
 459 \r
 460 <p>Each HTML to Docbook convertors that I could find had at least some\r
 461 of the following limitations, some of which stemmed from their\r
 462 intended use as one-time-only convertors for legacy documents:</p>\r
 463 \r
 464 <ul>\r
 465 <li>Many only operated on a subset of HTML, and relied upon hand\r
 466 editing of the output to clean up mistakes.  This made them impossible\r
 467 to use as part of a processing pipeline, where the source is\r
 468 <em>maintained</em> in XHTML.</li>\r
 469 \r
 470 <li>There was no way to customize the output, except by (1) hand\r
 471 editing, or (2) writing a post-processing stylesheet, which didn't\r
 472 have access to the information in the XHTML source document.</li>\r
 473 \r
 474 <li>Many of them were difficult or impossible to customize and\r
 475 extend. They were closed-source, or written in Java or Perl (which I\r
 476 find to be a difficult languages to use for customizing this kind of\r
 477 thing) and embedded in a larger system.</li>\r
 478 \r
 479 <li>They didn't take full advantage of the Docbook tag set and content\r
 480 model to represent document structure.  For instance, they didn't\r
 481 generate nested <code>section</code> elements to represent\r
 482 <code>h1</code> <code>h2</code> sequences, or <code>table</code> to\r
 483 represent tables with <code>summary</code> attributes.</li>\r
 484 </ul>\r
 485 \r
 486 <h3>I got this error.  What does it mean?</h3>\r
 487 <dl>\r
 488 <dt>Q. <code>Fatal Error! The element type "br" must be terminated by the matching end-tag "&lt;/br&gt;".\r
 489 </code></dt>\r
 490 <dd>A. Your document is HTML, not <em>X</em>HTML.  You need to fix it, or run it through Tidy first.</dd>\r
 491 \r
 492 <dt>Q. My output document is empty except for the <code>&lt;?xml version="1.0" encoding="UTF-8"?&gt;</code> line.</dt>\r
 493 <dd>A. The document is missing a namespace declaration.  See the <a href="index.src.html">example</a> for an example.</dd>\r
 494 \r
 495 <dt>Q. Some of the headers and document sections are repeated multiple times.</dt>\r
 496 <dd>A. The document has out-of-sequence headers, such as <code>h1</code> followed by <code>h3</code> (instead of <code>h2</code>).  This won't work.</dd>\r
 497 \r
 498 <dt>Q. <code>Fatal Error! The prefix "db" for element "db:footnote" is not bound.</code></dt>\r
 499 <dd>A. You haven't declared the <code>db</code> namespace prefix.  See the <a href="index.src.html">example</a> for an example.</dd>\r
 500 \r
 501 </dl>\r
 502 \r
 503 \r
 504 <h2>Implementation Notes</h2>\r
 505 \r
 506 <h3>Bugs</h3>\r
 507 <ul>\r
 508 <li>Improperly sequenced <code>h<var>n</var></code> (for example\r
 509 <code>h1</code> followed by <code>h3</code>, instead of\r
 510 <code>h2</code>) will result in duplicate text.</li>\r
 511 </ul>\r
 512 \r
 513 \r
 514 <h3>Limitations</h3>\r
 515 <ul>\r
 516 <li>The <code>id</code> attribute is only preserved for certain\r
 517 elements (at least <code>h<var>n</var></code>, images, paragraphs, and\r
 518 tables).  It ought to be preserved for all of them.</li>\r
 519 <li>Only the <a href="#tables">very simplest</a> table format is\r
 520 implemented.</li>\r
 521 <li>Always uses compact lists.</li>\r
 522 <li>The string matching for <code>&lt;?html2b\r
 523 class="<var>classname</var>"?&gt;</code> requires an exact match\r
 524 (spaces and all).</li>\r
 525 <li>The <a href="#implicit-blocks">implicit blocks</a> code is easily\r
 526 confused, as documented in that section.  This is\r
 527 easy to fix now that I understand the difference between block and\r
 528 inline elements (I didn't when I was implementing this), but I\r
 529 probably won't do so until I run into the problem again.</li>\r
 530 \r
 531 </ul>\r
 532 \r
 533 \r
 534 \r
 535 \r
 536 <h3>Wishlist</h3>\r
 537 <ul>\r
 538 <li>Allow <code>&lt;html2db attribute-name="<var>name</var>"\r
 539 value="<var>value</var>"?&gt;</code> at any position, to set arbitrary\r
 540 Docbook attributes on the generated element.</li>\r
 541 \r
 542 <li>Use different technique from the <a href="#docbook-elements">fake\r
 543 namespace prefix</a> to name Docbook elements in the source, that\r
 544 preserves the XHTML validity of the source file. For example, an\r
 545 option transform <code>&lt;div class="db:footnote"&gt;</code> into\r
 546 <code>&lt;footnote&gt;</code>, or to use a processing attribute\r
 547 (<code>&lt;div&gt;&lt;?html2db classname="footnote"?&gt;</code>).</li>\r
 548 \r
 549 <li>Parse DC metadata from XHTML <code>html/head/meta</code>.</li>\r
 550 \r
 551 <li>Add an option to use <code>html/head/title</code> instead of\r
 552 <code>html/body/h1[1]</code> for top title.</li>\r
 553 \r
 554 <li>Allow an <code>id</code> on every element.</li>\r
 555 \r
 556 <li>Add an option to translate the XHTML <code>class</code> into a\r
 557 Docbook <code>role</code>.</li>\r
 558 \r
 559 <li>Preserve more of the whitespace from the source document &emdash; especially within lists and tables &emdash; in order to make it easier to debug the output document.</li>\r
 560 \r
 561 <h3>Support</h3>\r
 562 <p>This is a work in progress.  It serves my needs, but doesn't\r
 563 attempt to be much more general than that.  If you run into anything\r
 564 it can't handle, please send a note, or better yet, a patch, to <a\r
 565 href="mailto:steele@osteele.com">steele@osteele.com</a>.  I can't\r
 566 promise to address problems (I have a day job too), but knowing what\r
 567 people have run into will help my prioritize my work when I do have\r
 568 time to work on this.</p>\r
 569 \r
 570 \r
 571 </ul>\r
 572 \r
 573 \r
 574 <h3>Design Notes</h3>\r
 575 <h4 id="docbook-namespace">The Docbook Namespace</h4>\r
 576 <p>&html2db; accepts elements in the "Docbook namespace" in XHTML\r
 577 source.  This namespace is <code>urn:docbook</code>.</p>\r
 578 \r
 579 <p>This isn't technically correct.  Docbook doesn't really have a\r
 580 namespace, and if it did, it wouldn't be this one.  <a\r
 581 href="http://www.faqs.org/rfcs/rfc3151.html">RFC 3151</a> suggests\r
 582 <code>urn:publicid:-:OASIS:DTD+DocBook+XML+V4.1.2:EN</code> as the\r
 583 Docbook namespace.</p>\r
 584 \r
 585 <p>There two problems with the RFC 3151 namespace.  First, it's long\r
 586 and hard to remember.  Second, it's limited to Docbook v4.1.2 &emdash;\r
 587 but &html2db; works with other versions of Docbook too, which would\r
 588 presumably have other namespaces.  I think it's more useful to\r
 589 <em>under</em>specify the Docbook version in the spec for this tool.\r
 590 Docbook itself underspecifies the version completely, by avoiding a\r
 591 namespace at all, but when mixing Docbook and XHTML elements I find it\r
 592 useful to be <em>more</em> specific than that.</p>\r
 593 \r
 594 <h3>History</h3>\r
 595 <p>The original version of &html2db; was written by <a\r
 596 href="http://osteele.com">Oliver Steele</a>, as part of the <a\r
 597 href="http://laszlosystems.com">Laszlo Systems, Inc.</a> documentation\r
 598 effort.  We had a set of custom stylesheets that formatted and added\r
 599 linking information to programming-language elements such as\r
 600 <code>classname</code> and <code>tagname</code>, and added\r
 601 Table-of-Contents to chapter documentation and numbers examples.</p>\r
 602 \r
 603 <p>As the documentation set grew, the doc team (John Sundman)\r
 604 requested features such as inter-chapter navigation, callouts, and\r
 605 index and glossary elements.  I was able to beat all of these back\r
 606 except for navigation, which seemed critical.  After a few days trying\r
 607 to implement this, I decided it would be simpler to convert the subset\r
 608 of XHTML that we used into a subset of Docbook, and use the latter to\r
 609 add navigation.  (Once this was done, the other features came for\r
 610 free.)</p>\r
 611 \r
 612 <p>During my August 2004 "sabbatical", I factored the general html2db\r
 613 code out from the Laszlo-specific code, refactored and otherwise\r
 614 cleaned it up, and wrote this documentation.</p>\r
 615 \r
 616 <h3>Credits</h3>\r
 617 <p>&html2db; was written by <a href="http://osteele.com">Oliver Steele</a>, as part of the <a href="http://laszlosystems.com">Laszlo Systems, Inc.</a> documentation effort.</p>\r
 618 \r
 619 </body>\r
 620 </html>