Inside the ePub format: Xhtml What?

In the inaugural post for this Inside the ePub format series, I demonstrated how to unlock the contents of an ePub file to have a peek inside. With those basics out of the way, I am going to walk you through the most important part of the ePub files: Your content!

When writing a book, you use a text editing software application, like Microsoft Word, Apple’s Pages, Adobe’s Framemaker, Madcap Software’s Flare, the open source Sigil, or Literature & Latte’s Scrivener, to name a few.

Whatever application you use to write your book, it eventually needs to be output to a format that is compatible with ePub. The name of that format is…

XHTML (my attempt at a layperson’s definition)

When you open your browser and go to any site, the text you are reading is contained within one of two formats. The first is html, and the other is xhtml (okay maybe not every site but nearly every site).

Html stands for Hypertext Markup Language. In the early days, nearly every website had html pages. These pages define the content that goes into your page. Here is an example from the

<!DOCTYPE html>
<title>Pride and Prejudice</title>

<h1>Chapter 1</h1>
<p><strong>It</strong> is a truth universally acknowledged, that...</p>

As you can see, an html document includes both text and markup. The markup is quite noticeable and is always placed inside brackets. In html, we call them tags. In the sample, there is an H1 tag that tells the browser to display the words Chapter 1 using the Heading 1 style (h1 is short for heading 1). Note how there is always an opening tag, like <h1> and a closing tag that looks nearly identical but has a slash in it, like </h1>.

When you are writing a book, most software applications are using a markup that looks very much like this, but most programs hide that detail from you.

So what is xhtml? Xhtml stands for Extensible Hypertext Markup Language. The biggest problem with html is it can be a little loose in how you markup the document. Standards bodies do not like that because it can confuse the browser and causes lots of problems.

Take a look at this example code snippet:

<h1>Chapter 1
<p><strong>It</strong> is a truth universally acknowledged, that...</p>

See how I did not close the <h1> tag with an </h1> tag? In html, the browser would try to display it anyway. The page may look weird, and every browser might treat this slightly differently, but the page will still load.

Xhtml is nearly identical to html, but it requires strict adherence to the standards. Going back to my example where there is no closing </h1> tag, a browser would not display the page and displays an error instead. With xhtml, there are guarantees the content contained within the file is properly formatted.

There are some other things that are unique to xhtml as well. For example, xhtml attempts to remove any formatting from the file and place that detail in a css file instead. More on css in another article.

How does this relate to ePub?

An eBook reader is essentially a browser that displays web pages. If you create a poorly formed document (like that missing </h1> tag), the eBook reader will not know how to react. That is why the IDPF’s ePub 3.x format requires documents be in the xhtml format.

How do I get my document into xhtml format?

Depending on the software application you are using, it may only output to html, or it might output to ePub format and saves the content to xhtml for you. If your book is in html format, there is very little you have to do so long as those opening and closing tags exist throughout your document. You might also have to remove some html formatting and put them into a css file.

How can I verify my xhtml document?

Whether you are using a software application to produce the xhtml or you are doing it yourself, you should always verify the xhtml document is properly formatted. The best way to test your document is to run it through the W3C Markup Validation Service. The W3C is the committee that maintains the standards for html and xhtml so that the validator will be pretty accurate.

Important: I wrote the eBook Toolkit because I wanted to output all my Microsoft Word files to xhtml. I ran my content through the W3C validator, and it passed with flying colors. Later, when I built the ePub file, a different ePub validator found additional problems. The point I am making here is this validator is not perfect but will get you 80-90% of the way there.

What next?

To learn more about xhtml, visit this Wikipedia site.

In the next article, I will show you how css works in conjunction with css files.


Leave A Comment