Using XML Schemas and DTDs Together
Using XML Schemas and DTDs Together
By: Greg Watson
Oct. 29, 2003 12:00 AM
XML Schemas are quickly becoming the industry standard that Document Type Definitions (DTDs) used to be. Much has been written about the advantages of XML Schemas over DTDs. Indeed, Schemas do offer advantages. However, with all the focus on the need to transition from DTDs to Schemas, it seems that little attention has been paid to how XML Schemas and DTDs can be used together.
This article focuses on how to validate an XML document against an XML Schema and a DTD at the same time. Additionally, the article focuses on how to transition from using DTDs exclusively to using both XML Schemas and DTDs. This type of transition is especially important for organizations that have heavily invested in DTDs and now have large document inventories based on them. The XML Schema and DTD in this article will be a small version of the DocBook DTD standard - a modular approach to building DTDs that has long been a standard for SGML developers. Although the example is presented in a recognizable DocBook format, it could easily be adapted to work with XML Schemas and DTDs not based on the DocBook standard. The example could also be adapted to work in a developmental environment in which a "full-version" of DocBook might be used.
A first step to consider when moving to XML Schemas is whether to write the schema from scratch, generate it from an XML document, or convert it from an existing DTD. Of these options, the first is probably the least desirable, unless development time is not an issue, which is probably not the case. A good compromise to writing a schema from scratch is to generate it automatically from an existing XML document. To generate a schema from an XML document using XMLSPY, open the XML document, select "DTD/Schema," and then select the "Generate DTD/Schema" option.
If an XML document on which a schema can be based is not available, consider generating one from an existing text file. One way to locate text-to-XML conversion programs is to search on "text to XML" on www.oreilly.com site. Unidex also offers a free trial download of XML Convert - a Java-based program for converting text files to XML (www.unidex.com/download.htm). Also, the example program ConvertToHTML in Chapter 14 of K.N. King's book Java Programming: From the Beginning can easily be modified to convert text files to basic XML.
In many cases though, organizations that are considering moving to XML Schemas are likely to be migrating to them from existing DTDs. The question then is how to convert these existing DTDs to XML Schemas. There are at least two known methods for converting DTDs to XML Schemas. One method is to use the Perl script DTD2Schema, which is available at www.w3.org/2000/04/schema_hack. Unlike many XML conversion utilities written in Perl, this one does not require the installation of any Perl modules. To convert a DTD to a schema using DTD2Schema, use the following command: perl DTD2Schema.pl file.dtd > file.xsd. This script will work well for converting a DTD to a schema if the DTD is contained within a single file and is not overly complex. If a DTD comprises multiple files, then converting it to a schema can best be done by using a tool such as TIBCO's TurboXML (also known as XML Authority), which is available for trial download at www.tibco.com/solutions/products/extensibility/turbo_xml.jsp.
To convert a DTD to a schema using TurboXML, use the following
Note that after completing these steps, there is no "export complete" prompt to the user.
The converted schema will be available in the user-specified output location almost immediately after performing these steps. TurboXML works quite well in converting even the most complex DTDs to XML Schemas.
In transitioning from DTDs to Schemas, there are definitely advantages in using conversion tools such as TurboXML. However, in using these conversion tools, be aware of two issues. First, organizations want to transition to schemas primarily because schemas allow for restricting data based on data types - integer, float, and decimal, for example. The problem in using an automated conversion tool for converting DTDs to schemas is that data types are not automatically added to the converted schema. They have to be added manually after the fact. This often goes unmentioned in discussions of automatically converting DTDs to schemas. Second, if the DTD contains text or document entities, an automated DTD-to-schema conversion tool will not convert these. The conversion tool will simply skip these entities. If they no longer need to be used in XML authoring, then this is not an issue. If there is a need to continue using them, however, two options are available. One is to include the entities in an internal DTD located in the top portion of each XML document. The other choice is to include them in an external DTD. In either case, the XML document will need to be validated against both a DTD and an XML Schema simultaneously if it contains text or document entity references.
If a task entails managing XML files with a relatively small number of entities that are used in a predictable way in XML authoring, using an internal DTD may be a good idea. Even if there are several entities to be used in document authoring, it may be advisable to use an internal DTD to define entities that may be unique to a particular XML document. To validate an XML document against an internal DTD and a schema, the following example code would need to be placed at the top of the XML file:
In this example, the internal DTD contains a single entity - a .tif image file defined as file01.tif. Any number of entities could be defined in this way within the brackets [ ] at the top of the XML file. The example also contains a reference to an XML Schema named MySchema.xsd.
Probably the most common development scenario would be a need to validate an XML document against an internal DTD, an external DTD, and a schema. Example code for doing this is as follows:
In this example, there are really three separate validations taking place. The XML document validates against an external DTD (my.dtd), an internal DTD with an entity declaration (file01.tif), and an XML Schema (MySchema.xsd). The external DTD in this case would contain entity declarations for any text or document entities. The schema in this case would contain element and attribute declarations as well as any data type restrictions on those elements and attributes.
In looking at the example in the sample code (available below), consider the needs of an Internet-based company offering online access to technical white papers. In our fictional example, some of the white papers on the company site are offered free of charge, and others papers can be downloaded from the site for a fee. To better manage the data available on the site, the company has built XML files to store information about each white paper. To manage these XML files, the company has designed an XML Schema, an internal DTD, and an external DTD. Figure 1 is an overview of the schema and DTDs the company has designed.
1. The schema consists of the following components (which
will be described in more detail momentarily): top-level
"subject elements," DocBook "information pool" elements,
the XML Exchange table model elements, and subject elements
added to the XML Exchange table model.
In our example schema, there are four top-level subject elements - CopyrightNotice, Disclaimer, Abstract, and Biblio- Data. On our fictional site, the white papers offered free of charge contain a CopyrightNotice informing the reader that permission is granted to copy and distribute the paper to anyone, anywhere. The white papers on our site that can be downloaded for a fee contain a CopyrightNotice informing the reader that unauthorized distribution of the paper is punishable by a $1,000,000 fine and 10 years of hard labor in San Quinten. Each white paper contains a disclaimer protecting the company from damages resulting from loss of data or revenue as a result of using the ideas in the paper. Each white paper also contains an abstract summarizing the contents of the paper. Associated with each paper is a BiblioData element, which contains Author, Title, PaperDate, PaperNumber, PaperSubject, FileSize, and NumberOfPages elements.
The seven elements within the BiblioData element will be stored within the entry element of the XML Exchange table model - a subset of the CALS table model. By using subject elements within the entry of element of the XML Exchange model, we can more easily constrain the table data using a schema and more easily search the data using XPath queries. One additional advantage to using a CALS-based table is that the standard elements (such as entry and row) can be inserted automatically into a document using an XML editor such as Epic, which is available from Arbortext at www.arbortext.com.
To help constrain the data in the BibloData element, we assign XML Schema data types to each of seven elements. The Author, Title, and PaperSubject elements are assigned a text data type. The FileSize element has a decimal type and the NumberOfPages element has an integer type. The PaperDate must match a year, month, day pattern (i.e., YYYYMMDD) and the PaperNumber must match the pattern of PNUM- followed by four digits.
Since many of the white papers have multiple authors, we assign a numeric value of 100 or 700 to the "type" attribute of the Author element. The value 100 identifies an author as a primary author of paper, and a value of 700 identifies an author as a coauthor of a paper. The numeric attribute values of 100 and 700 are based on the Library of Congress cataloging record known as the MARC record. For many years, the MARC record has been a standard for cataloging library materials in machine-readable format.
Recently, the MARC record has been converted to XML format. The XML Schema for the MARC record can be downloaded from the Library of Congress site at www.loc.gov/standards/marcxml/schema/MARC21slim.xsd. For more information on the MARC record in general, see http://lcweb.loc.gov/marc/umb.
In addition to the XML Exchange table model, our example schema also contains elements - such as para - that are part of the information pool module of the DocBook DTD. The information pool elements are commonly used elements in many XML document types. Our example schema will contain a handful of these commonly used elements.
The external DTD in our example contains ISO text (or character) entities and document entities. The text entities allow a document author to insert special characters or symbols into an XML document. Our example DTD contains these entities "as is" from the DocBook DTD.
Added to the DocBook entities are modules for document entities. In our example we have three document entities, which reference text file information for CopyrightNotice and Disclaimer information associated with each white paper. Declaring document entities in an external DTD allows text files to be reused in multiple XML documents as a user has a need to reference the data.
Now that we've described our XML Schema and DTDs, we can view the results in XMLSPY and Internet Explorer. Figure 2 provides a view of validating a sample XML document in XMLSPY using our example schema and DTDs. Figure 3 provides a view of this same XML document in Internet Explorer.
I hope that the example in this article is helpful. Any questions about the sample code may be directed to me via e-mail; I'll be happy to help.
Reader Feedback: Page 1 of 1
Tweets by @BigDataExpo
Digital Transformation Blogs