Academy of Web Specialists


Can XML be used to Describe META Content?

by Robin Nobles


During a CNET Web Builder Conference held in New Orleans, Daniel Austin addressed the limitations of HTML, taking a close look at XML and its abilities organize information. Following is a report on Austin's presentation and conclusions.

As with just about everything else on the Web, Austin's presentation began with a metaphor.

"The Web has been described as a pile of books laying on the floor in the middle of the library with no sense of organization and no means of finding the information that you need to find. There are endless numbers of competing standards and different ways of organizing information at the document level," CNET's Director of Research & Development told those gathered to learn about Extensible Markup Language (XML), a language that has been creating quite a buzz among Web developers.

XML, the Extensible Markup Language, is an SGML-compliant subset of SGML, the Standard Generalized Markup Language. XML defines objects such as web pages and email addresses, and most importantly, the relationship between information, META content.

"Information about information allows us to determine the value of that information. Without the META content, there's no way to determine the value of the information you're dealing with."

XML itself is really a syntax language. Its use is to serve as a framework or a description for subsets that are useful in describing certain kinds of data. For instance, if you want to describe a MSIE 4.0 channel, you'd use a subset of XML called CDF. If you want to describe the content rating of your site so that children won't be exposed to inappropriate content, you'd use a subset called PICS.

When we look at the family tree for XML, we find RDF, the Resource Data Framework, which defines a META content language that was developed by Apple Computers in terms of the XML syntax. It's a generic META content language and was designed purposely to describe resources on the Internet and particularly on the Web.

The CDF, Channel Definition Format, was developed by PointCast and has been adopted by Microsoft for use in MSIE 4.0 to define channels for push technology in the active desktop.

Other subsets of XML in use include such things as XSL and VTML. XSL (Extensible Style Language) is similar to the cascading style sheets proposal. It uses the XML syntax and is actually based on DSSL. XSL is a method of describing document styles via XML rather than the CSS presentation.

VTML (Visual Tool Markup Language) is used to design software and components for software.

Resource Data Framework

The purpose of RDF, Resource Data Framework, is to describe web resources and the relationships between them. RDF is a new translation based on the MCF, MetaContent Framework, which was developed at Apple's Research and Development Division.

RDF is intended to provide the same sort of META content description that the MCF did, only in the XML format. It's a generic language for the description of web sites and resources, or "META content for the masses."

The key point is that RDF is structured intelligently for information design. It uses Direct Linked Graphs to describe the relationships between data, which is nothing more than two objects with an arrow between them indicating the links.

RDF is intended to be read by browsers and for use in web resources and not software, even though it could be modified to do so. It's more flexible and usable than CDF files. It's intended to be a superset of CDF, so all the functionality that is available in the CDF format is also available in the RDF format. Both of them are now competing proposals before the W3C, though it's unlikely that both will be approved.

RDF describes information sources as nodes, and then it draws arrows between those nodes to show the relationship between the information. It provides us with a description of objects at the site level rather than at the document level. One of the limitations that we've run into with HTML is that we can't describe things except at the document level. Neither the META tag nor the link tag provides a realistic framework for describing META content or information about the website. Thus, we have RDF.

Why is META Content Important?

Search engines essentially work by sending out crawlers that harvest information in a random way by following every link they come to, indexing the information, and sending it back to the search engine. This type of approach doesn't let the user effectively search for the information he's trying to find.

If every website had an up-to-date RDF file that included all the recent changes and location of their files and organization structure, and if that file described and submitted the information regularly to the search engines, you could easily combine, collate, index, and cross-reference the RDF files and generate a map of the Internet.

"This method would provide a more complete set of META information about the website to the search engines and would allow the user to define a narrow search category and actually get back meaningful results, something we have been missing," Austin explained. "To some extent, some of the search engines are actually going to attempt this in the very near future."

Personal Harvesting of Information

Another concept for using XML and RDF is for the personal harvesting of information. One can easily imagine an extension to the Aurora applet where it would go out, read the map of maps at the search engine, harvest only those web resources that are of interest to you, and bring them back on a regular basis.

Aurora, a prototype applet introduced by Netscape provides RDF-based desktop integration for files and applications. The Aurora applet represents a user customizable interface to information. The RDF file that controls the applet provides a map of the user's information space and is customizable to include the user's unique viewpoint. You can provide a map of the information space in your bookmarks that are of interest to no one but you through this applet.

The structure of the information doesn't change, just the user's viewpoint, which means it's very flexible. You can easily add or remove items to the applet allowing you the ability to customize, something you don't have with any other sort of information organizer.

Of course, this is essentially the aim of push technology that we're seeing from Netscape, PointCast, and others. The only difference is that instead of content providers actually providing push sites, the harvester would go out and harvest only the information that matches the user's criteria and bring it back on a regular basis.

Problems with the Concept?

The content providers such as CNET provide sites that have a clear focus and purpose, and they want users to go there and view those sites in their entirety rather than harvesting them by bits and pieces and only seeking the information that the users want.

In a lot of cases, these new technologies, especially XML and META content technologies, provide means for the user to easily access the information, but they don't provide any essential method for the content provider to receive compensation for having gone to the trouble of providing that content.

Should web developers be coding in XML now?

Not according to Austin.

"What you need to know is that it's coming down the pipe, and to have a grasp of the basic ideas of META content, XML, and RDF so you can be aware of what's happening in the META content world. META content really makes your information a lot more valuable, even if you don't supply it through XML," he said.

Reference URLs

Netscape Aurora

Information Design with XML and RDF by Daniel Austin

Sample XML File Using CDF to Define a Channel

The SGML/XML Web Page

XML

RDF Model and Syntax

Published in Web-Vantage.

Academy of Web SpecialistsTM

 ©1999-2000 Academy of Web SpecialistsTM. All rights reserved.
418 Main St. #9, Half Moon Bay, CA 94019


Search engine marketing training web site optimization online education Sitemap