RDF Metadata for End-to-end Content

What do you want to know today?

Too much data, or not enough?

How are we to deal with the massive volumes of data generated by and available using Internet and Web technologies? There is much too much for a single person to use meaningfully. And much of it is biased, of dubious content, or just plain wrong.

Conventional wisdom: build better search engines

Many think the main problem is to have better tools to find knowledge needles in the data haystack. This would certainly be helpful, but this may be solving a problem and overlooking a bigger opportunity.

But, with enough processing, more data from more sources can:

yield information not otherwise available
increase the reliability/accuracy of what is discovered
uncover new relationships

Security

The traditional view of computer security is one of locks and keys. But, for e-services, the real issues are those more like “Would you buy a used car from this person?”.

Evaluating a range of information:

The need is not only for a signed assurance or guarantee, but also to evaluate a range of information to help manage risks. Also to assess trust, and to manage conflicts between different sources of information.

According to Claude Shannon, information is that which reduces uncertainty, and a goal is to find or deduce this from the raw data available.

Need to build trust

They say “security makes trust work”, but...

... without trust there can be no security

Sources of data and information

What are the various sources of data and information that we can bring together as part of a wider knowledge handling and development strategy?

Email and other messages

Both the content and the protocol elements of email transfers convey information.

Web (global hyperlinked documents and data)

Again, both content and protocol headers. In some cases, the identity of the requester may also be significant.

Files (locally stored data)

Local file stores contain valued data. Protocol-related metadata is not available, but the fcat it is stored suggests some value.

Applications

Applications create or modify data using user-supplied information. If accessible, locally created application data is likely to be very relevant.

Words: word processing, text editors, ...
Numbers: spreadsheet, finance, ...
Pictures: drawings, images, ...
Combinations: database, PIMs, ...
Multimedia ...
etc.

Building blocks

These are components that we might employ to combine information from thje various sources identified.

Protocol handlers:

MailSweeper, WebSweeper, ...

Content analysers:

MIMEsweeper, LEX, PornSweeper, ...

Storage and indexing:

Archivist, ...

Information Integration:

WTFi2

Some pieces are available, but we are still evolving the technologies

RDF for metadata

RDF, the Resource Descripion Framework, is a W3C recommendation for describing metadata, that is data about data. In W3C it is being used to describe web resources (e.g. PICS), and also the characteristics of participating parties in the WWW (e.g. CC/PP, P3P). It is also being developed as a basis for communication between software agents (DAML: DARPA Agent Markup Language), and developing ideas of general knowledge representation. These are parts of a vision known as the “Semantic Web”: one which contains machine processable (“understandable”) information as well as human-readable documents.

The underlying model of RDF is based on a directed labelled graph. Nodes and arcs are labelled using URIs.

RDF syntax is based on XML. At first approach, the syntax of RDF is complex and confusing, but this is largely due to the way it is presented in the RDF specification. In practice, many intuitive XML formats can be mapped to an RDF-compliant form very easily. But RDF is both more regular and more flexible than raw XML, making it easier to write general purpose applications that deal with the information content of data, as well as its syntax.

End-to-end architecture

"There's a freedom about the Internet: As long as we accept the rules of sending packets around, we can send packets containing anything to anywhere." [Berners-Lee]

The Internet has been the basis of an explosion in communication services -- why?

End-to-end architecture:

In the Internet, communication services are defined by the users, the parties who connect to the network, not the network provider

Service provision is driven by user needs and desires

The IP datagram is basis of all Internet comm services.

IP datagram sent between any pair of endpoints; hence end-to-end architecture. It is (in theory, and the absence of firewalls) communicated transparently between any pair of Internet-connected systems. (Netork address translation is evil because it breaks this important architectural model; hosts subjected to this are second class citizens on the Internet. Firewalls have a similar effect, but at least they are under some form of administrative control by the end user.)

This approach separates the infrastructure from service: service additions don’t need infrastructure changes.

End-to-end applied to content:

Look for separation of information infrastructure (information formats) from service definition (applications).

RDF provides basic building blocks of information formats, that can, in theory, be applied to any form of information, yet maintains a simple consistent underlying structure for which common support tools can be developed. It might be regarded as the IP datagram of information representation.

Users can decide what to do with that information, unfettered by the design of the application that created it. Alternative tools can be brought to bear, and an independent, multivendor market in such tools can be created.

RDF may underpin an end-to-end architecture for content

Based on the simple concept of a “triple”: a statement of the form “Subject --predicate--> Object” or “predicate(Subject,Object)”.

But how to get there?

Multiple data sources in common format

RSS web site summary

Dublin core bibliographic information

... and many more works in progress ...

All using RDF. Common RDF applications may access and combine this information.

Start with simple information integration from a few sources

This may not fully exploit the potential of RDF, but it costs little to design XML formats to be RDF compliant.

Expand the range of sources

Extend this idea across a family of applications that use XML as a data interchange format.

As the “knowledge base” grows, introduce smarter analysis (e.g. expert systems, etc.)

We are doing research and experimental prototyuping work for an RDF-driven expert system.

Initially: metadata; eventually: all data

Initially, the focus will be on descriptions applied to data, or other information about the way data is being used. But in due course, more and more data formats may be converted to use RDF as their primary representation. Then RDF applications can access the content without having to rely on an intermediary processor to extract the information to an RDF-based form.

References

http://public.research.mimesweeper.com/RDF/RDFBriefIntroduction.htm

http://www.w3.org/TR/REC-rdf-syntax

http://www.w3.org/TR/rdf-schema

http://public.research.mimesweeper.com/IETF/Messaging/draft-klyne-message-rfc822-xml-01c.txt

http://www.w3.org/DesignIssues/Notation3.html

http://www.w3.org/DesignIssues/

http://www.tmdenton.com/netheads.htm -- looks at the issues of centre-defined vs edge-defined communication networks.

ftp://ftp.isi.edu/in-notes/rfc1958.txt -- a description of Internet architecture

ftp://ftp.isi.edu/in-notes/rfc2775.txt -- discussion of the Internet architecture in recent times

For feedback please see: <http://public.research.mimesweeper.com/index.html#Contact>
Last updated: 25-Jun-2001, GK.