Web Science

Creating a Science of the Web

Tim Berners-Lee¹, Wendy Hall², James Hendler³, Nigel Shadbolt² and Daniel J. Weitzner¹

1 � Massachusetts Institute of Technology

2 � Southampton University

3 � University of Maryland (Corresponding author)

Since its inception the World Wide Web has changed the ways scientists communicate, collaborate, and educate. There is, however, a growing realization among researchers across a number of disciplines that a clear research agenda aimed at understanding the current, evolving and potential Web is needed to assure continued growth. If we want to model the Web, if we want to understand the architectural principles that have provided for its growth, and if we want to be sure that it supports the basic social values of trustworthiness, privacy, and respect for social boundaries, then we must chart out a research agenda that targets the Web as a primary focus of attention.

When we discuss an agenda for "Web Science," we use the term science in two ways. Physical science analyses the natural world, and tries to find microscopic laws which, extrapolated to the macroscopic, would generate the behavior observed. Computer Science, by contrast, though partly analytic, is principally synthetic: the construction of new languages and algorithms, which will provide new sought-for behaviors on the part of the computer. Web science is a combination of these two. The web is an engineered space created via formally specified languages and protocols. However, as humans are the creators of Web pages and links between them, their interactions form emergent patterns in the Web at a macroscopic scale. These human interactions are, in turn, governed by social conventions and laws. Web Science, therefore, must be inherently interdisciplinary; its goal is to both understand the growth of the Web and to create approaches allowing new powerful and more beneficial patterns to occur.

Unfortunately, such a research area does not yet exist in a coherent form. Within Computer Science Web-related research has largely focused on information retrieval algorithms and the algorithms for the routing of information through the underlying Internet. Outside of computing, researchers grow ever more dependent on the Web, but have no coherent agenda for exploring the emerging trends on the Web nor are they fully engaged with the emerging Web research community to more specifically focus on providing for scientists' needs.

A recent workshop invited a number of leading researchers[i] to discuss the scientific and engineering problems that form the core of Web Science. Participants discussed emerging trends on the Web and debated the specific types of research needed to exploit the opportunities as new media types, data sources and knowledge bases become �webized,� as Web access becomes increasing mobile and ubiquitous, and with the increasing need for privacy guarantees and control of information on the Web.

The workshop covered a wide range of topics across technical and legal disciplines. For example, there has been research done on understand the structure and topology of the web (1,2) and the laws of connectivity and scaling to which it appears to conform (3-5) This work leads some to argue that the development of the Web has followed an evolutionary path, suggesting a view of the Web in terms of biological techniques that model it as a populated ecology. These analyses also showed the Web to have scale-free and small-world networking structures, areas that have largely been studied by physicists and mathematicians using the tools of complex dynamical system analysis.

The need for better mathematical modeling of the Web is clear. Take the simple problem of finding an authoritative page on a given topic. Conventional information retrieval techniques proved insufficient at the scale of the Web. However, just as quantum states turn out to be eigenvectors of a physical system, human topics of conversation on the web turn out to be eigenvectors of the matrix of links (6) The mathematics of information retrieval and structure based search will certainly continue to be a fertile area of research as the Web itself grows. However, approaches to developing a mathematical framework for modeling the Web vary widely, and any significant impact will, again, require a truly interdisciplinary approach. For example, the process-oriented methodologies of the formal systems community, the symbolic modeling methodologies of the AI and semantics researchers, and the mathematical methods used in network analyses are all relevant, but no current mathematical model can unify all of these.

One particular ongoing extension of the Web is in the direction of moving from text documents to data resources. In the Web of human-readable documents, the computer uses Natural Language Processing techniques to extract some form of meaning from the human-readable text of the pages. These approaches are based on �latent� semantics, that is, on the computer using heuristic techniques to recapitulate the intended meanings used in human communication. By contrast, in the �Semantic Web� of relational data and logical assertions, computer logic is in its element, and can do much more.

Research is exploring the use of new, logically based languages for question answering, hypothesis checking, and data-modeling. Imagine being able to query the Web for a chemical in a specific cell biology pathway that has a certain regulatory status as a drug and is available at a certain price. The engineering challenge is to allow independent consistent data systems to be connected locally without requiring global consistency. The statistical based methods that serve for the scaling of language resources in search and the data calculi that are used in scaling database queries are largely based on incompatible assumptions, and unifying these will be a significant challenge for future Web research.

Despite excitement about the Semantic Web, the majority of the world's data is locked in large data stores and is not published as an open web of inter-referring resources. As a result, the reuse of information has been limited. Substantial research challenges arise in changing this situation: how to effectively query an unbounded Web of linked information repositories, how to align and map between different data models, and how to visualize and navigate the huge connected graph of information that results. In addition, a policy question arises as to how to control the access to data resources being shared on the Web. This latter has implications both with respect to underlying technologies that could provide greater protections, and to the issues of ownership in, for example, scientific data-sharing and Grid computing.

The scale, topology and power of decentralized information systems such as the Web also pose a unique set of social and public policy challenges. While computer and information science have generally concentrated on the representation and analysis of information, attention is also required to the social and legal relationships amongst this information. (8) Transparency and control over the complex social and legal relationships behind this information is vital, but require a much more well-developed set of models and tools that can represent these relationships. Early efforts to model these complex relationships in the area of privacy and intellectual property have begun to establish the scientific and legal challenges associated with representing and providing user control over their own information. Our aim is to be able to design "policy aware" systems that provide reasoning over these policies, enable agents to act on a user�s behalf, make compliance easier, and provide accountability where rules are broken.

In summary, Web science is about more than modeling the current Web. It is about engineering new infrastructure protocols and about understanding the human society that uses them and creates the Web, and it is about the creation of beneficial new systems. It has its own ethos: decentralization to avoid social and technical bottlenecks, of openness to the reuse of information in unexpected ways, and of fairness so that a just society can be built on its principles. It uses powerful scientific and mathematical techniques from many disciplines to consider at once microscopic Web properties, macroscopic Web phenomena, and the relationships between them. Web science is about making powerful new tools for humanity, and doing it with our eyes open.

References

1. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, Network Motifs: Simple Building Blocks of Complex Networks Science 25 October 2002 298: 824-827

2. Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt, Shai Shen-Orr, Inbal Ayzenshtat, Michal Sheffer, and Uri Alon, Superfamilies of Evolved and Designed Networks Science 5 March 2004 303: 1538-1542

3. Albert-L�szl� Barab�si and R�ka Albert, Emergence of Scaling in Random Networks, Science 15 October 1999 286: 509-512

4. Jon M. Kleinberg, Navigation in a small world, Nature 406, August 2000 845-845

5. Steven H. Strogatz, Exploring complex networks, Nature 410, March 2001 268-276

6. Sergei Brin and Lawrence Page, The anatomy of a large-scale hypertextual web search engine, In Proceedings of the 7th International World Wide Web Conference, pages 107-117, Brisbane, Australia, April 1998. Elsevier Science.

7. Zolt�n N. Oltvai and Albert-L�szl� Barab�si, Life's Complexity Pyramid, Science 25 October 2002 298: 763-764

8. Lawrence Lessig, Code and other Laws of Cyberspace, Basic Books, 1999.