Data Tags

 

Storage and subsequent retrieval of data on the Internet can be particularly difficult. Stored data is increasingly impossible to navigate by traditional means. Witness the success of search engines like Google with their simple syntax for searching for keywords on the World Wide Web.

We have begun to formalize efficient keywords and standards for the use of keywords as ‘Tags’. We can use these "Tag" values as efficient aliases and indices. Traditional approaches to the storage, retrieval and use of information suffice when a human being engineers a relational database. They become inadequate when entities, relationships, data sets, tuples, etc grow to the size of the Internet. Tags have advantages over traditional indices. We can use them to store and retrieve information from enormous data sets. We can also create and use unbounded spaces. Tags allow us to create and use sparse sets that are unbounded both in set size and in dimensional size. Done properly, Tags bring order to chaos.

 

·        A tag can be thought of as the name of a cell in the two dimensional Array (e.g. ‘B3’ for column 2, row 3).

·        This can be easily extended to a three dimensional array by considering the Excel’ worksheet as a third dimension of ‘pages’ or ... ‘Sheets’¯.

·        To those with a sufficient background in mathematics, this is obviously an example of a ‘bounded’ (preferable terminology to ‘infinite’) EnDimensionalArray, which extends the concepts of...

o       An (Entity; Property) or (Name; Value) tuple which is a two element Vector, that can be one axis of a Matrix (‘table’ or ‘Spread sheet’¯) resulting from rendering two (orthogonal) vectors as a Matrix.

 

In a simple matrix of M (rows), x N (columns), where the values of M and N have positions in the vector of the 26 upper case letters of the alphabet (positions 13 and 14 respectively), it is possible to ‘tag’ or name the cell with a base ten notation as the computed value of:

 

13 x 26*1 + 14 x 26*0

 

This exactly equates a ‘Tag’ vale of ‘MN’ to position number 562 in a bounded matrix with a maximum set of 702 cells.

 

It may be informative to note that bases of each of the positional terms in this equation do not have to be the same. Consider, for example, the Canadian Postal Code’s ‘alpha numeric alpha - numeric alpha numeric’ structure. A specific index value of ‘M5C 1B5’ could be converted to a base 10 index by evaluating the polynomial...

13 x 26*5 + 5 x 10*4 + 3 x 10*3 + 1 x 10*2 + 2 x 10*1 + 5 x 10*0

 

Again, is it a mathematical fact that this is an index into a bounded array of cells with a maximum number of:

 

Z9Z 9Z9

 

This clearly establishes the fact that a tag value can be an index into an EnDmensionalArray.

In simple practical terms, the analogy to a spreadsheet discussed during the second review meeting may illustrate this more clearly since it is quite common to use terminology like ‘B3’ to identify the cell at the intersection of the second column and the third row. In this example, the worksheet can be thought of as an extension to a third dimension of an array, which might extend its comparable Tag value to (S4) B3.

 

Proof of the validity of this ‘theoretical’ but fundamentally important realization was the motivation for much of the research in to the frequency, scope, and acceptability of tags in a RepresentativeSample(s) of very large GroupFormingNetworks of electronic ‘communities’ such as flickr, DiiGO and ‘DelIcioUs’ documented in ‘section D. Description of work in the tax year’ of the submitted material.

 


Dispersed Data

 

Ubiquitous, Trusted and Distributed Information

 

As mentioned above, the Internet offers some unprecedented opportunities and challenges. It allows us to store nearly limitless amounts of data. It allows real-time, concurrent access to that data by literally hundreds of millions of users. In aggregate, the Internet serves billions of access points. We expect this to grow physically into the Trillions in a very short time. Physical growth is already spectacular. However, the conceptual growth of the Internet is perhaps already not measurable with conventional numbers. This is due to the mathematics of Group Forming Networks as described in our larger submission. At a trillion nodes, the number of ‘virtual loci’ (two raised to the power of a trillion) is difficult to imagine.

 

Nobody has much experience with the dynamics of a phenomenon like the Internet. Witness the spectacular rise and fall of huge companies in the past decade. If the dynamics of this system were understood, we would not have seen individuals forming multi-billion dollar companies in times measured in months.

 

The current and projected size and distribution of the Internet clearly allows us to make data available everywhere in enormous quantities. Two things are difficult research issues. The first is a technical issue -- how to limit access to that data to entities with the authority to do so while still leaving it widely dispersed and accessible everywhere. The second is the key to the kingdom. How, in terms of social engineering, can we gain access to this global resource? We are well along in terms of designing the first mechanism. The second is much more difficult and entirely an open research issue. In plain terms, we need individual people to grant us access to their systems and the data they contain. Other companies are doing this, but in ways that are resource intensive, inefficient and (we believe) doomed to failure due to privacy issues.

 

We need to both have the working technology to distribute data and the knowledge of how to influence resource owners to work with us, and with each other.

 

If we are successful in our efforts, we will have a data store that is accessible anywhere, any time, but only by those who have the legitimate authority to do so. Because of the widely dispersed nature of this data (it is not just ‘distributed’, it is dispersed physically and logically), it is effectively indestructible. As shown in the section on encryption, we can make the destruction of our data store vulnerable only to the destruction of the entire network infrastructure.

 


PeerToPeer Technologies

 

We expect that the complexity of the Internet will continue for some time. However, our view of the system is as a network of peers. For our purposes, this simplifies the view tremendously. Our systems can use the entirety of the Internet as a data storage medium without respect to things considered by others as clients or servers. This has been non-trivial to develop in practice. As shown below, the Internet is extremely heterogeneous. It has been a challenge to determine which things can be depended upon and which cannot. During our claim period and in the time surrounding it a significant effort was undertaken to determine a simple vision of the Internet cloud that is non-trivial enough to be useful in Implementation. Below, we discuss the complexity of the world network, our simplification of a portion of it and deal specifically with an example of how data can be stored securely and reliably in the cloud.

 

Choice of Network Application Systems

The drawing here demonstrates some of the complexity of the current environment on the Internet. It is still not entirely clear which protocols and which technologies will be used in the future.

We have attempted to take as agnostic a view of the global network as possible. TPC/IP is a given, but beyond that, things are still very much in a state of flux.

 

We chose social networking sites to examine due to the mathematics described in our comprehensive submission. These sites offer the most leverage to influence elements of the network and ultimately to collect, store and clean data.

 

Wiki’s were chosen due to their ubiquity (for instance Wikipedia is one of the most trafficked websites on the net) and their simplicity both for implementation and for training of users. It is easier to move people on to wikis than just about any interface on the internet. Various wikis offer extremely rich function that exceeds that of any other site. They can also store data in a very simply distributed fashion. Our wikis transparently show information from and send information to wikis all over the world.

 

Of particular note are the following two points: Wikis can be both client-side and server-side. A client-side wiki can also transparently connect to server side wikis without user intervention. This offers the best of all worlds in that user can have rich GUI functionality, local access to data and security on their local machine (using, for instance, the java driven TiddlyWiki). Users get all the benefits of both server-side and client-side function.

 

 

Distribution of Data in the Network Cloud

 

The diagram below shows the view that we take of the network above. From our point of view, the network is best used as a distributed data store. Our designs for data collection, storage, security, dependability and retrieval are based on the notion that documents are broken into packets and multiples of each packet are encrypted and distributed across the network.

 

The drawing below demonstrates some of the complexity of the current environment on the Internet. It is still not entirely clear which protocols and which technologies will be used in the future.


Data Security and Encryption

 

One of the things we have been working on is conceptual designs for the use of the network cloud to create a high level of security both in terms of restricting access and in terms of guaranteeing access with the appropriate rights.

The diagram here gives a simplified conceptual version of the system we are designing.

 

As can be seen, a data element (in this case a Word document) goes through a series of stages on its way to storage and retrieval.

 

The first stage is to compress the data. This reduces the storage required and makes the data more difficult to attack with cryptanalysis.

 

In the next stage, the data is split into packets for distribution. We then prep it for dispersal (‘distribute’ here), by creating multiple redundant sets. The way we do this in practice is more complex. It saves significantly on storage. However, this gives the general idea. Since there are multiple copies, the destruction of one copy does not eliminate the data. It also means that we can have many more access points and that additional concurrent access is possible.

 

Critical data (such as the combination to a vault) requires some type of shared custody. This is to ensure that the capture of a single key will not allow access to the data. The diagram shows a simple scheme that will require two keys to decrypt the data. It may be hard to understand looking at the diagram, but it also shows how multiple keys could be required and also that it is possible to make it such that even legitimate key-holders cannot even locate and destroy all copies since their keys do not work on all copies. Finally, it shows how the system can be extended such that many people can carry key sets and that you only need any two (in this case) to decrypt the data.