Ajax, Blog and XML co-occurrence streams MOTOBOY NoiseTube

Phenotypes/Limited Forms

Integrated IMDB and Netflix Dataset

TAGora DBPedia Sense Info

Data from was gathered in 2006 and currently consists of over 667 thousand users, nearly 2.5 million tags, and around 18.7 million resources.

This data set was extensively used in the project, for example in the analysis and modelling of evolutionalary behaviour and structural information of social resource sharing systems, analysis and modelling of the structure and dynamics of folksonomies, and in semantic user interest profiling and tag disambiguation analysis.

An anonymiszed version of this dataset is available in two formats: a comma separated version here, and an RDF representation. The RDF version is comprised of two files, a complete list of posts here, and a complete list of tags and their frequencies here. This data is modelled using the Tagora tagging ontology and expands to around 120GB of turtle files (or 1 Billion Triples).


This data collection contains all descriptions of photos that were uploaded to Flickr during January 2004 and December 2005 and that were still available over the public API in the first half of 2007. The crawling of the data collection was finished 07/2007.

An anonymized version of this dataset is available in two formats: a comma separated version here, and an RDF representation. The RDF version is made up of two files, a complete list posts here, and a complete list of tags and their frequencies here. Like the Delicious dataset described above, this data is modelled using the Tagora tagging ontology and expands to around 100GB of turtle files (or 0.8 Billion Triples).


To provide the Consortium with raw data for modeling and analyzing interactions in online social communities, we offer a benchmark dataset from our collaborative tagging system BibSonomy. The anonymized data of BibSonomy are downloadable via a MySQL dump, which will be updated every half year. Interested people get an account from for access to our server on here. Before starting the download, participants have to sign a license agreement in which terms of use are set up. The data set currently consists of over 2.6 thousand users, 181 thousand bookmarks, 219 thousand publications, and over 816 thousand tag assignments. The dumps can easily be loaded into a MySQL database.

To study music tags, we collected data from about users, tags, artists, albums, tracks, sound extracts. The data is currently stored in a MySQL database, and consists of 10 users, 200 tags, 73 albums, 1500 artists, 65000 tracks, and 18000 sound extracts (26 seconds each). This dataset was generate in the summer of 2005, and is used in the demo version of Ikoru application.

In addition to the above, we have also collected data about music charts in The data contains song titles, bands, and their positions in the charts on weekly bases. This data is available here.

Ajax, Blog and XML co-occurrence streams

To analyse the semantics of tags, we create a subset of the larger Delicious crawl the contains complete co-occurrence streams for the tags “ajax”, “blog” and “xml”. The data set and more detailed information about it (e.g. its size and formatting) are available here. MOTOBOY

The dataset from the canal*MOTOBOY project, which involves a small-scale community using tags to represent and communicate their daily life experiences has been made available to the TAGora consortium. In canal*MOTOBOY, 15 motorcycle messengers in Sao Paulo Brazil transmit tagged images, videos and audio clips directly from their mobile phones to a web page. The dataset, which includes 13 months of activity, can be used to study the dynamics of tagging of a small, densely-connected group. It contains over 8000 tag assignments, nearly 8000 resources, 712 tags and 15 users. This dataset can be downloaded from here. NoiseTube

To apply the concept i.e. new tagging usages in the real world, we set up a new extension called NoiseTube for the last year in a new environmental context enabling the general public to measure and annotate their exposure to noise pollution via their mobile phones. In the context of noise pollution, measuring exposure is not enough since we need to identify the causes of pollution to react on it. As people are excellent at recognizing noise sources, they can annotate in real time the geolocated measures regarding the cause or context of their exposures via the mobile application before to send them to the platform. Such environmental tagging allow to add a semantic layer on top of the exposure map created by the public. Currently the data are directly available from NoiseTube via a the web api (csv or json format). The data gathering started in April 2009. Due to the fresh nature of the project,the dataset currently contains 10 users, 8000 measures with 400 tags assignments.

Phenotypes/Limited Forms

Phenotypes/Limited Forms is an art installation that uses photos by the photographer Armin Linke and that has been on display at the Zentrum fur Kunst und Medien (ZKM) in Karlsruhe, Germany, and at the Selective Knowledge exhibition in Athens, Greece. We collected data about 8000 users, 1000 photos, 8000 tags, and 70000 tag assignments. The data gathering started in November 2007. The photos are copyrighted, but the tag assignments are available.

Integrated IMDB and Netflix Dataset

To support the investigation of communal data structures, such as folksonomies, in the context of recommendation, we have created a large knowledge base about movies and how users rate movies. To achieve this, a large portion of the Internet Movie Database (IMDB) was downloaded from to provide information about movies, actors and production personnel, as well a large set of keywords that have been assigned by users to describe movies. The IMDB dataset contains 898,078 movie titles, 2,564,990 names (including actors, actresses, writers, directors and producers), and 32,247 keywords. To obtain information about the way users rate movies, we have collected a dataset from Netflix, a mail-based movie rental company in the US, which contains the movie ratings of 480,189 customers across 17,770 movie titles over the last five years.

Both the IMDB and Netflix datasets have been converted into a relational database, a 643MB compressed MySQL dump. To provide a single view over both datasets, for example, to support the querying of information on movies from IMDB and how users rate these movies from Netflix, we have correlated the 13,880 movie titles in the Netflix dataset with their IMDB counterparts. The result is a large knowledge base on movies and movie ratings that supports semantic querying (for example through SPARQL). The mappings between movie titles in Netflix with those in IMDB can be downloaded from here.

TAGora DBpedia Sense Info

To facilitate investigation into the possibility of automatic grounding of tags to Semantic Web concepts, we provide additional meta-data about concepts in the form a vector space model. By mining the Wikipedia pages that correspond to concepts, we are able to provide a list of terms (and their frequencies) that are associated with the concept, as well as the total number of terms and length of the document. For example, the concept Apple (the fruit) has associated terms (apples, fruit, tree, etc..) whereas the concept Apple_Inc. (the computer company) has associated terms (mac, macintosh, ipod, iphone, etc..). This dataset is served as linked data using the Tagora DBpedia ontology. A a full download is also provided here. The full dataset expands to 50GB of turtle files (or around 0.45 Billion Triples).


Tag Filtering

Tag Filtering
Using knowledge gathered from extensive investigation into user tagging habits, we provide a Tag Filtering Service here. This service consumes a list of raw tags and provides a set of mappings. Such mappings are divided into two classes: tag filtering (covering stemming, singularisation, and normalisation), and “did you mean” (to suggest alternative forms in the event of a tag misspelling). This service can be tested in a web page at where details of how to access the service are also provided.

Sense Matching

Sense Matching
Many within the Semantic Web community have been investigation the possibility of grounding tags to concepts on the Semantic Web. For example, rather than annotating a resource using the string “apple”, a link would be used to a well defined concept such as This style of annotation would alleviate many of the problems associated with the free-form nature of tagging, such as ambiguous meanings an morphological variation.

We provide a Sense Matching Service here that consumes a tag string and suggests possible meanings from two large Semantic Web resources: and W3C Wordnet. Since often contains a large number of concepts that might be related to a particular tag, we provide a ranking (or score) for each sense by considering how many pages link to the concept and how similar the concept title string is to the original tag.

We provide JSON and RDF interfaces to this service, as well as a simple web page for testing at

RDF Builder

RDF Builder
To support the conversion of user tagging data to a Semantic Web Representation we provide the RDF Builder Service. This service currently supports both Delicious and Flickr and provides a Linked Data representation of user contact information and complete tagging history. Using screen scraping (API access when available), this service uses the Tagora Tagging Ontology to represent posts made by the user, and automatically performs filtering on any tags used.


Epistemic dynamical tagging model

Tagging model simulatorThis generative tagging simulator has been developed by the Institute for Computer Science at the University of Koblenz-Landau. It integrates both the background knowledge and the influence of previous tag assignments. It successfully reproduces characteristic properties of tag streams and even explains effects of the user interface on the tag stream. The simulator employs two widely accepted algorithms for producing artificial tag stream, statistically similar to the real datasets found in collaborative tagging communities. It is written in Java and is available online via the institute’s website, where additional documentation can be found. Along with the software, an archive containing all generated tag streams, the software simulator and the technical report is provided. A README file describes how to start the software simulator and which files are contained in the archive with the artificial tag streams. Click on the image to view a screenshot of the simulator.


NET is a software developed by Vito D.P. Servedio of the Physics Department at University “Sapienza” of Rome, Italy. It is thought to help researchers in the field of complex networks to analyse the statistical properties of complex networks. NET provides tools to both analyze and generate random graphs: it can run multiple realizations of network generation models and perform multiple realization (ensemble) statistics; alternatively, it can perform single and multiple statistics on networks read from files.

It is written in C. NET generates networks according to the “rich-get-richer” model introduced by Barabasi and Albert, or to the fitness model introduced by Caldarelli et al. As an analysis tool for directed, undirected and weighted networks, it performs several tasks including graph reduction according to the minimum betweennes criterium, spectral analysis (requires GSL and Lapack libraries), calculation of the degree distribution and correlations, clustering coefficient, site and edge betweenness, node pair distance and cluster dimension. A graphical frontend is available, requiring the installation of the QT 3.0 libraries. A direct interface with the common XmGrace mathematical plotting software is provided. Networks generated by NET are displayed by the graphical visualization software Graphviz and Grip.

The software package and a user’s manual are available on the developer’s webpage. Click to view a snapshot of its graphical frontend.

TAGora project started on June 1st 2006
Sixth Framework Programme, Information Society Technologies, IST call 5, Contract N. 34721
Powered by WordPress | Entries and comments feeds | Valid XHTML and CSS