Cultural Heritage Cluster
DeIC (Danish e-Infrastructure Cooperation) has been charged with spreading High-Performance Com-puting (HPC) to new research areas, such as the humanities and social science areas. In order to respond to this, DeIC and the State and University Library have agreed to establish the DeIC National Cultural Heritage Cluster, State and University Library.
The cultural heritage cluster applies state-of-the-art technologies within data science, and for the first time ever facilitates quantitative research projects on the digital Danish cultural heritage – e.g. radio and TV programmes, websites and historical newspapers.
In recent years, the State and University Library has participated in national and international research and research infrastructure projects based on Danish digital cultural heritage. The library has expand-ed both knowledge and competences about what it takes to offer, for instance, data mining – the search for structures and patterns in large data sets.
The agreement between DeIC and the State and University Library has a total financial framework of DKK 7.2 million over the next three years.
Collections available to research projects
The State and University Library and the Royal Library together are responsible for collecting and preserving Danish cultural heritage, including the digital cultural heritage. This digital cultural heritage is divided into numerous collections, each with its own properties, formats and pos-sibilities. Examples of collections that are now made available to researchers include radio/TV, the Netarchive and the Danish Newspaper Collection.
The radio/TV collection contains more than 1 million hours of TV broadcasts and more than 1.5 million hours of radio programmes broadcast on Danish channels from the 1980s until today. The collec-tion's data are made accessible as audio and video files. The collection also contains large amounts of metadata, such as programme titles, broadcast times and subtitles, depending on the epoch from which the material originates. Read more at mediestream.dk.
The Netarchive contains more than 600 TB data, corresponding to more than 20 billion objects gath-ered from the Danish part of the Internet from 2005 until today. This archive also contains both data and metadata, and both are made available to research projects. The Netarchive is a joint national project between the Royal Library and the State and University Library, and you can read more at netarkivet.dk.
The digital newspaper collection contains 11 million newspaper pages from the 1700s until today. Once the digitisation project is complete, there will be 32 million pages in the collection. All of these pages are stored as image files along with a large amount of metadata and optical character recognition data (OCR).
In addition to these large collections, the State and University Library also has smaller special collec-tions.
All in all, more than 4 PB, corresponding to approx. 4,000,000 gigabytes, are made available to new and existing research projects.
The Cultural Heritage Cluster is to support new areas, particularly within digital humanities. It was therefore decided to design a system that would make it easier easy to conduct well-established analyses without having to compromise in relation to advanced and be-spoke methods.
The Cultural Heritage Cluster is making IBM's BigInsights platform available to research projects. This platform consists of the Open Data Platform (ODPi), which includes a set of advanced analysis tools developed by IBM.
The Open Data Platform is a new initiative from the largest Hadoop distributors, and it features many of the current Hadoop technologies. You can read about ODPi at odpi.org, and from this site, it is possible to download a virtual and fully functional ODPi server, which can be run on an ordinary desk-top PC so that the techniques can be tested in a small setup.
On top of ODPi, IBM has added a number of commercial applications: BigSheets, BigSQL, BigR and Text Analytics. Combined, these four systems form the basis for carrying out analyses by means of known techniques – but doing so on enormous data stocks.
BigSheets uses the spreadsheet metaphor. If you are used to working in Excel, this will be the way to get started.
BigSQL is an ANSI SQL, which facilitates SQL queries about similarly large data stocks of a size that traditional relational databases cannot handle. If you already have a work routine or knowledge about SQL, you can link existing SQL client programs to BigSQL via the widely distributed JDBC.
BigR makes it possible to use the R program on data stocks that exceed the individual computer's resources.
Text Analytics is a browser-based work area for text analysis. It comes with a number of complete modules for e.g. NER and sentiment analysis.
Over the next six months, three pilot projects will utilise the system's new facilities. The State and University Library in collaboration with the DeIC eScience center of competence will make facilities available and offer training in use of the system to the researchers working on these projects free of charge. In 2016 and 2017, DeIC and the State and University Library will offer further, fully financed pilot projects through open project invitations.
In the course of 2016, it will also be possible to buy calculation time and consultancy assistance un-der a transparent price model, which will be developed in connection with the first pilot projects.
The three planned pilot projects are:
Probing a Nation's Web Domain, run by Professor Niels Brügger from Aarhus Univer-sity and Senior Researcher Ditte Laursen from the State and University Library. The project will analyse the Danish part of the Internet as it has developed from 2005 until today. Their da-ta source will primarily be metadata from the Netarchive.
Digital Footprints Research Group, run by Anja Bechmann, Aarhus University. This project will analyse data from social media. The data source will be both the project's own data and data from the Netarchive.
A project run by Sabine Kirchmeier-Andersen from the Danish Language Council's research institute. This project will analyse the development in the Danes' language usage on the social media, and the data source will be the Netarchive.
Future project invitations will be distributed through national channels for all relevant fields. If you are interested in being notified directly, please contact Per Møldrup-Dalum.