One aim of our project was to extend our FinderApp WiTTFind, which is currently used for exploring and researching only Ludwig Wittgenstein’s Big Typescript TS-213 (BT), to the rest of the 5000 pages of Wittgenstein’s Nachlass that are made freely available by the Wittgenstein Archives at the University of Bergen and are used as linked data software from the DM2E project. With the money from the award, we could engage three new members in our research group “Wittgenstein in Co-Text”: Roman Capsamun, Yuliya Kalasouskaya and Stefan Schweter.
To get an a good insight in the actual work of the Archive, the Bergen Electronic Edition (BEE) and the open-available parts of Wittgenstein’s Nachlass, two members of our research group, Angela Krey und Matthias Lindinger, traveled to the Wittgenstein Archive in Bergen. Together with Dr. Alois Pichler and Øyvind Liland Gjesdal (University of Bergen Library) they discussed the latest developments at the archive and transferred the rest of the 5000 pages from the Nachlass of Ludwig Wittgenstein to our institute. At the Bergen archive they also discussed the high density (HD) scanning of the complete Wittgenstein Nachlass, which is done in cooperation with Trinity College, Cambridge. During their visit they could join lessons of Prof. Dr. Peter Hacker, a famous Wittgenstein researchers. He was on a visit at the archive and spoke about „Philosophy and Neuroscience“ and „The Nature of Consciousness“. After the speeches, they could present him a demo of our FinderApp WiTTFind, which impressed him very much.
Finished Milestones in our Award project in September 2014 include:
Extending the Nachlass-data for our FinderApp WiTTFind
We transferred the rest of all 5000 Pages of the free available part of Wittgenstein’s Nachlass into our storage-area at our institute. One problem of the XML-TEI-P5 compatible edition data in Bergen is, that they defined XML-tags with a lot of information, which is not important for our FinderApp. So we defined a restricted, limited XML-TEI-P5 compatible tagset which includes all information which is necessary for our FinderApp. We call this tagset: “CISWAB-tagset”. To reduce the Bergen-tagset to our CISWAB-tagset we programmed XSLT-scripts together with our cooperation-partner in Bergen. To validate the CISWAB-tagset data, we defined an XML-DTD-scheme (CISWAB-DTD).
Extending the syntactic disambiguation of the Nachlass-Data
To extend syntactic disambiguation to the rest of the 5000 pages we had to program new scripts, which runs the Part of Speech (POS) tagging stage with the “treetagger” automatically. Every new incoming CISWAB-XML file is automatically tagged and inserted in the storage-area of our FinderApp.
Using “Tesseract” for OCR and switching to HD-scans for our WiTTReader
One central part of our FinderApp is the facsimile reader WiTTReader which allows to display, browse and highlight all the found hits of the Finder within the original facsimile. Up to now we used only single density (SD) facsimile to scroll through the Nachlass. In the next generation of our FinderApp we want to use high density (HD) facsimile, which are currently produced at the Trinity College in Cambridge.
As it is very important in our project, to use only open source tools, we won’t use the OCR tool ABBYY-finereader (version 11) anymore. After some tests, we decided to use “Tesseract” which is also used by the Google Books project. We transferred the first HD-facsimile of the Nachlass to our institute and the first OCR-quality-tests with “Tesseract” are very promising.
Putting Linked Library Data to Work: the DM2E Showcase
Join us on Tuesday 18 November at the ONB Austrian National Library in Vienna to find out more about the DM2E project and the wider possibilities of scholarly and library (re-)use of Linked Open Data.
In this half-day seminar we will share information on how content has been used for mappings to Europeana and for publishing delivered metadata as Linked Open Data using the DM2E model, a specialised version of the Europeana Data Model (EDM) for the manuscript domain. In addition, Open Knowledge will be present to talk about the value of open data and the OpenGLAM network and we will show results of the work carried out by Digital Humanities scholars applying the semantic annotation tools developed in DM2E to a subset of the published content. The day will be concluded with a workshop based around the Pundit tool for semantic annotation from NET7.
Date and time: Tuesday, 18 November 2014, 13:00 – 18:00
Location: Oratorium, Austrian National Library, Josefsplatz 1, 1015 Vienna, Austria
Max Kaiser, ONB Austrian National Library: Welcome
Doron Goldfarb, ONB Austrian National Library: Introduction to the DM2E project
Marko Knepper, University Library Frankfurt am Main: From Library Data to Linked Open Data
Bernhard Haslhofer, Open Knowledge Austria and Lieke Ploeger, Open Knowledge: The value of open data and the OpenGLAM network
Kristin Dill, ONB Austrian National Library: DM2E and Scholarly Activities
The DM2E project has provided the inspiration for two of its partners ― Dr Kai Eckert of the University of Mannheim and Dov Winer of the European Association for Jewish Culture and Judaica Europeana ― to embark on an initiative to publish existing reference works on Jewish history and culture as Linked Data under the name JudaicaLink.
Reference works such as encyclopedias, glossaries, thesauries or catalogues function as guides to a scholarly domain as well as anchor points and manifestations of scholarly work. On the web of Linked Data, they can perform a key function of interlinking resources related to the described concepts. In effect, this means they can be enriched by creating new links between and within different encyclopedias. This function could revolutionize the work of digital humanists’ and become the bread and butter of their research diet.
To our almost certain knowledge JudaicaLink is the first such initiative and platform in the field of Jewish studies.
JudaicaLink: a plaform for access to Linked Data versions of encyclopedias
Like with many pioneering LOD publishing efforts, the first challenge was to persuade the publishers and maintainers of such reference works to give their permission to create a Linked Data version of their encyclopedia and publish it on JudaicaLink.org. Provided the work is already online, the minimal requirement is that the URLs of the articles in the encyclopedia remain stable. It is also possible to publish an LOD version of a given work on the publishers own website provided they have the technical infrastructure and capacity to do so. In this case, JudaicaLink can provide information and a central search functionality.
The YIVO Encyclopedia of Jews in Eastern Europe
We have been fortunate that after some discussion the leaders of the YIVO Institute for Jewish Research in New York saw the potential of LOD for their extraordinary YIVO Encyclopedia of Jews in Eastern Europe and gave us the go ahead. From the point of view of a Linked Data enthusiast, the YIVO Encyclopedia is really a great resource. All articles are highly interlinked, often they even provide a hierarchy of sub-concepts described under a superordinate concept. Links to glossary terms provide further terminological control.
Encyclopedia of Russian Jewry
Recently, JudaicaLink announced also the first release of the Encyclopedia of Russian Jewry (Rujen), published in Moscow since 1994, as Linked Open Data. Rujen is not as interlinked as YIVO and the articles are much shorter on average, but it contains many more articles (about 20,000 compared to about 2,500 YIVO articles). The first obvious feature of Rujen for English-speaking people is the language: it’s Russian. The Cyrillic alphabet raises an important question regarding Linked Data: how to coin the URIs for the articles. We are still considering different solutions. Basically, there are three options based on the actual identifier, the Cyrillic title of the article:
Use an Internationalized Resource Identifier, an IRI. For example: http://data.judaicalink.org/data/rujen/гатов_шапсель_гиршевич. This is perfectly readable (at least by Russians) and probably the option to be preferred. However, it is not clear if all applications support IRIs correctly and we would like to have the data as easily accessible as possible. Therefore, and also because we wanted to try it, we decided on the next option:
Use a transliterated URI. For example: http://data.judaicalink.org/data/rujen/gatov_shapselj_girshevich. Again, this is perfectly readable, and since Rujen is mostly about persons and locations, people familiar with the Latin alphabet can make sense of it. However, there are drawbacks. We did not transliterate just because of our widely shared ignorance of the Cyrillic alphabet. We adopted this option because we wanted to have valid URIs, for the sake of backwards-compatibility and technical interoperability. This means using only the 26 common Latin letters, no diacritics, no special characters. And the transliteration should be simple, based on a lookup-table that translates every Cyrillic letter consistently to one or more Latin characters. This is obviously not an ideal way to transliterate and, according to our tests with our nice colleagues from Belarus and Russia, it quite often produces somewhat strange results for native speakers. However, they assured us that it is still readable and not insulting.
Actually, there is also a fourth option that is completely different: using some kind of numbering or code scheme (for example a hash value of the title), but despite leading to shorter URIs, this again has the effect that no one can make sense of it, similar to option 1. There are people who advocate this approach precisely for this reason: a URI should not contain possibly misleading semantics. And, of course, a number does not show an arbitrary preference for a language or an alphabet.
So, we settled for transliteration as our first attempt, but we are curious about your ideas and opinion. After reading this long and hopefully interesting digression, you are probably much more interested in the question: how can I access this LOD resource?
The easiest way is the following: while browsing the YIVO Encyclopedia, you can access the data representation by simply replacing www.yivoencyclopedia.org/article.aspx/ with data.judaicalink.org/data/yivo/ in the URL field of the browser. For convenience, you can also use our bookmarklets. They are provided together with additional information for each encyclopedia here. Just drag and drop them to your bookmarks and when you click on this bookmark while on an encyclopedia article, you will be directed to the Linked Data version. For an even quicker look, you can also just start at the concept used above in option 3, or for YIVO, for example here: http://data.judaicalink.org/data/html/yivo/Abramovitsh_Sholem_Yankev.
All in all, JudaicaLink now provides access to 22,808 concepts in English (approx. 10%) and in Russian (approx. 90%), mostly locations and persons.
JudaicaLink gets links
From the beginning our vision was not only about the provision of stable URIs and data for concepts described in the encyclopedias. It was also about the generation of links between these resources and other linked data resources on the Web. In a first run, we used Silk to generate links between JudaicaLink and the following sources:
All the links have been created automatically and are primarily based on the labels of the resources, so some wrong links are to be expected. Nevertheless this is an important first step. For the present, we provide the links directly together with the resource descriptions (as owl:sameAs links), but we will separate them with proper identification of the provenance as soon as we are able to use more sophisticated linking approaches. One immediate benefit from this simple linking is that we could already generate links between both encyclopedias. This works because several sources, like DBpedia, are multilingual and therefore links to both encyclopedias could be established. Whenever a single resource has two links to one resource in each encyclopedia, an additional link establishing the identity of these two resources could be inferred.
JudaicaLink arrives in the Cloud
With all these links, JudaicaLink is now also part of the famous LOD Cloud that was released recently in its latest version. You can find us at about 4 o’clock, close to the border and right beside our neighbour project DM2E.
We hope the readers of this blog will spread the word and help us to convince more publishers to work with us. And do let us know what you think of JudaicaLink and what additional ideas you have. We look forward to hearing from you!
The Digital Humanities Advisory Board of the DM2E project meets on a regular basis to ensure that the research direction and technical developments within the project meet the needs of digital humanities scholars. On 2 October the Board held their fifth meeting, with the following attendees:
Dirk Wintergrün (Max-Planck-Institut für Wissenschaftsgeschichte)
Sally Chambers (DARIAH, Göttingen Centre for the Digital Humanities)
Laurent Romary (INRIA)
Alois Pichler (University of Bergen)
Alastair Dunning (The European Library)
At the start of the meeting, Christian Morbidoni (Net7) presented the latest developments of Pundit 2 and its new features, which include an improved usability interface and added flexibility, which now makes it easier to adopt Pundit for specific needs such as user experiments.
Steffen Hennicke (Humboldt-Universität zu Berlin) then went on to present the user experiments that are currently ongoing with Pundit2 and the new version of Korbo. The first set of experiments focuses more generally on how humanists work with Pundit and Linked Data: students from fields such as Archival Sciences and (Educational) History formulated an original and relevant research question or interest, discussed the ontology in a first workshop and worked with Pundit on annotating specific material, with a second workshop to talk through their results.
The goal of the second set of experiments is to better understand the reasoning process of digital humanists working with Linked Data. Three different users (a philosopher, an art historian and an historian) were asked to formulate an original and relevant research question pertaining to their particular set of research objects and data and to find an answer to this question using a faceted browser containing and visualising their data. During this work they self-document their work and thought process.
Both sets of experiments are still work in progress: the results will be presented at the final meeting of DM2E on 11 December in Pisa, Italy, together with the final results of the DM2E project and those of the winners of the Open Humanities Awards.
Attendance to this event is free and registration will open soon, if you are interested in coming, please save the date and keep an eye out on the DM2E blog.
The event is focused on digital humanists and intended to target research-driven experimentation with existing humanities data sets. One of the most exciting recent developments in digital humanities include the investigation and analysis of complex data sets that require the close collaboration between Humanities and computing researchers. The aim of the hack day is not to produce complete applications but to experiment with methods and technologies to investigate these data sets so that at the end we can have an understanding of the types of novel techniques that are emerging.
Possible themes include but are not limited to
Research in textual annotation has been a particular strength of digital humanities. Where are the next frontiers? How can we bring together insights from other fields and digital humanities?
How do we provide linking and sharing humanities data that makes sense of its complex structure, with many internal relationships both structural and semantic. In particular, distributed Humanities research data often includes digital material combining objects in multiple media, and in addition there is diversity of standards for describing the data.
Visualisation. How do we develop reasonable visualisations that are practical and help build on overall intuition for the underlying humanities data set
How can we advance the novel humanities technique of network analysis to describe complex relationships of ‘things’ in social-historical systems: people, places, etc.
With this hack day we seek to form groups of computing and humanities researchers that will work together to come up with small-scale prototypes that showcase new and novel ways of working with humanities data.
SEA CHANGE aims to facilitate the mass annotation of places in Historic Geospatial Documents (both geographic texts and maps) in two hackathon-style annotation workshops. We are pleased to announce that the first of these workshops will take place at the University of Heidelberg, on October 30-31. Our special thanks go to Lukas Loos from the Institute of Geography who has kindly offered to be our host for this event.
SEA CHANGE is part of the Pelagios project, an extensive network that is interlinking annotated online resources that document the past. The results from SEA CHANGE will feed back into Pelagios, making our interlinked resources even richer and more extensive. In preparation, those of us at Pelagios have already been annotating a range of Greek literary texts with a specific geographic focus, and have compiled a candidate list of documents from the European Middle Ages to work with next. On the technical side, we have extended our digital toolbox to allow us to annotate high-resolution imagery.
We’ve prepared a little sneak preview in the screencast above. The real thing will make its first live & hands-on appearance at the workshop. In case you happen to be around, and your curiosity is piqued, you are welcome to join us, subject to space & availability. If you are interested, then do drop us a line – first come, first serve!
In the second part of the session community, we carried out some practical exercises. The People Time Person (PTP) experiment allows users to annotate snippets of text expressing the concept that a person has been in a certain place at a certain time. The Annomatic experiment allows users to create annotations semi-automatically. After creating annotations some visualization tools have been used in order to display the “knowledge” created during the experiment. The following website has been used to facilitate the experiment conduction http://dev.thepund.it/dariah2014/
At the end of the experiments we asked participants, who were about 30, to complete a questionnaire to gather their feedback.
Since the project start in February 2012, the partners in the DM2E project have been working on opening up prominent manuscripts and developing open workflows for migration of this data to Europeana and the wider Linked Open Web. Before the project closes, in February 2015, the consortium is organising a final series of events to demonstrate the progress that has been achieved as well as to inspire future research in the area of Linked Open Data. Join us at one of the following events to hear more!
18 November 2014: Putting Linked Library Data to Work: the DM2E Showcase (ONB Austrian National Library, Vienna, Austria)
This half-day seminar (taking place from 13.00 – 18.00) will be divided into two parts. In the first part we focus on the work of the content providing partners from the DM2E project. We will share information on how the providers’ content has been used for mappings to Europeana and for publishing delivered metadata as Linked Open Data using the DM2E model, a specialised version of the Europeana Data Model (EDM) for the manuscript domain. In addition, Open Knowledge will be present to talk about the value of open data and the OpenGLAM network.
The second part will focus on possibilities of scholarly (re-)use of Linked Open Data. Among other topics, it will present the results of the work carried out by Digital Humanities scholars applying the semantic annotation tools developed in DM2E to a subset of the published content. The day will be concluded with a workshop based around the Pundit tool for semantic annotation from NET7.
1-3 December 2014: workshop ‘RDF Application Profiles in Cultural Heritage’ at the SWIB2014 Semantic Web in Libraries conference (Bonn, Germany)
The SWIB conference aims to provide substantial information on Linked Open Data developments relevant to the library world and to foster the exchange of ideas and experiences among practitioners. SWIB encourages thinking outside the box by involving participants and speakers from other domains, such as scholarly communications, museums and archives, or related industries.
DM2E is organising a workshop at this conference together with Europeana and the DCMI RDF Application Profiles Task Group (RDF-AP) on the definition, creation and use of application profiles, with the DM2E model as one of the case studies. Registration for the conference has already opened: the full programme and more information on this workshop is available here.
11 December 2014: DM2E final event (Pisa, Italy)
Towards the closing of our project, we invite all those interested to come to Pisa for the presentation of our final results. Speakers from the DM2E consortium will demonstrate what has been achieved throughout the course of the project and how the tools and communities created have helped to further humanities research in the area of manuscripts in the Linked Open Web.
The winners of the second round of the Open Humanities Awards will be participating in this event as well and show the results of their projects. Finally, there will be a keynote talk from a prominent researcher from the digital humanities field.
More information on the programme for each of these events, as well as the registration details, will be announced through the DM2E website in the near future.
“Europäische Friedensverträge der Vormoderne online” (“Early Modern European Peace Treaties Online”) is a comprehensive collection of about 1,800 bilateral and multilateral European peace treaties from the period of 1450 to 1789, published as an open access resource by the Leibniz Institute of European History (IEG). Currently the metadata is stored in a relational database with a Web front-end. This project has two primary goals:
the publication of the treaties metadata as Linked Open Data, and
the evaluation of nanopublications as a representation format for humanities data.
The project got off to a rocky start, as I had massive troubles finding someone to work on it. The IEG is a non-university research institute, so I do not have ready access to students—and in particular not to students with knowledge about Linked Open Data (LOD). I was about to give up, when Magnus Pfeffer of the Stuttgart Media University called to tell me he’d be interested to work on it with his team. He’s got lots of experience with LOD, so I’m very happy to have him work with me on the project.
We’ve now started to work on the first goal, the publication of the treaties metadata as LOD. This should be a relatively straightforward process, whereas the second goal, the evaluation of the nanopublications approach, will be more experimental—obviously, since nobody has used it in such a context yet.
The process for converting the content of the existing database into LOD basically consists of four steps:
1. Analyzing the data. The existing database consists of 11 tables and numerous fields. Some of the fields have telling names, but not all of them. Another question will be what the fields actually contain; it seems that sometimes creative solutions have been used. For example, the parties of a treaty are stored in a field declared as follows:
partners varchar(255) NOT NULL DEFAULT ”
This is a string field, but the field doesn’t contain the names of the parties, but rather their IDs, for example:
You can then look up the names in another table and find out that 37 is France, 46 is Genoa, and 253 is Naples-Sicily. This is a workaround for the problem of storing lists of variable length, which is quite tedious in a relational database. While this approach is clearly better than hardcoding the names of the parties in every record, it moves a part of the semantics into the application, which has to know that what looks like a string is actually a list of keys for a table.
Now, this example is not particularly complicated, but it illustrates that a thorough analysis of the database is necessary in order to accurately extract and convert the information it contains.
2. Identifying and selecting pertinent ontologies. We don’t want to re-invent the wheel but rather want to build upon existing and proven ontologies for describing the treaties. One idea we’ve discussed is to model them as events; one could then use an ontology like LODE. However, we will first need to see what information we need to represent, i.e., what we find in the database.
3. Modelling the information in RDF. Once we know how to conceptually model the information, we need to define how to actually represent the information on a treaty in RDF.
4. Generating the data. Finally, we can then iterate over the database, extract the information, combine it into RDF statements, and output them in a form we can then import into a triple store.
At this point, the basic data on the treaties will be available as LOD. However, some very interesting information is only available as unstructured text, for example the references to secondary literature or the names of the signees. At this point, we’ll probably get back to the database to see what additional information could be extracted—with reasonable effort—for inclusion.
Getting out the basic information should be straightforward, but, as always when dealing with legacy data, we may be in for some surprises…
On 17-19 September the DARIAH-EU network organises its fourth General VCC (Virtual Competency Centres) meeting in Rome. The DARIAH-EU infrastructure is focused on bringing together individual state-of-the-art digital Arts and Humanities activities across Europe. Their annual meeting provides opportunities to work together on topics within their Virtual Competency Centres (VCC) and to share experiences from various fields within the digital humanities.
This year, the DARIAH-EU general meeting will also guest specific community sessions alongside the general programme. We are happy to announce that DM2E has been selected to host a community session on the Pundit tool, the semantic annotation tool that is being developed as part of work package 3.
This community session, entitled Pundit, a semantic annotation tool for researchers will take place on Thursday 18 September. The session aims at illustrating Pundit main features and components (the client, Feed, Ask) as well as showing how it has been used by specific scholarly communities in Philosophy, History of Art, Philology and History of Thought domains. Moreover, attendees will be practically introduced to Pundit through dedicated exercises thought to give them the first skills to produce semantic annotations on Digital Libraries and generic websites.
Attendance to this event is free: registration is possible through Eventbrite.