I am a postdoctoral researcher at the University of Twente interested in various areas of information retrieval, information extraction and natural language processing.
Current activities:
About me
I am a postdoctoral researcher with the Human Media Interaction group of the University of Twente. I am interested in various areas of information retrieval, information extraction and natural language processing.
After receiving my MSc degree in computer science from the University of Twente, I started my PhD project in the area of biomedical information retrieval under supervision of Franciska de Jong and Wessel Kraaij. During my PhD project I worked at the European Bioinformatics Institute in Cambridge as a visiting researcher. In my PhD work I investigated how to incorporate domain knowledge into search engines for biomedical researchers. I adapted conventional models for cross-lingual information retrieval to effectively incorporate knowledge from concept thesauri in the retrieval process.
In 2010 joined the Database Group of the University of Twente as a postdoctoral researcher, where I worked on distributed information retrieval. Since 2012, I have worked for the Meertens Insitute in Amsterdam and the Human Media Interaction group as part of the FACT (Folktales As Classifiable Texts) project. In this project I am researching techniques to (semi) automatically assign metadata to Dutch folktales collected in the Dutch folktale database.
Publications
Also see
UTwente Eprints and
Google Scholar
2014
R. Aly, D. Trieschnigg, K. McGuinness, N. E. O'Connor, and F. de Jong.
Average precision: Good guide or false friend to multimedia search
effectiveness?
In 20th International Conference on Multimedia Modeling
(Session on Multimedia Hyperlinking and Retrieval), MMM 2014. Dublin,
Ireland, 2014.
Accepted.
T. Demeester, R. Aly, D. Hiemstra, D. Nguyen, D. Trieschnigg, and C. Develder.
Exploiting user disagreement for search evaluation: an experimental
approach.
In Proceedings of the sixth ACM international conference on Web
Search and Data Mining, WSDM. 2014.
Accepted.
T. Demeester, D. Trieschnigg, D. Nguyen, and D. Hiemstra.
Overview of the TREC 2013 Federated Web Search Track.
In
Proceedings of the 22nd Text REtrieval Conference
Proceedings (TREC). 2014.
Abstract
Paper
Abstract
The TREC Federated Web Search track is intended to promote research related to federated search in a realistic web setting, and hereto provides a large data collection gathered from a series of online search engines. This overview paper discusses the results of the first edition of the track, FedWeb 2013. The focus was on basic challenges in federated search: (1) resource selection, and (2) results merging. After an overview of the provided data collection and the relevance judgments for the test topics, the participants' individual approaches and results on both tasks are discussed. Promising research directions and an outlook on the 2014 edition of the track are provided as well.
2013
M. Dadvar, D. Trieschnigg, and F. de Jong.
Expert knowledge for automatic detection of bullies in social
networks.
In
Proceedings of the 25th Benelux Conference on Artificial
Intelligence, BNAIC. Delft, 7-8 Nov 2013.
Abstract
Paper
Abstract
Cyberbullying is a serious social problem in online environments and social networks. Current approaches to tackle this problem are still inadequate for detecting bullying incidents or to flag bullies. In this study we used a multi-criteria evaluation system to obtain a better understanding of YouTube users`` behaviour and their characteristics through expert knowledge. Based on experts`` knowledge, the system assigns a score to the users, which represents their level of ``bulliness'' based on the history of their activities, The scores can be used to discriminate among users with a bullying history and those who were not engaged in hurtful acts. This preventive approach can provide information about users of social networks and can be used to build monitoring tools to aid finding and stopping potential bullies.
M. Dadvar, D. Trieschnigg, R. Ordelman, and F. de Jong.
Improving cyberbullying detection with user context.
In
Proceedings of the 35th European Conference on IR Research,
ECIR 2013, pages 693–696. 2013.
Abstract
Paper
Abstract
The negative consequences of cyberbullying are becoming more alarming every day and technical solutions that allow for taking appropriate action by means of automated detection are still very limited. Up until now, studies on cyberbullying detection have focused on individual comments only, disregarding context such as users' characteristics and profile information. In this paper we show that taking user context into account improves the detection of cyberbullying.
T. Demeester, D. Nguyen, D. Trieschnigg, C. Develder, and D. Hiemstra.
Snippet-based relevance predictions for federated web search.
In
Proceedings of the 35th European Conference on IR Research,
ECIR 2013, pages 697–700. 2013.
Abstract
Paper
Abstract
How well can the relevance of a page be predicted, purely based on snippets? This would be highly useful in a Federated Web Search setting where caching large amounts of result snippets is more feasible than caching entire pages. The experiments reported in this paper make use of result snippets and pages from a diverse set of actual Web search engines. A linear classifier is trained to predict the snippet-based user estimate of page relevance, but also, to predict the actual page relevance, again based on snippets alone. The presented results confirm the validity of the proposed approach and provide promising insights into future result merging strategies for a Federated Web Search setting.
D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder.
“How Old Do You Think I Am?”: A Study of Language and Age in
Twitter.
In
Proceedings of the Seventh International AAAI Conference on
Weblogs and Social Media, ICWSM 2013. 2013.
Abstract
Paper
Abstract
In this paper we focus on the connection between age and language use, exploring age prediction of Twitter users based on their tweets. We discuss the construction of a fine-grained annotation effort to assign ages and life stages to Twitter users. Using this dataset, we explore age prediction in three different ways: classifying users into age categories, by life stages, and predicting their exact age. We find that an automatic system achieves better performance than humans on these tasks and that both humans and the automatic systems have difficulties predicting the age of older people. Moreover, we present a detailed analysis of variables that change with age. We find strong patterns of change, and that most changes occur at young ages.
D. Nguyen, D. Trieschnigg, and M. Theune.
Folktale classification using learning to rank.
In
Proceedings of the 35th European Conference on IR Research,
ECIR 2013, pages 195–206. 2013.
Abstract
Paper
Abstract
We present a learning to rank approach to classify folktales, such as fairy tales and urban legends, according to their story type, a concept that is widely used by folktale researchers to organize and classify folktales. A story type represents a collection of similar stories often with recurring plot and themes. Our work is guided by two frequently used story type classification schemes. Contrary to most information retrieval problems, the text similarity in this problem goes beyond topical similarity. We experiment with approaches inspired by distributed information retrieval and features that compare subject-verb-object triplets. Our system was found to be highly effective compared with a baseline system.
K. Tjin-Kam-Jet, D. Trieschnigg, and D. Hiemstra.
Using a stack decoder for structured search.
In
Proceedings of the 10th international conference on Flexible
Query Answering Systems (FQAS 2013), volume 8132 of
Lecture Notes in
Computer Science, pages 519–530. 2013.
Abstract
Paper
Abstract
We describe a novel and flexible method that translates free-text queries to structured queries for filling out web forms. This can benefit searching in web databases which only allow access to their information through complex web forms. We introduce boosting and discounting heuristics, and use the constraints imposed by a web form to find a solution both efficiently and effectively. Our method is more efficient and shows improved performance over a baseline system.
D. Trieschnigg, D. Nguyen, and T. Meder.
In search of Cinderella: A transaction log analysis of folktale
searchers.
In
Proceedings of the first ACM SIGIR Workshop on the
Exploration, Navigation and Retrieval of Information in Cultural Heritage,
ENRICH 2013. Dublin, Ireland, 1 August 2013.
Abstract
Paper
Slides
Abstract
In this work we report on a transaction log analysis of the Dutch Folktale Database, an online repository of extensively annotated folktales ranging from old fairy tales to recent urban legends, written in (old) Dutch, Frisian and a variety of Dutch dialects. We observed that users have a preference for subgenres within folktales such as traditional legends and urban legends and prefer stories in standard Dutch over stories in Frisian. Searches are typically short and aim at large groups of stories (from the same subgenre or collector), or specific stories with the same main character. In contrast, search sessions are relatively long (median of around 2 minutes) and many result pages are viewed (average: 3.4 pages, median: 2 pages). Based on the observations we propose a number of improvements to the current search and browsing interface. Our findings offer insight into the search behavior of folktale searchers, but are also of interest to researchers and developers working on other e-humanities collections.
D. Trieschnigg, D. Nguyen, and M. Theune.
Learning to extract folktale keywords.
In
Proceedings of the 7th Workshop on Language Technology for
Cultural Heritage, Social Sciences, and Humanities, LaTeCH 2013, pages
65–73. Association for Computational Linguistics, Sofia, Bulgaria, August
2013.
Abstract
Paper
Slides
Abstract
Manually assigned keywords provide a valuable means for accessing large document collections. They can serve as a shallow document summary and enable more efficient retrieval and aggregation of information. In this paper we investigate keywords in the context of the Dutch Folktale Database, a large collection of stories including fairy tales, jokes and urban legends. We carry out a quantitative and qualitative analysis of the keywords in the collection. Up to 80% of the assigned keywords (or a minor variation) appear in the text itself. Human annotators show moderate to substantial agreement in their judgment of keywords. Finally, we evaluate a learning to rank approach to extract and rank keyword candidates. We conclude that this is a promising approach to automate this time intensive task.
D. Trieschnigg, K. Tjin-Kam-Jet, and D. Hiemstra.
SearchResultFinder: Federated search made easy (demo).
In
Proceedings of the 36th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR
2013, pages 1113–1114. July 28–August 1, 2013, Dublin, Ireland, 2013.
Abstract
Paper
Poster
Abstract
Building a federated search engine based on a large number existing web search engines is a challenge: implementing the programming interface (API) for each search engine is an exacting and time-consuming job. In this demonstration we present SearchResultFinder, a browser plugin which speeds up determining reusable XPaths for extracting search result items from HTML search result pages. Based on a single search result page, the tool presents a ranked list of candidate extraction XPaths and allows highlighting to view the extraction result. An evaluation with 148 web search engines shows that in 90% of the cases a correct XPath is suggested.
2012
M. Dadvar, F. de Jong, R. Ordelman, and D. Trieschnigg.
Improved cyberbullying detection using gender information.
In
Proceedings of the Twelfth Dutch-Belgian Information
Retrieval Workshop, DIR 2012, pages 23–25. 2012.
Abstract
Paper
Abstract
As a result of the invention of social networks, friendships, relationships and social communication are all undergoing changes and new definitions seem to be applicable. One may have hundreds of "friends" without even seeing their faces. Meanwhile, alongside this transition there is increasing evidence that online social applications are used by children and adolescents for bullying. State-of-the-art studies in cyberbullying detection have mainly focused on the content of the conversations while largely ignoring the characteristics of the actors involved in cyberbullying. Social studies on cyberbullying reveal that the written language used by a harasser varies with the author``s features including gender. In this study we used a support vector machine model to train a gender-specific text classifier. We demonstrated that taking gender-specific language features into account improves the discrimination capacity of a classifier to detect cyberbullying.
M. Dadvar, R. Ordelman, F. de Jong, and D. Trieschnigg.
Towards user modelling in the combat against cyberbullying.
In
Proceedings of the17th International Conference on
Applications of Natural Language to Information Systems, NLDB 2012, pages
277–283. 2012.
Abstract
Paper
Abstract
Friendships, relationships and social communications have all gone to a new level with new definitions as a result of the invention of online social networks. Meanwhile, alongside this transition there is increasing evidence that online social applications have been used by children and adolescents for bullying. State-of-the-art studies in cyberbullying detection have mainly focused on the content of the conversations while largely ignoring the users involved in cyberbullying. We hypothesis that incorporation of the users' profile, their characteristics, and post-harassing behaviour, for instance, posting a new status in another social network as a reaction to their bullying experience, will improve the accuracy of cyberbullying detection. Cross-system analyses of the users' behaviour - monitoring users' reactions in different online environments - can facilitate this process and could lead to more accurate detection of cyber- bullying. This paper outlines the framework for this faceted approach.
T. Demeester, D. Nguyen, D. Trieschnigg, C. Develder, and D. Hiemstra.
What snippets say about pages in federated web search.
In
Proceedings of the 8th Asia Information Retrieval Societies
Conference, AIRS 2012, pages 250–261. 2012.
Abstract
Paper
Abstract
What is the likelihood that a Web page is considered rel- evant to a query, given the relevance assessment of the corresponding snippet? Using a new federated IR test collection that contains search results from over a hundred search engines on the internet, we are able to investigate such research questions from a global perspective. Our test collection covers the main Web search engines like Google, Yahoo!, and Bing, as well as a number of smaller search engines dedicated to multimedia, shopping, etc., and as such reflects a realistic Web environment. Using a large set of relevance assessments, we are able to investigate the connection between snippet quality and page relevance. The dataset is strongly inhomogeneous, and although the assessors' consistency is shown to be satisfying, care is required when comparing resources. To this end, a number of probabilistic quantities, based on snippet and page relevance, are introduced and evaluated.
F. Karsdorp, P. van Kranenburg, T. Meder, D. Trieschnigg, and A. van den Bosch.
In search of an appropriate abstraction level for motif annotations.
In
Computational Models of Narrative workshop (LREC 2012).
Istanbul, Turkey, 26–27 May 2012.
Abstract
Paper
Abstract
We present ongoing research on the role of motifs in oral transmission of stories. We assume that motifs constitute the primary building blocks of stories. On the basis of a quantitative analysis we show that the level of motif annotation utilized in the Aarne-Thompson-Uther folktale type catalogue is well suited to analyze two genres of folktales in terms of motif sequences. However, for the other five genres in the catalogue the annotation level is not apt, because it is unable to bring to front the commonalities between stories.
D. Nguyen, T. Demeester, D. Trieschnigg, and D. Hiemstra.
Federated search in the wild: the combined power of over a hundred
search engines.
In
Proceedings of the 21th ACM international Conference on
Information and Knowledge Management, CIKM 2012, pages 1874–1878. 2012.
Abstract
Paper
Abstract
Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgements for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.
D. Nguyen, D. Trieschnigg, T. Meder, and M. Theune.
Automatic classification of folk narrative genres.
In
First International Workshop on Language Technology for
Historical Text(s) (KONVENS 2012), pages 378–382. 2012.
Abstract
Paper
Abstract
Folk narratives are a valuable resource for humanities and social science researchers. This paper focuses on automatically recognizing folk narrative genres, such as urban legends, fairy tales, jokes and riddles. We explore the effectiveness of lexical, structural, stylistic and domain specific features. We find that it is possible to obtain a good performance using only shallow features. As dataset for our experiments we used the Dutch Folktale database, containing narratives from the 16th century until now.
A. Tigelaar, D. Hiemstra, and D. Trieschnigg.
Peer to peer information retrieval: An overview.
ACM Transactions on Information Systems, 30(2):9:1–9:34,
2012.
Abstract
Paper
Abstract
Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these has seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralised solutions. In this article we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralised client-server solutions in terms of scalability, performance, user satisfaction and freedom.
K. Tjin-Kam-Jet, D. Trieschnigg, and D. Hiemstra.
An analysis of free-text queries for a multi-field web for.
In
Proceedings of the 4th Information Interaction in Context
Symposium, IIIX 2012, pages 82–89. 2012.
Abstract
Paper
Abstract
We report how users interact with an experimental system that transforms single-field textual input into a multi-field query for an existing travel planner system. The experimental system was made publicly available and we collected over 30,000 queries from almost 12,000 users. From the free-text query log, we examined how users formulated structured information needs into free-text queries. The query log analysis shows that there is great variety in query formulation, over 400 query templates were found that occurred at least 4 times. Furthermore, with over 100 respondents to our questionnaire, we provide both quantitative and qualitative evidence indicating that end-users significantly prefer a single field interface over a multi-field interface when performing structured search.
K. Tjin-Kam-Jet, D. Trieschnigg, and D. Hiemstra.
A probabilistic approach for mapping free-text queries to complex web
forms.
Technical Report TR-CTIT-12-33, Centre for Telematics and Information
Technology, University of Twente, Enschede, 2012.
Abstract
Paper
Abstract
Web applications with complex interfaces consisting of mul- tiple input fields should understand free-text queries. We propose a probabilistic approach to map parts of a free-text query to the fields of a complex web form. Our method uses token models rather than only static dictionaries to create this mapping, offering greater flexibility and requiring less domain knowledge than existing systems. We evaluate different implementations of our mapping model and show that our system effectively maps free-text queries without using a dictionary. If a dictionary is available, the performance increases and is significantly better than a rule-based baseline.
D. Trieschnigg, D. Hiemstra, M. Theune, F. de Jong, and T. Meder.
An exploration of language identification techniques for the Dutch
folktale database.
In
Adaptation of Language Resources and Tools for Processing
Cultural Heritage workshop (LREC 2012). Istanbul, Turkey, 26 May 2012.
Abstract
Paper
Slides
Abstract
The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research.
D. Trieschnigg, K. Tjin-Kam-Jet, and D. Hiemstra.
Ranking XPaths for extracting search result records.
Technical Report TR-CTIT-12-08, Centre for Telematics and Information
Technology, University of Twente, Enschede, 2012.
Abstract
Paper
Abstract
Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.
2011
C. Hauff and D. Trieschnigg.
Adding emotions to pictures.
In G. Amati and F. Crestani, editors,
Advances in Information
Retrieval Theory - Third International Conference, ICTIR 2011, pages
364–367. 2011.
Abstract
Paper
Poster
Abstract
A large number of out-of-copyright children books are available online, but are not very attractive to children due to a lack of illustrations. Automatic text illustration may enhance the reading experience of these books, but inappropriate picture coloring may convey inappropriate emotions. Since already at a very early age, children can map colors to certain emotions, we propose an approach to automatically alter picture colors according to the emotion conveyed in the text.
A. S. Tigelaar, D. Trieschnigg, and D. Hiemstra.
Search result caching in peer-to-peer information retrieval networks.
In A. Hanbury, A. Rauber, and A. P. de Vries, editors,
Multidisciplinary Information Retrieval - Second Information Retrieval
Facility Conference, IRFC 2011, pages 134–148. 2011.
Abstract
Paper
Abstract
For peer-to-peer web search engines it is important to quickly process queries and return search results. How to keep the perceived latency low is an open challenge. In this paper we explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache others performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.
K. Tjin-Kam-Jet, D. Trieschnigg, and D. Hiemstra.
Free-text search over complex web forms.
In A. Hanbury, A. Rauber, and A. P. de Vries, editors,
Multidisciplinary Information Retrieval - Second Information Retrieval
Facility Conference, IRFC 2011, pages 94–107. 2011.
Abstract
Paper
Abstract
This paper investigates the problem of using free-text queries as an alternative means for searching `behind' web forms. We introduce a novel specification language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.
K. Tjin-Kam-Jet, D. Trieschnigg, and D. Hiemstra.
Free-text search versus complex web forms.
In
33rd European Conference on IR Research, ECIR 2011, pages
670–674. 2011.
Abstract
Paper
Abstract
We investigated the use of free-text queries as an alternative means for searching 'behind' web forms. We conducted a user study where we evaluated our prototype free-text interface in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.
D. Trieschnigg.
Proof of concept: concept-based biomedical information retrieval
(doctoral abstract).
SIGIR Forum, 44(2):89, January 2011.
Abstract
Paper
Abstract In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. Biomedical IR is concerned with the disclosure of these vast amounts of written knowledge. Biomedical IR is not only important for end-users, such as biologists, biochemists, and bioinformaticians searching directly for relevant literature but also plays an important role in more sophisticated knowledge discovery. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Multiple synonymous terms can be used for single biomedical concepts, such as genes and diseases. Conversely, single terms can be ambiguous, and may refer to multiple concepts. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge in modern word-based information retrieval is, however, far from trivial. This thesis investigates the problem of handling biomedical terminology based on three research themes.
The first research theme deals with robust word-based retrieval. Effective retrieval models commonly use a word-based representation for retrieval. As so many spelling variations are present in biomedical text, the way in which these word-based representations are obtained affect retrieval effectiveness. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. This investigation included stop-word removal, stemming, different approaches to breakpoint identification and normalisation, and character n-gramming. In particular breakpoint identification and normalisation (that is determining word parts in biomedical compounds) showed a strong effect on retrieval performance. A combination of effective preprocessing heuristics was identified and used to obtain word-based representations from text for the remainder of this thesis.
The second research theme deals with concept-based retrieval. We investigated two representation vocabularies for concept-based indexing, one based on the Medical Subject Headings thesaurus, the other based on the Unified Medical Language System metathesaurus extended with a number of gene and protein dictionaries.
We investigated the following five topics.
- How documents are represented in a concept-based representation.
- To what extent such a document representation can be obtained automatically.
- To what extent a text-based query can be automatically mapped onto a concept-based representation and how this affects retrieval performance.
- To what extent a concept-based representation is effective in representing information needs.
- How the relationship between text and concepts can be used to determine the relatedness of concepts.
We compared different classification systems to obtain concept-based document and query representations automatically. We proposed two classification methods based on statistical language models, one based on K-Nearest Neighbours (KNN) and one based on Concept Language Models (CLM).
For a selection of classification systems we carried out a document classification experiment in which we investigated to what extent automatic classification could reproduce manual classification. The proposed KNN system performed well in comparison to the out-of-the-box systems. Manual analysis indicated the improved exhaustiveness of automatic classification over manual classification. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval. In an artificial setting, we compared the optimal retrieval performance which could be obtained with word-based and concept-based representations. Contrary to our intuition, on average a single word-based query performed better than a single concept-based representation, even when the best concept term precisely represented part of the information need.
We investigated to what extent the relatedness between pairs of concepts as indicated by human judgements could be automatically reproduced. Results on a small test set indicated that a method based on comparing concept language models performed particularly well in comparison to systems based on taxonomy structure, information content and (document) association.
In the third and last research theme of this thesis we propose a framework for concept-based retrieval. We approached the integration of domain knowledge in monolingual information retrieval as a cross-lingual information retrieval (CLIR) problem. Two languages were identified in this monolingual setting: a word-based representation language based on free text, and a concept-based representation language based on a terminological resource. Similar to what is common in traditional CLIR, queries and documents are translated into the same representation language and matched. The cross-lingual perspective gives us the opportunity to adopt a large set of established CLIR methods and techniques for this domain. In analogy to established CLIR practise, we investigated translation models based on a parallel corpus containing documents in multiple representations and translation models based on a thesaurus. Surprisingly, even the integration of very basic translation models showed improvements in retrieval effectiveness over word-only retrieval. A translation model based on pseudo-feedback translation was shown to perform particularly well. We proposed three extensions to a basic cross-lingual retrieval model which, similar to previous approaches in established CLIR, improved retrieval effectiveness by combining multiple translation models. Experimental results indicate that, even when using very basic translation models, monolingual biomedical IR can benefit from a cross-lingual approach to integrate domain knowledge.
Directions for future work are using these concepts for communication between user and retrieval system, extending upon the translation models and extending CLIR-enhanced concept-based retrieval outside the biomedical domain.
D. Trieschnigg and C. Hauff.
Classic children's literature - difficult to read?
In
33rd European Conference on IR Research, ECIR 2011, pages
691–694. 2011.
Abstract
Paper
Poster
Abstract
Classic children's literature such as Alice in Wonderland is nowadays freely available thanks to initiatives such as Project Gutenberg. Due to diverging vocabularies and style, these texts are often not readily understandable to children in the present day. Our goal is to make such texts more accessible by aiding children in the reading process, in particular by automatically identifying the terms that result in low readability. As a first step, in this poster we report on a preliminary user study that investigates the extent of the vocabulary problem. We also propose and evaluate a basic approach to detect such difficult terminology.
2010
C. Hauff and D. Trieschnigg.
Enhancing access to classic children's literature.
In
BooksOnline'10 Workshop at CIKM 2010. Toronto, Canada,
2010.
Abstract
Paper
Abstract
Project Gutenberg is a digital library that contains mostly public domain books, including a large number of works that belong to children's literature. Many of these classic books are offered in a text-only format, which does not make them appealing for children to read. Moreover, stories that were written for children one hundred or more years ago, might not be readily understandable by children today due to diverging vocabularies and experiences. In this poster, we describe ongoing work to enhance the access to this children's literature repository. Firstly, we attempt to automatically illustrate the children's literature. Secondly, we link the text to background information to increase understanding and ease of reading. The overall motivation of this work is to make such publicly available books more easily accessible to children by making them more entertaining and engaging.
M. Kalsbeek, J. de Wit, D. Trieschnigg, P. van der Vet, T. Huibers, and
D. Hiemstra.
Automatic reformulation of children's search queries.
Technical Report TR-CTIT-10-23, Centre for Telematics and Information
Technology, University of Twente, Enschede, 2010.
Abstract
Paper
Abstract
The number of children that have access to an Internet connection (at home or at school) is large and growing fast. Many of these children search the web by using a search engine. These search engines do not consider their skills and preferences however, which makes searching difficult. This paper tries to uncover methods and techniques that can be used to automatically improve search results on queries formulated by children. In order to achieve this, a prototype of a query expander is built that implements several of these techniques. The paper concludes with an evaluation of the prototype and a discussion of the promising results.
E. Meij, D. Trieschnigg, M. de Rijke, and W. Kraaij.
Conceptual language models for domain-specific retrieval.
Information Processing and Management, 46(4):448–469, 2010.
Abstract
Paper
Abstract
Over the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to query modeling within the language modeling framework, which leverages these conceptual annotations to improve retrieval. By means of relevance feedback the original query is translated into a conceptual representation, which is subsequently used to update the query model. Extensive experimental work on five test collections in two domains shows that our approach gives significant improvements in terms of recall, initial precision and mean average precision with respect to a baseline without relevance feedback. On one test collection, it is also able to outperform a text-based pseudo-relevance feedback approach based on relevance models. On the other test collections it performs similarly to relevance models. Overall, conceptual language models have the added advantage of offering query and browsing suggestions in the form of conceptual annotations. In addition, the internal structure of the meta-language can be exploited to add related terms. Our contributions are threefold. First, an extensive study is conducted on how to effectively translate a textual query into a conceptual representation. Second, we propose a method for updating a textual query model using the concepts in conceptual representation. Finally, we provide an extensive analysis of when and how this conceptual feedback improves retrieval.
D. Trieschnigg.
Proof of Concept: Concept-based Biomedical Information
Retrieval.
Ph.D. thesis, University of Twente, Enschede, The Netherlands,
September 2010.
Abstract
Paper
Abstract In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. Biomedical IR is concerned with the disclosure of these vast amounts of written knowledge. Biomedical IR is not only important for end-users, such as biologists, biochemists, and bioinformaticians searching directly for relevant literature but also plays an important role in more sophisticated knowledge discovery. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Multiple synonymous terms can be used for single biomedical concepts, such as genes and diseases. Conversely, single terms can be ambiguous, and may refer to multiple concepts. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge in modern word-based information retrieval is, however, far from trivial. This thesis investigates the problem of handling biomedical terminology based on three research themes.
The first research theme deals with robust word-based retrieval. Effective retrieval models commonly use a word-based representation for retrieval. As so many spelling variations are present in biomedical text, the way in which these word-based representations are obtained affect retrieval effectiveness. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. This investigation included stop-word removal, stemming, different approaches to breakpoint identification and normalisation, and character n-gramming. In particular breakpoint identification and normalisation (that is determining word parts in biomedical compounds) showed a strong effect on retrieval performance. A combination of effective preprocessing heuristics was identified and used to obtain word-based representations from text for the remainder of this thesis.
The second research theme deals with concept-based retrieval. We investigated two representation vocabularies for concept-based indexing, one based on the Medical Subject Headings thesaurus, the other based on the Unified Medical Language System metathesaurus extended with a number of gene and protein dictionaries.
We investigated the following five topics.
- How documents are represented in a concept-based representation.
- To what extent such a document representation can be obtained automatically.
- To what extent a text-based query can be automatically mapped onto a concept-based representation and how this affects retrieval performance.
- To what extent a concept-based representation is effective in representing information needs.
- How the relationship between text and concepts can be used to determine the relatedness of concepts.
We compared different classification systems to obtain concept-based document and query representations automatically. We proposed two classification methods based on statistical language models, one based on K-Nearest Neighbours (KNN) and one based on Concept Language Models (CLM).
For a selection of classification systems we carried out a document classification experiment in which we investigated to what extent automatic classification could reproduce manual classification. The proposed KNN system performed well in comparison to the out-of-the-box systems. Manual analysis indicated the improved exhaustiveness of automatic classification over manual classification. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval. In an artificial setting, we compared the optimal retrieval performance which could be obtained with word-based and concept-based representations. Contrary to our intuition, on average a single word-based query performed better than a single concept-based representation, even when the best concept term precisely represented part of the information need.
We investigated to what extent the relatedness between pairs of concepts as indicated by human judgements could be automatically reproduced. Results on a small test set indicated that a method based on comparing concept language models performed particularly well in comparison to systems based on taxonomy structure, information content and (document) association.
In the third and last research theme of this thesis we propose a framework for concept-based retrieval. We approached the integration of domain knowledge in monolingual information retrieval as a cross-lingual information retrieval (CLIR) problem. Two languages were identified in this monolingual setting: a word-based representation language based on free text, and a concept-based representation language based on a terminological resource. Similar to what is common in traditional CLIR, queries and documents are translated into the same representation language and matched. The cross-lingual perspective gives us the opportunity to adopt a large set of established CLIR methods and techniques for this domain. In analogy to established CLIR practise, we investigated translation models based on a parallel corpus containing documents in multiple representations and translation models based on a thesaurus. Surprisingly, even the integration of very basic translation models showed improvements in retrieval effectiveness over word-only retrieval. A translation model based on pseudo-feedback translation was shown to perform particularly well. We proposed three extensions to a basic cross-lingual retrieval model which, similar to previous approaches in established CLIR, improved retrieval effectiveness by combining multiple translation models. Experimental results indicate that, even when using very basic translation models, monolingual biomedical IR can benefit from a cross-lingual approach to integrate domain knowledge.
Directions for future work are using these concepts for communication between user and retrieval system, extending upon the translation models and extending CLIR-enhanced concept-based retrieval outside the biomedical domain.
D. Trieschnigg, D. Hiemstra, F. de Jong, and W. Kraaij.
A cross-lingual framework for monolingual biomedical information
retrieval.
In
Proceedings of the 19th ACM International Conference on
Information and Knowledge Management, CIKM '10, pages 169–178. ACM, New
York, NY, USA, 2010.
Abstract
Paper
Abstract
An important challenge for biomedical information retrieval (IR) is dealing with the complex, inconsistent and ambiguous biomedical terminology. Frequently, a concept-based representation defined in terms of a domain-specific terminological resource is employed to deal with this challenge. In this paper, we approach the incorporation of a concept-based representation in monolingual biomedical IR from a cross-lingual perspective. In the proposed framework, this is realized by translating and matching between text and concept-based representations. The approach allows for deployment of a rich set of techniques proposed and evaluated in traditional cross-lingual IR. We compare six translation models and measure their effectiveness in the biomedical domain. We demonstrate that the approach can result in significant improvements in retrieval effectiveness over word-based retrieval. Moreover, we demonstrate increased effectiveness of a CLIR framework for monolingual biomedical IR if basic translations models are combined.
2009
D. Trieschnigg, P. Pezik, V. Lee, W. Kraaij, F. de Jong, and
D. Rebholz-Schuhmann.
MeSH Up: Effective MeSH Text Classification and Improved Document
Retrieval.
Bioinformatics, 25(11):1412–1418, 2009.
Abstract
Paper
Abstract
Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared to a limited number of other systems.
Results: We compare the performance of 6 MeSH classification systems (MetaMap, EAGL, a language and a vector space model based approach, a K-Nearest Neighbor approach and MTI) in terms of reproducing and complementing manual MeSH annotations. A K-Nearest Neighbor system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone.
Conclusions: The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable to those observed for manual annotations.
D. Trieschnigg, P. Pezik, V. Lee, W. Kraaij, F. de Jong, and
D. Rebholz-Schuhmann.
Response to comment on “MeSH-up: effective MeSH text classification
for improved document retrieval”.
Bioinformatics, 25(20):2772, October 2009.
Abstract
Paper
Abstract
As developers and primary users of MTI and MetaMap, N{\'e}v{\'e}ol et al. made a number of interesting comments on our recent publication in Bioinformatics. However, some of the results and conclusions found in the reply seem premature and lack proper clarification.
2008
E. Meij, D. Trieschnigg, M. de Rijke, and W. Kraaij.
Parsimonious concept modeling.
In
Proceedings of the 31th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR
2008, pages 815–816. ACM, New York, NY, USA, 2008.
Paper
D. Nguyen, A. Overwijk, C. Hauff, D. Trieschnigg, D. Hiemstra, and F. de Jong.
WikiTranslate: Query translation for cross-lingual information
retrieval using only Wikipedia.
In
Evaluating Systems for Multilingual and Multimodal
Information Access, volume 5706 of
Lecture Notes in Computer Science,
pages 58–65. Springer Verlag, Berlin, 2008.
Abstract
Paper
Abstract
This paper presents WikiTranslate, a system which performs query translation for cross-lingual information retrieval (CLIR) using only Wikipedia to obtain translations. Queries are mapped to Wikipedia concepts and the corresponding translations of these concepts in the target language are used to create the final query. WikiTranslate is evaluated by searching with topics formulated in Dutch, French and Spanish in an English data collection. The system achieved a performance of 67% compared to the monolingual baseline.
A. Overwijk, D. Nguyen, C. Hauff, D. Trieschnigg, D. Hiemstra, and F. de Jong.
On the evaluation of snippet selection for WebCLEF.
In
Evaluating Systems for Multilingual and Multimodal
Information Access, volume 5706 of
Lecture Notes in Computer Science,
pages 794–797. Springer Verlag, Berlin, 2008.
Abstract
Paper
Abstract
WebCLEF is about supporting a user who is an expert in writing a survey article on a specific topic with a clear goal and audience by generating a ranked list with relevant snippets. This paper focuses on the evaluation methodology of WebCLEF. We show that the evaluation method and test set used for WebCLEF 2007 cannot be used to evaluate new systems and give recommendations how to improve the evaluation.
D. Trieschnigg.
Biomedical cross-language information retrieval.
In
Proceedings of the 31th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR
2008, page 897. ACM, New York, NY, USA, 2008.
Paper
D. Trieschnigg, E. Meij, M. de Rijke, and W. Kraaij.
Measuring concept relatedness using language models.
In
Proceedings of the 31th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR
2008, pages 823–824. ACM, New York, NY, USA, 2008.
Paper
2007
M. Schuemie, D. Trieschnigg, and W. Kraaij.
Cross language information retrieval for biomedical literature.
In
Proceedings of The Sixteenth Text REtrieval Conference,
TREC 2007, NIST Special Publication 500-274. Gaithersburg, MD, USA, 2007.
Paper
D. Trieschnigg, W. Kraaij, and F. de Jong.
The influence of basic tokenization on biomedical document retrieval.
In
Proceedings of the 30th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR
2007, pages 803–804. ACM, New York, NY, USA, 2007.
Paper
2006
C. Hauff, D. Trieschnigg, and H. Rode.
University of Twente at GeoCLEF 2006: geofiltered document
retrieval.
In
Working Notes CLEF 2006, Alicante, Spain. 2006.
Abstract
Paper
Abstract
In this report we describe the approach of the University of Twente to the 2006 Geo-CLEF task. It is based on retrieval by content and the subsequent filtering by geographical relevance utilizing a gazetteer. The results do not show an improvement inretrieval performance when taking geographical information into account.
D. Trieschnigg, M. Schuemie, and W. Kraaij.
Concept based document retrieval for genomics literature.
In
Proceedings of the Fifteenth Text REtrieval Conference,
TREC 2006, NIST Special Publication 500-272. Gaithersburg, MD, USA, 2006.
Paper
2005
D. Trieschnigg.
Exploring news archives using hierarchical topic detection.
Master's thesis, University of Twente, Enschede, The Netherlands,
2005.
Abstract
Paper
Abstract
The amount of available information in digital news archives is growing and scalable methods are sought for presenting this information in a user friendly way. Grouping related news items in fuzzy hierarchies has this potential, but the fully automated construction of these structures is complex. Furthermore there is the difficulty of evaluating automatically generated hierarchies of topics. In this thesis is investigated how hierarchical topic detection (HTD) can aid in the exploration and navigation of large news archives. The contribution of this work is a twofold. First of all a simple scalable HTD system is presented for clustering a large collection of documents in a fuzzy hierarchical topic structure. The prototype system has been used in the trial HTD evaluation of the TDT 2004 evaluation program. The participation starts a discussion of the evaluation methodology of hierarchical topic structures in an experimental context. The second contribution is a set of indicators and a visualization method for evaluating a hierarchical topic structure given a set of flat ``truth'' topics. With these indicators a richer discussion can take place about the desired properties of cluster structures.
D. Trieschnigg and W. Kraaij.
Hierarchical topic detection in large digital news archives.
In
Proceedings of the Fifth Dutch-Belgian Workshop on
Information Retrieval, DIR '05. Center for Content and Knowledge
Engineering, Utrecht, The Netherlands, 2005.
Paper
Slides
D. Trieschnigg and W. Kraaij.
Hierarchical topic detection in large digital news archives:
Exploring a sample based approach.
Journal of Digital Information Management, 3(1):21–27, 2005.
D. Trieschnigg and W. Kraaij.
Scalable hierarchical topic detection: exploring a sample based
approach.
In
Proceedings of the 28th annual international ACM SIGIR
conference on Research and development in information retrieval, SIGIR 2005,
pages 655–656. ACM, New York, NY, USA, 2005.
Paper
Software
- SearchResultFinder is a Firefox plugin to (semi)automatically determine XPaths to extract search result items.