My research interests are in the knowledge representation of language. For coming generations of information and knowledge systems, a closer understanding of the communication process between author and reader will be necessary: my aim is to improve information access by designing systems based on an informed but practical analysis of usage, context, situation, and domain and in formulating and implementing a flexible and scalable representation for what is essential for realistic volumes of linguistic data.
This includes understanding how language usage changes over time and over new modes of communication, such as new text genres or new modalities. I am currently (2021) very interested in understanding how new genres will emerge in the convergence of broadcasting, video clip sharing, and podcast publication, how they can be related to previous media, and made easily and handily accessible.
Most of what I have written can be found in various repositories on the net, e.g. from ORCID.
As my first research effort I attempted to formulate an Algebra for Recommendations. An Algebra for Recommendations (1990) This is what now is known as Recommender systems and I probably should have continued along this path in spite of an initial reviewer number two setback Newsgroup Clustering Based On User Behavior - A Recommendation Algebra (1994) having to do with ethical issues involved with clustering .newsrc files.
Stylostatistics and Studies of Genre
Since 1993 I have worked on computational stylistic analysis of text. Previous work on style and genre has been motivated from a primarily philologic standpoint, even if sometimes computationally oriented. The 1994 Coling publication Recognizing Text Genres with Simple Metrics Using Discriminant Analysis (1994) by myself and Douglass Cutting marked the first attempt to apply general language technology methods for this purpose. I have held numerous talks, seminars, and international symposia on the topic covering both methodology, results, and applications. New Text (2006) Textual Stylistic Variation: Choices, Genres and Individuals (2010) Conventions and Mutual Expectations - understanding sources for web genres (2010) The Relation Between Author Mood and Affect to Sentiment in Text and Text Genre (2011)
Currently I am working on extending this work to the emerging landscape of podcasts and other recorded speech, e.g. to study how podcasts are systematically different from other previous collections of language. Lexical variation in English language podcasts, editorial media, and social media (2022)
Scalable, realistic, and useful semantic models
Since 1998 I have together with Magnus Sahlgren participated in work pioneered by Pentti Kanerva on scalable, behaviouristically, and neurophysiologically plausible computational models for processing large amounts of text efficiently and usefully. We have worked on building semantic spaces based on distributional analysis of linguistic items, originally using the random indexing processing model and memory model. From Words to Understanding (2001) Meaningful Models for Information Access Systems (2005) Filaments of Meaning in Word Space (2008)
This work is continuing and I am currently interested in exploring the interface between geometric high-dimensional models on the one hand and graph and topological models on the other. Counting Lumps in Word Space (2005) Semantic Topology (2014)
Parts of this work was what became the text analysis company Gavagai which I co-founded with Magnus Sahlgren in 2008 and where I worked with him, Fredrik Olsson, Fredrik and Nicolas Espinoza, and Ola Hamfors until 2019. We started by building a lexical learning model The Gavagai Living Lexicon (2016) which we used for sentiment analysis in media monitoring Usefulness of Sentiment Analysis (2014) and for analysis of customer feedback and questionnaires. Analysis of Open Answers to Survey Questions through Interactive Clustering and Theme Extraction (2018)
We even toyed with applying our model to non-human communication, but while the approach still seems reasonable we never managed to get the project properly afloat or in the air as it were. A proposal to use distributional models to analyse dolphin vocalisation (2017)
In 2017-2018 I visited Stanford, hosted by Martin Kay. I worked on an application of random indexing for construction grammar High-dimensional distributed semantic spaces for utterances (2019) together with Pentti Kanerva. A brief discussion with Dan Jurafsky in the coffee lounge after a seminar later became a paper on how the semantic representation of human language and high-dimensional models interact Semantics in High-Dimensional Space (2021).
Interacting with information
At SICS I worked on various aspects of interaction with information systems in numerous projects. These papers range from understanding how to interact with virtual worlds Interaction Models, Reference, and Interactivity in Speech Interfaces to Virtual Environments (1995) to how to model and manage user expectations in human-computer dialogue. The Interaction of Discourse Modality and User Expectations in Human-Computer Dialog (1992) Inferring Complex Plans (1993) Transparent Natural Language Interaction through Multimodality (1993) A Glass Box Approach to Adaptive Hypermedia (1995) An especially entertaining application domain was to work with increasing the understanding of home owners on their energy usage. Socially Intelligent Interfaces for Increased Energy Awareness in the Home (2008)
Evaluating information systems
The reason I moved in the direction of information retrieval from linguistics was to work with large amounts of language data and to process them to meet some purpose of interest to their audience and creators. This had to do with my interest in stylistics, and that in turn led me to think more about models for evaluating quality of information retrieval. The contributions have largely been channeled into my participation in the CLEF series of workshops and conferences where I usually go to talk with colleagues about shared tasks and innovative evaluation schemes. Especially interesting to me is how an intrinsic evaluation scheme could be built for learning models Evaluating Learning Language Representations (2015) and how evalution schemes could be moved from laboratories to operational settings. Adopting Systematic Evaluation Benchmarks in Operational Settings (2019)
One of my favourite gripes is to make the distinction between benchmarking and validation clear, how to figure out if a solidly reliable component on a laboratory bench will be useful for practical application, and how a laboratory benchmark can be variously useful in real life and how it might impact the activities the model is used for. How Lexical Gold Standards Have Effects On The Usefulness Of Text Analysis Tools For Digital Scholarship (2019)
(Also, research needs to be exciting and fun.) From Boxes and Arrows to Conversation and Negotiation or how Research should be Amusing, Awful, and Artificial (2006)