Mudra

Lode Hoste; Bruno Dumas; Beat Signer

doi:10.1145/2070481.2070500

Outline

Mudra: A Unified Multimodal Interaction Framework

Bruno Dumas

Beat Signer

2011

https://doi.org/10.1145/2070481.2070500

visibility

…

description

8 pages

link

1 file

Abstract

In recent years, multimodal interfaces have gained momentum as an alternative to traditional WIMP interaction styles. Existing multimodal fusion engines and frameworks range from low-level data stream-oriented approaches to high-level semantic in\-fer\-ence-based solutions. However, there is a lack of multimodal interaction engines offering native fusion support across different levels of abstractions to fully exploit the power of multimodal interactions. We present Mudra, a unified multimodal interaction framework supporting the integrated processing of low-level data streams as well as high-level semantic inferences. Our solution is based on a central fact base in combination with a declarative rule-based language to derive new facts at different abstraction levels. Our innovative architecture for multimodal interaction encourages the use of software engineering principles such as modularisation and composition to support a growing set of input modalities as well as to enable the integration of existing or novel multimodal fusion engines.

Mudra: A Unified Multimodal Interaction Framework Lode Hoste, Bruno Dumas and Beat Signer Web & Information Systems Engineering Lab Vrije Universiteit Brussel Pleinlaan 2, 1050 Brussels, Belgium {lhoste,bdumas,bsigner}@vub.ac.be ABSTRACT In recent years, multimodal interfaces have gained momentum as an alternative to traditional WIMP interaction styles. Existing multimodal fusion engines and frameworks range from low-level data stream-oriented approaches to high-level semantic inference-based solutions. However, there is a lack of multimodal interaction engines offering native fusion support across different levels of abstractions to fully exploit the power of multimodal interactions. We present Mudra, a unified multimodal interaction framework supporting the integrated processing of low-level data streams as well as high-level semantic inferences. Our solution is based on a central fact base in combination with a declarative rule-based language to derive new facts at different abstraction levels. Our innovative architecture for multimodal interaction encourages the use of software engineering principles such as modularisation and composition to support a growing set of input modalities as well as to enable the integration of existing or novel multimodal fusion engines. Keywords multimodal interaction, multimodal fusion, rule language, declarative programming Categories and Subject Descriptors D.2.11 [Software Engineering]: Software Architectures; H.5.2 [Information Interfaces and Presentation]: User Interfaces General Terms Algorithms, Languages 1. INTRODUCTION Multimodal interaction and interfaces have become a major research topic over the last two decades, representing a new class of user-machine interfaces that are different from standard WIMP interfaces. As stated by Oviatt [13], multimodal interfaces have the added capability to process multiple user input modes not only in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI’11, November 14–18, 2011, Alicante, Spain. Copyright 2011 ACM 978-1-4503-0641-6/11/11 ...$10.00. a parallel manner, but also by taking temporal and semantic combinations between different input modes into account. These interfaces tend to emphasise the use of richer and more natural ways of communication, including speech or gesture, and more generally address all five senses. Hence, the objective of multimodal interfaces is twofold: first, to support and accommodate a user’s perceptual and communicative capabilities; and second, to embed computational power in the real world, by offering more natural ways of human-computer interaction [5]. However, the process of recovering the user intent through multiple different input sources and their potential combination, known as “multimodal input fusion”, presents a number of challenges to be overcome before multimodal interfaces can be experienced to their fullest. First, the processing has to happen in real time, demanding for architectures to efficiently manage parallel input streams as well as to perform the recognition and fusion in the presence of temporal constraints. Second, the type of data to be managed by a multimodal system may originate from a variety of different sources. For example, a multi-touch surface might deliver multiple streams of pointer positions along with identifiers for different fingers, hands or even different users. On the other hand, a speech recogniser may deliver a list of potential text results in combination with the corresponding recognition probabilities. Being able to fuse user input from such different channels is one of the strengths of multimodal interfaces, but in practice only a few tools have been able to fully support data-agnostic input. In effect, most multimodal interaction tools either focus on extracting semantic interpretations out of the input data or offer low-level management of input data streams. Semantic-level tools are typically high-level frameworks that consume semantic events from multiple modalities to achieve a more intuitive and improved human-computer interaction. In these approaches, fusion is done on sets of individual high-level interpretations mostly coming from different recognisers. The second family of tools and frameworks focus strongly on dataflow paradigms to process raw data at the data level. These approaches typically declare the flow of primitive events by chaining multiple boxes serving as a filter or via event-specific fusion of multiple sources. The consequence is that we have, on the one hand, semantic tools which struggle with low-level and high frequency data and, on the other hand, frameworks to manage low-level input data streams which have to resort to ad hoc and case by case implementations for higher-level information fusion. Sharma et al. [17] identified three different levels of fusion of input data: data-level fusion, feature-level fusion and decision-level fusion. These fusion levels are seen as distinct entities, working at completely different stages. On the other hand, fusion as a whole is supposed to be able to take into account data from any of these three levels, be it x/y coordinates from a pointer or semantic information coming from speech. In this paper, we address this conflict between seemingly irreconcilable fusion levels and data-agnostic input. A first open issue is the correlation between the events processed by dataflow approaches with a high throughput and low-throughput high-level events produced by semantic inference-based fusion. A second issue is how to retain access to low-level information when dealing with interpreted high-level information. For example, a gesture can be described by a set of x/y coordinates, by a sequence of atomic time-stamped vectors or by a semantic interpretation denoting that the user has drawn an upward-pointing arrow. Only a few tools keep track of the three different information levels and consider that users might express either deictic pointing information, manipulative movements or iconic gestures. Finally, in order to achieve multimodal fusion, some specific metadata has to be extracted regardless of the level of fusion. For instance, for raw-level data provided by a simple deictic pointing gesture that is freely performed and captured by a 3D camera, the start and end time of the gesture is required in order to resolve temporal relationships with other modalities. To accommodate this missing link between low-level and highlevel events, we developed a unified multimodal fusion engine capable of reasoning over both primitive and high-level information based on time window constructs. We encourage the use of modularisation and composition to build reusable and easily understandable building blocks for multimodal fusion. These software principles are emphasised by Lalanne et al. [11] stating that the “engineering aspects of fusion engines must be further studied, including the genericity (i.e., engine independent of the combined modalities), software tools for the fine-tuning of fusion by the designer or by the end-users as well as tools for rapidly simulating and configuring fusion engines to a particular application by the designer or by the end-users.” We start by discussing related work in Section 2 and investigate how existing approaches address the correlation between low-level and high-level events. In Section 3, we present the architecture of our multimodal interaction framework, called Mudra, and introduce the features that enable Mudra to deal with data-level, featurelevel as well as decision-level fusion while retaining temporary fusion information. We further introduce the declarative rule-based language forming part of the Mudra core. A discussion and comparison of our unified multimodal interaction framework with existing multimodal engines as well as potential future directions are provided in Section 4. Concluding remarks are given in Section 5. 2. BACKGROUND The fusion of multimodal input data can take place at different levels of abstraction. In this section, we first present these different levels together with some classical use cases. We then discuss existing solutions for the fusion of multimodal input at these abstraction levels and identify some of their limitations. 2.1 Multimodal Fusion Levels As mentioned earlier, Sharma et al. [17] distinguish three levels of abstraction to characterise multimodal input data fusion: datalevel fusion, feature-level fusion and decision-level fusion. • Data-level fusion focuses on the fusion of identical or tightly linked types of multimodal data. The classical illustration of data-level fusion is the fusion of two video streams coming from two cameras filming the same scene at different angles in order to extract the depth map of the scene. Data-level fusion rarely deals with the semantics of the data but tries to enrich or correlate data that is potentially going to be pro- cessed by higher-level fusion processes. As data-level fusion works on the raw data, it has access to the detailed information but is also highly sensitive to noise or failures. Datalevel fusion frequently entails some initial processing of raw data including noise filtering or very basic recognition. • Feature-level fusion is one step higher in abstraction than data-level fusion. Typically, data has already been processed by filters and fusion is applied on features extracted from the data rather than on the raw data itself. Feature-level fusion of modalities typically applies to closely coupled modalities with possibly different representations. A classical example is speech and lip movement integration [14], where data comes from a microphone that is recording speech as well as from a camera filming the lip movements. The two data streams are synchronised and in this case the goal of the data fusion is to improve speech recognition by combining information from the two different modalities. Feature-level fusion is less sensitive to noise or failures than data-level fusion and conveys a moderate level of information detail. Typical feature-level fusion algorithms include statistical analysis tools such as Hidden Markov Models (HMM), Neural Networks (NN) or Dynamic Time Warping (DTW). • Decision-level fusion is centered around deriving interpretations based on semantic information. It is the most versatile kind of multimodal fusion, as it can correlate information coming from loosely coupled modalities, such as speech and gestures. Decision-level fusion includes the merging of high-level information obtained by data- and feature-level fusion as well as the modelling of human-computer dialogues. Additionally, partial semantic information originating from the feature level can yield to mutual disambiguation [12]. Decision-level fusion is assumed to be highly resistant to noise and failures as it relies on the quality of previous processing steps. Therefore, the information that is available for decision-level fusion algorithms may be incomplete or distorted. Typical classes of decision-level fusion algorithms are meaning frames, unification-based or symbolic-statistical fusion algorithms. Note that a single modality can be processed on all three fusion levels. For example, speech can be processed at the signal (data) level, phonemes (features) level or utterances (decision) level. In the case of speech, the higher fusion levels might use results from lower-level fusion. Surprisingly, existing multimodal interaction frameworks often excel at one specific fusion level but encounter major difficulties at other levels. We argue that the reason for these limitations lies on the architecture level and in particular how the initial data from different modalities is handled. 2.2 Data Stream-Oriented Architecture One approach to build multimodal interaction architectures is to assume a continuous stream of information coming from different modalities and to process them via a number of chained filters. This is typically done to efficiently process streams of high frequency data and to perform fusion on the data and/or feature level. Representatives of this strategy are OpenInterface [16] and Squidy [10], employing a data stream-oriented architecture to process raw data sources and fuse multiple sources on an event-per-event basis. Although these data stream approaches advocate the use of composition boxes, they do not provide a fundamental solution to define temporal relations between multiple input sources. All incoming events are handled one by one and the programmer manually needs to take care of the intermediate results. This leads to a difficult management of complex semantic interpretations. Data streamoriented architectures show their limits when high-throughput information such as accelerometer data (i.e. more than 25 events per second) should be linked with low-throughput semantic-level information such as speech (i.e. less than one event per second). When confronted with the fusion of information coming from different abstraction levels, these architectures tend to rely on a caseby-case approach, thereby losing their genericity. Furthermore, the decision-level fusion of semantic information between multiple modalities requires classes of algorithms, such as meaning frames, which address temporal relationships and therefore need some kind of intermediate storage (e.g. registers). These algorithms are not in line with the stream-oriented architecture and developers have to rely on ad hoc solutions. 2.3 Semantic Inference-Based Approach A second type of architecture for multimodal interaction focuses on supporting fusion of high-level information on the decision level. These approaches offer constructs to specify sets of required information before an action is triggered. Information gathered from the different input modalities is assumed to be classified correctly. Furthermore, these approaches work best with relatively low frequency data and highly abstracted modalities. Four classes of fusion algorithms are used to perform decisionlevel fusion: • Meaning frame-based fusion [19] uses data structures called frames for the representation of semantic-level data coming from various sources or modalities. In these structures, objects are represented as attribute/value pairs. • Unification-based fusion [9] is based on recursively merging attribute/value structures to infer a high-level interpretation of user input. • Finite state machine-based approaches [8] model the flow of input and output through a number states, resulting in a better integration with strongly temporal modalities such as speech. • Symbolic/statistical fusion, such as the Member-Team-Committee (MTC) algorithm used in Quickset [21] or the probabilistic approach of Chai et al. [2], is an evolution of standard symbolic unification-based approaches, which adds statistical processing techniques to the fusion techniques described above. These hybrid fusion techniques have been demonstrated to achieve robust and reliable results. The presented approaches work well for the fusion of semanticlevel events. However, when confronted with lower level data, such as streams of 2D/3D coordinates or data coming from accelerometers, semantic inference-based approaches encounter difficulties in managing the high frequency of input data. In order to show their potential, these approaches assume that the different modalities have already been processed and that we are dealing with semantic-level information. However, even when confronted with semantic-level data, several issues can arise with existing approaches. First, they have to fully rely on the results of the modality-level recognisers without having the possibility to exploit the raw information at all. This can lead to problems in interpretation, for example with continuous gestures (e.g. pointing) in thin air. Second, as decision-level fusion engines assume that the creation of semantic events happens at a lower level, they have no or only limited control over the refresh rate of these continuous gestures. The typical solution for this scenario is to create a single pointing event at the time the hand was steady. Unfortunately, this considerably slows down the interaction and introduces some usability issues. Another approach is to match the pointing gesture for every time step on the discrete time axis; for example once per second. However, this conflicts with the occupied meaning frame slot and demands for ad hoc solutions. A third issue that arises when employing meaning frames or similar fusion algorithms is related to the previously discussed problem. Suppose that a user aborts and restarts their interaction with the computer by reissuing their commands. Already recognised information from the first attempt, such as a “hello” speech utterance, are already occupying the corresponding slot in the meaning frame. A second triggering of “hello” will either be refused and possibly result in a misclassification due to an unexpected time span when matched with a newer pointing gesture, or it will overwrite the existing one which introduces problems for partially overlapping fusion since meaningful scenarios might be dropped. Finite state machine-based approaches such as [8], typically lack the constructs to express advanced temporal conditions. The reason is that a finite state machine (FSM) enforces the input of events in predefined steps (i.e. event x triggers a transition from state a to b). When fusing concurrent input, all possible combinations need to be manually expressed. The two major benefits are the flexible semantic and temporal relations between edges and the inherent support for probabilistic input. However, the manual construction of complex graphs becomes extremely difficult to cope with as the number of cases to be taken into account is growing. When such systems have to be trained, the obvious problem of collecting training sets arises and once again increases with the number of considered cases. Additionally, these approaches require a strict segmentation of the interaction. This implies a clear specification of the start and stop conditions before the matching occurs. Hence, supporting overlapping matches introduces some serious issues and also has an impact on the support for multiple users and the possible collaborative interaction between them. 2.4 Irreconcilable Approaches? Finally, other issues, such as multi-user support, are currently problematic in both data stream-oriented and semantic inferencebased approaches. For instance, at the raw data level, potentially available user information is frequently lost as data is treated at the same level as other pieces of data and requires some ad hoc implementations. When employing meaning frame-based fusion or any other decision-level fusion, slots can be occupied by any user. However, this means that events from one participant can be undesirably composed with events from another user. Note that a major additional effort is required from the programmer to support multiuser scenarios, since every meaning frame has to be manually duplicated with a constant constraint on the user attribute, resulting in an increased fusion description complexity. In conclusion, data stream-oriented architectures are very efficient when handling data streams and semantic inference-based approaches process semantic-level information with ease. However, none of the presented approaches is efficient in handling both high frequency data streams at a low abstraction level and low frequency semantic pieces of information at a high abstraction level. Not to mention the possibility to use data-level, feature-level and decisionlevel information of the very same data stream at the same time. In the next section, we present our unified multimodal interaction architecture called Mudra, which reconciles the presented approaches by supporting fusion across the different abstraction levels. # $ % ! " ! " # " ! ( !&' Figure 1: Mudra architecture 3. MUDRA In order to build a fusion engine that is able to process information on the data, feature and decision level in real-time, we believe that a novel software architecture is needed. We present Mudra, our multimodal fusion framework which aims to extract meaningful information from raw data, fuse multiple feature streams and infer semantic interpretation from high-level events. The overall architecture of the Mudra framework is shown in Figure 1. At the infrastructure level, we support the incorporation of any arbitrary input modalities. Mudra currently supports multiple modalities including skeleton tracking via Microsoft’s Xbox Kinect in combination with the NITE1 package, cross-device multitouch information via TUIO and Midas [15], voice recognition via CMU Sphinx2 and accelerometer data via SunSPOTs3 . These bindings are implemented in the infrastructure layer. On arrival, event information from these modalities is converted into a uniform representation, called facts, and timestamped by the translator. A fact is specified by a type (e.g. speech) and a list of attribute/value pairs, called slots (e.g. word or confidence). For example, when a user says "put", the fact shown in Listing 1 is inserted into a fact base. Facts can address data coming from any level of abstraction or even results from fusion processing. A fact base is a managed collection of facts, similar to a traditional database. Listing 1: "Put" event via Speech 1 (Speech (word "put") (confidence 0.81) 2 (user "Lode") (on 1305735400985)) However, instead of activating queries on demand, we use continuous rules to express conditions to which the interaction has to adhere. A production rule consists of a number of prerequisites (before the ⇒) and one or more actions that are executed whenever the rule is triggered. Such a prerequisite can either be a fact match or a test function. A match is similar to an open slot in meaning frames but with the possibility to add additional constraints and boolean features, which leads to more flexible expressions. Test functions are user defined and typically reason over time, space or other constraints (e.g. tSequence or tParallel for sequential and parallel temporal constraints). Finally, when all prerequisites are met, an action (after the ⇒) is triggered. Typical behaviour for an action is the assertion of a new, more meaningful fact in the fact base while bundling relevant information. 1 NITE and OpenNI: http://www.openni.org CMU Sphinx: http://cmusphinx.sourceforge.net 3 Sun SunSPOT: http://www.sunspotworld.com 2 The encapsulation of data enables the modularisation and composition while modelling complex interactions. This is inherently supported by our approach and allows developers to easily encode multimodal interaction. These constructs form the basis of our solution and allow developers to match the complete range from lowlevel to high-level events. The inference engine is based on CLIPS4 (C Language Integrated Production System), which is an expert system tool developed by the Technology Branch of the NASA Lyndon B. Johnson Space Center. We have substantially extended this tool with an extensive infrastructure layer, the support for continuous evaluation, the inclusion of machine learning-based recognisers (e.g. DTW and HMM) and a network-based communication bus to the application layer. The application layer provides flexible handlers for end-user applications or fission frameworks, with the possibility to feed application-level entities back to the core layer. 3.1 Unified Multimodal Fusion 3.1.1 Data-Level processing Data-level processing is primarily used for two purposes in the Mudra framework: noise filtering and recognition. Kalman filtering [20] typically allows for easier recognition of gestures in accelerometer data. This processing is achieved at the infrastructure layer since filtering is tightly coupled with specific modalities. Employing rules at the data level has already been shown to be effective for the recognition of complex multi-touch gestures [15]. The use of production rules to encode gestures based on vision datalevel input has also been exploited by Sowa et al. [18] who used the following declarative encoding for a pointing gesture: “If the index finger is stretched, and if all other fingers are rolled (pointing hand- shape) and if the hand simultaneously is far away from the body, then we have a pointing gesture”. A similar approach is used in Mudra in the form of production rules to deal with the correlation of information at the data level. 3.1.2 Feature-Level processing To improve recognition rates, fusion at the feature level can be used to disambiguate certain cases where a single modality falls short. For example, in multi-touch technology, every finger gets assigned a unique identifier. However this does not provide information whether these fingers originate from the same hand or from different users. The fusion of existing techniques, for example shadow images [6] or the use of small amounts of electrical current to identify individual users [4], is possible at the feature level. 4 CLIPS: http://clipsrules.sourceforge.net It is important to stress that we do not enforce a strict dataflow from the data level to the feature level. This has the advantage that data-level recognisers can benefit from information provided at the feature level. If existing feature-level techniques are incorporated in the framework, the data-level processing of multi-touch gestures can, for example, immediately profit from their results. 3.1.3 Decision-Level processing At the decision level, the advanced modelling of multimodal fusion can be described in a very flexible manner, as developers have access to events ranging from low to high level. External data-level, feature-level or decision-level fusion algorithms can also be applied to any facts available in the fact base. The underlying complexity is hidden from developers, as illustrated in Listing 2 showing our implementation of Bolt’s famous “put that there” example [1]. Listing 2: Bolt’s "Put that there" 1 (defrule bolt 2 (declare (global−slot−constraint (user ?user))) 3 ?put <− (Voice (word "put") {> confidence 0.7}) 4 ?that <− (Voice (word "that")) 5 ?thatp <− (Point) 6 (test (tParallel ?that ?thatp)) 7 ?there <− (Voice (word "there")) 8 ?therep <− (Point) 9 (test (tParallel ?there ?therep)) 10 (test (tSequence ?put ?that ?there)) 11 => 12 (assert (BoltInteraction))) In this fusion example, we assume some high-level events. For example, line 3 and 4 show a pattern match on a voice fact containing a “put” (see Listing 1) and a “that” string at the word slot. Resulting fact matches are bound to variables, denoted by a question mark (i.e. ?put and ?that). Line 5 specifies a point event, which could be issued by a touch interaction or a hand pose. The point fact also contains a modality type slot, but if the developer does not constrain the attribute information to a single or multiple modalities, the rule will trigger for all cases. This example illustrates the abstraction level of our declarative rules, where the underlying complexity of the point event is hidden by one or multiple rules or external recognisers. Different temporal operators, such as tParallel (line 6) or tSequence (line 10), are user defined rather than being fixed and limited to engine-level constructs. Developers can introduce their own operators at any time. It is important to note that the voice events generated by recognisers in the infrastructure layer are merged with point events extracted by data-level processing. Although pointing is a continuous interaction which generates multiple events per second, our system is able to fuse both inputs. Fusion algorithms at different fusion levels may find patterns in the fact base. In the future we plan to further exploit this feature by dynamically analysing speech in fusion-aware speech grammars. 3.2 Fundamental Features of Mudra Attribute Constraints Additional constraints can be enforced by developers before a matched fact type is bound to a variable. A first constraint is realised by assigning a constant value on an attribute. This is shown on line 3 of Listing 2 by stating the string "put" for the word slot. A boolean OR operator supports alternative constant values in the case that this is required. A second interesting constraint is available via inline function calls (denoted by curly braces), which is outlined on line 3. This construct not only supports boolean operators such as AND, OR and NOT, but it can also be used to specify value ranges (e.g. for sliders) or to call user-defined functions. In Listing 2, we applied an inline function call to enforce a minimal probability for the correctness of the recognised word. A third type of attribute constraint is to use variable bindings. In production rules, a variable can only be bound once. Thus, if a single variable is used at multiple locations, it indicates that all these instances should contain the same value. This is very flexible as developers are not enforced to provide a constant value. Typically, this feature is applied in a multi-user context, where all matched events should be produced by a user without referring to a particular username. In Listing 2, we applied this mechanism via a macro (global-slot-constraint) which spans over all fact matches that contain user information. Negation of Events A powerful feature is the use of negation to denote that an event should not happen during the defined scenario. This construct can also be used to define priorities between different modalities, for instance to express that pointing should be active as long as there is no voice input. The rich expressiveness of the negation feature is very handy when describing certain types of multimodal interactions. Local Integration of Probability Probability information originating from external recognisers (e.g. speech recognition) can be integrated as attribute values. Due to the advanced attribute constraint mechanism, a threshold can be set locally and is not required to be system wide. It is very interesting to exploit this feature to reduce false positives as one can enforce a higher threshold for key components of the fusion. For instance, line 3 in Listing 2 requires a recognition probability higher than 0.7 for the speech recognition of “put”, which is higher than the default threshold of 0.5. When extending the system with additional but similar fusion rules like “clone that there”, the possibility to refine these probabilities on a per-event basis is a clear advantage. Overlapping Matches Support for overlapping matches is an important benefit of our approach and enables to bridge the gap between low-level and high-level fusion. New events that overlap with partial matches are not thrown away but create new, additional partial matches. The mechanism relates to an automated replication strategy of meaning frames whenever a register is occupied. Developers do not have to decide between skipping new events or overwriting existing partial matches. Overlapping matches are handled very efficiently by the Rete algorithm [7]. Since we inherently support the bookkeeping of partial matches, we provide an additional delay construct to control the frequency of the rule triggering. The delayed triggering is important for data-level processing since many low-level events correlate to similar conclusions and a reduction of events minimises the processing necessary by the inference engine. Note that the delay construct can be applied to define the refresh rate of continuous gestures (e.g. pointing). Decision-level fusion greatly benefits from this control mechanism in our unified multimodal framework. Sliding Window The fact base only contains facts that have not yet outlived their time span. This time span parameter is necessary for performance and memory reasons. A time span is specified per fact type, which allows developers to keep high-level semantic events with a low throughput longer in the fact base than low-level events generated with a higher refresh rate. The result is a flexible time-windowing strategy where developers can choose between performance and accessibility of older data for fusion. Multi-User Support Multi-user support is exploited by specifying conditions on attributes. Whenever any user information is available—either originating from hardware, extracted by a recogniser or fused from multiple modalities—it can be included as an attribute in a fact. As mentioned earlier, the use of a single variable binding in a rule can be used to enforce events generated by the same user. However, specifying this attribute for every conditional element introduces a lot of redundant program code resulting in more complex rules. We therefore introduced a new language construct to declare constraints on all matches whenever the specified attribute is present. This is illustrated on line 2 of Listing 2 which enforces all events to be issued by the same user. Due to the inherent support for overlapping matches, Listing 2 supports the concurrent interaction of multiple users in the multimodal “put that there” scenario. Collaborative User Support To go one step further, we show how rules can be employed to support collaborative interaction. Hoccer5 is an example of a simple collaborative scenario where users can share data by initiating a throw and catch gesture. The implementation of this scenario, which matches a throw and a catch fact, is shown by Listing 3. The rule declares that the throw and catch fact should originate from two different users (nequal test on line 4) and that the former should happen before the latter (temporal constraint on line 5). Finally, the spatial constraint on line 6 tests whether the throw was performed in the direction of the catch. Again, the recognition of multiple users concurrently throwing data at each other is completely handled by the inference engine without any additional programming effort. Listing 3: Collaborative multimodal interaction 1 (defrule throwAndCatch 2 ?throw <− (Throw (user ?user1)) 3 ?catch <− (Catch (user ?user2)) 4 (test (nequal ?user1 ?user2)) 5 (test (tSequence ?throw ?catch)) 6 (test (sInDirectionOf ?throw ?catch)) 7 => 8 (assert (ThrowAndCatch 9 (user1 ?user1) (user2 ?user2) 10 (on:begin ?throw.on) (on:end ?catch.on) 11 (on ?catch.on)))) Compilation Rules are compiled to a Rete network to accommodate soft real-time performance. Rete is a very efficient mechanism that compiles multiple rules to a dataflow graph and stores intermediate results to speed up pattern finding problems. Note that the engine itself will take care of storing intermediate results, which usually puts an additional burden on application developers. It is also important to mention that the temporal and other constraints are handled at each node level. This means that we are open to incorporate more advanced approximation (e.g. based on data obtained by training) at runtime, without running into architectural or performance issues. This compilation step is provided by CLIPS and allows us to process an average of 9615 events per second with the two code samples (i.e. Listing 2 and Listing 3) active on an Intel Core i7 with 4GB of RAM. The data consisted of 80% Points, 10% Voice and 10% Throw/Catch facts with a successful fusion rate of 20%. The assumed data input in a realistic environment is around 25 (Point) +1 (Voice) +2 (Throw/Catch) events per second which implies that our engine definitely has no problems processing these scenarios in real-time. Figure 2 shows an example of the compiled Rete network for the rule defined in Listing 2. Note that the evalua5 Hoccer: http://hoccer.com tion of depending matching is postponed whenever possible and the system only spends very little time for newly arriving event. However, this also implies that the ordering of the declared constraint in a rule can significantly influence the performance. Figure 2: Compiled directed acyclic graph of Bolt’s example External Recognisers We also support the possibility to plug external recognisers into Mudra, along the fusion algorithms. Such external recognisers access data from the fact base and enrich it in turn. Recognition algorithms such as Dynamic Time Warping and Hidden Markov Models are examples of external recognisers. Flexible publish/subscribe handlers are provided for these recognisers. In future work, we would like to exploit this external recogniser feature even further via our concept of smart activations. As developers have access to low-level information in high-level fusion scenarios, it is easy to initiate additional low-level recognition techniques when desired. This initiation can embody computational intensive algorithms, which do not have to run continuously since their information is only interesting in certain scenarios. Typical applications are voice localisation using beam formation or the application of image processing for user identification. 3.3 Current Limitations Our main focus is directed towards (1) finding meaningful patterns out of low- and high-level data, either via unimodal recognition or multimodal fusion, and (2) providing developers with highlevel domain-specific language constructs to express advanced multimodal interaction with respect to the CARE properties [3]. Mudra supports a wide range of recognition techniques (DTW, HMM, production rules at the data-, feature- and decision-level), but it does not provide abstractions to set up a chain of raw data filters. We employ noise filtering for input data at the infrastructure layer (e.g. via a Kalman filter for accelerometer data), but we do not offer a complete infrastructure to chain stream boxes as for example offered by OpenInterface or Squidy. In case this is needed for future applications, it could be interesting to connect the output of these frameworks to our infrastructure layer. A second limitation of our current implementation is the lack of advanced conflict resolution. A basic conflict resolution is offered via a numeric salience indication per rule, which allows developers to prioritise rules. However, since this construct only works when two rules trigger at the same time, we cannot always exploit this functionality. We also argue that a numeric salience value is insufficient to model all conflicting cases [15]. Dealing with probabilities at the attribute level permits a powerful control mechanism using high-level constructs. Although we frequently exploit this explicit mechanism, we lack constructs to automatically reason over the combination of multiple probabilities, as implemented by fuzzy logic approaches. It is an open question to see whether the increase of complexity at the performance and programming level is worth the extra effort. Developers using our unified multimodal framework have to be aware of the ordering of their conditional elements. As mentioned earlier, the order of conditions can significantly influence the performance. An initial, automated reordering of conditions is provided by CLIPS; however it is limited to trivial situations. To improve the performance, we propose a simple guideline: position events with a higher throughput at a later position in the rule and put conditions as close as possible to the matches, allowing the engine to avoid unnecessary event processing. 4. DISCUSSION In this section, we discuss how Mudra’s unified fusion relates to other existing approaches. Frameworks positioned at the data level, such as OpenInterface, Squidy and other data stream-based approaches, rely on the linear chaining of processing components. Although these boxes encapsulate the implementation complexity, the internal implementation of such a box is far from trivial. Suppose that a high throughput (vision) and a low throughput (speech) input stream have to be fused. The composition of such a box, which handles all events one by one, requires a lot of bookkeeping. Key events have to be kept in local variables (state management) and all combinations of matches have to be manually exploited (pattern matching). This ad hoc composition of boxes is of course feasible but puts a burden on the application developer who is only interested in expressing a simple correlation. This issue is particularly present when other developers would like to extend the internals of the box (e.g. to support multiple or collaborative users). Most feature-level processing tools rely on preprocessed data, such as noise filtering or multi-touch identification, before the fusion occurs. However, this means that data-level processing can typically not benefit from recognised features as existing architectures enrol a one way propagation of events as illustrated in Figure 3. In Mudra, we benefit from a single fact base with a garbage collector, from which recognisers can access all available information at any time. Via this structure, data-level recognisers can incorporate optional feature data. It is worth mentioning that dealing with “optional” data is fairly easy to accomplish via rules. For instance, one rule is responsible for reasoning over raw data and another rule augments this data whenever additional features are found in the fact base. Due to the continuous evaluation of the inference engine, the second rule will automatically be triggered as soon as the feature information becomes available. Figure 3: Traditional chaining of fusion Decision-level multimodal fusion assumes the existence of highlevel semantic data. This type of fusion is also known as late fusion where all high level information is gathered and correlated. Despite the introduced robustness, typical decision-level frameworks cannot recover from the loss of information which might occur at lower levels. A secondary limitation of these frameworks is the lack of support for overlapping matches. A commonly used implementation technique for high-level fusion is the incorporation of meaning frames. As already mentioned in Section 2.3, apparent issues arise when dealing with overlapping matches and continuous information. An important assumption of decision-level frameworks is the atomic compilation of meaningful events by data- and feature-level recognisers. However, for continuous gestures, such as pointing, it is hard to control the frequency. The continuous pointing in Bolt’s “put that there” scenario is a simple example to stress this issue. The refresh rate of the pointing gesture is declared at the data-level processing. A high refresh rate introduces the problem of occupied slots in meaning frames, while a low refresh rate can lead to skipped decision-level integration since they are invalidated by the temporal constraints. It should be possible to circumvent these problems with an ad hoc solution, however, for more complex scenarios such as a multi-user environment, existing meaning-frame based solutions cannot be employed without inherent support for overlapping matches. We argue that current frameworks are bound to their implementation approach, which means they can only offer well-defined abstractions either at the data-, feature- or decision-level. There is typically a one-way chaining from the lower to the higher level as shown in Figure 3. We have incorporated existing techniques in our unified approach and offer developers powerful language constructs to express their multimodal fusion requirements. One of the important benefits is that developers are freed from the manual bookkeeping of events. The declarative rules support the definition of multimodal fusion in terms of conditions on one or more primitive events via composite high-level rules. All recognisers build on top of each other, while they are still able to access low- or high-level information to improve their recognition rates. We also explicitly enforce every fact to be annotated with a timestamp for fine-grained garbage collection. Our unified approach solves a number of important issues. However, there is still a lot of room for improvements and future research. For instance, we plan to evaluate the use of multiple recognisers on the same data. A combination of rules with multiple machine learning techniques that reason over the same data could significantly improve recognition rates. Another issue that we are currently investigating is the incorporation of user feedback via supervised gesture learning. Since all low-level and high-level information is available in the fact base, rules could be used to manage user feedback intentions. Whenever such a rule is triggered, we could delegate the training process of gesture recognisers using knowledge of previous handling. Additionally, a batch learning approach can be used, in which the training of a gesture is only triggered with a threshold number of positive and negative examples. Finally, we would like to include the smart activation of input modalities. For example, a low-level movement sensor could trigger the activation of a 3D camera, which in turn could activate the speech recognition module whenever a user is close to the microphone to improve speech recognition rates. The same smart activation constructs could also be used to control the propagation of information to machine learning techniques, as they are computationally too expensive for continuous evaluation. 5. CONCLUSION Multimodal interfaces have become an important solution in the domain of post-WIMP interfaces. However, significant challenges still have to be overcome before multimodal interfaces can reveal their true potential. We addressed the challenge of managing multimodal input data coming from different levels of abstraction. Our investigation of related work shows that existing multimodal fusion approaches can be classified in two main categories: data stream-oriented solutions and semantic inference-based solutions. We further highlighted that there is a gap between these two categories and most approaches trying to bridge this gap introduce some ad hoc solutions to overcome the limitations imposed by initial implementation choices. The fact that most multimodal interaction tools have to introduce these ad hoc solutions at one point confirms that there is a need for a unified software architecture with fundamental support for fusion across low-level data streams and high-level semantic inferences. We presented Mudra, a unified multimodal interaction framework for the processing of low-level data streams as well as highlevel semantic inferences. Our approach is centered around a fact base that is populated with multimodal input from various devices and recognisers. Different recognition and multimodal fusion algorithms can access the fact base and enrich it with their own interpretations. A declarative rule-based language is used to derive low-level as well as high-level interpretations of information stored in the fact base. By presenting a number of low-level and highlevel input processing examples, we have demonstrated that Mudra bridges the gap between data stream-oriented and semantic inference-based approaches and represents a promising direction for future unified multimodal interaction processing frameworks. Acknowledgements The work of Lode Hoste is funded by an IWT doctoral scholarship. Bruno Dumas is supported by MobiCraNT, a project forming part of the Strategic Platforms programme by the Brussels Institute for Research and Innovation (Innoviris). 6. REFERENCES [1] R. A. Bolt. “Put-That-There”: Voice and Gesture at the Graphics Interface. In Proc. of SIGGRAPH 1980, 7th Annual Conference on Computer Graphics and Interactive Techniques, pages 262–270, Seattle, USA, 1980. [2] J. Chai, P. Hong, and M. Zhou. A Probabilistic Approach to Reference Resolution in Multimodal User Interfaces. In Proc. of IUI 2004, 9th International Conference on Intelligent User Interfaces, pages 70–77, Funchal, Madeira, Portugal, 2004. [3] J. Coutaz, L. Nigay, D. Salber, A. Blandford, J. May, and R. Young. Four Easy Pieces for Assessing the Usability of Multimodal Interaction: The CARE Properties. In Proc. of Interact 1995, International Conference on Human-Computer Interaction, pages 115–120, Lillehammer, Norway, June 1995. [4] P. Dietz and D. Leigh. DiamondTouch: A Multi-User Touch Technology. In Proc. of UIST 2001, 14th Annual ACM Symposium on User Interface Software and Technology, pages 219–226, Orlando, USA, 2001. [5] B. Dumas, D. Lalanne, and S. Oviatt. Multimodal Interfaces: A Survey of Principles, Models and Frameworks. Human Machine Interaction: Research Results of the MMI Program, pages 3–26, March 2009. [6] F. Echtler, M. Huber, and G. Klinker. Hand Tracking for Enhanced Gesture Recognition on Interactive Multi-Touch Surfaces. Technical Report TUM-I0721, Technische Universität München, Department of Computer Science, November 2007. [7] C. L. Forgy. Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem. Artificial Intelligence, 19(1):17–37, 1982. [8] M. Johnston and S. Bangalore. Finite-State Methods for Multimodal Parsing and Integration. In Proc. of ESSLLI 2001, 13th European Summer School in Logic, Language and Information, Helsinki, Finland, August 2001. [9] M. Johnston, P. Cohen, D. McGee, S. Oviatt, J. Pittman, and I. Smith. Unification-Based Multimodal Integration. In Proc. of ACL 1997, 35th Annual Meeting of the Association for Computational Linguistics, pages 281–288, Madrid, Spain, July 1997. [10] W. König, R. Rädle, and H. Reiterer. Squidy: A Zoomable Design Environment for Natural User Interfaces. In Proc. of CHI 2009, ACM Conference on Human Factors in Computing Systems, pages 4561–4566, Boston, USA, 2009. [11] D. Lalanne, L. Nigay, P. Palanque, P. Robinson, J. Vanderdonckt, and J. Ladry. Fusion Engines for Multimodal Input: A Survey. In Proc. of ICMI-MLMI 2009, International Conference on Multimodal Interfaces, pages 153–160, Cambridge, Massachusetts, USA, September 2009. [12] S. Oviatt. Advances in Robust Multimodal Interface Design. IEEE Computer Graphics and Applications, 23(5):62–68, September 2003. [13] S. Oviatt. Multimodal Interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, Second Edition, pages 286–304. Lawrence Erlbaum Associates, 2007. [14] E. Petajan, B. Bischoff, D. Bodoff, and N. Brooke. An Improved Automatic Lipreading System to Enhance Speech Recognition. In Proc. of CHI 1988, ACM Conference on Human Factors in Computing Systems, pages 19–25, Washington, USA, June 1988. [15] C. Scholliers, L. Hoste, B. Signer, and W. D. Meuter. Midas: A Declarative Multi-Touch Interaction Framework. In Proc. of TEI 2011, 5th International Conference on Tangible, Embedded and Embodied Interaction, pages 49–56, Funchal, Portugal, January 2011. [16] M. Serrano, L. Nigay, J. Lawson, A. Ramsay, R. Murray-Smith, and S. Denef. The OpenInterface Framework: A Tool for Multimodal Interaction. In Proc. of CHI 2008, ACM Conference on Human Factors in Computing Systems, Florence, Italy, April 2008. [17] R. Sharma, V. Pavlovic, and T. Huang. Toward Multimodal Human-Computer Interface. Proceedings of the IEEE, 86(5):853–869, 1998. [18] T. Sowa, M. Fröhlich, and M. Latoschik. Temporal Symbolic Integration Applied to a Multimodal System Using Gestures and Speech. In Proc. of GW 1999, International Gesture Workshop on Gesture-Based Communication in Human-Computer Interaction, pages 291–302, Gif-sur-Yvette, France, March 1999. [19] M. Vo and C. Wood. Building an Application Framework for Speech and Pen Input Integration in Multimodal Learning Interfaces. In Proc. of ICASSP 1996, IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 3545–3548, Atlanta, USA, May 1996. [20] G. Welch and G. Bishop. An Introduction to the Kalman Filter. Technical Report TR 95-041, Department of Computer Science, University of North Carolina at Chapel Hill, 2000. [21] L. Wu, S. Oviatt, and P. Cohen. From Members to Teams to Committee - A Robust Approach to Gestural and Multimodal Recognition. IEEE Transactions on Neural Networks, 13(4):972–982, 2002.

Mudra: A Unified Multimodal Interaction Framework

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics

Cited by