August 19, 2021

The Future of Human-AI Collaboration

Use case
Decision Augmentation


Recent technological advances, especially in the field of deep learning, provide astonishing progress on the road towards AGI (Goertzel and Pennachin 2007; Kurzweil 2010). AI is progressively achieving (super-) human level performance in various tasks, such as autonomous driving , cancer detection, or playing complex games (Mnih et al. 2015; Silver et al. 2016). Therefore, more and more business applications that are based on AI technologies arise. Both research and practice are wondering when AI will be capable of solving complex tasks in real-world business applications apart from laboratory settings in research. 

However, those advances provide a rather one-sided picture on AI, denying the fact that although AI is capable to solve certain tasks with quite impressive performance, AGI is far away from being achieved. There are lots of problems that machines cannot solve alone yet (Kamar 2016), such as applying expertise to decision-making, planning, or creative tasks, just to name a few. ML systems in the wild have major difficulties with being adaptive to dynamic environments and self-adjusting (Müller-Schloer and Tomforde 2017), and the lack of what humans call common sense. This makes them highly vulnerable for adversarial examples (Kurakin et al. 2016). Moreover, AGI needs massive amounts of training data compared to humans, who can learn from only few examples (Lake et al. 2017) and fails to work with certain data types (e.g. soft data). Nevertheless, a lack of control of the learning process might lead to unintended consequences (e.g. racism biases) and limit interpretability, which is crucial for critical domains such as medicine (Doshi-Velez and Kim 2017). Therefore, humans are still required at various positions in the loop of the ML process. While a lot of work has been done in creating training sets with human labellers, more recent research points towards end user involvement (Amershi et al. 2014) and teaching of such machines (Mnih et al. 2015), thus, combining humans and machines in hybrid intelligence systems. 

The main idea of hybrid intelligence systems is, thus, that socio-technical ensembles and its human and AI parts can co-evolve to improve over time. The purpose of this paper is to point towards such hybrid intelligence systems. Thereby, I aim to conceptualize the idea of hybrid intelligence systems and provide an initial taxonomy of design knowledge for developing such socio-technical ensembles. By following a taxonomy development method (Nickerson et al. 2013), I reviewed various literature in interdisciplinary fields and combine those findings with an empirical examination of practical business applications in the context of hybrid intelligence. 

The contribution of this paper is threefold. First, I provide a structured overview of interdisciplinary research on the role of humans in the ML pipeline. Second, I offer an initial conceptualization of the term hybrid intelligence systems and relevant dimensions for system design. Third, I intend to provide useful guidance for system developers during the implementation of hybrid intelligence systems in real-world applications. Towards this end, I propose an initial taxonomy of hybrid intelligence systems. 

ML and AI 

The subfield of intelligence that relates to machines is called AI. With this term I mean systems that perform” [. . .] activities that I associate with human thinking, activities such as decision-making, problem solving, learning [. . .]” (Bellman 1978). Although various definitions exist for AI, this term generally covers facets, such as creating machines that can accomplish complex goals. This includes facets such as natural language processing, perceiving objects, storing of knowledge, and applying it for solving problems, and ML to adapt to new circumstances and act in its environment (Russell and Norvig 2016). 

A subset of techniques that is required to achieve AI is machine learning (ML). (Mitchell 1997)defines it the following way: “[…] A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E […].” A popular approach that drives current progress in both paradigms is deep learning (Kurakin et al. 2016). Deep-learning constitutes a representation learning method that includes multiple levels of representation, obtained by combining simpler but non-linear models. Each of those models transform the representation of one level (starting with the input data) into a representation at more abstract level (LeCun et al. 2015). Deep learning is a special ML technique. Finally, human-in-the-loop learning describes ML approaches (both deep and other) that use the human in some part of the pipeline. Such approaches contrast with research on most knowledge-based systems in IS that use rather static knowledge repositories. I will focus on this in the following chapter. 

The Role of Humans-in-the-Loop of ML 

Although, the terms of AI and ML give the impression that humans become to some extent obsolete, the ML pipeline still requires lot of human interaction such as for feature engineering, parameter tuning, or training. While deep learning has decreased the effort for manual feature engineering and some automation approaches (e.g. AutoML (Feurer et al. 2015)) support human experts in tuning models, the human is still heavily in the loop for sense-making and training. For instance, unsupervised learning requires humans to make sense of clusters that are identified as patterns in data to create knowledge (Gomes et al. 2011). More obviously, human input is required to train models in supervised ML approaches, especially for creating training data, debug models, or train algorithms such as in reinforcement learning (Mnih et al. 2015). This is especially relevant when divergences of real-life and ML problem formulations emerge. This is, for instance, the case when static (offline) training datasets are not perfectly representative of realist and dynamic environments (Kleinberg et al. 2017). Moreover, human input is crucial when models need to learn from human preferences (e.g. recommender systems) and adapt to users or when security concerns require both control and interpretability of the learning process and the output (Doshi-Velez and Kim 2017). Therefore, more recent research has focused on interactive forms of ML (Holzinger 2016) and machine teaching (Simard et al. 2017). Those approaches make active use of human input (Settles 2014) and, thus, learn from human intelligence. This allows machines to learn tasks that they cannot yet achieve alone (Kamar 2016), adapt to environmental dynamics, and deal with unknown situations (Attenberg et al. 2015). 

Hybrid Intelligence 

Rather than using the human just in certain parts and time during the process of creating ML models, applications that can deal with real-world problems require a continuously collaborating sociotechnological ensemble integrating humans and machines, which is contrast to previous research on decision support and expert systems (Holzinger 2016). 

Therefore, I argue that the most likely paradigm for the division of labor between humans and machines in the next years, or probably decades, is hybrid intelligence. This concept aims at using the complementary strengths of human intelligence and AI to behave more intelligently than each of the two could be in separation (Kamar 2016). The basic rational is to try to combine the complementary strengths of heterogeneous intelligences (i.e., human, and artificial agents) into a socio-technological ensemble. 

I envision hybrid intelligence systems, which are defined as systems that can accomplish complex goals by combining human and AI to collectively achieve superior results than each of them could have done in separation and continuously improve by learning from each other. 

The idea of hybrid intelligence systems is thus that sociotechnical ensembles and its human and AI parts can co-evolve to improve over time. The central questions are, therefore, which and how certain design decisions should be made for implementing such hybrid systems rather than focusing. 

Taxonomy Development Method 

For developing my proposed taxonomy, I followed the methodological procedures of Nickerson et al. (2013). In general, a taxonomy is defined as a “fundamental mechanism for organizing knowledge” and the term is considered as a synonym to “classification” and “typology” (Nickerson et al., 2013). The method follows an iterative process consisting of the following steps: 

  1. defining a meta-characteristic;
  2. determining stopping conditions;
  3. selecting an empirical-to-conceptual or conceptual-to-empirical approach;
  4. iteratively following this approach, until the stopping conditions are met. 

The process of the taxonomy development starts with defining a set of meta-characteristics. This step limits the odds of “naive empiricism” where many characteristics are defined in search for random pattern and reflects the expected application of the taxonomy (Nickerson et al. 2013). For this purpose, I define those meta-characteristic as generic design dimensions that are required for developing hybrid intelligence systems. Based on my classification from literature, I choose four dimensions: task characteristics, learning paradigm, human-AI interaction, and AI-human interaction. In the second step, I selected both objective and subjective conditions to conclude the iterative process. 

The following subjective conditions were considered: conciseness, robustness, comprehensiveness, extensibility, explanatory, and information availability. I included no unnecessary dimension or characteristic (conciseness), whereas there are enough dimensions and characteristics to differentiate (robustness). At this point, all design decisions can be classified in the taxonomy (comprehensiveness), while still allowing for new dimensions and characteristics to be subsequently added (extensible). Furthermore, the information is valuable for guiding hybrid intelligence systems design decisions (explanatory) and is typically available or easily interpretable (information availability). 

I conducted a total of three iterations so far. The first iteration used a conceptual-to-empirical approach, where I used extant theoretical knowledge from literature in various fields such as computer science, HCI, information systems, and neuro science to guide the initial dimensions and characteristics of the taxonomy. 

Based on the identified dimensions of hybrid intelligence systems, I sampled seven real-world applications that make use of human and AI combinations. The second iteration used the empirical-to conceptual approach focuses on creating characteristics and dimensions based on the identification of common characteristics from a sample of AI applications in practice. The third iteration then used the conceptual-to-empirical approach, based on an extended literature review including newly identified search termini. 

Data Sources and Sample 

Literature Review

For conducting my literature review, I followed the methodological procedures of (Webster and Watson 2002; vom Brocke et al. 2009). The literature search was conducted from April to June 2018. A prior informal literature search revealed keywords for the database searches resulting in the search string (” hybrid intelligence” OR” human-in-the-loop” OR” interactive machine learning” OR” machine teaching” OR” machine learning AND crowdsourcing” OR” human supervision” OR” human understandable machine learning” OR” human concept learning”). 

During this initial phase I decided to exclude research on knowledge-based systems such as expert systems or DSSs in IS (Kayande et al. 2009; Gregor 2001), as the studies either do not focus on the continuous learning of the knowledge repository or do not use ML techniques at all. Moreover, the purpose of this study is to identify and classify relevant (socio-) technical design knowledge for hybrid intelligence systems, which is also not included in those studies. 

The database search was constrained to title, abstract, keywords and not limited to a certain publication. Databases include AISeL, IEEE Xplore, ACM DL, AAAI DL, and arXiv to identify relevant interdisciplinary literature from the fields of IS, HCI, bio-informatics, and computer science. The search resulted in a total of 2505 hits. Titles, abstracts, and keywords were screened for potential fit to the purpose of my study. Screening was conducted by three researchers independently and resulted in 85 articles that were reviewed in detail so far. A backward and forward search ensured the extensiveness of my results. 

Empirical Cases for Taxonomy Development

To extend my findings from literature and provide empirical evidence from recent (business) applications of hybrid intelligence systems, I include an initial set of seven empirical applications that was analysed for enhancing my taxonomy.

Design Knowledge on Hybrid Intelligence Systems 

My taxonomy of hybrid intelligence systems is organized along the four meta-dimensions task characteristics, learning paradigm, human-AI interaction, and AI-human interaction. Moreover, I identified 16 sub-dimensions and a total of 50 categories for the proposed taxonomy. For organizing the dimensions of the taxonomy, I followed a hierarchical approach following the sequence of the design decisions that are necessary to develop such systems. The design knowledge is displayed in Figures 32-35.

Task Characteristics

The goal of hybrid intelligence is to create superior results through a collaboration between humans and machines. The central component that drives design decisions for hybrid intelligence systems is the task, that humans and machines solve collaboratively. Task characteristics focus on how the task itself is carried out (Reynolds and Miller 2003). In context of hybrid intelligence systems, I identify the following four important tasks characteristics. 

Type of Task: The task to be solved is the first dimension that must be defined for developing hybrid intelligence systems. In this context, I identified four generic categories of tasks: recognition, prediction, reasoning, and action. First, recognition defines tasks that recognize for instance objects (LeCun et al. 2015), images, or natural language (Hinton et al. 2012). On an application level such tasks are used for autonomous driving or smart assistants such as Alexa, Siri, or Duplex. Second, prediction tasks aim at predicting future events based on previous data such as stock prices or market dynamics (Choudhary et al. 2018). The third type of task, reasoning, focuses on understanding data by for instance inductively building (mental) models of a certain phenomenon and therefore make it possible to solve complex problems with small amount of data (Lake et al. 2017). Finally, action tasks are characterized as such that require an agent (human or machine) to conduct a certain kind of action (Mao et al. 2016).

Goals: The two involved agents, the human and the AI, may have a common” goal” like solving a problem through the combination of the knowledge and abilities of both. An example for such common goals is research on recommender systems (e.g. Netflix (Gomez-Uribe and Hunt 2016)), which learn a user’s decision model to offer suggestions. In other contexts, the agents’ goals may also be adversarial. For instance, in settings where AIs try to beat human in games such as IBMs Watson in the game of Jeopardy! (Ferrucci et al. 2010). In many other cases the goal of the human and the AI may also be independent for example when humans train image classifiers without being involved in the end solution. 

Shared Data Representation: The shared data representation is what is the data that is shown to both the human and the machine before executing their tasks. The data can be represented in different levels of granularity and abstraction to create a shared understanding between humans and machines (Feldman 2003; Simard et al. 2017). Features describe phenomena in different kinds of dimensions like height and weight of a human being. Instances are examples of a phenomena which are specified by features. Concepts on the other hand are multiple instances that belong to one common theme, e.g. pictures of different humans. Schemas finally illustrate relations between different concepts (Gentner and Smith 2012). 

Timing in ML Pipeline: The last sub-dimension describes the timing in the ML pipeline that focuses on hybrid intelligence. For this dimension I identified three characteristics: feature engineering, parameter tuning, and training. First, feature engineering allows the integration of domain knowledge in ML models. While more recent advances make it possible to fully automatically (i.e. machine only) learn features through deep learning, human input can be combined for creating and enlarging features such in the case of artist identification on images and quality classification of Wikipedia articles (Cheng and Bernstein 2015). Second, parameter tuning is applied to optimize models. Here ML experts typically use their deep understanding of statistical models to tune hyper-parameters or select models. Such human only parameter tuning can be augmented with approaches such as AutoML (Feurer et al. 2015) or neural architecture search (Real et al. 2018; Real et al. 2017) automate the design of ML models, thus, making it much more accessible for non-experts. Finally, human input is crucial for training ML models in many domains. For instance, large dataset such as ImageNet or the lung cancer dataset LUNA16 rely on human annotations. Moreover, recommender systems heavily rely on input of human usage behavior to adapt to specific preferences (e.g. (Amershi et al. 2014)) and robotic applications are trained by human examples (Mao et al. 2016). 

Learning Paradigm 

Augmentation: In general, hybrid intelligence systems allow three different forms of augmentation: human, machine, and hybrid augmentation. The augmentation of human intelligence is focused on typically applications that enable humans to solve tasks through the predictions of an algorithm such as in financial forecasting or solving complex problems (Doroudi et al. 2016). Contrary, most research in the field of ML focuses on leveraging human input for training to augment machines for solving tasks that they cannot yet solve alone (Kamar 2016). Finally, more recent work identified the great potential for simultaneously augmenting both at the same time through hybrid augmentation (Milli et al. 2017; Carter and Nielsen 2017) or the example of Alpha Go that started by learning from human game moves (i.e. machine augmentation) and finally offered hybrid augmentation by inventing creative moves that taught even mature players novel strategies (Baker et al. 2009; Silver et al. 2016). 

Machine Learning Paradigm: The ML paradigm that is applied in hybrid intelligence systems can be categorized into four relevant subfields: supervised, unsupervised, semi-supervised, and reinforcement learning (Murphy 2012). In supervised learning, the goal is to learn a function that maps the input data x to a certain output data y, given a labelled set of input-output pairs. In unsupervised learning, such output y does not exist, and the learner tries to identify pattern in the input data x (Mitchell et al. 1997). Further forms of learning such as reinforcement learning, or semi-supervised learning can be subsumed under those two paradigms. Semi-supervised learning describes a combination of both paradigms, which uses both a small set of labelled and a large set of unlabelled data to solve a certain task (Zhu 2006). Finally, reinforcement learning. An agent interacts with an environment thereby learning to solve a problem through receiving rewards and punishment for a certain action (Mnih et al. 2015; Silver et al. 2016).

Human Learning Paradigm: Humans have a mental model of their environment, which gets updated through events. This update is done by finding an explanation for the event (Carter and Nielsen 2017; Milli et al. 2017; Lake et al. 2017). Human learning can therefore be achieved from experience and comparison with previous experiences (Kim et al. 2014; Gentner and Smith 2012) and from description and explanations (Hogarth 2011). 

Human-AI Interaction 

Machine Teaching: defines how humans provide input. First, humans can demonstrate actions that the machine learns to imitate (Mao et al. 2016). Second, humans can annotate data for training a model for instance through crowdsourcing (Snow et al. 2008; Raykar et al. 2010). I designate that as a labelling. Third, human intelligence can be used to actively identify a misspecification of the learner and debug the model, which I define as troubleshooting (Nushi et al. 2017; Attenberg et al. 2015). Moreover, human teaching can take the form of verification whereby humans verify or falsify machine output (Pei et al. 2017). 

Teaching Interaction: The input provided through human teaching, can be both explicit and implicit. While explicit teaching leverages active input of the user such as for instance labelling tasks such as image or text annotation (Li et al. 2017), implicit teaching learns from observing the actions of the user and thus adapts to their demands. For instance, Microsoft uses contextual bandit algorithms to suggest users content, using the actions of the user as implicit teaching interaction. 

Expertise Requirements: Hybrid intelligence systems can have certain requirements for the expertise of humans that provides input for systems. While by now both most research and practical applications focus on human input from an ML expert (Chakarov et al. 2016; Kulesza et al. 2010; Patel et al. 2010), thus, requiring deep expertise in the field of AI. Moreover, end users can provide the system with input for product recommendations and e-commerce or input from human non-experts accessed through crowd work platforms (Chang et al. 2017; Chang et al. 2016; Nushi et al. 2017). More recent endeavours, however, focus on the integration of domain experts in hybrid intelligence architectures that leverage the profound understanding of the semantics of a problem domain to teach a machine, while not requiring any ML expertise (Simard et al. 2017).

Amount of Human Input: The amount of human input can vary between those of individual humans and aggregated input from several humans. Individual human input is for instance applied in recommender systems for individualization or due to cost efficiency reasons (Li et al. 2017). On the other hand, collective human input combines the input of several individual humans by leveraging mechanisms of human computation (Quinn and Bederson 2011). This approach allows to reduce errors and biases of individual humans and the aggregation of heterogeneous knowledge (Cheng and Bernstein 2015; Cheng et al. 2015; Zou et al. 2015). 

Aggregation: When human input is aggregated from a collective of individual humans, different aggregation mechanisms can be leveraged to maximize the quality of teaching. First, unweighted methods can be used that use averaging or majority voting to aggregate results (Li et al. 2017)). Additionally, aggregation can be achieved by modelling the context of teaching through algorithmic approach such as expectation maximization, graphical models, entropy minimization, or discriminative training. Therefore, the aggregation can be human dependent focusing on the characteristics of an individual human (Kamar et al. 2012; Dawid and Skene 1979), or human-task dependent adjusting to the teaching task (Kosinski et al. 2014; Whitehill et al. 2009).

Incentives: Humans need to be incentivized to provide input in hybrid intelligence systems. Incentives can be monetary rewards such in the case of crowd work on platforms (e.g. Amazon Mechanical Turk), intrinsic rewards such as intellectual exchange in citizen science (Segal et al. 2018), fun in games with a purpose (Ahn and Dabbish 2008), learning (Vaughan 2017). Another incentive for human input is customization, which allows to increase individualized service quality for users that provide a higher amount of input to the learner (Bernardo et al. 2017; Amershi et al. 2014). 

AI-Human Interaction 

This sub-dimension describes the machine part of the Interaction, the AI-human interaction. At first, which query strategy the algorithm used to learn. Second, I describe the feedback of the machine to humans. Third, I carry out a short explanation of interpretability to show the influence for hybrid intelligence. 

Query Strategy: Offline query strategies require the human to finish her task completely before her actions are applied as input to the AI (Lin et al. 2014; Sheng et al. 2008). Handling a typical labelling task, the human would first need to go through all the data and label each instance. Afterwards the labelled Data is fed to a ML algorithm to train a model. In contrast, online query strategies let the human complete subtasks whose are directly fed to an algorithm, so that teaching and learning can be executed almost simultaneously (Chang et al. 2017; Nushi et al. 2017; Kamar et al. 2012). Another possibility is the use of active learning query strategies (Zhao et al. 2014; Settles 2014)Zhao and Zhu 2014). In this case, the human is queried by the machine when more input to give an accurate prediction is required. 

Machine Feedback: Those four categories describe the feedback that humans receive from the machine. First, humans can get direct suggestions from the machine, which makes explicit recommendations to the user on how to act. For instance, recommender systems such as Netflix or Spotify provide such suggestions for users. Furthermore, systems can make suggestions for describing images (Nushi et al. 2017). Predictions as machine feedback can support humans e.g. to detect lies (Cheng and Bernstein 2015), predict worker behaviours (Kamar et al. 2012), or classify images. Thereby, this form of feedback provides a probabilistic value of a certain outcome (e.g. probability of some data x belonging to a certain class y). The third form of machine feedback is clustering data. Thereby, machines compare data points and put them in an order for instance to prioritize items (Kou et al. 2014), or organize data among identified pattern. Furthermore, another possibility of machine feedback is optimization. Machines enhance humans for instance in making more consistent decisions by optimizing their strategy (Chirkin and Koenig 2016). 

Interpretability: For AI-Human interaction in hybrid intelligence systems interpretability is crucial to prevent biases (e.g. racism), achieve reliability and robustness, ensure causality of the learning, debugging the learner if necessary and for creating trust especially in the context of AI safety (Doshi-Velez and Kim 2017). Interpretability in hybrid intelligence systems can be achieved through algorithm transparency, which allows to open the black box of an algorithm itself, global model interpretability that focuses on the general interpretability of a ML model, and local prediction interpretability that tries to make more complex models interpretable for a single prediction  (Lipton 2018). 


My proposed taxonomy for hybrid intelligence systems extracts interdisciplinary knowledge on human-in-the-loop mechanisms in ML and proposes initial descriptive design knowledge for the development of such systems that might guide developers. My findings reveal the manifold applications, mechanisms, and benefits of hybrid systems that might probably become of increasing interest in real-world applications in the future. My taxonomy of design knowledge offers insights on how to leverage the advantages of combining human and machine intelligence. 

For instance, this allows to integrate deep domain insights into ML algorithms, continuously adapt a learner to dynamic problems, and enhance trust through interpretability and human control. Vice versa, this approach offers the advantage of improving humans in solving problems by offering feedback on how the task was conducted or the performance of a human during that task and machine feedback to augment human intelligence. Moreover, I assume that the design of such systems might allow to move beyond sole efficiency of solving tasks to combined socio-technical ensembles that can achieve superior results that could no man or machine have achieved so far. Promising fields for such systems are in the field of medicine, science, innovation, and creativity. 


Within this paper I propose a taxonomy for design knowledge for hybrid intelligence systems, which presents descriptive knowledge structured along the four meta-dimensions task characteristics, learning paradigm, human-AI interaction, and AI-human interaction. Moreover, I identified 16 sub-dimensions and a total of 50 categories for the proposed taxonomy. By following a taxonomy development methodology (Nickerson et al. 2013), I extracted interdisciplinary knowledge on human-in-the-loop approaches in ML and the interaction between human and AI. I extended those findings with an examination of seven empirical applications of hybrid intelligence systems. 

Therefore, my contribution is threefold. First, the proposed taxonomy provides a structured overview of interdisciplinary research on the role of humans in the ML pipeline by reviewing interdisciplinary research and extract relevant knowledge for system design. Second, I offer an initial conceptualization of the term hybrid intelligence systems and relevant dimensions for developing applications. Third, I intend to provide useful guidance for system developers during the implementation of hybrid intelligence systems in real-world applications. 

Obviously, this paper is not without limitations and provides a first step towards a comprehensive taxonomy of design knowledge on hybrid intelligence systems. First, further research should extend the scope of this research to more practical applications in various domains. By now my empirical case selection is slightly biased on decision problem contexts. Second, as I proceed my research I will further condensate the identified characteristics by aggregating potentially overlapping dimensions in subsequent iterations. Third, my results are overly descriptive so far. As I proceed my research I will therefore focus on providing prescriptive knowledge on what characteristics to choose in a certain situation and thereby propose more specific guidance for developers of hybrid intelligence systems that combine human and machine intelligence to achieve superior goals and driving the future progress of AI. For this purpose, I will identify interdependencies between dimensions and sub-dimensions and evaluate the usefulness of my artefact for designing real-world applications. Finally, further research might focus on integrating the overly design oriented knowledge of this study with research on knowledge-based systems in IS to discuss the findings in the context of those class of systems. 

See refrences

The orginal paper was published at HICSS2018