Artificial Intelligence 97 (1997)1-5 Editorial The relevance of relevance 1 Devika Subramanian a,*, Russell Greiner b,2, Judea Pearl c a Department of Computer Science, Rice University, Houston, TX 77005, USA b Siemens Corporate Research, Inc., 755 College Road East, Princeton, NJ 08540-6632, USA Department of Computer Science, University of California, Los Angeles, CA, USA * Corresponding author. The editors thank AAAI for allowing us to organize the very successful Fall Symposium, which showed the community's interest in the topic and prompted us to organize this special issue. We also thank the dozens of reviewers who helped us to maintain high standards in producing this issue. 2 Current address: Department of Computing Science, 615 General Service Bldg., University of Alberta,Edmonton, Alberta, Canada T6G 2H1 1. Introduction With too little information, reasoning and learning systems cannot work effectively. Too much information can also cause the performance of these systems to degrade, in both accuracy and efficiency. It is therefore important to determine what information must be preserved, or more generally, i.e., what information is "relevant". Reasoning about what is and is not relevant to the goals at hand is ubiquitous in humans. Much of this meta-theoretic reasoning occurs outside of the scope of our introspection; it is believed that over 90% of the neuronal connections in our brains are inhibitory and serve to discard or suppress inputs received through our senses. Moreover, when we engage in deliberate problem solving (e.g., writing this introduction, planning a trip, or learning a buyer's profile from credit data), we explicitly reason about the relevance of entities in our conceptualization of the problem. In addition, in the context of computing within a problem formulation, we reason about the relevance of steps in that computation; e.g., which move to explore in game tree search or which variable to ignore in data analysis. The object of relevance reasoning is to simplify problem formulations and to eliminate wasted effort in computation. Relevance reasoning is crucial for agents like us with limited memory and computational resources; the benefit obtained from being able to derive appropriate actions within given resource limits far outweighs the expense of such reasoning. This special issue, motivated by the very successful AAAI Fall Symposium on this topic, addresses questions in automating relevance reasoning, focusing on autonomous systems with limited computational and sensory resources, embedded in dynamic environments. First, what is relevance reasoning? Is it different from other kinds of reasoning, such as default reasoning or common sense reasoning? Are there general definitions of relevance that transcend tasks? If relevance is task-specific, what is the relationship between a task and the kind of relevance reasoning appropriate for it? When is relevance reasoning necessary for autonomous systems? Can we provide a knowledge-level account of relevance reasoning independent of the syntactic details of how it is implemented in particular computational systems? What are the costs and benefits of relevance reasoning, and how do we quantify them? A key premise underlying the study of relevance is that there are general relevance criteria that can be articulated explicitly, independent of specific tasks and formulations. A related premise is that the cost of relevance reasoning will be offset by the ability to compute good action policies within resource limits. The 11 papers in this special issue put these premises to theoretical and empirical tests. One general way to define the meta-theoretic notion of relevance is in terms of "perturbations": A specific "entity" (such as an action, training sample, attribute, hack-ground proposition, or inference step) is irrelevant to a task in some context if the appropriate response to the task does not change by an unacceptable amount if we change the entity in that context. Otherwise, we view that entity as (somewhat) relevant to the task. This view is explicitly stated in the paper by Galles and Pearl, which deals with causality and where a perturbation corresponds to a material change in the physical world; it is also implicit in most of the other papers appearing in this specific issue. For instance, the learning papers of Greiner, Grove and Kogan, Kivinen, Warmuth and Auer, Kohavi and John and Blum and Langley all use the fact that an attribute is irrelevant to a classification function if the class label of an instance does not change if we change the value of that attribute. Baum and Smith define relevance of individual computational steps; in particular, they say that a leaf expansion in a game tree search computation is relevant if it has the potential to alter the choice of move. Darwiche, Lakemeyer, and Levy, Fikes and Sagiv provide similar definitions of when a perturbation to a proposition leaves answers to a set of goal propositions unchanged in the context of some background theory. While there is significant agreement on the general definition of relevance, there is considerable variety in how they are instantiated in the context of specific tasks or task classes. Baum and Smith demonstrate the subtleties involved in instantiating this general definition to a specific computation. A leaf expansion in a game tree search could impact the utility of the move under consideration as well as the choice of future leaf expansion decisions. The implementation of their relevance criterion forces them to move away from the traditional model of making point estimates of the value of leaf positions to using parametric value distributions that can be learned during the course of a game. They show that the distinction between important and unimportant lines of play in a game can be captured using relevance measures on distributions over evaluation scores. The nature of the learning task necessitates different choices in the structure of criteria for determining attribute relevance. Thus, while Kivinen, Warmuth and Auer and Kohavi and John assume the same attributes are relevant over all instances, Greiner, Grove and Kogan and Baluja and Pomerleau consider tasks where different attributes are relevant for different instances, making attribute relevance a function of the specific instance rather than that of an entire instance set. The task in Baluja and Pomerleau's paper is to detect lane marking in an image to help a vehicle follow a road. This task is difficult as the image is often cluttered with distracting extraneous objects (e.g., other vehicles, pedestrians, etc.) that can look superficially like the lane marking that the vehicle is seeking. To avoid these distractions, the system focuses its search on the "relevant" part of the image; this subset is based on its expectation of the lane marking positions, which is based on previous images in the sequence. In contrast, the analysis by Kivinen, Warmuth and Auer does not require ever explicitly identifying which attributes are relevant-it only uses the observation that there is a set of k relevant attributes in its analysis. Blum and Langley survey work in defining and using other relevance criteria in concept learning. The difficulty in finding realizable implementations of relevance definitions in reasoning is illustrated by Galles and Pearl, Levy, Fikes and Sagiv, Darwiche, and Lakemeyer. Lakemeyer postulates semantic definitions of the relevance of a proposition to another in the context of a theory and shows that it is intractable to compute exact relevance relations even under propositional logic. This work motivates the need for approximations as considered by Khardon and Roth who demonstrate that reasoning can be efficient when restricted to relevant models (e.g., vivid knowledge bases of Levesque) of formulas under consideration. Darwiche recasts Pearl's definition of conditional independence (irrelevance) in probabilistic reasoning [4] in the logical context and provides several efficient graph-theoretic algorithms for decomposing complex logical calculations into simpler computations. Similarly, Galles and Pearl show that most aspects of causal relevance can be captured by directed graphs and that such graphs can be used as theorem provers, to test whether new relevance relations follow from, or are consistent with, a set of relevance premises. Can a pure knowledge level account of relevance reasoning be provided? Kohavi and John and Levy, Fikes and Sagiv provide evidence that such an account is either unlikely to be comprehensive or not be very useful. Kohavi and John empirically demonstrate that the determination of attribute relevance in an induction task cannot be made independent of the induction algorithm (which therefore necessitates a symbol level analysis). Levy, . 46ikes and Sagiv, building on earlier work by Subramanian in [5], provide a proof-theoretic account of irrelevance in the context of efficient deductive reasoning with large Horn theories, in sharp contrast to Lakemeyer's purely semantic account. Almost all of the papers in the special issue provide theoretical guarantees and/or experimental results on the utility of relevance reasoning. Baluja and Pomerleau demonstrate that by using relevance analysis to focus the attention on lane pixels in an image, the accuracy of the choice of steering direction is improved by 20%-a significant performance improvement on the task. Baum and Smith conduct an extensive empirical evaluation of their policy of expanding relevant leaf nodes in the context of Othello. They show that their programs play a much stronger game than alpha-beta pruning and that the superiority of their approach increases rapidly with additional time resources. An interesting open empirical question is whether the relevance criterion of Baum and Smith would benefit the game of chess. The following papers on logical reasoning demonstrate how logical inference of various forms can be made more efficient by explicit reasoning about relevance. The paper by Darwiche provides conditions under which reasoning about logical conditional independence simplifies complex computations (belief change, entailment and satisfiability) into small sets of much simpler "local" calculations. Levy, Fikes and Sagiv's paper demonstrates that reasoning with large Horn databases can be significantly speeded up by explicit elimination of irrelevant facts. Khardon and Roth take a very different approach, showing that models which summarize relevant information in a set of formulas can be the basis of very efficient schemes for logical as well as default reasoning. The papers on learning also demonstrate that "less is better", i.e., they show that learning algorithms produce more accurate classifiers if trained on only the subset of "relevant" training examples, or if given only the values of the "relevant" attributes of the training examples. The paper by Blum and Langley uses insights from the machine learning and the computational learning theory communities to summarize results, both characterizing the "state of the art" implemented systems (e.g., int "embedded" versus "filter" versus "wrapper" approaches, contrasting feature selection with feature weighting schemes, and considering both selecting relevant examples, and relevant attributes) and providing useful theoretical explanations. Kohavi and John explain why a learner, trying to produce an accurate classifier from a limited pool of training examples, can do better if it uses only the subset of relevant attributes. They then present an extensive empirical comparison between their approach, which involves "wrapping" the basic learner, as a black box, within an algorithm that searches for the optimal subset of features with a "filtering" algorithm, that uses other, learner-independent principles to select the subset of attributes. Greiner, Grove and Kogan present a learning model where a helpful teacher, who uses the target decision tree to label instances, also tells the learner which attributes in the target tree wer used to classify the instance. They show that this relevance information can be extremely useful: e.g., while there is no known algorithm that can "PAC-learn" arbitrary decision trees in the standard model (where the learner sees the complete set of attribute values for each instance) [61, this learning task become trivial if the learner is told exactly which features are relevant for each instance. Finally, Littlestone earlier presented a very clever algorithm for learning linear-separators, called WINNOW, which had the intriguing property that it scales linearly with the number of relevant attributes but only logarithmically with the total number of attributes [21; by contrast, the typical PERCEPTRON algorithm scales linearly here [3]. As most of these results were only upper bounds, it was not clear whether there really was a difference here. The paper by Kivinen, Warmuth and Auer answers this, by showing that an adversary can force the Perceptron algorithm to make exponentially more mistakes (in the context of on-line learning) than WINNOW. We hope these papers awaken interest in relevance reasoning and inspire new answers to the open questions posed earlier in this introduction. References [1] R. Greiner and D. Subramanian, in: Proceedings AAAI Fall Symposium on "Relevance", Tech. Rept. FS-94-02 (AAAI, Menlo Park, CA, 1995). [2] N. Littlestone, learning quickly when irrelevant attributes abound: a new linear-threshold algorithm, Machine Learning 2 (1988) 285-318. [3] M. Minsky and S. Papert, Perceptron: An Introduction to Computational Geometry (MIT Press, Cambridge, MA, expanded ed., 1988). [4] J Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, San Mateo, CA, 1988). [5] D. Subramanian, Shift of vocabulary bias in speedup learning, Machine Learning 20 (1995)155-191. [61 L.G. Valiant, A theory of the learnable, Comm. ACM 27 (II) (1984) 1134-1142.