An Upper Bound for Random Measurement Error in Causal Discovery

Causal discovery algorithms infer causal relations from data based on several assumptions, including notably the absence of measurement error. However, this assumption is most likely violated in practical applications, which may result in erroneous, irreproducible results. In this work we show how to obtain an upper bound for the variance of random measurement error from the covariance matrix of measured variables and how to use this upper bound as a correction for constraint-based causal discovery. We demonstrate a practical application of our approach on both simulated data and real-world protein signaling data.

Constraint-based Causal Discovery for Non-Linear Structural Causal Models with Cycles and Latent Confounders

We address the problem of causal discovery from data, making use of the
recently proposed causal modeling framework of *modular structural causal
models (mSCM)* to handle cycles, latent confounders and non-linearities. We
introduce *σ-connection graphs (σ-CG)*, a new class of mixed
graphs (containing undirected, bidirected and directed edges) with additional
structure, and extend the concept of *σ-separation*, the
appropriate generalization of the well-known notion of d-separation in this
setting, to apply to σ-CGs. We prove the closedness of σ-separation
under marginalisation and conditioning and exploit this to implement a test of
σ-separation on a σ-CG. This then leads us to the first causal
discovery algorithm that can handle non-linear functional relations, latent
confounders, cyclic causal relationships, and data from different (stochastic)
perfect interventions. As a proof of concept, we show on synthetic data how
well the algorithm recovers features of the causal graph of modular structural
causal models.

From Deterministic ODEs to Dynamic Structural Causal Models

Structural Causal Models are widely used in causal modelling, but how they
relate to other modelling tools is poorly understood. In this paper we provide
a novel perspective on the relationship between Ordinary Differential Equations
and Structural Causal Models. We show how, under certain conditions, the
asymptotic behaviour of an Ordinary Differential Equation under non-constant
interventions can be modelled using Dynamic Structural Causal Models. In
contrast to earlier work, we study not only the effect of interventions on
equilibrium states; rather, we model asymptotic behaviour that is
*dynamic* under interventions that vary in time, and include as a
special case the study of static equilibria.

Algebraic Equivalence of Linear Structural Equation Models

Despite their popularity, many questions about the algebraic constraints imposed by linear structural equation models remain open problems. For causal discovery, two of these problems are especially important: the enumeration of the constraints imposed by a model, and deciding whether two graphs define the same statistical model. We show how the half-trek criterion can be used to make progress in both of these problems. We apply our theoretical results to a small-scale model selection problem, and find that taking the additional algebraic constraints into account may lead to significant improvements in model selection accuracy.

Causal Effect Inference with Deep Latent-Variable Models

Learning individual-level causal effects from observational data, such as inferring the most effective medication for a specific patient, is a problem of growing importance for policy makers. The most important aspect of inferring causal effects from observational data is the handling of confounders, factors that affect both an intervention and its outcome. A carefully designed observational study attempts to measure all important confounders. However, even if one does not have direct access to all confounders, there may exist noisy and uncertain measurement of proxies for confounders. We build on recent advances in latent variable modeling to simultaneously estimate the unknown latent space summarizing the confounders and the causal effect. Our method is based on Variational Autoencoders (VAE) which follow the causal structure of inference with proxies. We show our method is significantly more robust than existing methods, and matches the state-of-the-art on previous benchmarks focused on individual treatment effects.

Causal Consistency of Structural Equation Models

Complex systems can be modelled at various levels of detail. Ideally, causal models of the same system should be consistent with one another in the sense that they agree in their predictions of the effects of interventions. We formalise this notion of consistency in the case of Structural Equation Models (SEMs) by introducing exact transformations between SEMs. This provides a general language to consider, for instance, the different levels of description in the following three scenarios: (a) models with large numbers of variables versus models in which the 'irrelevant' or unobservable variables have been marginalised out; (b) micro-level models versus macro-level models in which the macro-variables are aggregate features of the micro-variables; (c) dynamical time series models versus models of their stationary behaviour. Our analysis stresses the importance of well specified interventions in the causal modelling process and sheds light on the interpretation of cyclic SEMs.

Ancestral Causal Inference

Constraint-based causal discovery from limited data is a notoriously difficult challenge due to the many borderline independence test decisions. Several approaches to improve the reliability of the predictions by exploiting redundancy in the independence information have been proposed recently. Though promising, existing approaches can still be greatly improved in terms of accuracy and scalability. We present a novel method that reduces the combinatorial explosion of the search space by using a more coarse-grained representation of causal information, drastically reducing computation time. Additionally, we propose a method to score causal predictions based on their confidence. Crucially, our implementation also allows one to easily combine observational and interventional data and to incorporate various types of available background knowledge. We prove soundness and asymptotic consistency of our method and demonstrate that it can outperform the state-of-the-art on synthetic data, achieving a speedup of several orders of magnitude. We illustrate its practical feasibility by applying it on a challenging protein data set.

Methods for causal inference from gene perturbation experiments and validation

Inferring causal effects from observational and interventional data is a highly
desirable but ambitious goal. Many of the computational and statistical methods
are plagued by fundamental identifiability issues, instability, and unreliable
performance, especially for large-scale systems with many measured variables.
We present software and provide some validation of a recently developed
methodology based on an invariance principle, called invariant causal
prediction (ICP). The ICP method quantifies confidence probabilities for
inferring causal structures and thus leads to more reliable and confirmatory
statements for causal relations and predictions of external intervention
effects. We validate the ICP method and some other procedures using large-scale
genome-wide gene perturbation experiments in *Saccharomyces cerevisiae*.
The results suggest that prediction and prioritization of future experimental
interventions, such as gene deletions, can be improved by using our statistical
inference techniques.

Distinguishing cause from effect using observational data: methods and benchmarks

The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X,Y. An example is to decide whether altitude causes temperature, or vice versa, given only joint measurements of both variables. Even under the simplifying assumptions of no confounding, no feedback loops, and no selection bias, such bivariate causal discovery problems are challenging. Nevertheless, several approaches for addressing those problems have been proposed in recent years. We review two families of such methods: methods based on Additive Noise Models (ANMs) and Information Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs that consists of data for 100 different cause-effect pairs selected from 37 data sets from various domains (e.g., meteorology, biology, medicine, engineering, economy, etc.) and motivate our decisions regarding the "ground truth" causal directions of all pairs. We evaluate the performance of several bivariate causal discovery methods on these real-world benchmark data and in addition on artificially simulated data. Our empirical results on real-world data indicate that certain methods are indeed able to distinguish cause from effect using only purely observational data, although more benchmark data would be needed to obtain statistically significant conclusions. One of the best performing methods overall is the method based on Additive Noise Models that has originally been proposed by Hoyer et al. (2009), which obtains an accuracy of 63 ± 10 % and an AUC of 0.74 ± 0.05 on the real-world benchmark. As the main theoretical contribution of this work we prove the consistency of that method.

An empirical study of one of the simplest causal prediction algorithms

We study one of the simplest causal prediction algorithms that uses only conditional independences estimated from purely observational data. A specific pattern of four conditional independence relations amongst a quadruple of random variables already implies that one of these variables causes another one without any confounding. As a consequence, it is possible to predict what would happen under an intervention on that variable without actually performing the intervention. Although the method is asymptotically consistent and works well in settings with only few (latent) variables, we find that its prediction accuracy can be worse than simple (inconsistent) baselines when many (latent) variables are present. Our findings illustrate that violations of strong faithfulness become increasingly likely in the presence of many latent variables, and this can significantly deterioriate the accuracy of constraint-based causal prediction algorithms that assume faithfulness.

MAGMA: Generalized Gene-Set Analysis of GWAS Data

By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expen- sive. To address these issues we have developed MAGMA, a novel tool for gene and gene- set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.

Type-II Errors of Independence Tests Can Lead to Arbitrarily Large Errors in Estimated Causal Effects: An Illustrative Example

Estimating the strength of causal effects from observational data is a common problem in scientific research. A popular approach is based on exploiting observed conditional independences between variables. It is well-known that this approach relies on the assumption of faithfulness. In our opinion, a more important practical limitation of this approach is that it relies on the ability to distinguish independences from (arbitrarily weak) dependences. We present a simple analysis, based on purely algebraic and geometrical arguments, of how the estimation of the causal effect strength, based on conditional independence tests and background knowledge, can have an arbitrarily large error due to the uncontrollable type II error of a single conditional independence test. The scenario we are studying here is related to the LCD algorithm by Cooper and to the instrumental variable setting that is popular in epidemiology and econometry. It is one of the simplest settings in which causal discovery and prediction methods based on conditional independences arrive at non-trivial conclusions, yet for which the lack of uniform consistency can result in arbitrarily large prediction errors.

Causal Discovery with Continuous Additive Noise Models

We consider the problem of learning causal directed acyclic graphs from an observational joint distribution. One can use these graphs to predict the outcome of interventional experiments, from which data are often not available. We show that if the observational distribution follows a structural equation model with an additive noise structure, the directed acyclic graph becomes identifiable from the distribution under mild conditions. This constitutes an interesting alternative to traditional methods that assume faithfulness and identify only the Markov equivalence class of the graph, thus leaving some edges undirected. We provide practical algorithms for finitely many samples, RESIT (regression with subsequent independence test) and two methods based on an independence score. We prove that RESIT is correct in the population setting and provide an empirical evaluation.

Cyclic Causal Discovery from Continuous Equilibrium Data

We propose a method for learning cyclic causal models from a combination of observational and interventional equilibrium data. Novel aspects of the proposed method are its ability to work with continuous data (without assuming linearity) and to deal with feedback loops. Within the context of biochemical reactions, we also propose a novel way of modeling interventions that modify the activity of compounds instead of their abundance. For computational reasons, we approximate the nonlinear causal mechanisms by (coupled) local linearizations, one for each experimental condition. We apply the method to reconstruct a cellular signaling network from the flow cytometry data measured by Sachs et al. (2005). We show that our method finds evidence in the data for feedback loops and that it gives a more accurate quantitative description of the data at comparable model complexity.

From Ordinary Differential Equations to Structural Causal Models: the deterministic case

We show how, and under which conditions, the equilibrium states of a first-order Ordinary Differential Equation (ODE) system can be described with a deterministic Structural Causal Model (SCM). Our exposition sheds more light on the concept of causality as expressed within the framework of Structural Causal Models, especially for cyclic models.

Learning Sparse Causal Models is not NP-Hard

This paper shows that causal model discovery is not an NP-hard problem, in the sense that for sparse graphs bounded by node degree $k$ the sound and complete causal model can be obtained in worst case order $N^{2(k+2)}$ independence tests, even when latent variables and selection bias may be present. We present a modification of the well-known FCI algorithm that implements the method for an independence oracle, and suggest improvements for sample/real-world data versions. It does not contradict any known hardness results, and does not solve an NP-hard problem: it just proves that sparse causal discovery is perhaps more complicated, but not as hard as learning minimal Bayesian networks.

On causal and anticausal learning

We consider the problem of function estimation in the case where an underlying causal model can be inferred. This has implications for popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. We argue that causal knowledge may facilitate some approaches for a given problem, and rule out others. In particular, we formulate a hypothesis for when semi-supervised learning can help, and corroborate it with empirical results.

Information-geometric approach to inferring causal directions

While conventional approaches to causal inference are mainly based on conditional (in)dependences, recent methods also account for the shape of (conditional) distributions. The idea is that the causal hypothesis "X causes Y" imposes that the marginal distribution PX and the conditional distribution P(Y|X) represent independent mechanisms of nature. Recently it has been postulated that the shortest description of the joint distribution P(X,Y) should therefore be given by separate descriptions of P(X) and P(Y|X). Since description length in the sense of Kolmogorov complexity is uncomputable, practical implementations rely on other notions of independence. Here we define independence via orthogonality in information space. This way, we can explicitly describe the kind of dependence that occurs between P(Y) and P(X|Y) making the causal hypothesis "Y causes X" implausible. Remarkably, this asymmetry between cause and effect becomes particularly simple if X and Y are deterministically related. We present an inference method that works in this case. We also discuss some theoretical results for the non-deterministic case although it is not clear how to employ them for a more general inference method.

On Causal Discovery with Cyclic Additive Noise Models

We study a particular class of cyclic causal models, where each variable is a (possibly nonlinear) function of its parents and additive noise. We prove that the causal graph of such models is generically identifiable in the bivariate, Gaussian-noise case. We also propose a method to learn such models from observational data. In the acyclic case, the method reduces to ordinary regression, but in the more challenging cyclic case, an additional term arises in the loss function, which makes it a special case of nonlinear independent component analysis. We illustrate the proposed method on synthetic data.

Efficient inference in matrix-variate Gaussian models with iid observation noise

Inference in matrix-variate Gaussian models has major applications for multi-output prediction and joint learning of row and column covariances from matrix-variate data. Here, we discuss an approach for efficient inference in such models that explicitly account for iid observation noise. Computational tractability can be retained by exploiting the Kronecker product between row and column covariance matrices. Using this framework, we show how to generalize the Graphical Lasso in order to learn a sparse inverse covariance between features while accounting for a low-rank confounding covariance between samples. We show practical utility on applications to biology, where we model covariances with more than 100,000 dimensions. We find greater accuracy in recovering biological network structures and are able to better reconstruct the confounders.

Learning of Causal Relations

To learn about causal relations between variables just by observing samples from them, particular assumptions must be made about those variables' distributions. This article gives a practical description of how such a learning task can be undertaken based on different possible assumptions. Two categories of assumptions lead to different methods, constraint-based and Bayesian learning, and in each case we review both the basic ideas and some recent extensions and alternatives to them.

Identifiability of Causal Graphs using Functional Models

This work addresses the following question: Under what assumptions on the data generating process can one infer the causal graph from the joint distribution? The approach taken by conditional independence-based causal discovery methods is based on two assumptions: the Markov condition and faithfulness. It has been shown that under these assumptions the causal graph can be identified up to Markov equivalence (some arrows remain undirected) using methods like the PC algorithm. In this work we propose an alternative by defining Identifiable Functional Model Classes (IFMOCs). As our main theorem we prove that if the data generating process belongs to an IFMOC, one can identify the complete causal graph. To the best of our knowledge this is the first identifiability result of this kind that is not limited to linear functional relationships. We discuss how the IFMOC assumption and the Markov and faithfulness assumptions relate to each other and explain why we believe that the IFMOC assumption can be tested more easily on given data. We further provide a practical algorithm that recovers the causal graph from finitely many data; experiments on simulated data support the theoretical findings.

A Graphical Model Framework for Decoding in the Visual ERP-Based BCI Speller

We present a graphical model framework for decoding in the visual ERP-based speller system. The proposed framework allows researchers to build generative models from which the decoding rules are obtained in a straightforward manner. We suggest two models for generating brain signals conditioned on the stimulus events. Both models incorporate letter frequency information but assume different dependencies between brain signals and stimulus events. For both models, we derive decoding rules and perform a discriminative training. We show on real visual speller data how decoding performance improves by incorporating letter frequency information and using a more realistic graphical model for the dependencies between the brain signals and the stimulus events. Furthermore, we discuss how the standard approach to decoding can be seen as a special case of the graphical model framework. The letter also gives more insight into the discriminative approach for decoding in the visual speller system.

Probabilistic latent variable models for distinguishing between cause and effect

We propose a novel method for inferring whether X causes Y or vice versa from joint observations of X and Y. The basic idea is to model the observed data using probabilistic latent variable models, which incorporate the effects of unobserved noise. To this end, we consider the hypothetical effect variable to be a function of the hypothetical cause variable and an independent noise term (not necessarily additive). An important novel aspect of our work is that we do not restrict the model class, but instead put general non-parametric priors on this function and on the distribution of the cause. The causal direction can then be inferred by using standard Bayesian model selection. We evaluate our approach on synthetic data and real-world data and report encouraging results.

libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models

This paper describes the software package libDAI, a free & open source C++ library that provides implementations of various exact and approximate inference methods for graphical models with discrete-valued variables. libDAI supports directed graphical models (Bayesian networks) as well as undirected ones (Markov random fields and factor graphs). It offers various approximations of the partition sum, marginal probability distributions and maximum probability states. Parameter learning is also supported. A feature comparison with other open source software packages for approximate inference is given. libDAI is licensed under the GPL v2+ license and is available at http://www.libdai.org.

Inferring deterministic causal relations

We consider two variables that are related to each other by an invertible function. While it has previously been shown that the dependence structure of the noise can provide hints to determine which of the two variables is the cause, we presently show that even in the deterministic (noise-free) case, there are asymmetries that can be exploited for causal inference. Our method is based on the idea that if the function and the probability density of the cause are chosen independently, then the distribution of the effect will, in a certain sense, depend on the function. We provide a theoretical analysis of this method, showing that it also works in the low noise regime, and link it to information geometry. We report strong empirical results on various real-world data sets from different domains.

Remote Sensing Feature Selection by Kernel Dependence Measures

This letter introduces a nonlinear measure of independence between random variables for remote sensing supervised feature selection. The so-called HilbertSchmidt independence criterion (HSIC) is a kernel method for evaluating statistical dependence and it is based on computing the HilbertSchmidt norm of the cross-covariance operator of mapped samples in the corresponding Hilbert spaces. The HSIC empirical estimator is easy to compute and has good theoretical and practical properties. Rather than using this estimate for maximizing the dependence between the selected features and the class labels, we propose the more sensitive criterion of minimizing the associated HSIC p-value. Results in multispectral, hyperspectral, and SAR data feature selection for classification show the good performance of the proposed approach.

Distinguishing between cause and effect

We describe eight data sets that together formed the CauseEffectPairs task in the Causality Challenge #2: Pot-Luck competition. Each set consists of a sample of a pair of statistically dependent random variables. One variable is known to cause the other one, but this information was hidden from the participants; the task was to identify which of the two variables was the cause and which one the effect, based upon the observed sample. The data sets were chosen such that we expect common agreement on the ground truth. Even though part of the statistical dependences may also be due to hidden common causes, common sense tells us that there is a significant cause-effect relation between the two variables in each pair. We also present baseline results using three different causal inference methods.

Identifying confounders using additive noise models

We propose a method for inferring the existence of a latent common cause ("confounder") of two observed random variables. The method assumes that the two effects of the confounder are (possibly nonlinear) functions of the confounder plus independent, additive noise. We discuss under which conditions the model is identifiable (up to an arbitrary reparameterization of the confounder) from the joint distribution of the effects. We state and prove a theoretical result that provides evidence for the conjecture that the model is generically identifiable under suitable technical conditions. In addition, we propose a practical method to estimate the confounder from a finite i.i.d. sample of the effects and illustrate that the method works well on both simulated and real-world data.

Regression by dependence minimization and its application to causal inference

Motivated by causal inference problems, we propose a novel method for regression that minimizes the statistical dependence between regressors and residuals. The key advantage of this approach to regression is that it does not assume a particular distribution of the noise, i.e., it is non-parametric with respect to the noise distribution. We argue that the proposed regression method is well suited to the task of causal inference in additive noise models. A practical disadvantage is that the resulting optimization problem is generally non-convex and can be difficult to solve. Nevertheless, we report good results on one of the tasks of the NIPS 2008 Causality Challenge, where the goal is to distinguish causes from effects in pairs of statistically dependent variables. In addition, we propose an algorithm for efficiently inferring causal models from observational data for more than two variables. The required number of regressions and independence tests is quadratic in the number of variables, which is a significant improvement over the simple method that tests all possible DAGs.

Nonlinear causal discovery with additive noise models

The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuous-valued data linear acyclic causal models with additive noise are often used because these models are well understood and there are well-known methods to fit them to data. In reality, of course, many causal relationships are more or less nonlinear, raising some doubts as to the applicability and usefulness of purely linear methods. In this contribution we show that the basic linear framework can be generalized to nonlinear models. In this extended framework, nonlinearities in the data-generating process are in fact a blessing rather than a curse, as they typically provide information on the underlying causal system and allow more aspects of the true data-generating mechanisms to be identified. In addition to theoretical results we show simulations and some simple real data experiments illustrating the identification power provided by nonlinearities.

Bounds on marginal probability distributions

We propose a novel bound on single-variable marginal probability distributions in factor graphs with discrete variables. The bound is obtained by propagating local bounds (convex sets of probability distributions) over a subtree of the factor graph, rooted in the variable of interest. By construction, the method not only bounds the exact marginal probability distribution of a variable, but also its approximate Belief Propagation marginal ("belief"). Thus, apart from providing a practical means to calculate bounds on marginals, our contribution also lies in providing a better understanding of the error made by Belief Propagation. We show that our bound outperforms the state-of-the-art on some inference problems arising in medical diagnosis.

Sufficient Conditions for Convergence of the Sum–Product Algorithm

Novel conditions are derived that guarantee convergence of the Sum-Product Algorithm (also known as Loopy Belief Propagation or simply Belief Propagation (BP)) to a unique fixed point, irrespective of the initial messages, for parallel (synchronous) updates. The computational complexity of the conditions is polynomial in the number of variables. In contrast with previously existing conditions, our results are directly applicable to arbitrary factor graphs (with discrete variables) and are shown to be valid also in the case of factors containing zeros, under some additional conditions. The conditions are compared with existing ones, numerically and, if possible, analytically. For binary variables with pairwise interactions, sufficient conditions are derived that take into account local evidence (i.e., single-variable factors) and the type of pair interactions (attractive or repulsive). It is shown empirically that this bound outperforms existing bounds.

Truncating the Loop Series Expansion for Belief Propagation

Recently, Chertkov and Chernyak (2006b) derived an exact expression for the partition sum (normalization constant) corresponding to a graphical model, which is an expansion around the belief propagation (BP) solution. By adding correction terms to the BP free energy, one for each "generalized loop" in the factor graph, the exact partition sum is obtained. However, the usually enormous number of generalized loops generally prohibits summation over all correction terms. In this article we introduce truncated loop series BP (TLSBP), a particular way of truncating the loop series of Chertkov and Chernyak by considering generalized loops as compositions of simple loops. We analyze the performance of TLSBP in different scenarios, including the Ising model on square grids and regular random graphs, and on PROMEDAS, a large probabilistic medical diagnostic system. We show that TLSBP often improves upon the accuracy of the BP solution, at the expense of increased computation time. We also show that the performance of TLSBP strongly depends on the degree of interaction between the variables. For weak interactions, truncating the series leads to significant improvements, whereas for strong interactions it can be ineffective, even if a high number of terms is considered.

Loop Corrections for Approximate Inference on Factor Graphs

We propose a method to improve approximate inference methods by correcting for the influence of loops in the graphical model. The method is a generalization and alternative implementation of a recent idea from Montanari and Rizzo (2005). It is applicable to arbitrary factor graphs, provided that the size of the Markov blankets is not too large. It consists of two steps: (i) an approximate inference method, for example, belief propagation, is used to approximate cavity distributions for each variable (i.e., probability distributions on the Markov blanket of a variable for a modified graphical model in which the factors involving that variable have been removed); (ii) all cavity distributions are improved by a message-passing algorithm that cancels out approximation errors by imposing certain consistency constraints. This loop correction (LC) method usually gives significantly better results than the original, uncorrected, approximate inference algorithm that is used to estimate the effect of loops. Indeed, we often observe that the loop-corrected error is approximately the square of the error of the uncorrected approximate inference method. In this article, we compare different variants of the loop correction method with other approximate inference methods on a variety of graphical models, including "real world" networks, and conclude that the LC method generally obtains the most accurate results.

Inference in the Promedas medical expert system

In the current paper, the Promedas model for internal medicine, developed by our team, is introduced. The model is based on up-to-date medical knowledge and consists of approximately 2000 diagnoses, 1000 findings and 8600 connections between diagnoses and findings, covering a large part of internal medicine. We show that Belief Propagation (BP) can be successfully applied as approximate inference algorithm in the Promedas network. In some cases, however, we find errors that are too large for this application. We apply a recently developed method that improves the BP results by means of a loop expansion scheme. This method, termed Loop Corrected (LC) BP, is able to improve the marginal probabilities significantly, leaving a remaining error which is acceptable for the purpose of medical diagnosis.

Loop Corrected Belief Propagation

We propose a method for improving Belief Propagation (BP) that takes into account the influence of loops in the graphical model. The method is a variation on and generalization of the method recently introduced by (Montanari and Rizzo, 2005). It consists of two steps: (i) standard BP is used to calculate cavity distributions for each variable (i.e. probability distributions on the Markov blanket of a variable for a modified graphical model, in which the factors involving that variable have been removed); (ii) all cavity distributions are combined by a message-passing algorithm to obtain consistent single node marginals. The method is exact if the graphical model contains a single loop. The complexity of the method is exponential in the size of the Markov blankets. The results are very accurate in general: the error is often several orders of magnitude smaller than that of standard BP, as illustrated by numerical experiments.

Sufficient conditions for convergence of Loopy Belief Propagation

We derive novel sufficient conditions for convergence of Loopy Belief Propagation (also known as the Sum-Product algorithm) to a unique fixed point. Our results improve upon previously known conditions. For binary variables with (anti-)ferromagnetic interactions, our conditions seem to be sharp.

On the properties of the Bethe approximation and Loopy Belief Propagation on binary networks

We analyse the local stability of the high-temperature fixed point of the loopy belief propagation (LBP) algorithm and how this relates to the properties of the Bethe free energy which LBP tries to minimize. We focus on the case of binary networks with pairwise interactions. In particular, we state sufficient conditions for convergence of LBP to a unique fixed point and show that these are sharp for purely ferromagnetic interactions. In contrast, in the purely antiferromagnetic case, the undamped parallel LBP algorithm is suboptimal in the sense that the stability of the fixed point breaks down much earlier than for damped or sequential LBP; we observe that the onset of instability for the latter algorithms is related to the properties of the Bethe free energy. For spin-glass interactions, damping LBP only helps slightly. We estimate analytically the temperature at which the high-temperature LBP fixed point becomes unstable for random graphs with arbitrary degree distributions and random interactions.

Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

We introduce a computationally efficient method to estimate the validity of the BP method as a function of graph topology, the connectivity strength, frustration and network size. We present numerical results that demonstrate the correctness of our estimates for the uniform random model and for a real-world network ("C. Elegans"). Although the method is restricted to pair-wise interactions, no local evidence (zero "biases") and binary variables, we believe that its predictions correctly capture the limitations of BP for inference and MAP estimation on arbitrary graphical models. Using this approach, we find that BP always performs better than MF. Especially for large networks with broad degree distributions (such as scale-free networks) BP turns out to significantly outperform MF.

Quantitative Imaging through a Spectrograph. 1. Principles and Theory

Laser-based optical diagnostics, such as planar laser-induced fluorescence and, especially, Raman imaging, often require selective spectral filtering. We advocate the use of an imaging spectrograph with a broad entrance slit as a spectral filter for two-dimensional imaging. A spectrograph in this mode of operation produces output that is a convolution of the spatial and spectral information that is present in the incident light. We describe an analytical deconvolution procedure, based on Bayesian statistics, that retrieves the spatial information while it avoids excessive noise blowup. The method permits direct imaging through a spectrograph, even under broadband illumination. We introduce the formalism and discuss the underlying assumptions. The performance of the procedure is demonstrated on an artificial but pathological example. In a companion paper [Appl. Opt. 43, 5682-5690 (2004)] the method is applied to the practical case of fuel equivalence ratio Raman imaging in a combustible methane-air mixture.

Theoretical Aspects of Cyclic Structural Causal Models

Structural causal models (SCMs), also known as (non-parametric) structural equation models (SEMs), are widely used for causal modeling purposes. A large body of theoretical results is available for the special case in which cycles are absent (i.e., acyclic SCMs, also known as recursive SEMs). However, in many application domains cycles are abundantly present, for example in the form of feedback loops. In this paper, we provide a general and rigorous theory of cyclic SCMs. The paper consists of two parts: the first part gives a rigorous treatment of structural causal models, dealing with measure-theoretic and other complications that arise in the presence of cycles. In contrast with the acyclic case, in cyclic SCMs solutions may no longer exist, or if they exist, they may no longer be unique, or even measurable in general. We give several sufficient and necessary conditions for the existence of (unique) measurable solutions. We show how causal reasoning proceeds in these models and how this differs from the acyclic case. Moreover, we give an overview of the Markov properties that hold for cyclic SCMs. In the second part, we address the question of how one can marginalize an SCM (possibly with cycles) to a subset of the endogenous variables. We show that under a certain condition, one can effectively remove a subset of the endogenous variables from the model, leading to a more parsimonious marginal SCM that preserves the causal and counterfactual semantics of the original SCM on the remaining variables. Moreover, we show how the marginalization relates to the latent projection and to latent confounders, i.e. latent common causes.

Generalized Structural Causal Models

Structural causal models are a popular tool to describe causal relations in systems in many fields such as economy, the social sciences, and biology. In this work, we show that these models are not flexible enough in general to give a complete causal representation of equilibrium states in dynamical systems that do not have a unique stable equilibrium independent of initial conditions. We prove that our proposed generalized structural causal models do capture the essential causal semantics that characterize these systems. We illustrate the power and flexibility of this extension on a dynamical system corresponding to a basic enzymatic reaction. We motivate our approach further by showing that it also efficiently describes the effects of interventions on functional laws such as the ideal gas law.

Joint Causal Inference from Multiple Contexts

The gold standard for discovering causal relations is by means of experimentation. Over the last decades, alternative methods have been proposed that can infer causal relations between variables from certain statistical patterns in purely observational data. We introduce Joint Causal Inference (JCI), a novel approach to causal discovery from multiple data sets that elegantly unifies both approaches. JCI is a causal modeling approach rather than a specific algorithm, and it can be used in combination with any causal discovery algorithm that can take into account certain background knowledge. The main idea is to reduce causal discovery from multiple datasets originating from different contexts (e.g., different experimental conditions) to causal discovery from a single pooled dataset by adding a set of auxiliary context variables. JCI offers the following features: it deals with several different types of interventions in a unified fashion, it can learn intervention targets, it pools data across different datasets which improves the statistical power of independence tests, and by exploiting differences in distribution between contexts it improves on the accuracy and identifiability of the predicted causal relations. We evaluate the approach on flow cytometry data.

From Random Differential Equations to Structural Causal Models: the stochastic case

Random Differential Equations provide a natural extension of Ordinary Differential Equations to the stochastic setting. We show how, and under which conditions, every equilibrium state of a Random Differential Equation (RDE) can be described by a Structural Causal Model (SCM), while pertaining the causal semantics. This provides an SCM that captures the stochastic and causal behavior of the RDE, which can model both cycles and confounders. This enables the study of the equilibrium states of the RDE by applying the theory and statistical tools available for SCMs, for example, marginalizations and Markov properties, as we illustrate by means of an example. Our work thus provides a direct connection between two fields that so far have been developing in isolation.

Markov Properties for Graphical Models with Cycles and Latent Variables

We investigate probabilistic graphical models that allow for both cycles and latent variables. For this we introduce directed graphs with hyperedges (HEDGes), generalizing and combining both marginalized directed acyclic graphs (mDAGs) that can model latent (dependent) variables, and directed mixed graphs (DMGs) that can model cycles. We define and analyse several different Markov properties that relate the graphical structure of a HEDG with a probability distribution on a corresponding product space over the set of nodes, for example factorization properties, structural equations properties, ordered/local/global Markov properties, and marginal versions of these. The various Markov properties for HEDGes are in general not equivalent to each other when cycles or hyperedges are present, in contrast with the simpler case of directed acyclic graphical (DAG) models (also known as Bayesian networks). We show how the Markov properties for HEDGes - and thus the corresponding graphical Markov models - are logically related to each other.

Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions

An important goal common to domain adaptation and causal inference is to make accurate predictions when the distributions for the source (or training) domain(s) and target (or test) domain(s) differ. In many cases, these different distributions can be modeled as different contexts of a single underlying system, in which each distribution corresponds to a different perturbation of the system, or in causal terms, an intervention. We focus on a class of such causal domain adaptation problems, where data for one or more source domains are given, and the task is to predict the distribution of a certain target variable from measurements of other variables in one or more target domains. We propose an approach for solving these problems that exploits causal inference and does not rely on prior knowledge of the causal graph, the type of interventions or the intervention targets. We demonstrate our approach by evaluating a possible implementation on simulated and real world data.

Joint Causal Inference from Observational and Experimental Datasets

We introduce Joint Causal Inference (JCI), a powerful formulation of causal discovery from multiple datasets that allows to jointly learn both the causal structure and targets of interventions from statistical independences in pooled data. Compared with existing constraint-based approaches for causal discovery from multiple data sets, JCI offers several advantages: it allows for several different types of interventions in a unified fashion, it can learn intervention targets, it systematically pools data across different datasets which improves the statistical power of independence tests, and most importantly, it improves on the accuracy and identifiability of the predicted causal relations. A technical complication that arises in JCI is the occurrence of faithfulness violations due to deterministic relations. We propose a simple but effective strategy for dealing with this type of faithfulness violations. We implement it in ACID, a determinism-tolerant extension of Ancestral Causal Inference (ACI) (Magliacane et al., 2016), a recently proposed logic-based causal discovery method that improves reliability of the output by exploiting redundant information in the data. We illustrate the benefits of JCI with ACID with an evaluation on a simulated dataset.

Structural Causal Models: Cycles, Marginalizations, Exogenous Reparametrizations and Reductions

Structural causal models (SCMs), also known as non-parametric structural equation models (NP-SEMs), are widely used for causal modeling purposes. In this paper, we give a rigorous treatment of structural causal models, dealing with measure-theoretic complications that arise in the presence of cyclic relations. The central question studied in this paper is: given a (possibly cyclic) SCM defined on a large system (consisting of observable endogenous and latent exogenous variables), can we "project it down" to an SCM that describes a subsystem (consisting of a subset of the observed endogenous variables and possibly different latent exogenous variables) in order to obtain a more parsimonious but equivalent representation of the subsystem? We define a marginalization operation that effectively removes a subset of the endogenous variables from the model, and a class of mappings, exogenous reparameterizations, that can be used to reduce the space of exogenous variables. We show that both operations preserve the causal semantics of the model and that under mild conditions they can lead to a significant reduction of the model complexity, at least in terms of the number of variables in the model. We argue that for the task of estimating an SCM from data, the existence of "smooth" reductions would be desirable. We provide several conditions under which the existence of such reductions can be shown, but also provide a counterexample that shows that such reductions do not exist in general. The latter result implies that existing approaches to estimate linear or Markovian SCMs from data cannot be extended to general SCMs.

Ancestral Causal Inference

Constraint-based causal discovery from limited data is a notoriously difficult challenge due to the many borderline independence test decisions. Several approaches to improve the reliability of the predictions by exploiting redundancy in the independence information have been proposed recently. Though promising, existing approaches can still be greatly improved in terms of accuracy and scalability. We present a novel method that reduces the combinatorial explosion of the search space by using a more coarse-grained representation of causal information, drastically reducing computation time. Additionally, we propose a method to score causal predictions based on their confidence. Crucially, our implementation also allows one to easily combine observational and interventional data and to incorporate various types of available background knowledge. We prove soundness and asymptotic consistency of our method and demonstrate that it can outperform the state-of-the-art on synthetic data, achieving a speedup of several orders of magnitude. We illustrate its practical feasibility by applying it to a challenging protein data set.

Distinguishing cause from effect using observational data: methods and benchmarks

The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X, Y. An example is to decide whether altitude causes temperature, or vice versa, given only joint measurements of both variables. Even under the simplifying assumptions of no confounding, no feedback loops, and no selection bias, such bivariate causal discovery problems are challenging. Nevertheless, several approaches for addressing those problems have been proposed in recent years. We review two families of such methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs that consists of data for 100 different cause-effect pairs selected from 37 datasets from various domains (e.g., meteorology, biology, medicine, engineering, economy, etc.) and motivate our decisions regarding the "ground truth" causal directions of all pairs. We evaluate the performance of several bivariate causal discovery methods on these real-world benchmark data and in addition on artificially simulated data. Our empirical results on real-world data indicate that certain methods are indeed able to distinguish cause from effect using only purely observational data, although more benchmark data would be needed to obtain statistically significant conclusions. One of the best performing methods overall is the additive-noise method originally proposed by Hoyer et al. (2009), which obtains an accuracy of 63+-10 % and an AUC of 0.74+-0.05 on the real-world benchmark. As the main theoretical contribution of this work we prove the consistency of that method.

Proof Supplement - Learning Sparse Causal Models is not NP-hard (UAI2013)

This article contains detailed proofs and additional examples related to the UAI-2013 submission "Learning Sparse Causal Models is not NP-hard". It describes the FCI+ algorithm: a method for sound and complete causal model discovery in the presence of latent confounders and/or selection bias, that has worst case polynomial complexity of order N^(2(k+1)) in the number of independence tests, for sparse graphs over N nodes, bounded by node degree k. The algorithm is an adaptation of the well-known FCI algorithm by (Spirtes et al., 2000) that is also sound and complete, but has worst case complexity exponential in N.

Novel Bounds on Marginal Probabilities

We derive two related novel bounds on single-variable marginal probability distributions in factor graphs with discrete variables. The first method propagates bounds over a subtree of the factor graph rooted in the variable, and the second method propagates bounds over the self-avoiding walk tree starting at the variable. By construction, both methods not only bound the exact marginal probability distribution of a variable, but also its approximate Belief Propagation marginal (``belief''). Thus, apart from providing a practical means to calculate bounds on marginals, our contribution also lies in an increased understanding of the error made by Belief Propagation. Empirically, we show that our bounds often outperform existing bounds in terms of accuracy and/or computation time. We also show that our bounds can yield nontrivial results for medical diagnosis inference problems.

Loop corrections for approximate inference

We propose a method for improving approximate inference methods that corrects for the influence of loops in the graphical model. The method is applicable to arbitrary factor graphs, provided that the size of the Markov blankets is not too large. It is an alternative implementation of an idea introduced recently by Montanari and Rizzo (2005). In its simplest form, which amounts to the assumption that no loops are present, the method reduces to the minimal Cluster Variation Method approximation (which uses maximal factors as outer clusters). On the other hand, using estimates of the effect of loops (obtained by some approximate inference algorithm) and applying the Loop Correcting (LC) method usually gives significantly better results than applying the approximate inference algorithm directly without loop corrections. Indeed, we often observe that the loop corrected error is approximately the square of the error of the approximate inference method used to estimate the effect of loops. We compare different variants of the Loop Correcting method with other approximate inference methods on a variety of graphical models, including "real world" networks, and conclude that the LC approach generally obtains the most accurate results.

Truncating the loop series expansion for Belief Propagation

Recently, M. Chertkov and V.Y. Chernyak derived an exact expression for the partition sum (normalization constant) corresponding to a graphical model, which is an expansion around the Belief Propagation (BP) solution. By adding correction terms to the BP free energy, one for each "generalized loop" in the factor graph, the exact partition sum is obtained. However, the usually enormous number of generalized loops generally prohibits summation over all correction terms. In this article we introduce Truncated Loop Series BP (TLSBP), a particular way of truncating the loop series of M. Chertkov and V.Y. Chernyak by considering generalized loops as compositions of simple loops. We analyze the performance of TLSBP in different scenarios, including the Ising model, regular random graphs and on Promedas, a large probabilistic medical diagnostic system. We show that TLSBP often improves upon the accuracy of the BP solution, at the expense of increased computation time. We also show that the performance of TLSBP strongly depends on the degree of interaction between the variables. For weak interactions, truncating the series leads to significant improvements, whereas for strong interactions it can be ineffective, even if a high number of terms is considered.

Sufficient conditions for convergence of the Sum-Product Algorithm

We derive novel conditions that guarantee convergence of the Sum-Product algorithm (also known as Loopy Belief Propagation or simply Belief Propagation) to a unique fixed point, irrespective of the initial messages. The computational complexity of the conditions is polynomial in the number of variables. In contrast with previously existing conditions, our results are directly applicable to arbitrary factor graphs (with discrete variables) and are shown to be valid also in the case of factors containing zeros, under some additional conditions. We compare our bounds with existing ones, numerically and, if possible, analytically. For binary variables with pairwise interactions, we derive sufficient conditions that take into account local evidence (i.e., single variable factors) and the type of pair interactions (attractive or repulsive). It is shown empirically that this bound outperforms existing bounds.

Spin-glass phase transitions on real-world graphs

We use the Bethe approximation to calculate the critical temperature for the transition from a paramagnetic to a glassy phase in spin-glass models on real-world graphs. Our criterion is based on the marginal stability of the minimum of the Bethe free energy. For uniform degree random graphs (equivalent to the Viana-Bray model) our numerical results, obtained by averaging single problem instances, are in agreement with the known critical temperature obtained by use of the replica method. Contrary to the replica method, our method immediately generalizes to arbitrary (random) graphs. We present new results for Barabasi-Albert scale-free random graphs, for which no analytical results are known. We investigate the scaling behavior of the critical temperature with graph size for both the finite and the infinite connectivity limit. We compare these with the naive Mean Field results. We observe that the Belief Propagation algorithm converges only in the paramagnetic regime.