By examining alterations in gene expression, researchers gain insights into cellular functions at the molecular level, which can lead to a better understanding of disease development. However, with approximately 20,000 genes in the human genome that interact in intricate ways, pinpointing specific gene group targets presents a significant challenge. Genes often collaborate in modules, influencing one another’s regulatory functions. To address this complexity, scientists at MIT have laid theoretical groundwork for methods that can effectively categorize genes into related groups, allowing efficient exploration of the underlying cause-and-effect relationships among them.
Significantly, this new method relies solely on observational data, alleviating the need for costly and often impractical interventional experiments that traditionally provide the necessary data to deduce causal relationships. In the long term, this approach could enable researchers to identify potential gene targets that influence particular biological behaviors more accurately and efficiently, potentially leading to the development of precise treatments for patients.
The importance of understanding cell state mechanisms within genomics cannot be overstated. However, the multi-scale structure of cells means that the level of data summarization is also crucial. “By determining the most effective way to aggregate observed data, we can enhance the interpretability and utility of the information we derive from the system,” remarks graduate student Jiaqi Zhang, a fellow at the Eric and Wendy Schmidt Center and co-lead author of a paper detailing this innovation. Along with Zhang, co-lead author Ryan Welch, a master’s student in engineering, and senior author Caroline Uhler, a professor in Electrical Engineering and Computer Science (EECS) as well as the Institute for Data, Systems, and Society (IDSS), contributed to the research. Uhler is also the director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard and has affiliations with MIT’s Laboratory for Information and Decision Systems (LIDS). Their findings will be presented at the Conference on Neural Information Processing Systems.
The research targets the issue of gene program identification, which involves understanding how sets of genes work collaboratively to regulate biological processes like cell development and differentiation. Given that it’s impractical to study interactions among all 20,000 genes, the scientists leverage a technique called causal disentanglement to group related genes into a format that allows efficient exploration of cause-and-effect relations.
Previously, the team demonstrated the effectiveness of this method when using interventional data—information acquired by manipulating variables in a gene network. However, conducting interventional experiments can be costly and, in some cases, unethical; additionally, technological limitations may hinder successful interventions.
With only observational data available, scientists cannot assess gene behaviors before and after interventions, making it challenging to understand how groups of genes interact. “Much of the research surrounding causal disentanglement presumes access to interventional data, leaving the potential of observational data largely unexplored,” Zhang notes.
The MIT researchers introduced a more comprehensive approach that employs a machine-learning algorithm capable of identifying and aggregating groups of observed variables—like genes—exclusively using observational data. This technique enables the identification of causal modules, reconstructing an accurate depiction of the underlying causal mechanisms.
“Although this research stemmed from the need to clarify cellular programs, we first developed an innovative causal theory to discern what can and cannot be gleaned from observational data. With this theory in place, we can apply our understanding to genetic data in future efforts to identify gene modules and their regulatory functions,” Uhler explains.
Through statistical methodologies, the researchers calculate a mathematical function known as the variance for the Jacobian of each variable’s score, asserting that causal variables that do not influence any subsequent variables should exhibit a variance of zero. The reconstruction process is layered, commencing with the elimination of bottom-layer variables with zero variance, followed by a reverse, layer-by-layer progression to unveil the connections among variables—or gene groups.
Identifying these zero-variance variables ultimately morphs into a complex combinatorial task, posing a considerable challenge for algorithm development. “Creating an efficient algorithm that could resolve this issue was a significant hurdle we faced,” Zhang admits.
Ultimately, their method generates an abstracted representation of the observed data, revealing interconnected variables that succinctly convey the underlying causal structure. Each variable equates to an aggregated gene group that functions collectively, while the relationship between two variables signifies how one gene group regulates another. The method captures all relevant information utilized to ascertain each layer of variables.
After verifying the theoretical soundness of their approach, the researchers conducted simulations demonstrating that their algorithm can effectively disentangle meaningful causal representations using only observational data.
Looking ahead, the team aims to apply this technique to real-world genetic studies as well as explore scenarios where some interventional data might be accessible. They aspire to enhance our understanding of the design of impactful genetic interventions. Ultimately, this method could facilitate researchers in efficiently determining gene cooperation within the same biological framework, which may lead to the identification of drugs targeting these genes for treating specific diseases.
This research received funding through initiatives like the MIT-IBM Watson AI Lab and the U.S. Office of Naval Research.
Source link