Sreenivas R. Sukumar, Chris D. Rickett, Kristyn J. Maschhoff, Michael S. Woodacre
Hewlett Packard Enterprise
Corresponding data sponsor email: [email protected]
Abstract:
In this data challenge, we aim to use state-of-the-art artificial intelligence to reason and hypothesize on a multi-modal knowledge graph. More specifically, contenders will work on a expert-curated Life-Sciences knowledge graph of proteins, molecules, protein-protein interactions, and known bio-chemical pathways of disease mechanisms. We provide two datasets – a sampled dataset focused only on the COVID-19 virus (<5 GBs) and a comprehensive dataset (~30 TBs). The comprehensive dataset includes over a 150 billion medical facts/properties around ~4 million protein sequences and ~36 million drug compounds. Based on resources from prior success of scientists using this dataset (i.e., implementations of drug repurposing workflows in Jupyter notebooks and access to query engines for knowledge graphs), the challenge put forth is the following:
Keywords: Scientific discovery, artificial intelligence, knowledge graph, generative models, knowledge-centric conversational AI, knowledge-centric large language models
1 Introduction
This challenge extends the proven value of knowledge graphs for the drug repurposing problem described in [1-5] by offering a Life-Sciences knowledge graph. The integrated Life-Sciences knowledge graph assembled to study potential drug repurposing candidates for COVID-19 was generated from a collection of publicly available databases commonly used in life sciences and systems biology research, bringing together over 150 billion facts (See list in Table 1).
The examples of code provided along with the dataset will walk through how the dataset was used for – (i) finding the potential cross-immunity between Tetanus and COVID-19 and (ii) shortlisting 66 drug molecules for prioritized clinical trials for COVID-19 and its variants. The sample code will work both on the smaller COVID-19 dataset and the comprehensive dataset. While the datasets can be used in their raw form, we will also provide access to a query engine as a packaged Singularity container that will render the knowledge graph in memory for interactive traversal and exploratory investigation.
2 Data Description
2.1 Acquisition
The dataset is a result of integration from several known scientific databases collected, curated, and made accessible by institutions such as the National Institutes of Health and the European Bioinformatics Institute. The appropriate data descriptions for these datasets are available in references [6-13] below. Each of these multi-modality datasets can also be individually accessed via the links provided in the table below. The weblinks also provide data dictionaries and descriptions.
Table 1. List of open-source databases integrated into the knowledge graph.
Dataset | Source |
UniProt (Mar 2020) | [6] |
PubChemRDF (v1.6.3 beta) | [7] |
ChEMBL-RDF (27.0) | [8] |
Bio2RDF (Release 4) | [9] |
OrthoDB (v10) | [10] |
Biomodels (r31) | [11] |
Biosamples (v20191125) | [12] |
Reactome (r71) | [13] |
ClinicalTrials | [14] |
SMC2022 Github Page: <>
Dataset Download (DOI): <to be published soon>
Dataset Companion Github: <to be published soon>
Resource Companion: <to be shared soon>
2.2 Characterizations
These open-source datasets in the knowledge graph are accessed and used by thousands of researchers across the world for biomedical research and innovation. We encourage such creative use of the data for research purposes under the open-source license guidelines. In that open spirit, the challenge questions below have been formulated for focus, feasibility, and scope.
The Jupyter notebooks show examples of how researchers interact with the knowledge graph – Some develop their own queries to learn new developments and research while others try to postulate ideas from “predicted” links. How can one use AI to traverse such knowledge graphs intuitively and intelligently? An ambitious example is described below.
Example 1: Context prompt
Context / Search terms: “Ritonavir” “COVID-19” “spike-protein” Answer: “Ritonavir is a drug that was under clinical trials for COVID-19 treatment <reference to clinical trial>. There is research supporting that Ritonavir interacts with the spike-protein of the COVID-19 virus etc. <reference>.…. “There is more information about the “spike protein” of COVID-19 in the knowledge graph. The spike protein has the following 3D structure <reference to file> and the following amino acid sequence <….>. |
This challenge is to improve pre-trained AI models such as ChatGPT, Llama, Alpuca etc. to avoid hallucinations and instead provide grounded answers based on facts from curated knowledge. This could be accomplished using smarter graph traversal algorithms and/or state-of-the-art graph neural networks that augment generative AI.
Example 2: Q&A prompt
Question: I am a researcher working on natural supplements. I want a list of natural products considered for COVID-19 treatment. Answer: “Here is a list of 10 natural compounds studied in the context of COVID-19. “ 1. Turmeric 2. Colchicine 3. …. Question: Tell me more about Colchicine and COVID-19 Answer: Colchicine was part of the Covid-19 trials conducted by the Montreal Heart Institute. Prior research of colchicine for gout revealed its potential to mitigate or prevent inflammation-associated manifestations of the disease….. |
This challenge aims at infusing generative AI into data-driven discovery workflow on knowledge graphs. Following examples provided along with the data, implement the entirety/parts the following workflow/query using open-sourced pre-trained AI models in drug discovery. The ability to synthesize new molecules and drugs and being able to collate in-silico evidence using AI to reason and hypothesize is the goal.
Example 3: Workflow
Select <drug> from <DatSet> where <drug> is AI_generated (new_molecule) // Molecular GAN model (e.g., MolGAN) <drug> has 3D structure. // 3D structure prediction model (e.g., Alphafold) <drug> docks with <protein> in 3D // 3D-3D alignment model (e.g., 3DRegNet) <drug> interacts with <protein> // Ligand-Protein prediction model (e.g., DeepBind) <protein> is like <viral protein = COVID> // Protein similarity model (e.g., hmmsearch, Smith-Waterman) |
Resources
References