Challenge 5 – AI-Driven Discovery using Science Knowledge Graphs

Sreenivas R. Sukumar, Chris D. Rickett, Kristyn J. Maschhoff, Michael S. Woodacre

Hewlett Packard Enterprise
Corresponding data sponsor email:



In this data challenge, we aim to use state-of-the-art artificial intelligence to reason and hypothesize on a multi-modal knowledge graph. More specifically, contenders will work on a expert-curated Life-Sciences knowledge graph of proteins, molecules, protein-protein interactions, and known bio-chemical pathways of disease mechanisms. We provide two datasets – a sampled dataset focused only on the COVID-19 virus (<5 GBs) and a comprehensive dataset (~30 TBs). The comprehensive dataset includes over a 150 billion medical facts/properties around ~4 million protein sequences and ~36 million drug compounds. Based on resources from prior success of scientists using this dataset (i.e., implementations of drug repurposing workflows in Jupyter notebooks and access to query engines for knowledge graphs), the challenge put forth is the following:

  1. Can we build a prompt-driven fact-based Q&A engine (i.e., ChatGPT-style models and plug-in) for such science knowledge graphs?
  2. Can we leverage open-source AI models such as (AlphaFold, MolGAN, DeepBind, and other graph-neural network models etc.) in a workflow to (i) hypothesize new drug compounds? (ii) predict efficacy of drug-disease treatments?

Keywords: Scientific discovery, artificial intelligence, knowledge graph, generative models, knowledge-centric conversational AI, knowledge-centric large language models

1 Introduction

This challenge extends the proven value of knowledge graphs for the drug repurposing problem described in [1-5] by offering a Life-Sciences knowledge graph. The integrated Life-Sciences knowledge graph assembled to study potential drug repurposing candidates for COVID-19 was generated from a collection of publicly available databases commonly used in life sciences and systems biology research, bringing together over 150 billion facts (See list in Table 1).


The examples of code provided along with the dataset will walk through how the dataset was used for – (i) finding the potential cross-immunity between Tetanus and COVID-19 and (ii) shortlisting 66 drug molecules for prioritized clinical trials for COVID-19 and its variants. The sample code will work both on the smaller COVID-19 dataset and the comprehensive dataset. While the datasets can be used in their raw form, we will also provide access to a query engine as a packaged Singularity container that will render the knowledge graph in memory for interactive traversal and exploratory investigation.


2 Data Description

2.1 Acquisition

The dataset is a result of integration from several known scientific databases collected, curated, and made accessible by institutions such as the National Institutes of Health and the European Bioinformatics Institute. The appropriate data descriptions for these datasets are available in references [6-13] below. Each of these multi-modality datasets can also be individually accessed via the links provided in the table below. The weblinks also provide data dictionaries and descriptions.

Table 1. List of open-source databases integrated into the knowledge graph.

Dataset Source
UniProt (Mar 2020) [6]
PubChemRDF (v1.6.3 beta) [7]
ChEMBL-RDF (27.0) [8]
Bio2RDF (Release 4) [9]
OrthoDB (v10) [10]
Biomodels (r31) [11]
Biosamples (v20191125) [12]
Reactome (r71) [13]
ClinicalTrials [14]


SMC2022 Github Page: <>

Dataset Download (DOI): <to be published soon>
Dataset Companion Github: <to be published soon>
Resource Companion: <to be shared soon>

2.2 Characterizations

  • Dataset – COVID-19: The format provided will be a simple text file with three columns of the form [<subject> <predicate> <object>] following the W3C standards of the Resource Description Framework (RDF). The size of the dataset is < 5 GBs and is intended as a sample for development, code-testing, and algorithm/model experiments – while capturing as much of the data to be comprehensive around the COVID-19 disease. The Jupyter notebook that is also made available around drug repurposing for COVID-19 will provide examples of how to traverse and navigate the dataset.
  • Dataset – KnowledgeGraph360: This comprehensive dataset will follow the same format as above. This dataset is comprehensive and incorporates dataset updates until the year 2022.  We recommend that this dataset be staged/hosted on a high-performance filesystem and accessed via the tools/query engines provided.
  1. Challenge Questions


These open-source datasets in the knowledge graph are accessed and used by thousands of researchers across the world for biomedical research and innovation. We encourage such creative use of the data for research purposes under the open-source license guidelines. In that open spirit, the challenge questions below have been formulated for focus, feasibility, and scope.

  • Traversing and Querying the Multi-Modality Knowledge Graph

The Jupyter notebooks show examples of how researchers interact with the knowledge graph – Some develop their own queries to learn new developments and research while others try to postulate ideas from “predicted” links. How can one use AI to traverse such knowledge graphs intuitively and intelligently? An ambitious example is described below.

  • Design a ChatGPT-like plug-in to traverse Science Knowledge Graphs?
Example 1: Context prompt

Context / Search terms: “Ritonavir” “COVID-19” “spike-protein”

Answer: “Ritonavir is a drug that was under clinical trials for COVID-19 treatment <reference to clinical trial>. There is research supporting that Ritonavir interacts with the spike-protein of the COVID-19 virus etc. <reference>.…. “There is more information about the “spike protein” of COVID-19 in the knowledge graph. The spike protein has the following 3D structure <reference to file> and the following amino acid sequence <….>.

  • Build a fact-checked Q&A Engine for Science Knowledge Graphs

This challenge is to improve pre-trained AI models such as ChatGPT, Llama, Alpuca etc. to avoid hallucinations and instead provide grounded answers based on facts from curated knowledge. This could be accomplished using smarter graph traversal algorithms and/or state-of-the-art graph neural networks that augment generative AI.

Example 2: Q&A prompt

Question: I am a researcher working on natural supplements. I want a list of natural products considered for COVID-19 treatment.

Answer: “Here is a list of 10 natural compounds studied in the context of COVID-19. “

1.      Turmeric 2. Colchicine 3. ….

Question: Tell me more about Colchicine and COVID-19

Answer: Colchicine was part of the Covid-19 trials conducted by the Montreal Heart Institute. Prior research of colchicine for gout revealed its potential to mitigate or prevent inflammation-associated manifestations of the disease…..

  • Implement an AI-driven drug hypothesizer

This challenge aims at infusing generative AI into data-driven discovery workflow on knowledge graphs. Following examples provided along with the data, implement the entirety/parts the following workflow/query using open-sourced pre-trained AI models in drug discovery. The ability to synthesize new molecules and drugs and being able to collate in-silico evidence using AI to reason and hypothesize is the goal.

Example 3: Workflow

Select <drug> from <DatSet>


<drug> is AI_generated (new_molecule)

// Molecular GAN model (e.g., MolGAN)

<drug> has 3D structure.

// 3D structure prediction model (e.g., Alphafold)

<drug> docks with <protein> in 3D

// 3D-3D alignment model (e.g., 3DRegNet)

<drug> interacts with <protein>

// Ligand-Protein prediction model (e.g., DeepBind)

<protein> is like <viral protein = COVID>

                                                   // Protein similarity model (e.g., hmmsearch, Smith-Waterman)



  • Singularity container of query engine launchable on laptop, cloud VMs or HPC resource.
  • Slack channel communications with data sponsors for dataset-related questions
  • Data, tools and code walkthrough (½ day Workshop + video recording)
  • Data science mentoring and coaching



  1. S. Sadegh et al., “Exploring the sars-cov-2 virus host-drug interactome for drug repurposing,” Nature Communications, Vol. 11, 2020.
  2. Zhou, Yadi, et al. “Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2.” Cell discovery6.1 (2020): 1-18.
  3. Gysi, Deisy Morselli, et al. “Network medicine framework for identifying drug repurposing opportunities for covid-19.” arXiv preprint arXiv:2004.07229(2020).
  4. , last accessed 2021/06/01.
  5. Sukumar, S. R., Balma, J. A., Rickett, C. D., Maschhoff, K. J., Landman, J., Yates, C. R., … & Khan, I. A. (2022). The convergence of HPC, ai and Big Data in rapid-response to the COVID-19 pandemic. In Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation: 21st Smoky Mountains Computational Sciences and Engineering, SMC 2021, Virtual Event, October 18-20, 2021, Revised Selected Papers (pp. 157-172). Cham: Springer International Publishing.
  6. UniProt Consortium. (2018). UniProt: the universal protein knowledgebase. Nucleic acids research46(5), 2699.
  7. Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., … & Bolton, E. E. (2019). PubChem 2019 update: improved access to chemical data. Nucleic acids research47(D1), D1102-D1109.
  8. Mendez, D., Gaulton, A., Bento, A. P., Chambers, J., De Veij, M., Félix, E., … & Leach, A. R. (2019). ChEMBL: towards direct deposition of bioassay data. Nucleic acids research47(D1), D930-D940.
  9. Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P., & Morissette, J. (2008). Bio2RDF: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics41(5), 706-716.
  10. Kriventseva, E. V., Kuznetsov, D., Tegenfeldt, F., Manni, M., Dias, R., Simão, F. A., & Zdobnov, E. M. (2019). OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids research47(D1), D807-D811.
  11. Malik-Sheriff, R. S., Glont, M., Nguyen, T. V., Tiwari, K., Roberts, M. G., Xavier, A., … & Hermjakob, H. (2020). BioModels—15 years of sharing computational models in life science. Nucleic acids research48(D1), D407-D415.
  12. Jupp, S., Malone, J., Bolleman, J., Brandizi, M., Davies, M., Garcia, L., … & Jenkinson, A. M. (2014). The EBI RDF platform: linked open data for the life sciences. Bioinformatics30(9), 1338-1339
  13. Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Garapati, P., … & D’Eustachio, P. (2018). The reactome pathway knowledgebase. Nucleic acids research46(D1), D649-D655.
  14. “,” 2020, [Online]. Available:, last accessed 6/3/2020.