The 2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC2020) is hosting its fourth annual Data Challenge. For this event, we have enlisted research scientists from across Oak Ridge National Laboratory (ORNL) to be data sponsors and help create data analytics challenges for eminent data sets at the laboratory. The role of our data sponsors is to provide a significant data set and formulate 3 to 5 challenge questions associated with the data set they provide. The challenge questions for each data set will cover multiple difficulty levels, with the first question in each challenge being suitable for a novice, and each question thereafter increasing in difficultly, with the series of questions ending with an advanced/expert level challenge question. These challenges are intended to draw scientists and researchers who may be at the beginning stages of incorporating data analytics into their workflow, to data analytics experts who are interested in applying novel data analytics techniques to data sets that are of national importance. This year there are 7 data challenges, and a team of up to 4 members may take on any of these challenges.
SMC2020 Data Challenge registration is now Open!
To participate in this year’s SMC2020 Data Challenge.
The top teams from each challenge will be selected to present a virtual poster on August 26, 2020 at virtual SMC2020, where the overall winner will be chosen. Selections will be made on the basis a peer review of the solution papers. Papers will be peer-reviewed and judged by the program committee based on how well they cover these aspects of the work. A selected set of papers will become finalists and be invited to extend the work by incorporating reviewer feedback. The updated papers will be notified of final acceptance on September 1 and due during the camera-ready deadline. Selected contributions are planned to be published in SMC2020 proceedings in a CCIS Springer volume. Selected teams may be required to submit a 3-minute phone video describing their solution that will be hosted on YouTube.
The Data Challenge provided by Garrett Granroth, Pete Peterson, and Wenduo Zhou focuses on analyzing a temperature log against time-stamped neutron events collected on the Vulcan beamline at the Spallation Neutron Source. Research for the team’s dataset, with application to additive manufacturing of complex structures and similar processes, involved continuously recording neutron scattering from a sample during rapid heating and cooling cycles, in situ, on a diffractometer at the Spallation Neutron Source. The group’s dataset contains information about phase transformations and corresponding residual stresses within the sample as functions of temperature and time over the entire set of cycles. The group’s Data Challenge sprang from Wenduo’s observation that the temperature log data contained repetitive pulses in temperature that were difficult to separate into distinct occurrences because of background noise in the data.
While the neutron science field has collected experimental data for a long time and Garrett, Pete, and Wenduo routinely apply data analysis in their work, they acknowledge datasets are getting bigger, making it difficult to eyeball data for patterns and phenomena, and they may have biases that affect their ability to recognize new, potentially beneficial methodologies. The three scientists are interested in approaches this year’s data challengers might use to examine how sample environments affect neutron events research and how researchers can get a clear view of data when there is noise in the sample environment. They believe that scientific computing can relieve the work of eyeballing large datasets and that newer techniques such as machine learning, artificial intelligences, and pattern recognition can reduce work and improve the quality of information feeding back into experiments. The scientists hope their challenge will contribute to a better understanding of how stresses are retained in various engineering materials after repeated heating and cooling cycles.
Nouamane, Junqi, and Albina are collaborating on a 2019 INCITE (Innovative and Novel Computational Impact on Theory and Experiment) project aimed at using deep learning techniques to decode material properties. Albina and Nouamane have been working together for 2 years to fill in gaps in our understanding of solid-state crystalline structures with modeling and simulation, and Nouamane and Junqi have been working together on simulations projects on ORNL’s Summit supercomputer since it went online in 2018. The group’s simulations allow them to predict what electron diffraction patterns for any solid-state crystalline material will look like and what variations will exist in the actual materials.
The dataset associated with Albina, Junqi, and Nouamane’s Data Challenge contains multidimensional images of electron diffraction pattern simulations. Their challenge asks participants to build a machine learning algorithm that accurately predicts materials’ crystal structures; can be tweaked to account for imbalances in the dataset, such as too few data points for a certain type of structure; and can multitask to predict material structures while reducing the effects of imbalances in the data. The dataset addresses more than 60,000 types of materials, so insights culled from modeling and simulation would be impactful to the materials science field.
Nouamane points out that machine learning techniques are developing quickly and continuously changing, so they are excited to see what fresh approaches data challengers will apply to their challenge. Albina believes that successful completion of the group’s data challenge will be an important step in automating material structure determination and classification during electron microscopy experiments, which will improve research efficiency and enable new discoveries. There are consortium efforts in industry to improve deep learning on popular datasets such as ImageNet, Junqi says, but this resource is lacking in the materials science community. Their dataset represents a great opportunity to create a deep learning benchmark for materials science research.
Melissa, Srinath, Kuldeep, Jibo, and Anne apply their interest and expertise in climate science and high-performance computing to questions that lie at the intersection of environment and urban infrastructure. The dataset was generated under a Laboratory Directed Research and Development project aimed at examining the impact of an area’s built environment on weather and energy use. The dataset contains a year of weather data taken at 15-min. intervals in a section of downtown Chicago; the latitude/longitude location for each building in the study area, its 2D footprint, and height; and a year of simulation data from an Energy Plus building-by-building assessment of energy use. Joshua New of Oak Ridge National Laboratory’s (ORNL’s) Energy and Environmental Sciences Directorate participated in simulations.
The group’s Data Challenge will allow participants to examine variations in weather and building energy use, seasonal influences, and the building types most impacted by external factors such as weather at daily, monthly, and yearly scales. They look forward to being presented with novel methods for interpreting and visualizing their data that draw on machine learning and other big data techniques, and they would welcome new collaborations to complement their work to understand climate, infrastructure, and energy use in urban areas from a systems perspective. The group hopes participants enjoy the interdisciplinary nature of their dataset and the challenges reflected in it.
The team participated as data sponsors in last year’s Data Challenge. In addition to Joshua New, the researchers wish to acknowledge Mark Adams of ORNL’s National Security Sciences Directorate for his contributions to the research that generated the data associated with their Data Challenge.
The data sponsors bring varied perspectives to the transportation analysis project underlying this Data Challenge. Anne, Srinath, Kuldeep, Melissa, and Jibo put their expertise in high-performance computing, transportation, climate science, machine learning, data science, and visualization to work understanding the interaction between transportation planning and climate, and enjoy learning from each other. Their cross-disciplinary work on this and other projects reflects the Computational Urban Sciences Group's interest in using new computational, visualization, and smart sensor technologies to examine how traffic flow can be made more energy efficient.
The team have donated a dataset containing a variety of data, including travel surveys, simulated traffic, simulated emissions, land use, buildings, and socioeconomic data. They encourage participants to explore the available data to answer questions on travel demand, climate impact, and travel patterns.
BP uses physics-based modeling to better understand the Earth’s subsurface as it works to locate new energy sources. BP engineers use the known locations of energy sources in the Earth’s subsurface as a starting point, moving out from those sources to areas that may contain undeveloped stores. Describing traditional seismic survey methods as analogous to taking a CAT scan of the Earth, Keith, Max, Anar, Madhav, and Xukai say the company would like to begin using machine learning techniques to augment processing the huge amounts of information the company amasses at it explores the subsurface. They believe that taking advantage of big data techniques will allow BP to more readily identify the most appropriate solutions to their business questions.
Uncertainties resulting from errors in human and instrument measurement, approximations of physical processes, and large computation expense make modeling the Earth’s subsurface difficult. BP’s Data Challenge asks participants to develop an uncertainty map from the company’s dataset. An uncertainty map lays out what is known and not known about a given area where BP chooses to explore for an energy source. In addition to creating the map, Keith, Max, Anar, Madhav, and Xukai would like data challengers to develop methods for gauging how accurate subsurface models are.
Keith, Max, Anar, Madhav, and Xukai are excited to offer a real-world problem with any number of potential solutions. They believe this open-ended quality will make their challenge interesting for students, experienced statisticians, and everyone in between. The application of machine learning methods is growing in their industry, and they look forward to the opportunity to brainstorm ideas with people coming from a range of disciplines and technical sciences. Collaborating with scientists at Oak Ridge National Laboratory through the Data Challenge will help the company prepare to harness the power of exascale computing in the future. Keith has attended the Smoky Mountains Computational Sciences and Engineering Conference for 4 years. Max attended the conference last year and participated in the Data Challenge as a data challenger. Anar, Madhav, and Xukai are new to the conference and Data Challenge.
Ioana and Gil collaborate on projects that apply Oak Ridge National Laboratory’s (ORNL’s) strengths in high-performance computing to questions related to clinical trials posed by the US Department of Veterans Affairs, Presidential Innovation Fellow (PIF) program, US Department of Health and Human Services (HHS), and National Cancer Institute. Most recently, Gina and Ioana led the ORNL team that participated in the Health Tech Sprint, for which groups delivered artificial intelligence (AI) and data-driven solutions for challenges related to cancer and other diseases. They also have presented ORNL SMARTClinicalTrials at the White House and at the Industries of the Future conference, as part of HHS/PIF booth at Tech Day in collaboration with the White House Office of Science and Technology Policy and the CIO Council at the Department of Labor.
Ioana and Gil have provided three datasets derived from the Health Tech Sprint: (1) annotated clinical trials inclusion criteria, (2) patient medical data, and (3) clinician-ranked clinical trials matched with patients. A second version of the third dataset, produced by oncology professionals, serves as a comparison dataset for the matches identified through the application of AI.
Ioana and Gil are excited to be data sponsors for this Data Challenge because the event aligns with their interest in engaging a broader swath of the biomedical, computing, and data science communities and in building a bridge between ORNL and interdisciplinary health sciences communities. Their Data Challenge reflects an unresolved cancer care delivery challenge they would like to help address, and they look forward to seeing the innovations data challengers suggest. They hope to engage experienced data analysts in one of the most challenging and high-priority problems in cancer research, and they hope budding biomedical and data scientists will grasp how satisfying it can be to pursue interdisciplinary research questions of such great societal impact.
As governments, policymakers, and scientists across the globe are racing to identify potential vaccines and drugs for SARS-CoV-2, many scientists hope the information needed to identify a vaccine lies in the millions of available research documents. To support mining information from research literature, the White House, along with leading industries, has made a dataset of research publications directly related to the outbreak available to the general public . Some of the most important questions pertaining to the outbreak which were identified by the US NASEM and the WHO, were published as part of a public challenge along with the publication dataset on Kaggle .
We invite submissions describing complete or partial solutions to any of the 10 Kaggle COVID-19 Open Research Dataset Challenge (CORD-19) Tasks to SMCDC for consideration for a best solution paper award, poster presentation, and publication in the conference proceedings.
The Kaggle CORD-19 Challenge is a separate challenge from SMCDC and can be registered for at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge.
The SMCDC poster session will give selected researchers perusing the CORD-19 dataset a place to present their work and discuss it with other researchers. Submissions will still need to follow the SMC Data challenge format of submitting a 6-8 page paper describing your partial or complete solution. Please see the call for papers for the full submission instructions. Selected submissions will be published in the conference proceeding.