The 2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC2020) is hosting its fourth annual Data Challenge. For this event, we have enlisted research scientists from across Oak Ridge National Laboratory (ORNL) to be data sponsors and help create data analytics challenges for eminent data sets at the laboratory. The role of our data sponsors is to provide a significant data set and formulate 3 to 5 challenge questions associated with the data set they provide. The challenge questions for each data set will cover multiple difficulty levels, with the first question in each challenge being suitable for a novice, and each question thereafter increasing in difficultly, with the series of questions ending with an advanced/expert level challenge question. These challenges are intended to draw scientists and researchers who may be at the beginning stages of incorporating data analytics into their workflow, to data analytics experts who are interested in applying novel data analytics techniques to data sets that are of national importance. This year there are 6 data challenges, and a team of up to 4 members may take on any of these challenges.
SMC2020 Data Challenge registration is now Open!
To participate in this year’s SMC2020 Data Challenge.
The top teams from each challenge will be selected to present a poster on August 26, 2020 at the Kingsport Tennessee Meadowview Resort at SMC2020, where the overall winner will be chosen. Need-based requests for domestic travel and lodging will be considered. Selections will be made on the basis a peer review of the solution papers. The selected solution papers will be published in SMC2020 proceedings in a CCIS Springer volume and will be peer-reviewed by the program committee. Selected teams will be required to submit a 2-minute phone video describing their solution that will be hosted on YouTube.
The Data Challenge provided by Garrett Granroth, Pete Peterson, and Wenduo Zhou focuses on analyzing a temperature log against time-stamped neutron events collected on the Vulcan beamline at the Spallation Neutron Source. Research for the team’s dataset, with application to additive manufacturing of complex structures and similar processes, involved continuously recording neutron scattering from a sample during rapid heating and cooling cycles, in situ, on a diffractometer at the Spallation Neutron Source. The group’s dataset contains information about phase transformations and corresponding residual stresses within the sample as functions of temperature and time over the entire set of cycles. The group’s Data Challenge sprang from Wenduo’s observation that the temperature log data contained repetitive pulses in temperature that were difficult to separate into distinct occurrences because of background noise in the data.
While the neutron science field has collected experimental data for a long time and Garrett, Pete, and Wenduo routinely apply data analysis in their work, they acknowledge datasets are getting bigger, making it difficult to eyeball data for patterns and phenomena, and they may have biases that affect their ability to recognize new, potentially beneficial methodologies. The three scientists are interested in approaches this year’s data challengers might use to examine how sample environments affect neutron events research and how researchers can get a clear view of data when there is noise in the sample environment. They believe that scientific computing can relieve the work of eyeballing large datasets and that newer techniques such as machine learning, artificial intelligences, and pattern recognition can reduce work and improve the quality of information feeding back into experiments. The scientists hope their challenge will contribute to a better understanding of how stresses are retained in various engineering materials after repeated heating and cooling cycles.
Nouamane, Junqi, and Albina are collaborating on a 2019 INCITE (Innovative and Novel Computational Impact on Theory and Experiment) project aimed at using deep learning techniques to decode material properties. Albina and Nouamane have been working together for 2 years to fill in gaps in our understanding of solid-state crystalline structures with modeling and simulation, and Nouamane and Junqi have been working together on simulations projects on ORNL’s Summit supercomputer since it went online in 2018. The group’s simulations allow them to predict what electron diffraction patterns for any solid-state crystalline material will look like and what variations will exist in the actual materials.
The dataset associated with Albina, Junqi, and Nouamane’s Data Challenge contains multidimensional images of electron diffraction pattern simulations. Their challenge asks participants to build a machine learning algorithm that accurately predicts materials’ crystal structures; can be tweaked to account for imbalances in the dataset, such as too few data points for a certain type of structure; and can multitask to predict material structures while reducing the effects of imbalances in the data. The dataset addresses more than 60,000 types of materials, so insights culled from modeling and simulation would be impactful to the materials science field.
Nouamane points out that machine learning techniques are developing quickly and continuously changing, so they are excited to see what fresh approaches data challengers will apply to their challenge. Albina believes that successful completion of the group’s data challenge will be an important step in automating material structure determination and classification during electron microscopy experiments, which will improve research efficiency and enable new discoveries. There are consortium efforts in industry to improve deep learning on popular datasets such as ImageNet, Junqi says, but this resource is lacking in the materials science community. Their dataset represents a great opportunity to create a deep learning benchmark for materials science research.
Melissa, Srinath, Kuldeep, Jibo, and Anne apply their interest and expertise in climate science and high-performance computing to questions that lie at the intersection of environment and urban infrastructure. The dataset was generated under a Laboratory Directed Research and Development project aimed at examining the impact of an area’s built environment on weather and energy use. The dataset contains a year of weather data taken at 15-min. intervals in a section of downtown Chicago; the latitude/longitude location for each building in the study area, its 2D footprint, and height; and a year of simulation data from an Energy Plus building-by-building assessment of energy use. Joshua New of Oak Ridge National Laboratory’s (ORNL’s) Energy and Environmental Sciences Directorate participated in simulations.
The group’s Data Challenge will allow participants to examine variations in weather and building energy use, seasonal influences, and the building types most impacted by external factors such as weather at daily, monthly, and yearly scales. They look forward to being presented with novel methods for interpreting and visualizing their data that draw on machine learning and other big data techniques, and they would welcome new collaborations to complement their work to understand climate, infrastructure, and energy use in urban areas from a systems perspective. The group hopes participants enjoy the interdisciplinary nature of their dataset and the challenges reflected in it.
The team participated as data sponsors in last year’s Data Challenge. In addition to Joshua New, the researchers wish to acknowledge Mark Adams of ORNL’s National Security Sciences Directorate for his contributions to the research that generated the data associated with their Data Challenge.
The data sponsors bring varied perspectives to the transportation analysis project underlying this Data Challenge. Anne, Srinath, Kuldeep, Melissa, and Jibo put their expertise in high-performance computing, transportation, machine learning, data science, and visualization to work understanding traffic flow in the Martin Luther King Boulevard corridor of Chattanooga, Tennessee, and enjoy learning from each other. Their cross-disciplinary work on this and other projects reflects the Computational Urban Sciences Group’s interest in using new computational, visualization, and smart sensor technologies to examine how traffic flow can be made more energy efficient. The researchers are collaborating with the Chattanooga Transportation Department, the Tennessee Department of Transportation, and the Georgia Department of Transportation to effect real-world improvements in the urban environment.
The team have donated a dataset containing 1 week’s worth of data collected on GridSmart 360° video cameras placed at six intersections near the University of Tennessee–Chattanooga. Data points address a vehicle’s length; time at which a vehicle passed through an intersection, its direction of travel, and how fast it was going; and what action a vehicle was taking, such as a U-turn or going straight through the intersection. Given that the corridor includes the path taken by the MLK Day parade during the study’s timeframe, the group’s Data Challenge offers interesting opportunities for data challengers to spot anomalies and investigate events that contributed to patterns observed in the data. The challenge asks participants to perform basic statistics to describe the corridor’s dynamic traffic behavior and put forward machine learning–based methodologies to enable short-term predictive modeling that could translate into innovative approaches to traffic infrastructure and policy.
The group wishes to acknowledge Phil Nugent of Oak Ridge National Laboratory’s National Security Sciences Directorate for his contributions to the Chattanooga project and the Chattanooga Department of Transportation for the GridSmart data provided for this challenge.
BP uses physics-based modeling to better understand the Earth’s subsurface as it works to locate new energy sources. BP engineers use the known locations of energy sources in the Earth’s subsurface as a starting point, moving out from those sources to areas that may contain undeveloped stores. Describing traditional seismic survey methods as analogous to taking a CAT scan of the Earth, Keith, Max, Anar, Madhav, and Xukai say the company would like to begin using machine learning techniques to augment processing the huge amounts of information the company amasses at it explores the subsurface. They believe that taking advantage of big data techniques will allow BP to more readily identify the most appropriate solutions to their business questions.
Uncertainties resulting from errors in human and instrument measurement, approximations of physical processes, and large computation expense make modeling the Earth’s subsurface difficult. BP’s Data Challenge asks participants to develop an uncertainty map from the company’s dataset. An uncertainty map lays out what is known and not known about a given area where BP chooses to explore for an energy source. In addition to creating the map, Keith, Max, Anar, Madhav, and Xukai would like data challengers to develop methods for gauging how accurate subsurface models are.
Keith, Max, Anar, Madhav, and Xukai are excited to offer a real-world problem with any number of potential solutions. They believe this open-ended quality will make their challenge interesting for students, experienced statisticians, and everyone in between. The application of machine learning methods is growing in their industry, and they look forward to the opportunity to brainstorm ideas with people coming from a range of disciplines and technical sciences. Collaborating with scientists at Oak Ridge National Laboratory through the Data Challenge will help the company prepare to harness the power of exascale computing in the future. Keith has attended the Smoky Mountains Computational Sciences and Engineering Conference for 4 years. Max attended the conference last year and participated in the Data Challenge as a data challenger. Anar, Madhav, and Xukai are new to the conference and Data Challenge.
Ioana and Gil collaborate on projects that apply Oak Ridge National Laboratory’s (ORNL’s) strengths in high-performance computing to questions related to clinical trials posed by the US Department of Veterans Affairs, Presidential Innovation Fellow (PIF) program, US Department of Health and Human Services (HHS), and National Cancer Institute. Most recently, Gina and Ioana led the ORNL team that participated in the Health Tech Sprint, for which groups delivered artificial intelligence (AI) and data-driven solutions for challenges related to cancer and other diseases. They also have presented ORNL SMARTClinicalTrials at the White House and at the Industries of the Future conference, as part of HHS/PIF booth at Tech Day in collaboration with the White House Office of Science and Technology Policy and the CIO Council at the Department of Labor.
Ioana and Gil have provided three datasets derived from the Health Tech Sprint: (1) annotated clinical trials inclusion criteria, (2) patient medical data, and (3) clinician-ranked clinical trials matched with patients. A second version of the third dataset, produced by oncology professionals, serves as a comparison dataset for the matches identified through the application of AI.
Ioana and Gil are excited to be data sponsors for this Data Challenge because the event aligns with their interest in engaging a broader swath of the biomedical, computing, and data science communities and in building a bridge between ORNL and interdisciplinary health sciences communities. Their Data Challenge reflects an unresolved cancer care delivery challenge they would like to help address, and they look forward to seeing the innovations data challengers suggest. They hope to engage experienced data analysts in one of the most challenging and high-priority problems in cancer research, and they hope budding biomedical and data scientists will grasp how satisfying it can be to pursue interdisciplinary research questions of such great societal impact.