The 2019 Smoky Mountains Computational Sciences and Engineering Conference (SMC2019) is hosting its third annual Data Challenge. For this event, we have enlisted research scientists from across Oak Ridge National Laboratory (ORNL) to be data sponsors and help create data analytics challenges for eminent data sets at the laboratory. The role of our data sponsors is to provide a significant data set and formulate 3 to 5 challenge questions associated with the data set they provide. The challenge questions for each data set will cover multiple difficulty levels, with the first question in each challenge being suitable for a novice, and each question thereafter increasing in difficultly, with the series of questions ending with an advanced/expert level challenge question. These challenges are intended to draw scientists and researchers who may be at the beginning stages of incorporating data analytics into their workflow, to data analytics experts who are interested in applying novel data analytics techniques to data sets that are of national importance. This year there are 7 data challenges, and a team of up to 4 members may take on any of these challenges.
To participate in this year’s SMC19 Data Challenge, register your team and select a challenge! The top two teams from each challenge will be selected to present at SMC19 where the overall winner will be selected.
To answer the challenge please submit a paper describing your solution of no more than 5 pages in length with pictures as well as a 3-minute narrated video describing your solutions. Detailed instructions for submissions can be found on the submissions page.
The Challenge will be open from May 15th to July 31st. Papers and videos are due by 5:00 PM EDT on July 31st. The top teams will be notified by August 7. Winning teams will be asked to present a poster at SMC 2019 of:
The Data Challenge provided by Garrett Granroth, Pete Peterson, and Wenduo Zhou focuses on analyzing a temperature log against time-stamped neutron events collected on the Vulcan beamline at the Spallation Neutron Source. Research for the team’s dataset, with application to additive manufacturing of complex structures and similar processes, involved continuously recording neutron scattering from a sample during rapid heating and cooling cycles, in situ, on a diffractometer at the Spallation Neutron Source. The group’s dataset contains information about phase transformations and corresponding residual stresses within the sample as functions of temperature and time over the entire set of cycles. The group’s Data Challenge sprang from Wenduo’s observation that the temperature log data contained repetitive pulses in temperature that were difficult to separate into distinct occurrences because of background noise in the data.
While the neutron science field has collected experimental data for a long time and Garrett, Pete, and Wenduo routinely apply data analysis in their work, they acknowledge datasets are getting bigger, making it difficult to eyeball data for patterns and phenomena, and they may have biases that affect their ability to recognize new, potentially beneficial methodologies. The three scientists are interested in approaches this year’s data challengers might use to examine how sample environments affect neutron events research and how researchers can get a clear view of data when there is noise in the sample environment. They believe that scientific computing can relieve the work of eyeballing large datasets and that newer techniques such as machine learning, artificial intelligences, and pattern recognition can reduce work and improve the quality of information feeding back into experiments. The scientists hope their challenge will contribute to a better understanding of how stresses are retained in various engineering materials after repeated heating and cooling cycles.
Nouamane, Junqi, and Albina are collaborating on a 2019 INCITE (Innovative and Novel Computational Impact on Theory and Experiment) project aimed at using deep learning techniques to decode material properties. Albina and Nouamane have been working together for 2 years to fill in gaps in our understanding of solid-state crystalline structures with modeling and simulation, and Nouamane and Junqi have been working together on simulations projects on ORNL’s Summit supercomputer since it went online in 2018. The group’s simulations allow them to predict what electron diffraction patterns for any solid-state crystalline material will look like and what variations will exist in the actual materials.
The dataset associated with Albina, Junqi, and Nouamane’s Data Challenge contains multidimensional images of electron diffraction pattern simulations. Their challenge asks participants to build a machine learning algorithm that accurately predicts materials’ crystal structures; can be tweaked to account for imbalances in the dataset, such as too few data points for a certain type of structure; and can multitask to predict material structures while reducing the effects of imbalances in the data. The dataset addresses more than 60,000 types of materials, so insights culled from modeling and simulation would be impactful to the materials science field.
Nouamane points out that machine learning techniques are developing quickly and continuously changing, so they are excited to see what fresh approaches data challengers will apply to their challenge. Albina believes that successful completion of the group’s data challenge will be an important step in automating material structure determination and classification during electron microscopy experiments, which will improve research efficiency and enable new discoveries. There are consortium efforts in industry to improve deep learning on popular datasets such as ImageNet, Junqi says, but this resource is lacking in the materials science community. Their dataset represents a great opportunity to create a deep learning benchmark for materials science research.
Melissa and Jibo apply their interest and expertise in climate science and high-performance computing to questions that lie at the intersection of environment and urban infrastructure. The dataset was generated under a Laboratory Directed Research and Development project aimed at examining the impact of an area’s built environment on weather and energy use. The dataset contains a year of weather data taken at 15-min. intervals in a section of downtown Chicago; the latitude/longitude location for each building in the study area, its 2D footprint, and height; and a year of simulation data from an Energy Plus building-by-building assessment of energy use. Joshua New of Oak Ridge National Laboratory’s (ORNL’s) Energy and Environmental Sciences Directorate participated in simulations.
The group’s Data Challenge will allow participants to examine variations in weather and building energy use, seasonal influences, and the building types most impacted by external factors such as weather at daily, monthly, and yearly scales. Melissa and Jibo look forward to being presented with novel methods for interpreting and visualizing their data that draw on machine learning and other big data techniques, and they would welcome new collaborations to complement their work to understand climate, infrastructure, and energy use in urban areas from a systems perspective. The group hopes participants enjoy the interdisciplinary nature of their dataset and the challenges reflected in it.
Melissa and Jibo participated as data sponsors in last year’s Data Challenge. In addition to Joshua New, the researchers wish to acknowledge Mark Adams of ORNL’s National Security Sciences Directorate for his contributions to the research that generated the data associated with their Data Challenge.
The data sponsors bring varied perspectives to the transportation analysis project underlying this Data Challenge. Anne, Srinath, Kuldeep, and Jibo put their expertise in high-performance computing, transportation, machine learning, data science, and visualization to work understanding traffic flow in the Martin Luther King Boulevard corridor of Chattanooga, Tennessee, and enjoy learning from each other. Their cross-disciplinary work on this and other projects reflects the Computational Urban Sciences Group’s interest in using new computational, visualization, and smart sensor technologies to examine how traffic flow can be made more energy efficient. The researchers are collaborating with the Chattanooga Transportation Department, the Tennessee Department of Transportation, and the Georgia Department of Transportation to effect real-world improvements in the urban environment.
Anne, Srinath, Kuldeep, and Jibo have donated a dataset containing 1 week’s worth of data collected on GridSmart 360° video cameras placed at six intersections near the University of Tennessee–Chattanooga. Data points address a vehicle’s length; time at which a vehicle passed through an intersection, its direction of travel, and how fast it was going; and what action a vehicle was taking, such as a U-turn or going straight through the intersection. Given that the corridor includes the path taken by the MLK Day parade during the study’s timeframe, the group’s Data Challenge offers interesting opportunities for data challengers to spot anomalies and investigate events that contributed to patterns observed in the data. The challenge asks participants to perform basic statistics to describe the corridor’s dynamic traffic behavior and put forward machine learning–based methodologies to enable short-term predictive modeling that could translate into innovative approaches to traffic infrastructure and policy.
The group wishes to acknowledge Phil Nugent of Oak Ridge National Laboratory’s National Security Sciences Directorate for his contributions to the Chattanooga project and the Chattanooga Department of Transportation for the GridSmart data provided for this challenge.
GE Power’s Data Challenge results from a real-world problem that gas turbines at natural gas power plants can experience with combustion pulsations. Turbine combustion systems burn huge amounts of fuel every minute and are precisely tuned for low emissions. The combustion process and acoustic noise of the system generally are stable. GE Power engineers have noted, occasional noise-pulsations, however, that reach operating limits for the turbines, necessitating undesirable actions to control the noise. Understanding why these infrequent events occur during operation is highly desired as GE Power seeks to continuously improve the tuning process.
GE Power’s Data Challenge aims to explore what is causing the noise-pulsations with a view to finding the best way to control the cause(s). The dataset contains 1 month of operations data for three different gas turbines, examining control, environmental, and output parameters. Information in the dataset includes the magnitude and frequency of pulsations, ambient temperature, airflow, and distribution of fuel between fuel injection points.
Joe believes that solutions for the specific set of turbines involved in the Data Challenge will provide insight into noise-pulsation phenomena in other gas turbines. He was keynote speaker of the 2018 Smoky Mountains Computational Sciences and Engineering Conference and is excited to be a data sponsor this year. The Data Challenge, Joe says, is a great opportunity for students to have a positive impact on a real-world industrial problem. GE Power routinely uses data analytics to detect anomalies and act upon them in a predictive way.
BP uses physics-based modeling to better understand the Earth’s subsurface as it works to locate new energy sources. BP engineers use the known locations of energy sources in the Earth’s subsurface as a starting point, moving out from those sources to areas that may contain undeveloped stores. Describing traditional seismic survey methods as analogous to taking a CAT scan of the Earth, Keith, Max, Anar, Madhav, and Xukai say the company would like to begin using machine learning techniques to augment processing the huge amounts of information the company amasses at it explores the subsurface. They believe that taking advantage of big data techniques will allow BP to more readily identify the most appropriate solutions to their business questions.
Uncertainties resulting from errors in human and instrument measurement, approximations of physical processes, and large computation expense make modeling the Earth’s subsurface difficult. BP’s Data Challenge asks participants to develop an uncertainty map from the company’s dataset. An uncertainty map lays out what is known and not known about a given area where BP chooses to explore for an energy source. In addition to creating the map, Keith, Max, Anar, Madhav, and Xukai would like data challengers to develop methods for gauging how accurate subsurface models are.
Keith, Max, Anar, Madhav, and Xukai are excited to offer a real-world problem with any number of potential solutions. They believe this open-ended quality will make their challenge interesting for students, experienced statisticians, and everyone in between. The application of machine learning methods is growing in their industry, and they look forward to the opportunity to brainstorm ideas with people coming from a range of disciplines and technical sciences. Collaborating with scientists at Oak Ridge National Laboratory through the Data Challenge will help the company prepare to harness the power of exascale computing in the future. Keith has attended the Smoky Mountains Computational Sciences and Engineering Conference for 4 years. Max attended the conference last year and participated in the Data Challenge as a data challenger. Anar, Madhav, and Xukai are new to the conference and Data Challenge.
Gina, Ioana, and Gil collaborate on projects that apply Oak Ridge National Laboratory’s (ORNL’s) strengths in high-performance computing to questions related to clinical trials posed by the US Department of Veterans Affairs, Presidential Innovation Fellow (PIF) program, US Department of Health and Human Services (HHS), and National Cancer Institute. Most recently, Gina and Ioana led the ORNL team that participated in the Health Tech Sprint, for which groups delivered artificial intelligence (AI) and data-driven solutions for challenges related to cancer and other diseases. They also have presented ORNL SMARTClinicalTrials at the White House and at the Industries of the Future conference, as part of HHS/PIF booth at Tech Day in collaboration with the White House Office of Science and Technology Policy and the CIO Council at the Department of Labor.
Gina, Ioana, and Gil have provided three datasets derived from the Health Tech Sprint: (1) annotated clinical trials inclusion criteria, (2) patient medical data, and (3) clinician-ranked clinical trials matched with patients. A second version of the third dataset, produced by oncology professionals, serves as a comparison dataset for the matches identified through the application of AI.
Gina, Ioana, and Gil are excited to be data sponsors for this Data Challenge because the event aligns with their interest in engaging a broader swath of the biomedical, computing, and data science communities and in building a bridge between ORNL and interdisciplinary health sciences communities. Their Data Challenge reflects an unresolved cancer care delivery challenge they would like to help address, and they look forward to seeing the innovations data challengers suggest. They hope to engage experienced data analysts in one of the most challenging and high-priority problems in cancer research, and they hope budding biomedical and data scientists will grasp how satisfying it can be to pursue interdisciplinary research questions of such great societal impact.