Analysis of the Summit Login Nodes Usage Data

Ketan Maheshwari, Sean Wilkinson, Rafael Ferriera da Silva

Oak Ridge National Laboratory

Introduction

Summit is the leadership class supercomputer hosted by OLCF at Oak Ridge National Laboratory. Summit has 5 login nodes that act as gateways for users to use the supercomputer. We run processes that collect the login node usage data every hour since January 1, 2020. The data consists of the observed usage on Summit such as running processes, logged in users, scheduler jobs, state of the storage, system response time and more.

The system usage variations and patterns are affected by several internal and external factors. Some such factors are, day of week, day of month, change in working patterns due to the pandemic, conference and proposal deadlines and more.

It will be valuable for users as well as administrators of the system to understand such usage patterns so that a better overall design could be developed for load balancing, number of login nodes, login node specifications and more. The goal of such a design would be to offer an optimal experience to users using the login nodes to use Summit. Additionally, it would be valuable for stakeholders if future usage trends could be predicted from current usage data so that appropriate resource allocation may be made for the future systems acquisitions.

In this challenge, we offer to share the pseudonymized dataset and invite interested individuals and teams to solve 4 challenges that will shed light of critical usage patterns and future predictions.

Data Organization and Size

All the data is in plain text. The data consists of one file per login node per day for 2 years. Some files are missing due to Summit being under planned maintenance or unavailable for other or data collection process glitches reasons. Considering there are 5 login nodes on Summit, there are approximately 2 X 365 X 5 = 3650 files.

The files are organized into directories named as Monyyyy (eg. May2021) – there are 24 such directories, one per month for 24 months. Each file has the name of the login node and date. For example, the data for login3 on 22nd March 2021 is in Mar2021 directory and is named as login3.summit.olcf.ornl.gov.Mar22_2021.txt.

File sizes varies and they are in order of 1-15 MBs each. The total size of the dataset is 20G uncompressed and 2.6G compressed. The dataset has approximately 51M lines of text.

The data in each file is organized in sections and subsections. There are 24 sections bound marked by the hour and “endsnap” in each file (with exceptions where the process was interrupted for some reasons). Each hourly section has 10 sections as follows:

  1. The output of the Unix w command bounded by “w –“ and “endw –“.
  2. The contents of /proc/meminfo file bounded by “meminfo –” and “endmeminfo –“.
  3. The contents of /proc/vmstat file bounded by “vmstat –” and “endvmstat –“.
  4. The output of “ps aux” command (excluding root owned processes) bounded by “ps aux –” and “endps aux –“.
  5. The output of “top” command (excluding root processes) bounded by “top -n 1 -bc | awk ‘$2!~/root/’ –” and “endtop -n 1 -bc | awk ‘$2!~/root/’ –“.
  6. Information on all the jobs currently active in scheduler bounded by “bjobs -a -u all –” and “endbjobs -a -u all –“.
  7. Time it takes to run unaliased ls command in $HOME.
  8. Time it takes to run a colored ls command in $HOME.
  9. Time it takes to create a 1G file in gpfs scratch.
  10. Output of the “df -h” command excluding the tmpfs filesystems.

Data

Dataset Download (DOI): https://doi.ccs.ornl.gov/ui/doi/386  

Weekend and holiday data set: https://docs.google.com/spreadsheets/d/1nzgBdxXRk01CberZMtOssXmmn1iTgDypnbG9zi5DHxo/edit?usp=sharing

Instructions for data download: https://smc-datachallenge.ornl.gov/wp-content/uploads/2022/05/Constellation_Downloading_Guide_v02.pdf

The Challenge Questions

1. Capture the data and organize it into one or more easily usable CSV datasets. For instance, the dataset could be represented as a CSV with columns such as date-hour, login-node, logged-users, running-procs, cpu-load, memory-used, time_to_ls, running-jobs, total-jobs, time-to-create-1G, disk-util, …

2. Correlate the usage patterns with external events such as weekly working and non-working days such as weekends and public holidays (a list of such days will be provided). For instance, how does the number of users, system loads, active processes are affected during working vs non-working hours. Are there any anomalies to this pattern, if so, why? What other events might be affecting the usage (eg. COVID surge)?

3. Understand and identify the relative system usage and any possible skews between the five login nodes. Derive a “state” defined by the various usage parameters and compare the state among the five login nodes. For instance, a “state” could be a tuple of (cpu-usage, memory-usage, users-logged, response-time-to-create-1G, …). Each element of this tuple could be a “dimension” of the state and could be disproportionately high or low wrt other dimensions. Does this discrepency indicate any specific behavior pattern?

4. Predict the future trends. Identify interesting usage trends and predict them with respect to each other with a given degree of confidentiality. Statistical methods such as correlation and regression may be used. We encourage usage of ML / DL techniques and developing model-based predication methods. For instance, how was the “state” during peak COVID, or just after the announcements of vaccines, or during unusual weather events etc.

5. Bonus Question: Analyze the data and identify any other possible trends / patterns / info that may not have been covered by the above questions.