Paper | Title | Page |
---|---|---|
TUPPC129 | NIF Device Health Monitoring | 887 |
|
||
Funding: * This work was performed under the auspices of the Lawrence Livermore National Security, LLC, (LLNS) under Contract No. DE-AC52-07NA27344. #LLNL-ABS-633794 The Integrated Computer Control System (ICCS) at the National Ignition Facility (NIF) uses Front-End Processors (FEP) controlling over 60,000 devices. Often device faults are not discovered until a device is needed during a shot, creating run-time errors that delay the laser shot. This paper discusses a new ICCS framework feature for FEPs to monitor devices and report its overall health, allowing for problem devices to be identified before they are needed. Each FEP has different devices and a unique definition of healthy. The ICCS software uses an object oriented approach using polymorphism so FEP’s can determine their health status and report it in a consistent way. This generic approach provides consistent GUI indication and the display of detailed information of device problems. It allows for operators to be informed quickly of faults and provides them with the information necessary to pin point and resolve issues. Operators now know before starting a shot if the control system is ready, thereby reducing time and material lost due to a failure and improving overall control system reliability and availability. |
||
![]() |
Poster TUPPC129 [2.318 MB] | |
THPPC082 | Monitoring of the National Ignition Facility Integrated Computer Control System | 1266 |
|
||
Funding: This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. #LLNL-ABS-632812 The Integrated Computer Control System (ICCS), used by the National Ignition Facility (NIF) provides comprehensive status and control capabilities for operating approximately 100,000 devices through 2,600 processes located on 1,800 servers, front end processors and embedded controllers. Understanding the behaviors of complex, large scale, operational control software, and improving system reliability and availability, is a critical maintenance activity. In this paper we describe the ICCS diagnostic framework, with tunable detail levels and automatic rollovers, and its use in analyzing system behavior. ICCS recently added Splunk as a tool for improved archiving and analysis of these log files (about 20GB, or 35 million logs, per day). Splunk now continuously captures all ICCS log files for both real-time examination and exploration of trends. Its powerful search query language and user interface provides allows interactive exploration of log data to visualize specific indicators of system performance, assists in problems analysis, and provides instantaneous notification of specific system behaviors. |
||
![]() |
Poster THPPC082 [4.693 MB] | |
THPPC086 | Analyzing Off-normals in Large Distributed Control Systems using Deep Packet Inspection and Data Mining Techniques | 1278 |
|
||
Funding: This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. #LLNL-ABS-632814 Network packet inspection using port mirroring provides the ultimate tool for understanding complex behaviors in large distributed control systems. The timestamped captures of network packets embody the full spectrum of protocol layers and uncover intricate and surprising interactions. No other tool is capable of penetrating through the layers of software and hardware abstractions to allow the researcher to analyze an integrated system composed of various operating systems, closed-source embedded controllers, software libraries and middleware. Being completely passive, the packet inspection does not modify the timings or behaviors. The completeness and fine resolution of the network captures present an analysis challenge, due to huge data volumes and difficulty of determining what constitutes the signal and noise in each situation. We discuss the development of a deep packet inspection toolchain and application of the R language for data mining and visualization. We present case studies demonstrating off-normal analysis in a distributed real-time control system. In each case, the toolkit pinpointed the problem root cause which had escaped traditional software debugging techniques. |
||
![]() |
Poster THPPC086 [2.353 MB] | |