The Second International Workshop on
Dependability and Security of System Operation
(DSSO 2015)

MONTREAL, QUEBEC, CANADA
Sept 28, 2015

In conjunction with 34TH INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS(SRDS 2015)

GOAL

System operation is about setting up or changing a target system and/or its environment for purposes such as installation, upgrade, or reconfiguration. A system operation process may be executed by scripts, operations tools, code (as in “infrastructure as code”), or humans, usually based on some specification. A large amount of system downtime is caused by failures during a planned system operation or an operator incorrectly responding to a small initial error.

With the rise of the Development-Operations (DevOps) and Continuous Deployment (CD) movements, the speed and frequency of system operation processes, the automation of these processes, and the possibility of concurrent and conflicting execution of several operations processes are all increasing. In the meantime, large-scale use of Infrastructure/Platforms as Services (IaaS/PaaS) and resource sharing in virtualisation introduce more uncertainties into the environment.

Dependability and security issues can come from anywhere in the process – the specification, the code/scripts/tools/human involved, the target system, or the environment. Failures need to be prevented, detected, diagnosed, recovered from, or tolerated in the context of system operation processes.

The goal of the workshop is to bring together researchers from academia and industry to discuss dependability and security issues of system operation and techniques to reduce the downtime caused by these issues. Topics include but are not limited to the following:
  • Architectures or systems impact on operations
  • Best practices and patterns in system operation
  • Canary testing and production environment testing
  • Dependability/Security in configuration management
  • Dependability/Security in disaster recovery and business continuity
  • Dependability/Security in Infrastructure as Code, Software Defined Infrastructure, Software Defined Networks
  • Dependability/Security in operating HPC or big data processing (e.g. Hadoop/Spark) clusters
  • Dependability/Security in release engineering, continuous build, integration, delivery, and deployment
  • Development-Operation (DevOps) process interactions
  • Experience reports and data analysis of real-world system operation
  • Error diagnosis and root cause analysis during system operation
  • Failure/Fault detection/prevention/tolerance during system operation
  • Test driven system operations
  • Tolerance of variability
  • Operation-related machine data collection, storage and analytics

Program

This is a morning-only workshop.
  • 8:30-9:30 Opening and Keynote
  • 9:30-10:00 Paper session 1 (1 paper)
  • 10:00-10:30 Coffee Break
  • 10:30-12:30 Paper session 2 (4 papers)
  • 12:30 Wrap-up
Keynote: Dependability in a Connected World: Research in Action in the Operational Trenches ( slides and recordings in Webex format .wrf)

Speaker: Professor Saurabh Bagchi, Purdue University

Abstract: Much of the computational infrastructure that we encounter today and that we increasingly rely on for critical applications is provided by a distributed system, be it the smart electric grid or the cyber physical systems that embed sentient sensors in our physical spaces or loosely-coupled clusters executing genomic similarity matching software. Dependability is the property that the system continues to provide its functionality on time despite the introduction of faults, either accidental faults (design defects, environmental effects, etc.) or maliciously introduced faults (security attacks, either external or internal). The distributed systems are increasing in scale, both in terms of the number of executing elements and the amount of data that they need to process. Another emerging trend in distributed systems is that they are being built out of heterogeneous components – different kinds of computer platforms and different software platforms. These two trends have thrown new challenges to the designers of computer systems of how to ensure their high dependability. In this talk, I will give a high-level overview of mechanisms that are being devised to handle the dependability challenges in today’s large-scale computing systems in light of the two trends mentioned above. Then I will present details of solution directions that we have taken in two operational scenarios. The first is in analysis and improvements to Purdue’s supercomputing clusters which are used by campus researchers to look at everything from the molecular machinery of viruses to the origins of the universe and myriad science, engineering and social science problems in between. Along with Purdue’s IT organization called ITaP, we are analyzing system usage and failure data with the goal of making them work better for other researchers. We have also built an openly available repository of usage and failure data from these supercomputers, analysis of which can be used to help researchers run their code on the machines more efficiently and reliably and get results faster. In the talk I will present some of our initial insights, how they are driving acquisition and operation of the infrastructure, and how the community can participate in the open source repository. The second domain is in building and running a cyberinfrastructure for earthquake scientists and engineers, called NEEShub. NEES operated from 2009-15, a shared network of civil engineering experimental facilities aimed at facilitating research on mitigating earthquake damage and loss of life. The NEEShub gateway was created in response to the NEES community’s needs, combining data, simulation, and analysis functionality with collaboration tools. I will share operational lessons learned acting as the director of cybersecurity operations of NEEShub, which was a large-scale cyberinfrastructure, at least in academic circles (135K users, 6.5M files downloaded, 175K simulation runs). Our goal was to protect the cyberinfrastructure, including the expensively generated data, while balancing the ease of use of scientists and engineers, who were not always computationally minded. We used a conglomeration of off-the-shelf commercial and open source tools and adapted them to our purposes, some of which were quite esoteric. I will share with you what worked and what did not, with some extrapolation about what it takes to run cyberinfrastructures.

Speaker's Bio: Saurabh Bagchi is a Professor in the School of Electrical and Computer Engineering and the Department of Computer Science at Purdue University in West Lafayette, Indiana. He is an ACM Distinguished Scientist (2013), a Senior Member of IEEE (2007) and of ACM (2009), a Distinguished Speaker for ACM (2012), an IMPACT Faculty Fellow at Purdue (2013-14), and an Assistant Director of the CERIAS security center at Purdue. He is also a Visiting Scientist at IBM Research since 2011. Saurabh's research interest is in distributed systems and dependable computing. He is proudest of the 14 PhD students who have graduated from his research group and have gone on to wonderful careers in industry or academia. In his group, he and his students have far too much fun building real systems and transitioning them to practice. Saurabh received his MS and PhD degrees from the University of Illinois at Urbana-Champaign, in 1998 and 2001, respectively.

Session 1
  • Mamoru Ohara and Satoshi Fukumoto. A Client-based Replication Protocol for Multiversion Cloud File Storage
Session 2
  • Yan Liu. Modeling the Autoscaling Operations in Cloud with Time Series Data
  • Hong-Mei Chen, Rick Kazman, Serhiy Haziyev, Valentyn Kropov and Dmitri Chtchourov. Architectural Support for DevOps in a Neo-Metropolis BDaaS Platform
  • Wenjun Yang, Dianming Hu, Yuliang Liu and Shuhao Wang. Hard Drive Failure Prediction Using Big Data
  • Paul Rimba, Liming Zhu, Xiwei Xu and Daniel Sun. Building Secure Applications using Pattern-Based Design Fragments

PAPER SUBMISSION AND PUBLICATION

Workshop paper submissions: July 5, 2015
Notification to authors: July 19, 2015
Camera-ready: July 29, 2015

The submission and review process will be done using EasyChair (https://www.easychair.org/conferences/?conf=dsso2015).

Submissions must be no longer than 6 pages (including everything) and adhere to the IEEE Computer Society 8.5"x11" two-column camera-ready format. The manuscript templates for MS Word and LaTeX can be found at the following link:
http://www.ieee.org/conferences_events/conferences/publishing/templates.html

All papers will be peer-reviewed by at least 3 PC members and evaluated based on originality, technical quality and relevance to the workshop. All papers will be published in IEEE proceedings. And selected papers will be invited to IEEE Software Special Issue on "Software Engineering for DevOps".


Organizing Committee

Ingo Weber
NICTA/University of New South Wales, Australia

Dong-Seong Kim
University of Canterbury, New Zealand

Wei Xu
Tsinghua University, China

Liming Zhu
NICTA/University of New South Wales, Australia


Program Committee (To Be Confirmed and More to Come)

  • Bram Abrams, Polytechnique Montreal, Canada
  • Javier Alonso, Duke University, US
  • Giuliano Casale, Imperial College London, UK
  • Marc Chiarini, MarkLogic Corp, US
  • Marcello Cinque, University of Naples Federico II, Italy
  • Dianming Hu, Baidu Inc., China
  • Qiang Fu, Microsoft, US
  • Eben Haber, IBM Research, US
  • Andre van Hoorn, University of Stuttgart, Germany
  • Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign, US
  • Rick Kazman, SEI & Uni of Hawaii, US
  • Fumio Machida, NEC,Japan
  • Paulo Maciel, Federal University of Pernambuco, Brazil
  • Iulian Neamtiu, Univ. of California, Riverside, US
  • Eli Tilevich, Virginia Tech, US
  • Rajesh Vasa, Swinburne Uni., Australia
  • Michael Wahler, ABB Corporate Research, Switzerland
  • Eoin Woods, UBS, Canada
  • Xin Ye, DUT, China
  • Ding Yuan, Uni. of Toronto, Canada

CONTACT

dsso2015@easychair.org