Brian Clancy

SRE/DevOps Engineer

Download

PDF Word

Education

BS in Computer Science

University of Iowa

Iowa City, IA

Interests

Photography
Running
Boating
Coding
Coaching

Certifications

ITIL v3

Organizations

Career Profile

Seasoned DevOps Engineer with a deep understanding of modern infrastructure and software architectures, now specializing in Site Reliability Engineering. Adept at leveraging SRE principles to optimize system performance and ensure high availability. Recent projects include developing strategic dashboards aligned with Google's golden signals for enhanced observability. Extensive experience with Kubernetes management across cloud and on-premise platforms, coupled with strong Python and Go development skills. Proven ability to collaborate effectively across teams and deliver impactful solutions.

Skills & Proficiency

APM

Python

AWS & Azure

Docker

Kubernetes

Unix & Linux Scripting

Go

CI/CD Processing

C/C++

PowerShell Scripting

Core Competencies

Dynatrace

Prometheus

AWS/Azure/GCP

Git

Gitlab & Github

Terraform

Ansible

RESTful API

Atlassian Tooling

Jenkins

Helm

Kubespray

Cassandra

Elasticsearch

MongoDB

PostgreSQL

Kafka/TIBCO

SIEM & Log Management

Change Management

Requirements Planning

Data Domain

Data Deduplication

Visio Diagramming

Backup Administration

Disaster Recovery

Fabric Design & Zoning

Data Modeling

Automation

Cost Analysis

Cloud Data Protection

Experience

Site Reliability Engineer

06/2023 - Present

Concentra - Addison, TX (Hybrid)

Ensured the reliability and performance of critical medical service platforms, focusing on real-time services monitoring and KPI tracking. Developed and maintained monitoring, alerting, and incident response procedures to minimize downtime. Collaborated with engineering to improve system architecture and optimize performance. Automated infrastructure management and contributed to KPI development to measure system health.

Modernized application performance monitoring by migrating from App Dynamics to Dynatrace, leading to increased visibility and actionable insights.
Developed strategic KPI dashboards aligned with Google SRE golden signals, providing actionable insights into system health and performance.
Created comprehensive SRE documentation, standardizing monitoring and observability practices for applications and platforms.
Designed and implemented Azure dashboards and alerts, empowering application and DevOps teams with proactive monitoring capabilities.
Collaborated with application teams to refine and develop critical metrics, driving data-driven decision-making and business optimization.
Delivered critical observability and monitoring insights during major incidents, facilitating rapid resolution and minimizing business impact.

DevOps Engineer

08/2019 - 05/2023

Anodot - Home (Remote)

Member of a worldwide DevOps team with responsibilities impacting all elements of the business. These included SaaS-based Kubernetes environment. On-prem Kubernetes environments, CI/CD pipelines supporting development, infrastructure creation and maintenance of AWS, troubleshooting customer environments, and documentation of all critical components and designs.

Performed customer installations including Kubernetes, Anodot's application, and monitoring application. Troubleshooting issues with the customer's environment and validating data ingested into the platform.
Partnered with Sales and Customer Success to both discover and analyze data from the point of origin, through pipeline or APIs, and into the platform
Developed documentation around defining core business KPI for Anodot's platform both SaaS and On-prem. This was centered around monitoring dashboards, alerting and defining effective ranges and impact analysis.
Performed load and scale testing on the platform within isolated AWS environments. This included scaling out a Kubernetes cluster and pushing ingest rate and payloads into the platform.

Principal Platform Engineer

2017 - 2019

CA Technologies - Plano, TX

Partnered with teams on several internal infrastructure agile projects utilizing daily scrums and Kanban techniques to deliver defined sprint requirements.

Developed a weighted analysis and metric form for the comparison of enterprise backup vendor products including ROI and cost analysis for a project with a $1-2M budget.
Created Wiki-based documentation and architectural schematics for corporate data protection environment including RACI charts, Escalation Matrix, and Complex Decision Trees leading to a 100% satisfaction compliance review board finding.

Sr. Platform Administrator

2010 - 2017

CA Technologies - Plano, TX

IT Engineering team member responsible for partnering with architecture to gather application and infrastructure requirements from the business units. Refined requirements to develop test plans, run books, compliance policies and deployment procedures.

Automated AWS cloud-based application protection using snapshots and lifecycle management techniques. Moved overall protection compliance rate from 20% to 99.9% within a 3-month span.
Designed Nutanix local and remote data protection domains to protect critical worldwide data from over 15 countries and 25 cities to global data centers within four global regions.
Developed an API script used for the protection and isolation of a tertiary copy of critical application backup data resolving a mandate instituted by executive leadership for a near-immutable layer of data protection.
Worked with architects on application and infrastructure design and documentation leading to improvements in regulatory compliance and reductions in operational oversights.
Assisted in the development of the first malicious destruction of data protection policies and procedures.

Sr. Unix Systems Administrator

2000 - 2010

CA Technologies - Plano, TX

Collaborated with development engineers and team leads to oversee the critical infrastructure components of the Configuration, Quality, and Build Management Systems. Transitioned to an active role in the storage team with a primary goal of stabilizing and modernizing the corporate backup environment.

Redesigned entire corporate backup infrastructure using disk-based backups and asynchronous replication technologies.
Architected and implemented a complete overhaul of an isolated backup SAN fabric.
Oversaw NAS infrastructure redesign to include both DR and NDMP backup components.
Integrated monitoring and alerting into NAS and Backup infrastructure environments.
Developed Unix scripts providing the development division with independent control over production/development/QA life cycles for their configuration management environment.

Projects

Proactive System Health Monitoring Initiative

2023 - 2024

Observabilty and Montitoring

Developed a suite of strategic KPI dashboards aligned with Google's Site Reliability Engineering (SRE) principles, specifically focusing on the golden signals of latency, traffic, errors, and saturation. These dashboards provided real-time visibility into critical system metrics, empowering application and DevOps teams with proactive monitoring capabilities. By leveraging these insights, teams were able to identify and address potential issues before they impacted users, ultimately driving improved system reliability, performance, and overall service availability.

Trivy Vulnerability Scanning Interface

2021 - 2022

Design and Development

Implemented an automated Trivy scanning system for examining our nightly development-based builds across a master feature branch environment. The process sent emails to our development and security team with embedded emails showing a summary vulnerability scan output with the ability to dive into a detailed HTML AWS S3-hosted report/table. The functionality was executed via a Kubernetes job and was written in Python with an invocation in Bash

Prometheus Monitoring System

2020 - 2021

Project Team Contributor

Assisted in the design and administration of a single-point installation process of a Prometheus monitoring environment for an on-prem version of our application. This included Elasticsearch, Cassandra, Kafka, Mongo, our application’s remote write, Grafana, and Victoria Metrics. Grafana and Victoria Metrics were used for isolated metric analysis and long-term metric retention respectively. This was accomplished via the use of a Helmfile and basic bash scripting for the creation of the persistent storage volumes.

Utility Application Container

2019 - 2020

Designer and Developer

Created an advanced multipurpose utility for troubleshooting and diagnosing issues within our application's Kubernetes namespace. Designed around a robust and optimized Dockerfile. It encompassed tools for network diagnostics, functions for health checks on databases, Python modules specific to our applications API framework for creating and testing customer configurations, and default dashboard metric analytics.

Backup Rearchitect & Redesign

2017 - 2018

Project Technical Lead

Created a requirements list based on function and process. Performed vendor POCs with several different solutions: EMC Avamar/Networker, Commvault, Rubrik, and Veeam. Developed a weighted analysis and metric form to cover requirements, performance data, reporting, and several other miscellaneous metrics and processes. Then presented a business case to leadership with risks, budgeting criteria, and ROI analysis.

Malicious Destruction of Data

2015 - 2017

Project Team Lead

Gathered requirements from the business and IT leaders to develop a strategy to protect the company\'s core data in the event of a malicious attack. Implemented a tertiary isolated and protected copy of critical application data using specific storage API calls and ad hoc scripting via a bridge/jump/bastion server.

Backup Optimization

2012 - 2014

Analytical Observer / Contributing Team Member

Appointed as an analytical observer to develop a course of action. Redesigned backup infrastructure to utilize a disk-based backup strategy with reliance on replication to handle offsite backup compliance requirements. Transitioned backup environment to Cisco\'s UCS for automatic load balancing of network and fabric protocols. Reduced operational backup windows by up to 80%, RTO by up to 60%, and significantly increased RPO.