Career Profile

Seasoned DevOps Engineer with a deep understanding of modern infrastructure and software architectures, now specializing in Site Reliability Engineering. Adept at leveraging SRE principles to optimize system performance and ensure high availability. Recent projects include developing strategic dashboards aligned with Google's golden signals for enhanced observability. Extensive experience with Kubernetes management across cloud and on-premise platforms, coupled with strong Python and Go development skills. Proven ability to collaborate effectively across teams and deliver impactful solutions.

Skills & Proficiency

APM

Python

AWS & Azure

Docker

Kubernetes

Unix & Linux Scripting

Go

CI/CD Processing

C/C++

PowerShell Scripting

Core Competencies

  • Dynatrace
  • Prometheus
  • AWS/Azure/GCP
  • Git
  • Gitlab & Github
  • Terraform
  • Ansible
  • RESTful API
  • Atlassian Tooling
  • Jenkins
  • Helm
  • Kubespray
  • Cassandra
  • Elasticsearch
  • MongoDB
  • PostgreSQL
  • Kafka/TIBCO
  • SIEM & Log Management
  • Change Management
  • Requirements Planning
  • Data Domain
  • Data Deduplication
  • Visio Diagramming
  • Backup Administration
  • Disaster Recovery
  • Fabric Design & Zoning
  • Data Modeling
  • Automation
  • Cost Analysis
  • Cloud Data Protection
  • Experience

    Site Reliability Engineer

    06/2023 - Present
    Concentra - Addison, TX (Hybrid)
    Ensured the reliability and performance of critical medical service platforms, focusing on real-time services monitoring and KPI tracking. Developed and maintained monitoring, alerting, and incident response procedures to minimize downtime. Collaborated with engineering to improve system architecture and optimize performance. Automated infrastructure management and contributed to KPI development to measure system health.
    • Modernized application performance monitoring by migrating from App Dynamics to Dynatrace, leading to increased visibility and actionable insights.
    • Developed strategic KPI dashboards aligned with Google SRE golden signals, providing actionable insights into system health and performance.
    • Created comprehensive SRE documentation, standardizing monitoring and observability practices for applications and platforms.
    • Designed and implemented Azure dashboards and alerts, empowering application and DevOps teams with proactive monitoring capabilities.
    • Collaborated with application teams to refine and develop critical metrics, driving data-driven decision-making and business optimization.
    • Delivered critical observability and monitoring insights during major incidents, facilitating rapid resolution and minimizing business impact.

    DevOps Engineer

    08/2019 - 05/2023
    Anodot - Home (Remote)
    Member of a worldwide DevOps team with responsibilities impacting all elements of the business. These included SaaS-based Kubernetes environment. On-prem Kubernetes environments, CI/CD pipelines supporting development, infrastructure creation and maintenance of AWS, troubleshooting customer environments, and documentation of all critical components and designs.
    • Performed customer installations including Kubernetes, Anodot's application, and monitoring application. Troubleshooting issues with the customer's environment and validating data ingested into the platform.
    • Partnered with Sales and Customer Success to both discover and analyze data from the point of origin, through pipeline or APIs, and into the platform
    • Developed documentation around defining core business KPI for Anodot's platform both SaaS and On-prem. This was centered around monitoring dashboards, alerting and defining effective ranges and impact analysis.
    • Performed load and scale testing on the platform within isolated AWS environments. This included scaling out a Kubernetes cluster and pushing ingest rate and payloads into the platform.

    Principal Platform Engineer

    2017 - 2019
    CA Technologies - Plano, TX
    Partnered with teams on several internal infrastructure agile projects utilizing daily scrums and Kanban techniques to deliver defined sprint requirements.
    • Developed a weighted analysis and metric form for the comparison of enterprise backup vendor products including ROI and cost analysis for a project with a $1-2M budget.
    • Created Wiki-based documentation and architectural schematics for corporate data protection environment including RACI charts, Escalation Matrix, and Complex Decision Trees leading to a 100% satisfaction compliance review board finding.

    Sr. Platform Administrator

    2010 - 2017
    CA Technologies - Plano, TX
    IT Engineering team member responsible for partnering with architecture to gather application and infrastructure requirements from the business units. Refined requirements to develop test plans, run books, compliance policies and deployment procedures.
    • Automated AWS cloud-based application protection using snapshots and lifecycle management techniques. Moved overall protection compliance rate from 20% to 99.9% within a 3-month span.
    • Designed Nutanix local and remote data protection domains to protect critical worldwide data from over 15 countries and 25 cities to global data centers within four global regions.
    • Developed an API script used for the protection and isolation of a tertiary copy of critical application backup data resolving a mandate instituted by executive leadership for a near-immutable layer of data protection.
    • Worked with architects on application and infrastructure design and documentation leading to improvements in regulatory compliance and reductions in operational oversights.
    • Assisted in the development of the first malicious destruction of data protection policies and procedures.

    Sr. Unix Systems Administrator

    2000 - 2010
    CA Technologies - Plano, TX
    Collaborated with development engineers and team leads to oversee the critical infrastructure components of the Configuration, Quality, and Build Management Systems. Transitioned to an active role in the storage team with a primary goal of stabilizing and modernizing the corporate backup environment.
    • Redesigned entire corporate backup infrastructure using disk-based backups and asynchronous replication technologies.
    • Architected and implemented a complete overhaul of an isolated backup SAN fabric.
    • Oversaw NAS infrastructure redesign to include both DR and NDMP backup components.
    • Integrated monitoring and alerting into NAS and Backup infrastructure environments.
    • Developed Unix scripts providing the development division with independent control over production/development/QA life cycles for their configuration management environment.

    Projects

    Proactive System Health Monitoring Initiative

    2023 - 2024
    Observabilty and Montitoring
    Developed a suite of strategic KPI dashboards aligned with Google's Site Reliability Engineering (SRE) principles, specifically focusing on the golden signals of latency, traffic, errors, and saturation. These dashboards provided real-time visibility into critical system metrics, empowering application and DevOps teams with proactive monitoring capabilities. By leveraging these insights, teams were able to identify and address potential issues before they impacted users, ultimately driving improved system reliability, performance, and overall service availability.

    Trivy Vulnerability Scanning Interface

    2021 - 2022
    Design and Development
    Implemented an automated Trivy scanning system for examining our nightly development-based builds across a master feature branch environment. The process sent emails to our development and security team with embedded emails showing a summary vulnerability scan output with the ability to dive into a detailed HTML AWS S3-hosted report/table. The functionality was executed via a Kubernetes job and was written in Python with an invocation in Bash

    Prometheus Monitoring System

    2020 - 2021
    Project Team Contributor
    Assisted in the design and administration of a single-point installation process of a Prometheus monitoring environment for an on-prem version of our application. This included Elasticsearch, Cassandra, Kafka, Mongo, our application’s remote write, Grafana, and Victoria Metrics. Grafana and Victoria Metrics were used for isolated metric analysis and long-term metric retention respectively. This was accomplished via the use of a Helmfile and basic bash scripting for the creation of the persistent storage volumes.

    Utility Application Container

    2019 - 2020
    Designer and Developer
    Created an advanced multipurpose utility for troubleshooting and diagnosing issues within our application's Kubernetes namespace. Designed around a robust and optimized Dockerfile. It encompassed tools for network diagnostics, functions for health checks on databases, Python modules specific to our applications API framework for creating and testing customer configurations, and default dashboard metric analytics.

    Backup Rearchitect & Redesign

    2017 - 2018
    Project Technical Lead
    Created a requirements list based on function and process. Performed vendor POCs with several different solutions: EMC Avamar/Networker, Commvault, Rubrik, and Veeam. Developed a weighted analysis and metric form to cover requirements, performance data, reporting, and several other miscellaneous metrics and processes. Then presented a business case to leadership with risks, budgeting criteria, and ROI analysis.

    Malicious Destruction of Data

    2015 - 2017
    Project Team Lead
    Gathered requirements from the business and IT leaders to develop a strategy to protect the company\'s core data in the event of a malicious attack. Implemented a tertiary isolated and protected copy of critical application data using specific storage API calls and ad hoc scripting via a bridge/jump/bastion server.

    Backup Optimization

    2012 - 2014
    Analytical Observer / Contributing Team Member
    Appointed as an analytical observer to develop a course of action. Redesigned backup infrastructure to utilize a disk-based backup strategy with reliance on replication to handle offsite backup compliance requirements. Transitioned backup environment to Cisco\'s UCS for automatic load balancing of network and fabric protocols. Reduced operational backup windows by up to 80%, RTO by up to 60%, and significantly increased RPO.