## Adopting Monthly Operational Review Meetings as a Learning Exercise - Terry Brady, Software Developer https://github.com/terrywbrady --- ## Presentation Purpose - Development team needed to take on new operational responsibilities - Our approach to this challenge - What we learned --- ## Merritt Team [Merritt](merritt.cdlib.org) is a Digital Preservation Service that is run by the [California Digital Library](cdlib.org) ---- ## Merritt System - 1.4 PB of Cloud Storage (3 distinct providers) - 4.5M Objects - 188M Files - 7 microservices, 26 Servers - MySQL, ZooKeeper - https://github.com/cdluc3/mrt-doc ---- ## Team Roles - 1 Product Manager - 3 Software Developers - 1 DevOps engineer (supporting multiple teams within our department) ---- ## Our Backgrounds - Primarily software development (Java, Ruby, SQL) - One developer has a Systems Admin background - Our DevOps engineer has a Systems Admin background - CDL has 3 full-time Systems Administrators supporting multiple departments ---- ## DevOps Adoption at CDL - We are adopting a DevOps approach - Need to empower developers to perform operational functions - without turning them into full time Systems Administrators ---- ## Immediate Needs for the Development Team to Take On - Capacity Planning - Software End of Life - Software Vulnerabilities/Dependencies - Proactive Error Discovery ---- ## Not our Expertise - Need to learn functions that have not traditionally been part of software development - While still primarily focusing on software development ---- ## Discussion - Does this challenge sound familiar? - How has your team addressed this challenge? --- ## Our Approach ---- ## Let's Figure it Out Together! - Hold each other accountable - Learn together - Hopefully make it more interesting - Iteratively improve our approach ---- ## 3 Areas of Focus - Capacity Planning - Server Performance - Software Dependency Review - Vulnerabilities - End of Life - Proactive Error Log Tracking ---- ## Our Process - Schedule time once a month for each area of focus - Limit to 30 minutes - Accomplish as much as we can - Figure it out together - Document learnings - Build a script to follow in next session --- ## Capacity Planning - Server Performance ---- ## Our Meeting Script - The actual script resides in a private repository. - A [sanitized version](https://github.com/CDLUC3/uc3-present/blob/main/monthly_ops/routine_librato_checks.md) is provided here. ---- ## Review Monthly Stats - Identify peak processing ---- Bytes Processed over 30 days
---- ## Sample Performance Charts ---- ## Database
---- ## Query Performance - Our system administrators have enabled Query Performance Insights for the most recent 7 days - Our query performance is quite stable at this time ---- RDS Performance Insights
---- ## Store/Ingest - subject to peak processing - Ingest service: download and validate - Storage service: upload to S3 and validate ---- ## Server Notes - Link to performance charts - Expected performance - Key items to review ---- ## Storage Service - Look for sustained CPU near 100% - Look for peaks in SAR NFS wait - Look for any significant drops in available memory or high swap ---- Storage Server - CPU
---- Storage Server - IO Wait
---- Storage Server - NFS wait
---- Storage Server - Memory Headroom
---- ## Ingest boxes - Look for high I/O wait - Look for peaks in SAR NFS - Look for any significant drops in available memory or high swap - Note trends in memory before and after patching ---- Ingest Server - CPU
---- Ingest Server - IO wait
---- Ingest Server - NFS wait
---- Ingest Server - Memory Headroom
---- ## Audit Service - Continually processes content - Performs fixity check on cloud content every 60 days ---- ## Audit - constant load
---- ## Learnings - Capacity Planning ---- ## Learnings - What to be aware of in each review - server patching - software releases - known service downtimes ---- ## Learnings - Issues Resolved - IO Wait revealed - need for IO-optimized instance types - replace AWS EFS with ZFS for better throughput ---- ## Leanings - Continue to Watch - RAM - Available headroom - Memory Leaks - Unreleased database connections ---- ## Discussion - How do your teams review server performance? - How do your teams perform capacity planning? --- ## Software Dependency Checks - Vulnerabilities in Code Repos - End of life for Code Frameworks - Obsolete versions of 3rd Party software ---- ## Software Dependency Review - The actual script resides in a private repository. - A [sanitized version](https://github.com/CDLUC3/uc3-present/blob/main/monthly_ops/dependency_scans.md) is provided here. - _some non-public links have been obscured_ ---- ## Code Repo Vulnerabilities - Review - Evaluate severity - Assign tickets to resolve ---- ## Review Feature Library Versions - MySQL - ZooKeeper - Apache Tika ---- ## Review Language Version and End of Life - Ruby - Java ---- ## Review Framework Versions and End of Life - Rails - Tomcat - jQuery ---- ## Review Build Tool Versions - Apache Maven ---- ## Review Build Tool Plugin Versions - Maven plugins ---- ## Set Schedule for Next Review ---- ## Identify General Purpose Libraries with New Versions - we have identified a tool - we have not yet implemented this - we just completed a large migration ---- ## Learnings Dependency Review ---- ## Learnings - The information we needed was available online - reference sites to search - services/sites that push alerts to us - Some end of life dates have surprised us ---- ## Learnings - Some checks are OK to do quarterly or biannually - we should note when the next review should occur ---- ## Learnings - We have been rewarded by every effort we have made to - normalize builds - consolidate dependency configuration - automate testing ---- ## Discussion - How do your teams review software vulnerabilities? - How do your teams keep 3rd party software up to date? --- ## Error Logs ---- ## Consolidated Error Logs - We consolidated all of our logs in OpenSearch in 2023 ---- ## Consolidated Logs - Expedites investigation of a user-reported problem ---- ## Consolidated Logs - Enables us to discover errors before they become user-reported problems ---- ## Error Discovery - We rely on - OpenSearch Saved Searches - OpenSearch Visualizations - OpenSearch Dashboards - Constantly refining our focus ---- ## Dashboard Review - The actual script resides in a private repository. - A [copy](https://github.com/CDLUC3/uc3-present/blob/main/monthly_ops/dashboard_review.md) is provided here. ---- ## Web Application Firewall (WAF) Logs - Make note of timeframes where errors may have been introduced by malicious activity - Identify categories of malicious activity ---- WAF Logs
---- ## Learnings - WAF - Anticpate random site crawling each month - Rate limit attempts to login to the site - Limit resource-intensive queries for unauthenticated users ---- ## Application Logs - User Interface ---- ## User Interface Errors by return code
---- Exploring User Interface Errors
---- ## Learnings - UI Errors - Expect 401/404 Errors - Explore/Prevent 500 Errors ---- ## UI 500 Errors - Stale database connections - Retry logic - Reset connections ---- ## Backend Service Errors ---- Storage Service Errors
---- ## Capacity/Performance Analysis Using Log Data ---- Bytes processed, computed from application logs
---- ## Leanings Capacity Analysis - Partner organization was requesting **18T of data assembly per week** - Surprise to us and our partner - We began queueing "large" requests ---- ## Discussion - How do your teams handle error logs? - Do you have a process to discover unreported errors? --- ## Our Review Meeting Process - Review - Document action items as tickets - Refine script for the next meeting ---- ## Meetings - Initially, we ran out of time at each meeting - Eventually, we can cover the script in half of the time - End early - OR identify how to go deeper ---- ## Meetings - An additional meeting is an interruption - Generally, we are glad we did it by the end - It is more fun when something went wrong during the month ---- ## Team Member Feedback - Connects team members to Security - Connects team members to Cost Information - Cost review is handled by our DevOps engineer - Greater appreciation of what our logs can do ---- ## Beyond Our Team - Our manager appreciates the ability to dive into details with the team - Folks beyond our team have joined us to observe the process - complimentary responses - This has given the initiative a nice boost ---- ## Script - Script contains collective learning - Markdown is easy to edit - Regular review builds confidence - We challenge ourselves to go deeper as we have solved the immediate issues ---- ## Discussion - Do these ideas sound applicable to your environments? --- ## Where to go next ---- ## Capacity Planning - Covered pretty well at the server level - Key metric computations could be interesting ---- ## Software dependencies - Library updates unrelated to vulnerabilities ---- ## Error Logs - What content is missing from our log files - What index keys are needed - What visualizations are need ---- ## Fun Questions to Answer - What system functions require the most retries? - Can we graph periods of time where retries increase? - Can we graph retries required by cloud provider? - Can we quantify throughput by cloud provider? --- ## Thank You - https://github.com/terrywbrady - UCTech Slack: Terry Brady (UCOP-CDL)