Standard maintenance operations
A “maintenance” service downtime typically happens when either application or operative system restarts are required on the entire cluster. The most common maintenance operations aim to assist during the:
- Deployment of a new version of the software
- Deployment of a Quick Fix Engineering (QFE)
- Replace/repair cluster nodes
- Daily backup of the system and application.
A maintenance operation is a superset of processes performed in sequence or in parallel by one or more Interana Engineers.
Here is a simple representation of the steps involved in a single Interana standard maintenance operation for one customer.
Important note: the following description is the Interana standard maintenance practice.
Each step can be customized depending on your needs, such as for example when you want to post the maintenance window alert on your own digital board and disable the Interana alert banner, or when particular post-upgrade validations (beyond the standard ones) are required.
During the entire process, communication is exchanged internally between Operations Engineers, Test Engineers, QA Engineers and Customer Managers in various forms, more often in person or during off-hours via real time communication systems including video, screen sharing, IM; third party software is also used to manage issue tracking and the maintenance project. During the maintenance window, additional checks can be performed as long as the whole process has been completed and there is still a gap before the end of the maintenance window. A standard maintenance period lasts 4 consecutive hours and is normally scheduled during Customer off-hours (for example: 6:00 PM - 10:00 PM PST/PDT).
Once the Customer (Single Point of Contact – SPOC) agrees on the maintenance scheduling, End Users will be informed at least 36 hours before the maintenance starts via banner displayed directly in the Interana Visual Explorer window.
Here below an example of Interana Visual Explorer output with the alert message highlighted in yellow:
Backup and Validation
Data backups vary in format and procedures; during a maintenance window Interana Engineers could perform several backups, from raw content to operative system data to application binaries to application data. Each node of the cluster is backed up depending on its cluster role.
Validation checks are performed on each backup for consistency and data integrity.
Service Stop – Health check
Once the previous step is completed, the cluster is ready to be "serviced” and Interana Teams are alerted about the imminent service downtime. Automated procedures to stop all Interana running services on the cluster are performed by Interana Engineers until all health checks confirm that the cluster is down and ready to be updated. At this point all network requests will fail and the UI will respond with an errors like this:
A maintenance page can be created though, assuming at least one API node is running.
This step could include one or more Interana service updates including third-party and operative system software.
Automated procedures are performed and monitored by Interana Engineers to make sure each service is properly updated.
Temporary folders are generated to host the just-to-be-updated code in case something goes wrong.
Once the update is completed a log is posted and services are restarted.
Service Start – Health check
Health checks are vital to determine if previous step has been successful or not. Automated procedures start all services within the cluster and then perform initial “smoke test” validation. Interana Engineers will check all available telemetries to evaluate the current cluster health and provide green light to next step. The following snapshot is just an example of few counters, Interana Engineers observe multiple dashboards and collect system diagnostics in real time and evaluate and compare performances.
Data validation includes data consistency checks for the configuration database as well as the content data via either database or API queries or any other ad-hoc developed script.
Import pipelines are checked to make sure the system has restarted importing new data (if available). Key Performance Indicators (KPI) dashboards are monitored to confirm each node of the cluster is behaving within required thresholds.
Once data is validated, functional tests can start.
End User UI tests are performed to confirm maintenance completion. Interana Engineers will perform Basic Validation Tests (BVTs) which include running basic queries such as counting events, counting unique shard keys as well as combinations of group by and filter by queries.
Dashboards load time is measured and compared with former data to evaluate any potential impact of the update.
If the functional test is PASS, an Interana Engineer will log the event and inform the Operations Team.
Internally, Interana Engineers will update all open tickets involved in the maintenance and Customer Success Managers will be informed and recommended to start specific validation tasks (like for example during standard QFEs-based maintenance).
The Interana Explorer system banner is then disabled and a check is performed to validate the change is successfully reflected in the UI.
Depending on the case, the Customer SPOC is informed via email or phone or any other effective meaning for the Customer SPOC.