DevOps has taken the center stage in the cloud computing market. More and more organizations are on their journey to embrace DevOps practices. Enterprises are increasingly adopting elastic Cloud (especially since COVID-19), moving and deploying their applications to the public, hybrid, and private cloud environments. Cloud provides flexible on-demand scaling compute facilities, enhanced mobility, and readily easy-to-use services while removing the redundancies with cost and complexity of maintaining static on-premise or bare-metal infrastructure.

However, increased cloud usage comes with some inherent critical challenges like higher risk of data breaches, SLA breaches, compliance risks, performance bottlenecks, and so on. We hear about companies getting hacked and paying for it. The cloud is vulnerable, that’s why companies have positions as CloudOpSec and head of cloud security who would identify and troubleshoot issues proactively rather than reactively and take remedial action to provide a seamless customer experience. That’s where cloud monitoring comes into play.

In this blog post, we will discuss cloud monitoring, advantages, best practices, and look at the next generation of cloud monitoring tools and services.

What is Cloud Monitoring?

Cloud monitoring is the process of observing, reviewing, evaluating, and managing the operational workflow of cloud-based applications, services, configuration, and cloud infrastructure.  It allows you to drill down observable data (KPI logs, traces, metrics, events, and metadata) collected using manual or automated tools and techniques, analyze them and generate insights via dashboards, graphs, charts, and alerts.

Cloud monitoring continuously looks at security, resource utilization, site speed and server response times, health, availability, and operating speeds to predict issues and prevent vulnerabilities before they arise. Synthetic testing should include the full stack for applications running on containers or VMs, underlying server infrastructure, API health checks, and network security.

Using uptime detectors we can track and review processes, applications, users, traffic, access requests, queries and data integrity, availability, available storage, real-time usage,  resource utilization and consumption of cloud storage and database resources, 3rd Part APIs,  virtual network elements (servers, firewalls, routers, and load balancers), CPU/Memory utilization, virtual machines and containers, orchestration tools like Kubernetes, and so on. Routine issue fixing jobs can include restarting services, clearing caches and temporary files, rebooting VMs, updating CRON jobs, etc.

Benefits of cloud monitoring:

⦁ Supports scalability and hybrid deployments without compromising performance.

⦁ Simple installation of cloud-native tools maintained by the service provider. Local issues don’t affect the performance of these tools. Out-of-the-box integrations with multiple cloud providers that require minimal supervision.

⦁ Enhanced mobility as the tools can be accessed and operated using a wide array of devices (desktops, tablets, and phones) from anywhere.

⦁ Reduced monitoring costs due to dynamic and flexible subscription-based pricing models.

Cloud Monitoring Best Practices

⦁ Identify key metrics and instances for monitoring service usage and expenditures.

⦁ Allow monitoring of serverless models.

⦁ Define threshold limits to trigger alerts by correlating events from different deployments and across different layers (application, infrastructure, etc.) for effective centralized data reporting.

⦁ Maintaining historical records in time-series databases for longer duration and on-demand pattern analysis. Store monitored data separately from related apps and services and export using APIs. Adhere to compliance regulations for data protection like GDPR.

⦁ Continuously test and audit tool performances to safeguard against failure.

Cloud Monitoring Tools

Some of the popular next-generation cloud monitoring tools are AppDynamics, Datadog, Nagios, New Relic, Sumo logic, Sysdig, Dynatrace, PagerDuty, Solarwinds, Prometheus, Grafana, Amazon Cloudwatch, Azure Monitor, and Google Operations, etc.

What is Alerting?

Monitoring systems can be configured to send alert notifications via integrated communication channels like whenever metric values cross defined threshold limits, detect anomalies, or for system heartbeat alerts. Notifications, both reactive and proactive, should identify the root cause behind an error so that DevOps engineers and SREs can fix them quickly before sufficient degradation of SLAs. False positives can lead to alert fatigue and alerts getting ignored. Alerting should be based on strong policy using baseline data as benchmarks and should be holistic to give a full picture.

Not all alerts have the same level of urgency and they are classified into low, medium, and high based on their severity. Low-level alerts are usually regular and frequent based on continuous monitoring of performance deviations, and unusual events that call for preemptive actions during regular troubleshooting. On the contrary, medium and high-level alerts are critical (Disk space full, High CPU usage, hung process, network traffic overload, etc.) that require immediate intervention.  It is very important that your established rules are able to determine the severity of alerts.

Monitoring the cloud is necessary but what happens when you get 2,553 alerts a day?

Automated Remediation

Manual resolution tasks for routine incidents are time-consuming, error-prone, aggravate security risks, and can create system outages. Automated problem resolution improves application uptime, speeds up problem fixes, relieves IT Operations of redundant tasks, and ensures better audit compliance. 

A policy-driven automation configurable capability in a cloud platform helps businesses to perform cloud governance easily and become self-healing while reducing the troubleshooting and maintenance costs. AI/ML-powered advanced next-generation monitoring tools that leverage machine intelligence can be configured to perform automated programmatic response i.e. remedial actions in response to alerts. Using scripts you can embed custom actions as instructions towards event-specific resolution.

For example, a Security Information and Event Management (SIEM) solution that collects, stores, and analyzes security data from network devices, servers, domain controllers, etc, and can be configured to automatically trigger threat response workflow in case of security incident detection. Enterprises also look for using Cloud Security Posture Management (CSPM) solutions for automated remediation.  

Automated or guided remediation depends on the state of your cloud infrastructure used and should be used as the best alternative solution to manual intervention.  Automated remediation should not ideally be started from day 1 of your operations as organizations need to understand, learn, plan, and prototype their automation capabilities gradually through the learning curve.  

To start with, during initial rollout notification serves as a building block to test and find out what resources (e.g. exposed SSH, public buckets, etc.) and processes need to be automated. 

You can then perform dry runs by simulating failures, log and audit results of automated remediation.

Sample automation for AWS includes four levels of automated remediation:

Now we understand the importance of cloud monitoring and remediation, but who can handle 2,553 alerts per day? And that’s even before taking into account the resources for maintenance. Kaholo gives you a fast and reliable solution. You can integrate Kaholo with your monitoring tools (Lacework, Datadog, etc.), and with a simple drag and drop interface create auto-remediation in minutes instead of days. You can try it yourself using our free sandbox. Andreas from Lacework shows how simple it is in this blog post

Auto remediation use-case

 Auto remediation use-case (Image Source: Faun)

In some cases, instead of taking action directly, the remediation workflow will wait to seek approval. Rather than being a preemptive measure, auto-remediation happens based on event outcomes and thus should be used in unison with proactive measures. Event-driven automation used for assisted or guided creates alerts with log diagnostic root cause analysis (RCA) reports, monitoring metrics data, and graphs that help engineers to easily find out the root cause and take up necessary actions.

Businesses can create a Cloud Center of Excellence (CCoE) for adopting reliable auto-remediation policies and best practices that should address financial, operational, and compliance goals. 

Some of the must-have features with auto-remediation tools are:

Conclusion

Every organization will have unique requirements when it comes to implementing automation. For some organizations using automation to create alert notifications may be enough while for others automated remediation is mandated by frequency and complexities. Whatever the need be, having a strong policy and vision, following a gradual path to achieve fully automated remediation can realize the best value for your cloud monitoring journey.    

 

 References: