Mphasis | Container Security: Full Stack Telemetry & Actionable Observability is the key

March 26, 2019

Container Security: Full Stack Telemetry & Actionable Observability is the key!

Sriram Krishnamachari

Vice President, Cloud Portfolio Solutions, GTM Leader, Mphasis, and an advisor in VMWare Tanzu Partner Tech Advisory Board

Kubernetes, a great tool to support the cloud native, container migration goals, is well-known for its efficient container orchestrations. It is estimated to be adopted by over 50-60% of enterprises at varying levels. However, 2018 was the year Kubernetes/docker faced its first real security attacks and malicious images were backdoored in the docker hub. The major holes, especially the recent runC root exposure, is a good reminder and reason to look deeper into the topic and question the controls and guardrails that you have in place.

It is a vast topic, with more questions than answers. So, let’s touch upon 4 of the key takeaways that reflect the current state and the aspects you may want to consider:

DevSecOps through Automation Platform
Telemetry and Actionable Observability
Rapidly Evolving Kubernetes Ecosystem
Security is everyone’s responsibility within the Enterprise

DevSecOps through Automation Platforms:

Container vulnerability (CVE-2019-5736) seemingly affected all container platforms that use runC, a standardized runtime that allows creation and running of containers. The vulnerability apparently affected Docker, Kubernetes, and even Apache Mesos, which does not use runC - a case where in a bad actor can gain control of your host by exploiting privileged containers.

The impact potential of losing root control is indeed severe and substantial for business operations. Just the sheer impact potential warrants a harder, closer look at your container / cloud security posture:

What is your business exposure and impact potential with the widened/scaled up threat surface with containers & cloud – do you measure your impact minutes?
What is your resiliency to spring back up? Do you measure your Mean Time to Recover & Resolve (MTTR²)
Have you re-baselined your threat surface and exposure controls?
Hard labor and toil simply cannot scale up to keep up with enterprise cloud scale demands. Do you have the tools, platforms that can help you with the resiliency today?

The cloud providers who provide us with DevSecOPs automation tools/frameworks may take only limited responsibility or accountability for the incident and recovery. It becomes ultimately the responsibility of the enterprise operator to ensure their security posture is current and updated.

Telemetry and Actionable Observability:

Containers run on a shared kernel, which greatly limits the level of isolation, and they require dynamic networking – both of which make it harder to have visibility and control over the runtime environment.

Think through the embedded observability that you need to correlate business transactions, with the app transactions cutting across monoliths, microservices, mesh networks, mainframes farm that you may have. Are you securing your container farms and run times with Telemetry and the three key tenets of observability - logs, metrics and tracing? Can you take specific actions based on the alerts on real time basis?

Signatures and Network Perimeter based intrusion detection are fast becoming a thing of the past, especially for the cloud scale enterprises, given the advanced intrusion patterns that enterprises are facing on their widened threat surface.

Have you embedded your observability right into your container to container (service to service) communications, and ensured your apps/services are secure?
Are you looking into the anomalous behaviors in container usage patterns and have the ability to trace it all the way back to root cause? The signals from infra alone is not enough anymore, it needs to cut across services & business transactions
As your threat surface has widened, what is your Telemetry and Actionable Observability strategy? Do you have framework that learns from the patterns across the value chain from detection to prevention, indeed pushing the boundaries to predict the incidences, based on in-service & social contextual learning

Rapidly evolving architectures & ecosystem:

Stateless Apps ... Sure! it is a near Disney ride with Kubernetes – fun and safe!

Think hard on how exactly you will run stateful apps with Kubernetes. With statefulsets, you could surely discover the POD and get it back up running and tackle the scheduling problem, ( i.e. presuming you are still in control of the root ). What about the storage orchestration? The Storage responsibilities were relegated to the underlying engine, until recently ie kubernetes (1.8), so there are solutions like ‘portworx’ to address the white space. Kubernetes has quite recently introduced the CSI (container storage interface) couple weeks ago, and the adoption still needs to be tested out in the real world and this is still evolving.

With such evolving community, how do you plan to secure your sidecars, ensure exposure controls within services? How mature is your SDN for enabling service to service communication & fine-grained security policies, across clusters?

With open core abstractions around Kubernetes, enterprises must consider that upgrading Kubernetes at runtime, without a downtime, is a non-trivial activity, especially if it is in a multi-cluster environment. In order to keep pace with the releases, you are probably looking at four major upgrades in a year and 10+ minor updates as experts point out. While cloud foundry runtime like bosh, or a light weight CRI-O purpose built for Kubernetes, offer a more hardened path for adoption. Abstractions like pivotal container services (PKS) make it super-efficient and easy to enforce guardrails for your multi-cloud, or any of the 'x'KS, from the respective public cloud providers, with its limitations.

Security is everyone’s responsibility:

In the digital delivery models, empowering developers is critical as they are directly accountable for the experiences they deliver to the customer, and they now equally own up the responsibility to secure the apps/services and their customer data to ensure it is consistent and compliant with enterprise policies.

CSOs/Operators therefore need to have the capability to code and roll out security through their platform chassis, aspects such as recognizing attack vector patterns and alerts & leveraging intelligent detection models that enable users and owners of the system, to stop an attack pre-emptively.

How do CIOs/CSOs enforce guard rails to 1000+ developers spanning multiple teams and clusters, especially as the app velocities increases in the cloud native world? How much of it can you automate, patch through elegant platform abstractions? – this indeed becomes a critical question to ask.

Platform Admins/Developers through embedding full stack observability must bring in security much earlier in the software dev cycle, such as informing larger community/teams of the nature of attacks, when/where they are occurring, the targets they are hitting, etc. in real time and try to shorten the mean time to detect and fix vulnerabilities.

In summary:

All of this is only pointing to the need to have a well-defined and holistic container security strategy, and leaning more and more on efficient automation platforms to execute it. I do see some teams in the enterprises preferring ‘upstream kubernetes’, for all its commercial benefits, but we must think about Day 2 operations - our security posture holistically; we must think about our journey to cloud & the implications of a downtime, the telemetry & actionable observability we need to scale en-masse’, as we make key decisions.

Security, undoubtedly, has now become a Board level consideration, as the threat surface & exposure has substantially widened over past few years & the impacts of a breach can be substantial, as well.

And the good news is that the 3Ps of security is not going away anytime soon … Patch, Patch, Patch !.