In recent event of global outage caused by Crowdstrike update resulting in BSOD ( Blue screen of death) as security professional and as operations head it’s important to learn the lesson and reduce the changes of this. Otherwise, we have for sure larger catastrophic failures coming in future.
Lesson #1 — Control/process effectiveness
We have all certificates given by global providers hence risk zero. So control is present in the form of process (whether Dev-ops, QA , Release strategy etc) or security ( Negative scenario check, SAST , DAST , PT etc) but is it effective? Are auditors or risk assessors going into depth enough to understand the details? Are maker and checker details taken seriously? Is development team doing the actual work getting basics checks done? Are they qualified enough?
Secure by design is that myth for product companies? This needs to be solved else many such events will follow sooner or later.
Lesson #2 — Zero Day fear factor – Basic 3 tier landscape (Dev-QA-PROD) missing
This is Fear factor created by security companies. On the name of zero day Realtime patching was introduced. Again this is violation of deployment principle of 3 tier and is one of the reasons for this all and despite multiple tools we are ending up on multi country outage. We have over trusted SaaS tools as they have been certified by another people working in another big firm ?
Product companies – Are product companies following the testing as per 3 tier basic architecture which was developed 50 year ago and is proven in so many product development scenarios. Are minor checks skipping this?
Customer – Why a direct production deployment to so many customers globally ? Where is logic of local or regional update ? Why are we not deploying to limited environments on customer site and deploy to rest only when you have tested in local eco-system is working and effective.
Lesson #4 — Certification and TPM gaps — Look for real-time and AI driven risk analysis of product
Partner got certified 300 days back on a template defined 700 days ago ? What about today ? 100+ cloud major products released , technology landscape changed , Many products became obsoleted . Then how this is effective ? Is it not this a fake narrative to fool customers and buy a trust? Who is responsible for this?
Need of the hour — This is place to use some real AI/ML tools and be effective and UpToDate.
Lesson #5 — Segregation of Critical workload – Use Linux/Unix for Critical systems
Windows by design originated as end user system, so they also carry all flaws in the server landscape as well. Linux/Unix by design were used for servers and they follow all principles and keep their basics right. So it’s little difficult to use command prompt but once you have realised the power of this, you can utilize for your critical workloads.
Summary
Once again security is everyone’s responsibility, and we need to learn from our failures and avoid the known risks in future.
1 Products need to have effective controls not just certificates.
2. Keep basics right and don’t blindly trust until you test yourself.
3. Keep regular updates in checks and define a time window if possible.
4. Choose right system/architecture for right workload
5. Deploy real-time TPM tools to assess possible gaps of partners