We are not all perfect
In my 36th
year of a career in Information Computer Technology of which more than
two-thirds have been as a freelance consultant, even in Nigeria where some
thought it folly to abandon a salaried role for the uncharted waters of
treading the streets of Lagos and beyond for engagement, I have been quite
fortunate.
I would be the first
to say every seemingly subject matter expert has a history and that is in
keeping with the credo that every saint has a past and every sinner has a
future. My recognition of some of the stupidest things to do in any IT environment
allows me to reflect on the imperfections, errors, ignorance, overconfidence,
mistakes, and failings of experts. Critically, we should learn lessons more and
excoriate ourselves less.
CrowdStrike’s bird
strike
However, on Friday I
was hoping so much that my weekend would not be ruined by the incident of an
event that we would need to remediate, especially when what caused it was
totally out of our control. I refer to the 2024 CrowdStrike
incident where a US cybersecurity company rolled out a faulty update to
their software which affected an estimated 8.5 million Microsoft Windows devices.
In the process, we
saw this affecting airlines, airports, hospitals, banks, hotels, payment
systems, government services, enterprise systems, and emergency services, among
the critical services the CrowdStrike software is widely deployed to protect.
The more technical
people have already analysed the issue and several solutions have been proffered
including an emergency recovery tool that Microsoft released late yesterday to
help fix devices that have been left in an unusable state. [Microsoft:
New Recovery Tool]
How ever did this
happen?
In my view, I am
shocked that this CrowdStrike update happened and calls for a reassessment of
how we measure impact, risk, and consequence when what we do can be so far-reaching
and the means to back out or roll back a presumed solution requires extraneous
measures.
If anything, and I
have been involved in major deployments that could reach up to 250,000 users
globally, you do not roll out a major update on a Friday and I never do on a
Monday either. You need the presence of mind and personnel active during the
week if things go wrong.
Before that, all
techies are left asking, how did such a fundamentally flawed update make it out
of the gates at CrowdStrike without being caught in testing, review, change
management, risk management, impact assessment, and just the basic corporate
desire never to roll out a problem regardless of the situation?
Preoccupied with the
stock market
I got one interesting
insight looking through the Twitter (X) feed of the President & CEO
CrowdStrike, his last tweet was quoting another before pandemonium broke loose on the 8th of July, and it was the ululation about CrowdStrike
being the seventh best-performing stock in the Nasdaq 100 year-to-date and the
14th best in the S&P 500. In both indices, it is the highest returning
software stock of the first half of 2024, up over 50%.
CrowdStrike seemed to
be a celebratory mode and last month they celebrated 5 years of being Nasdaq-listed,
I hate to think that they had taken their eyes off the ball and by some careless
misadventure, a company that was supposed to prevent cyberattacks presided over
one of the largest outages ever in the history of information technology.
It leads me to think
of the nursery rhyme, Sing a song of
sixpence where the king specifically had a counting house to count his
money and the maid suffered the mishap of having a blackbird which would have
been one of the four-and-twenty birds that were baked in the pie which escaped
when the pie was opened and the birds began to sing. Things were not
particularly right in that kingdom.
Falconry gone to
ground
That this outage
affected airports and airlines globally is quite interesting because the update
was to CrowdStrike’s Falcon Sensor product. This was a vulnerability scanner
that rendered devices totally inoperable. Certain airports deploy falcons to
scare away birds that might interfere with the take-off and landing of aircraft.
They prevent catastrophic bird strikes that could incapacitate aircraft and
lead to accidents.
I find myself
thinking CrowdStrike had become a bird strike of unimaginable consequence that
the cost of the outages is yet to be computed as many devices might still be
offline. CrowdStrike stock price fell almost 20% in the 5 days to the Friday
close of the market. The king of CrowdStrike counting his money just over a
week before just took a personal
hit of $43 million it might be up to $300 million according to Forbes.
Not much for a billionaire though.
Ticking boxes and flipping
heck
Back to the
fundamentals, the question about testing remains as much as I am left wondering
what product, service, or project manager needed to tick boxes to meet
deadlines over rolling out a patch later than planned to achieve something.
Siding with the techies rather than management, could the management have been
given different advice, but the techie was overruled for expediency purposes?
I have had these
conversations too many times with project managers who have promised the world
to management long before they have engaged the input of resources and facilities
to get things done to the standard they have promised. The resource is then put
in a bind to meet unrealistic deadlines.
You need a force of
personality to push back and assert that your job is to deploy solutions that
work the first time, maybe with a few tweaks, but you would never roll out what
you can determine with all clarity still has issues and can constitute a
problem. I do not want to screw up anyone’s project, but I have a professional
responsibility to those I provide service and support to not to leave the state
of their corporate devices any worse than before my solution was deployed.
Push back and regulation
too?
Better late than
sorry is not a sin, it is understanding the impact and risk of what you do. One
last thing, the update should have gone out in controlled tranches, not
globally in one fell swoop. I can see a situation where legislation might
require those who can impact critical services to submit a full assessment and
deployment plan to a regulator before deployment.
We might be a Federal
Aviation Administration (FAA), Securities
& Exchange Commission (SEC), or a Food and Drug
Administration (FDA) type agency at national, regional, and international
global levels to superintend services that can affect global infrastructure along
with teeth to regulate, sanction, or punish those who handle their impactful
responsibilities with levity.
No comments:
Post a Comment
Comments are accepted if in context are polite and hopefully without expletives and should show a name, anonymous, would not do. Thanks.