快板。在营销活动引起的突然交通峰值之后,电子商务网站下降了。中断是由集群资源管理中的配置错误引起的,该错误即使有可用的硬件资源,也阻止了更多的服务实例启动。
Cloudflare。不良配置(路由器规则)导致其所有边缘路由器崩溃,从而删除了所有Cloudflare。
Cloudflare。在维护其私人骨干网络期间,工程师在亚特兰大数据中心网络配置中制作了错字,从而导致所有来自美国和欧洲的流量流向这个唯一的数据中心,从而压碎了它。
Cloudflare。残疾BGP宣传的前缀的不正确订购造成了19个数据中心的故障。
Cloudflare。对我们的分层缓存系统的更改导致某些请求失败了,该状态代码530的用户总共持续了将近六个小时。我们估计所有请求中约有5%在高峰期失败。由于我们的系统的复杂性和测试中的盲点,当更改发布到我们的测试环境时,我们没有发现这一点。
Cloudflare。由于释放了服务令牌的错误,因此在2023年1月24日,在121分钟内无法使用几项CloudFlare服务。该事件降低了广泛的Cloudflare产品,包括我们的工人平台的各个方面,我们的零信任解决方案以及我们内容输送网络(CDN)中的控制平面功能。
Cloudflare。 2023年10月4日,CloudFlare从UTC 07:00开始经历DNS解决问题,并于11:00 UTC结束。一些1.1.1.1的用户或使用1.1.1.1的纱,零信任或第三方DNS解析器等产品可能已经收到了对有效查询的Servfail DNS响应。我们为此停电非常抱歉。此中断是内部软件错误,而不是攻击的结果。在此博客中,我们将讨论失败是什么,为什么发生的原因以及我们正在做的事情以确保这不会再发生。
Datadog。当受抚养客户端降低时,在一个客户中,一个客户发现的不良服务发现配置在全球范围内发现了服务发现。
enom。 2022年1月15日,美国东部时间上午9:00,Tucows的工程团队开始计划进行维护工作,以将INOM平台迁移到新的云基础架构。由于剪裁的复杂性,团队遇到了许多问题,导致延误。维护窗口多次扩展,以解决与数据复制,网络路由和DNS解决方案有关的问题,这些问题影响了网站可访问性和电子邮件交付。
Etsy。在不正确配置开关的情况下发送多播流量会导致Etsy全局中断。
Facebook。对Facebook的骨干路由器的配置更改导致所有Facebook属性和内部工具的全局停电。
Facebook。糟糕的配置都删除了Facebook和Instagram。
Firefox。 2022年1月13日,Firefox网络堆栈中的特定代码路径触发了HTTP/3协议实现中的问题。这阻止了网络通信,并使Firefox无反应,无法加载Web内容将近两个小时。
无盖。不良的配置与一组不常见的故障相结合,导致数据库群集的中断,将API和仪表板离线置于。
[Google](https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident)。最初的GCVE供应是使用旧式选项进行的,该选项在该期间结束时导致具有自动删除的“固定期限”合同。
谷歌。不良的配置(自动化)从BGP公告中删除了所有Google Compute Engine IP块。
谷歌。不良的配置(自动化)删除了大多数Google服务。
谷歌。不良配置导致配额服务失败,这导致多个服务失败(包括Gmail)。
谷歌。 /被检查到URL黑名单中,导致每个URL显示警告。
谷歌。对负载平衡器的配置推出的错误导致错误率增加了22分钟。
谷歌。配置更改旨在解决对元数据存储的需求的提升,该存储超载了BLOB查找系统的一部分,该系统造成了层叠失败,并具有对Gmail,Google Photos,Google Photos,Google Drive和其他GCP服务依赖于Blob存储的级联故障。
谷歌。两个错误的配置以及一个软件错误,在美国东海岸造成了巨大的Google云网络故障。
谷歌。 Google的前端负载平衡服务经历了失败,从而对欧洲的几种下游Google Cloud Services产生了影响。从初步分析中,该问题的根本原因是由新的基础架构功能引起的,该功能触发了内部网络负载均衡器代码中的潜在问题。
谷歌。 Google Cloud Networking经历了Google Cloud Load平衡(GCLB)服务的问题,从而对几种下游的Google Cloud Services产生影响。受影响的客户在其网站上观察到Google 404错误。从初步分析中,该问题的根本原因是网络配置服务中的一个潜在错误,该网络配置服务是在常规系统操作期间触发的。
谷歌。 Google Cloud Networking从2022年7月14日(星期四)至2022年7月15日星期五美国/太平洋地区从19:30 US/Pacific开始,批量,流媒体和转移运营的降低能力降低。这项服务中断是由于维修工作和常规网络软件升级推出期间遇到的问题造成的。由于Google Cloud产品的破坏和弹性能力的性质,受影响的区域和个人影响窗口差异很大。
Heroku。自动远程配置更改并未完全传播。无法启动Web Dynos。
Heroku。不正确的部署过程导致代码需要时不使用新的配置变量。
keepthesthescore。工程师偶然地删除了生产数据库。数据库是来自Digitalocean的托管数据库,每天备份一次。灾难发生30分钟后,它返回在线,但是7小时的记分牌数据永远消失了。
微软。不良的配置删除了Azure存储。
NPM。迅速的配置更改引起了后端路由问题。确切地说,问题在于我们在vcl_fetch函数中设置了req.backend,然后调用重新启动以重新遵循规则。但是,呼叫重新启动会重置req。后面到列表中的第一个备份,在这种情况下,这恰好是Manta,而不是负载平衡的CouchDB服务器。
奥瓦萨。按钮的错误按钮导致水处理厂由于氟化物的水平过高而关闭。
Pagerduty。 2021年12月15日,UTC 00:17,我们在Pagerduty的基础架构中部署了DNS配置更改,从而影响了我们的集装箱编排集群。该更改包含一个缺陷,我们没有在测试环境中检测到,这立即导致容器编排集群中运行的所有服务无法解析DNS。
Razorpay。 RDS硬件故障突出了不正确的MySQL配置,从而导致金融系统的重大数据丢失。
锈lang。在2023-01-25的星期三,UTC 09:15,我们对Crates.io的生产基础设施进行了更改。在部署期间,static.crates.io的DNS记录未能在估计的10-15分钟时间内解决。这是由于这两个证书和DNS记录都在停机时间重新创建。
锈lang。在2023-07-20的12:17到12:30 UTC之间,由于部署在下载URL生成中包含错误,因此从Crates.io中下载了所有板条箱。在此期间,我们平均每秒向Crates.io提出4.71k请求,导致大约370万的请求,包括货物的重试尝试。
堆栈溢出。不良的防火墙配置阻止了stackexchange/stackoverflow。
哨兵。备份上的Amazon S3设置错误导致数据泄漏。
Travisci。配置问题(不完整的密码旋转)导致“泄漏” VM,导致构建队列时间升高。
Travisci。配置问题(基于自动化年龄的Google Compute Engine VM图像清理作业)导致稳定的基本VM图像被删除。
Travisci。配置更改使构建开始失败。手动回滚破裂。
Travisci。意外环境变量使测试截断了生产数据库。
tui。事件飞行之前,已经升级了产生负载表的预订系统。该系统的错误导致女性乘客以“小姐”为儿童入住了签入。该系统分配了孩子的标准重量为35公斤,而正确的女性标准重量为69千克。因此,由于有38名女性被错误地检查为儿童,因此从负载表中的G-Tawg起飞质量低于实际质量飞机的质量1,244公斤。
Turso。错误配置的DB备份标识符导致免费层客户的数据泄漏,随后的修复程序可能导致可能的数据丢失。
阀门。尽管没有官方的验尸,但看起来像是BGP配置不良的阀门与3级,Telia和Abovenet/Zayo的连接导致了全球蒸汽中断。
亚马逊。未知事件导致变压器失败。其中一个PLC检查发电机电源是否由于未知原因而失败,这阻止了一组备份发电机上网。这影响了欧盟西部的EC2,EBS和RDS。
亚马逊。恶劣的天气导致了整个AWS East的电力故障。当电源切换到备份并加载发电机时,单个备份发电机无法传递稳定的功率。尽管两个月前已经通过了负载测试,并通过了每周的电动测试。
亚马逊。在6月4日,PDT的10:25 PM,由于该地区恶劣天气导致的AWS悉尼设施失去了电力,导致可造成可用性区域的大量实例。由于功率损耗的签名,电源隔离破坏者没有参与,从而导致备用能量储备排入退化的功率网格。
Arpanet。一个故障的IMP(接口消息处理器)损坏的路由数据,软件重新计算的校验和用良好的校验和传播不良数据,错误的序列编号导致缓冲区填充,完整的缓冲区损失了存储式数据包和节点使自己失去了网络。从1980年开始。
Cloudflare。部分开关行为不当导致伴随着拜占庭式失败,这影响了API和仪表板的可用性六个小时33分钟。
Cloudflare。屈曲数据中心功率故障。这篇文章概述了导致此事件的事件。
FirstEnergy / General Electric。当某些传输线撞到未修剪的叶子时,FirstEnergy局部失败。正常的过程是发出警报,这会导致人类操作员重新分配功率。但是正在监视此的GE系统有一个错误,这阻止了警报被触发,这最终导致了级联的失败,最终影响了5500万人。
github。 2016年1月28日,Github在其主要数据中心遭受了动力的破坏。
谷歌。在其欧洲数据中心(Europe-West1-B)上,连续的闪电袭击导致该地区内的Google计算发动机存储系统失去了电力。在标准持续磁盘(HDD)的一部分中观察到I/O误差,并在其中一小部分观察到永久性数据丢失。
谷歌。 2022年7月19日,星期二,美国/太平洋地区06:33,这是一个数据中心的多个冗余冷却系统的同时失败,该数据中心托管欧洲欧洲欧洲地区2-a影响了多个Google Cloud Services。这导致一些客户无法为受影响产品提供服务。
毕达尼亚的地方。一个存储服务器上的存储量故障引起了许多停电,从pythonanywhere站点开始,以及我们用户的程序(包括网站),这些程序依赖于该卷,然后扩展到其他托管站点。
太阳。 Sun著名地不包括几代服务器零件中的ECC。这导致数据损坏和崩溃。在Sun的典型MO之后,他们使客户在解释该问题之前报告了错误签名NDA。
CCP游戏。错字和名称冲突导致安装程序有时在安装Eve在线扩展时删除boot.ini文件,并带来后果。
github。维护过程中的43个第二个网络分区导致MySQL Master故障转移,但是由于跨陆续潜伏期,新的主人没有几秒钟的写入。 24小时以上的恢复工作以维持数据完整性。
无盖。关键的PostgreSQL表上的所有查询都通过非常快速的数据库迁移和长期运行的读取查询的组合来阻止,从而导致停机时间为15秒。
谷歌。通过非常慢的代码路径应用了很少修改的负载平衡器的许多更改。这冻结了所有公共解决变化约2小时的变化。
谷歌。 Google生产主链中美国中部网关校园之一的纤维路径上的组件故障导致网关和多个边缘位置之间可用的网络带宽下降,从而导致数据包丢失,而骨架自动将流量移动到剩余路径上。
骑士资本。相互矛盾的部署版本和重复使用先前使用的位的组合造成了4.6亿美元的损失。另请参阅更长的文章。
WebKit代码存储库。 WebKit存储库是一个配置为使用重复数据删除的颠覆存储库,在两个具有相同SHA-1哈希的文件作为测试数据的文件之后,无法使用,目的是实施碰撞的安全检查。这两个文件具有不同的MD5总和,因此结帐会失败一致性检查。在上下文中,最近宣布了第一个公共SHA-1哈希碰撞,其中有两个相撞文件。
azure。创建了有效期一年的证书。有人不使用适当的库,而是编写了计算为一年的代码,即当前日期加一年。 2012年2月29日,这导致创建证书,并在2013年2月29日到期日期,由于无效的日期而被拒绝。这导致了持续一天大部分时间的Azure全球停电。
Cloudflare。从跟踪2016-12-31T23:59:60Z的第27次LEAP第二次的后退时间流程导致加权的DNS解析器(RRDN)选择加权旋转旋转,并在某些CNAME查找中失败。 go's time.Now()被错误地认为是单调的;这将负值注入了对rand.Int63n()的调用,在这种情况下,这是恐慌的。
Linux。 LEAP第二代码是从xtime_lock的计时器中断处理程序中调用的。该代码做了一个printk来记录LEAP的第二个。 printk醒来了klogd ,有时可以尝试获得时间,这在xtime_lock上等待,造成僵局。
Linux。当leap秒发生时, CLOCK_REALTIME又倒了一秒钟。这不是通过可以更新hrtimer base.offset机制来完成的。这意味着,当计时器中断发生时,timer_abstime clock_realtime计时器提前一秒钟到期,包括设置不到一秒钟的计时器。这导致应用在循环中使用不到一秒钟的睡眠的应用而无需睡觉而导致许多系统上的高负载。这导致大量的Web服务在2012年下降。
Mozilla。大多数Firefox附加组件在2019年5月4日左右停止工作。 Firefox需要有效的证书链以防止恶意软件。大约九个小时后,Mozilla推了一个特权附加组件,将有效的证书注入了Firefox的证书商店,创建了有效的链条和解密的附加组件。这有效地将所有附加组件(约15,000个附加组件)进行,大多数用户的分辨率大约需要15-21个小时。一些用户数据丢失了。以前,Mozilla发布了有关技术细节的信息。
github。在处理大型MySQL表上的模式迁移时,GitHub平台遇到了一种新颖的故障模式。模式迁移是GitHub的常见任务,通常需要数周的时间才能完成。迁移的最后一步是执行重命名,以将更新的表移至正确的位置。在此迁移的最后一步中,我们的MySQL读取复制品的很大一部分进入了信号量的僵局。我们的MySQL集群由一个用于写流量的主要节点,用于生产流量的多个读取复制品以及为备份和分析目的提供内部读取流量的几个复制品。袭击僵局的读取复制品进入了崩溃的状态,导致健康读取复制品的负载增加。由于这种情况的级联性质,没有足够的主动读取复制品来处理影响核心GitHub服务可用性的生产请求。
Heroku。在2023年6月8日UTC的15:05 UTC中,发生了一个数据库错误,而外键使用的数据类型比引用的主要密钥较小。当主键超过允许值时,此错误导致溢出,从而导致无法在Heroku内创建新的授权。此错误还阻止客户创建新的部署。然后,OnCall操作触发了Heroku API的全部中断。
快板。 Allegro平台的子系统失败了,负责异步分布式任务处理。该问题影响了许多领域,例如通过购物车购买众多优惠(包括价格清单编辑)等功能根本不起作用。此外,它部分未能通过新报价发送每日新闻通讯。内部管理小组的某些部分也受到影响。
亚马逊。人为错误。 2017年2月28日上午9:37 PST,亚马逊S3团队正在调试一个小问题。尽管使用了一本已建立的剧本,但打算删除少量服务器的命令之一是发出了错字的,无意间删除了较大的服务器。这些服务器支持关键的S3系统。结果,依赖的系统需要完整重新启动才能正确操作,并且该系统对US-EAST-1(北弗吉尼亚州)进行了广泛的中断,直到PST下午1:54进行最终分辨率。由于亚马逊自己自己的服务(例如EC2和EBS)也依靠S3,因此它导致了巨大的级联失败,影响了数百家公司。
亚马逊。消息损坏导致分布式服务器状态功能淹没了S3请求处理机队的资源。
亚马逊。在常规网络升级期间的人为错误导致资源紧缩受到软件错误的加剧,最终导致了所有东部东部可用性区域的中断,损失了0.07%的卷。
亚马逊。无法联系数据收集服务器在存储服务器上的报告代理中触发了潜在内存泄漏错误。而且没有优美的退化处理,因此报告代理以缓慢消耗系统内存的方式连续联系收集服务器。此外,监视系统也无法警告该EBS服务器的内存泄漏,而且EBS服务器通常会非常动态使用所有内存。到星期一早上,在受影响的存储服务器上,记忆力损失的速度变得很高,并且使得无法与请求处理过程保持一致。由于无法执行故障转移而进一步切断了此错误,这导致了停机。
亚马逊。弹性负载平衡器在“无意中违背生产ELB状态数据的维护过程”时遇到了问题。
亚马逊。 “网络中断”导致元数据服务体验负载,导致响应时间超过超时值,从而导致存储节点降低了自己。使自己失败的节点继续重试,确保元数据服务的负载无法减少。
亚马逊。缩放运动式的前端缓存机队导致车队中的所有服务器都超过了操作系统配置允许的最大线程数。从Cognito到Lambda再到CloudWatch的多个关键下游服务受到影响。
亚马逊。 PST上午7:30,这是一项自动化活动,以扩展在Main AWS网络中托管的AWS服务之一的容量,从而触发了内部网络中许多客户端的出乎意料的行为。这导致了大量的连接活动激增,使内部网络和主要AWS网络之间的网络设备不堪重负,从而导致这些网络之间的通信延迟。这些延迟增加了这些网络之间交流服务的延迟和错误,从而导致了更多的连接尝试和回程。这导致连接两个网络的设备上的持续交通拥堵和性能问题。
AppNexus。数据库更新揭示的双免费揭示导致所有“印象总线”服务器同时崩溃。由于需要时间延迟才能触发错误,而且登台期间没有内置延迟,因此这并没有陷入生产。
AT&T。一条糟糕的C代码引入了种族危害,在适当的时候,电话网络崩溃了。计划中的停电后,QuickFire恢复消息触发了比赛,导致了更多重新启动,从而触发了问题。 “问题在网络中的114个开关中迭代重复,在稳定系统所花费的9个小时内阻止了超过5000万个电话。”从1990年开始。
阿特拉斯利亚人。 2022年4月5日,星期二,从UTC开始7:38,有775个Atlassian客户失去了对Atlassian产品的访问。这些客户中的一部分最多可达14天,其中第一组客户将于4月8日恢复,所有客户网站都在4月18日之前逐渐恢复。
BaseCamp,另请参阅。 BaseCamp的网络在2014年3月24日的100分钟窗口中受到了DDOS攻击。
BaseCamp,另请参阅。 2018年11月,一个数据库达到了整数限制,使该服务处于只读模式。
BBC在线。 2014年7月,BBC Online在包括BBC iPlayer在内的几项流行在线服务中经历了很长时间。当数据库后端超载时,它已经开始从各种服务中提出油门请求。尚未缓解当地数据库响应的服务开始计时并最终完全失败。
Bintray。 2017年7月,Jcenter中包括了一些恶意的毛茸茸的包裹,并进行了模仿攻击。这些软件包在Jcenter中居住了一年多,据说影响了几个Android应用程序,导致Jcenter的这些依赖性注入了恶意软件代码。
咬人。托管源代码回购包含凭证授予对位备份的访问,包括密码。
Browserstack。带有壳牌漏洞的旧原型机器仍然有效,上面有秘密键,最终导致了对生产系统的安全违反。
buildkite。数据库容量降级以最大程度地减少AWS支出,导致缺乏支持BuildKite客户在峰值的能力,从而导致依赖服务器的崩溃。
邦吉。错误修复错误时间戳的副作用会导致数据丢失;在以下更新中,Hotfix的服务器错误配置导致数据丢失在多个服务器中重新出现。
CCP游戏。有问题的记录频道导致群集节点在推出新游戏补丁后群集开始序列中死亡。
CCP游戏。记录了一个无固定的Python内存重用错误,花了数年的时间才能追踪。
厨师食谱社区网站超市在发射后两个小时坠毁,这是由于间歇性的无反应和延迟增加。验尸后发现故障的主要原因之一是健康检查超时。
Circleci。 GITHUB中断和恢复引起了出乎意料的大量进来载荷。由于未指定的原因,大负载会导致Circleci的队列系统减速,在这种情况下,每分钟处理一项交易。
Circleci。到2023年1月4日,我们的内部调查已经确定了未经授权的第三方和袭击的进入道路的入侵范围。迄今为止,我们了解到,未经授权的第三方杠杆恶意软件部署到了Circleci工程师的笔记本电脑,以窃取有效的2FA支持的SSO会话。该机器于2022年12月16日被妥协。我们的防病毒软件未检测到恶意软件。我们的调查表明,恶意软件能够执行会话cookie盗窃,使他们能够在远程位置模仿目标员工,然后升级到我们生产系统的子集。
Cloudflare。解析器错误导致CloudFlare Edge服务器返回包含私人信息的内存,例如HTTP Cookie,身份验证令牌,HTTP Post Bosties和其他敏感数据。
Cloudflare。 CPU疲惫是由单个WAF规则引起的,该规则包含书写不良的正则表达式,最终产生了过多的回溯。该规则迅速部署到生产中,一系列事件导致Cloudflare服务的全球27分钟停机时间。
Datadog。自动升级后,所有网络规则被删除,并在其所有地区和云提供商中造成了所有纤毛保护的Kubernetes群集的持续时间24小时。
不和谐。一旦出现后,一场拍打的服务会导致雷鸣般的牛群重新连接。这导致级联错误,由于内部队列的填充,前端服务不记忆力。
不和谐。在大约14:01的大约14:01中,Redis实例充当了Discord的API服务使用的高度可用群集的主要群集,这是由Google的云平台自动迁移的。这种迁移导致节点不正确地脱离离线,从而迫使群集重新启动并触发已知的问题,并触发了该case case的解决方案,该问题逐渐解决了case的其他问题。 Discord的实时系统。
Dropbox。这个验尸很瘦,我不确定发生了什么。听起来也许是计划的OS升级以某种方式导致一些机器被删除,从而删除了一些数据库。
二人由于请求队列超载现有的数据库容量而导致的级联故障。能力规划和监视不足也可以归因于。
史诗般的游戏。极端负载(340万并发用户的新峰值)导致部分服务和全部服务中断。
欧洲航天局。在Ariane 5 Intertial引导系统中将16位数字转换为64位数字时,发生了溢出,从而导致火箭崩溃。实际的溢出发生在代码中,这不是操作所必需的,但无论如何正在运行。根据一个帐户,这导致诊断错误消息被打印出来,并且诊断错误消息以某种方式解释为实际有效数据。根据另一个帐户,没有为溢出安装陷阱处理程序。
松紧带。在AWS EU-West-1(爱尔兰)地区部署的弹性云客户大约3个小时,因此对群集的访问严重下降。在同一时间范围内,大约有20分钟的时间内,该地区的所有部署都完全不可用。
松紧带。在AWS US-EAST-1地区部署的弹性云客户经历了降级访问其群集的访问权限。
eslint。 2018年7月12日,一名攻击者损害了ESLINT维护者的NPM帐户,并发布了NPM注册表的恶意软件包。
Etsy。首先,本来应该是小型错误文件部署的部署也导致实时数据库在运行的生产机器上升级。为了确保这不会导致任何损坏,Etsy停止服务流量来运行完整性检查。其次,ID中的溢出(已签名的32位INT)导致某些数据库操作失败。 Etsy并不相信这不会导致数据损坏,并在升级时撤下了网站。
迅速。由于未被发现的软件错误在6月8日被有效的客户配置更改触发时浮出水面。
Flowdock。 Flowdock即时消息传递在2020年4月21日至22日之间不可用约24小时。COVID-19大流行导致家庭工作的突然和急剧增加,这导致了较高的流量,这导致CPU使用率较高,这导致了应用程序数据库的悬挂。一些用户数据被永久丢失。
foursquare。 MongoDB用完了记忆力时的负载下降。由于AA查询模式,该故障是灾难性的,而不是优雅的,该图案涉及较低级别的读取负载(每个用户登机手续都会读取用户历史记录的所有检查,并且记录是300个字节,没有空间位置,这意味着每个页面中大部分数据都不需要。缺乏对MongoDB实例的监控导致高负载未被发现,直到负载变成灾难性,导致两天内发生了两次事件的17个小时的停机时间。
Gentoo。一个实体可以访问Gentoo Github组织,删除了对所有开发人员的访问权限,并开始在各种存储库中添加提交。
github。 2018年2月28日,GitHub经历了DDOS攻击,并以1.35TBP的流量访问了网站。
GitLab。主要锁定并重新启动后,它被带回错误的文件系统,导致全局中断。另请参见HN讨论。
GitLab。涌入请求超载数据库,导致复制滞后,疲倦的管理员删除了错误的目录,丢失了六个小时的数据。另请参见早期的报告和HN讨论。
谷歌。邮件系统向人们发送了20次以上的电子邮件。发生这种情况是因为发送了邮件,该邮件是通过批处理的cron工作发送的,该邮件将邮件发送给了被标记为等待邮件的每个人。这是一个非原子操作,批处理作业并没有标记人们在发送所有消息之前才等待。
谷歌。 FILESTORE对API请求进行了全球限制,以限制过载方案的影响。当管理大量GCP项目故障并用请求使FileStore API超载的内部Google服务触发中断,从而导致了Filestore API的全球节流。这一直持续到内部服务被手动暂停为止。由于这种节流,仅阅读API访问对于所有客户来说都是不可读取的。由于适用于文件的全球配额,这影响了所有位置的客户。控制台,GCLOUD和API访问(列表,GetOperation等)呼叫都失败了3小时12分钟。突变操作(CreateInstance,UpdateInstance,CreateBackup等)仍然成功,但客户无法检查操作进度。
谷歌。 The Google Meet Livestream feature experienced disruptions that caused intermittent degraded quality of experience for a small subset of viewers, starting 25 October 2021 0400 PT and ending 26 October 2021 1000 PT. Quality was degraded for a total duration of 4 hours (3 hours on 25 October and 1 hour on 26 October). During this time, no more than 15% of livestream viewers experienced higher rebuffer rates and latency in livestream video playback. We sincerely apologize for the disruption that may have affected your business-critical events. We have identified the cause of the issue and have taken steps to improve our service.
谷歌。 On 13 October 2022 23:30 US/Pacific, there was an unexpected increase of incoming and logging traffic combined with a bug in Google's internal streaming RPC library that triggered a deadlock and caused the Write API Streaming frontend to be overloaded. And BigQuery Storage WriteAPI observed elevated error rates in the US Multi-Region for a period of 5 hours.
GPS/GLONASS. A bad update that caused incorrect orbital mechanics calculations caused GPS satellites that use GLONASS to broadcast incorrect positions for 10 hours. The bug was noticed and rolled back almost immediately due to (?) this didn't fix the issue.
Healthcare.gov. A large organizational failure to build a website for United States healthcare.
Heroku。 Having a system that requires scheduled manual updates resulted in an error which caused US customers to be unable to scale, stop or restart dynos, or route HTTP traffic, and also prevented all customers from being able to deploy.
Heroku。 An upgrade silently disabled a check that was meant to prevent filesystem corruption in running containers. A subsequent deploy caused filesystem corruption in running containers.
Heroku。 An upstream apt update broke pinned packages which lead to customers experiencing write permission failures to /dev .
Heroku。 Private tokens were leaked, and allowed attackers to retrieve data, both in internal databases, in private repositories and from customers accounts.
Heroku。 A change to the core application that manages the underlying infrastructure for the Common Runtime included a dependency upgrade that caused a timing lock issue that greatly reduced the throughput of our task workers. This dependency change, coupled with a failure to appropriately scale up due to increased workload scheduling, caused the application's work queue to build up. Contributing to the issue, the team was not alerted immediately that new router instances were not being initialized correctly on startup largely because of incorrectly configured alerts. These router instances were serving live traffic already but were shown to be in the wrong boot state, and they were deleted via our normal processes due to failing readiness checks. The deletion caused a degradation of the associated runtime cluster while the autoscaling group was creating new instances. This reduced pool of router instances caused requests to fail as more requests were coming in faster than the limited number of routers could handle. This is when customers started noticing issues with the service.
Homebrew. A GitHub personal access token with recently elevated scopes was leaked from Homebrew's Jenkins that allowed access to git push on several Homebrew repositories.
蜂窝。 A tale of multiple incidents, happening mostly due to fast growth.
蜂窝。 Another story of multiple incidents that ended up impacting query performance and alerting via triggers and SLOs. These incidents were notable because of how challenging their investigation turned out to be.
蜂窝。 On September 8th, 2022, our ingest system went down repeatedly and caused interruptions for over eight hours. We will first cover the background behind the incident with a high-level view of the relevant architecture, how we tried to investigate and fix the system, and finally, we'll go over some meaningful elements that surfaced from our incident review process.
蜂窝。 On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 pm UTC to 2:48 pm UTC, during which no data could be processed or accessed. The full details of incident triage process is covered in here.
incident.io. A bad event (poison pill) in the async workers queue triggered unhandled panics that repeatedly crashed the app. This combined poorly with Heroku infrastructure, making it difficult to find the source of the problem. Applied mitigations that are generally interesting to people running web services, such as catching corner cases of Go panic recovery and splitting work by type/class to improve reliability.
Indian Electricity Grid. One night in July 2012, a skewed electricity supply-demand profile developed when the northern grid drew a tremendous amount of power from the western and eastern grids. Following a series of circuit breakers tripping by virtue of under-frequency protection, the entire NEW (northern-eastern-western) grid collapsed due to the absence of islanding mechanisms. While the grid was reactivated after over 8 hours, similar conditions in the following day caused the grid to fail again. However, the restoration effort concluded almost 24 hours after the occurrence of the latter incident.
Instapaper.也是这样。 Limits were hit for a hosted database. It took many hours to migrate over to a new database.
英特尔。 A scripting bug caused the generation of the divider logic in the Pentium to very occasionally produce incorrect results. The bug wasn't caught in testing because of an incorrect assumption in a proof of correctness. (See the Wikipedia article on 1994 FDIV bug for more information.)
Joyent. Operations on Manta were blocked because a lock couldn't be obtained on their PostgreSQL metadata servers. This was due to a combination of PostgreSQL's transaction wraparound maintenance taking a lock on something, and a Joyent query that unnecessarily tried to take a global lock.
Joyent. An operator used a tool with lax input validation to reboot a small number of servers undergoing maintenance but forgot to type -n and instead rebooted all servers in the datacenter. This caused an outage that lasted 2.5 hours, rebooted all customer instances, put tremendous load on DHCP/TFTP PXE boot systems, and left API systems requiring manual intervention. See also Bryan Cantrill's talk.
Kickstarter。 Primary DB became inconsistent with all replicas, which wasn't detected until a query failed. This was caused by a MySQL bug which sometimes caused order by to be ignored.
Kings College London. 3PAR suffered catastrophic outage which highlighted a failure in internal process.
Launchdarkly. Rule attribute selector causing flag targeting web interface to crash.
Mailgun. Secondary MongoDB servers became overloaded and while troubleshooting accidentally pushed a change that sent all secondary traffic to the primary MongoDB server, overloading it as well and exacerbating the problem.
Mandrill. Transaction ID wraparound in Postgres caused a partial outage lasting a day and a half.
中等的。 Polish users were unable to use their "Ś" key on Medium.
Metrist. Azure published a breaking change that affected downstream systems like Metrist's service without warning them, the post covers how to identify the issue and how to recover from it.
NASA。 A design flaw in the Apollo 11 rendezvous radar produced excess CPU load, causing the spacecraft computer to restart during lunar landing.
NASA。 Use of different units of measurement (metric vs. English) caused Mars Climate Orbiter to fail. There were also organizational and procedural failures[ref] and defects in the navigation software[ref].
NASA。 NASA's Mars Pathfinder spacecraft experienced system resets a few days after landing on Mars (1997). Debugging features were remotely enabled until the cause was found: a priority inversion problem in the VxWorks operating system. The OS software was remotely patched (all the way to Mars) to fix the problem by adding priority inheritance to the task scheduler.
Netflix. An EBS outage in one availability zone was mitigated by migrating to other availability zones.
North American Electric Power System. A power outage in Ohio around 1600h EDT cascaded up through a web of systemic vulnerabilities and process failures and resulted in an outage in the power grid affecting ~50,000,000 people for ~4 days in some areas, and caused rolling blackouts in Ontario for about a week thereafter.
Okta。 A hackers group got access to a third-party support engineer's laptop.
OpenAI. Queues for requests and responses in a Redis cache became corrupted and out of sequence, leading to some requests revealing other people's user data to some users, including app activity data and some billing info.
Pagerduty. In April 2013, Pagerduty, a cloud service proving application uptime monitoring and real-time notifications, suffered an outage when two of its three independent cloud deployments in different data centers began experiencing connectivity issues and high network latency. It was found later that the two independent deployments shared a common peering point which was experiencing network instability. While the third deployment was still operational, Pagerduty's applications failed to establish quorum due to to high network latency and hence failed in their ability to send notifications.
PagerDuty. A third party service for sending SMS and making voice calls experienced an outage due to AWS having issues in a region.
平价。 $30 million of cryptocurrency value was diverted (stolen) with another $150 million diverted to a safe place (rescued), after a 4000-line software change containing a security bug was mistakenly labeled as a UI change, inadequately reviewed, deployed, and used by various unsuspecting third parties. See also this analysis.
Platform.sh. Outage during a scheduled maintenance window because there were too much data for Zookeeper to boot.
reddit。 Experienced an outage for 1.5 hours, followed by another 1.5 hours of degraded performance on Thursday August 11 2016. This was due to an error during a migration of a critical backend system.
reddit。 Outage for over 5 hours when a critical Kubernetes cluster upgrade failed. The failure was caused by node metadata that changed between versions which brought down workload networking.
罗布乐思。 Roblox end Oct 2021 73 hours outage. Issues with Consul streaming and BoltDB.
Salesforce. Initial disruption due to power failure in one datacenter led to cascading failures with a database cluster and file discrepancies resulting in cross data center failover issues.
Salesforce. On September 20, 2023, a service disruption affected a subset of customers across multiple services beginning at 14:48 Coordinated Universal Time (UTC). As a result, some customers were unable to login and access their services. A policy change executed as a part of our standard security controls review and update cycle to be the trigger of this incident. This change inadvertently blocked access to resources beyond its intended scope.
哨兵。 Transaction ID Wraparound in Postgres caused Sentry to go down for most of a working day.
Shapeshift. Poor security practices enabled an employee to steal $200,000 in cryptocurrency in 3 separate hacks over a 1 month period. The company's CEO expanded upon the story in a blog post.
Skyliner. A memory leak in a third party library lead to Skyliner being unavailable on two occasions.
松弛。 A combination of factor results in a large number of Slack's users being disconnected to the server. The subsequent massive disconnection-reconnection process exceeded the database capacity and caused cascading connection failures, leading to 5% of Slack's users not being able to connect to the server for up to 2 hours.
松弛。 Network saturation in AWS's traffic gateways caused packet loss. An attempt to scale up caused more issues.
松弛。 Cache nodes removal caused the high workload on the vitness cluster, which in turn cased the service outage.
Spotify。 Lack of exponential backoff in a microservice caused a cascading failure, leading to notable service degradation.
正方形。 A cascading error from an adjacent service lead to merchant authentication service being overloaded. This impacted merchants for ~2 hours.
Stackdriver. In October 2013, Stackdriver, experienced an outage, when its Cassandra cluster crashed. Data published by various services into a message bus was being injested into the Cassandra cluster. When the cluster failed, the failure percolated to various producers, that ended up blocking on queue insert operations, eventually leading to the failure of the entire application.
Stack Exchange. Enabling StackEgg for all users resulted in heavy load on load balancers and consequently, a DDoS.
Stack Exchange. Backtracking implementation in the underlying regex engine turned out to be very expensive for a particular post leading to health-check failures and eventual outage.
Stack Exchange. Porting old Careers 2.0 code to the new Developer Story caused a leak of users' information.
Stack Exchange. The primary SQL-Server triggered a bugcheck on the SQL Server process, causing the Stack Exchange sites to go into read only mode, and eventually a complete outage.
Strava. Hit the signed integer limit on a primary key, causing uploads to fail.
条纹。 Manual operations are regularly executed on production databases. A manual operation was done incorrectly (missing dependency), causing the Stripe API to go down for 90 minutes.
瑞典。 Use of different rulers by builders caused the Vasa to be more heavily built on its port side and the ship's designer, not having built a ship with two gun decks before, overbuilt the upper decks, leading to a design that was top heavy. Twenty minutes into its maiden voyage in 1628, the ship heeled to port and sank.
Tarsnap. A batch job which scans for unused blocks in Amazon S3 and marks them to be freed encountered a condition where all retries for freeing certain blocks would fail. The batch job logs its actions to local disk and this log grew without bound. When the filesystem filled, this caused other filesystem writes to fail, and the Tarsnap service stopped. Manually removing the log file restored service.
Telstra. A fire in a datacenter caused SMS text messages to be sent to random destinations. Corrupt messages were also experienced by customers.
Therac-25. The Therac-25 was a radiation therapy machine involved in at least six accidents between 1985 and 1987 in which patients were given massive overdoses of radiation. Because of concurrent programming errors, it sometimes gave its patients radiation doses that were thousands of times greater than normal, resulting in death or serious injury.
Trivago。 Due to a human error, all engineers lost access to the central source code management platform (GitHub organization). An Azure Active Directory Security group controls the access to the GitHub organization. This group was removed during the execution of a manual and repetitive task.
Twilio。 In 2013, a temporary network partition in the redis cluster used for billing operations, caused a massive resynchronization from slaves. The overloaded master crashed and when it was restarted, it started up in read-only mode. The auto-recharge component in This resulted in failed transactions from Twilio's auto-recharge service, which unfortunately billed the customers before updating their balance internally. So the auto-recharge system continued to retry the transaction again and again, resulting in multiple charges to customer's credit cards.
Twilio。 Twilio's incident of having high filtering on SMS towards AT&T Network In United States.
阀门。 Steam's desktop client deleted all local files and directories. The thing I find most interesting about this is that, after this blew up on social media, there were widespread reports that this was reported to Valve months earlier. But Valve doesn't triage most bugs, resulting in an extremely long time-to-mitigate, despite having multiple bug reports on this issue.
Yeller. A network partition in a cluster caused some messages to get delayed, up to 6-7 hours. For reasons that aren't clear, a rolling restart of the cluster healed the partition. There's some suspicious that it was due to cached routes, but there wasn't enough logging information to tell for sure.
Zerodha. The Order Management System (OMS) provided to Zerodha, a stock broker, collapsed when an order for 1M units of a penny stock was divided into more than 0.1M individual trades against the typical few hundreds, triggering a collapse of the OMS, which was not encountered prior by its provider - Refinitiv (formerly Thomson Reuters), a subsidiary of the London Stock Exchange.
Zerodha. A failure of the primary leased line to a CTCL between a stock broker and a stock exchange led to the activation of a backup leased line that was operating sporadically over the following hour, affecting bracket and cover orders. Subsequently, the process of placing and validating orders had been modified to incorporate the unreliability of the CTCL's leased lines, but the reliability of the primary and the backup leased lines was not fundamentally improved by the providers.
Unfortunately, most of the interesting post-mortems I know about are locked inside confidential pages at Google and Microsoft. Please add more links if you know of any interesting public post mortems! is a pretty good resource; other links to collections of post mortems are also appreciated.
AWS Post-Event Summaries
Availability Digest website.
Postmortems community (with imported archive from the now-dead G+ community).
John Daily's list of postmortems (in json).
Jeff Hammerbacher's list of postmortems.
NASA lessons learned database.
Tim Freeman's list of postmortems
Wikimedia's postmortems.
Autopsy.io's list of Startup failures.
SRE Weekly usually has an Outages section at the end.
Lorin Hochstein's list of major incidents.
Awesome Tech Postmortems.
Nat Welch's parsed postmortems is an attempt to build a database out of this markdown file.
Postmortem Templates is a collection of postmortem templates from various sources.
How Complex Systems Fail
John Allspaw on Resilience Engineering