Stack under attack: what we learned about handling DDoS attacks

As a highly regarded web site, will get quite a lot of consideration. A few of it’s good, just like the time we had been nominated for a Webby Award. Different instances, that focus is distinctly much less good, as once we get focused for distributed denial of service (DDoS) assaults. 

For a number of months, we’ve been the goal of ongoing DDoS assaults. These assaults have been break up into two sorts of assault: our API has been hit by software layer assaults whereas the primary website has been topic to a volume-based assault. Each of those assaults benefit from the surfaces that we expose to the web. 

Volume Based Attacks: Includes UDP floods, ICMP floods, and other spoofed-packet floods. The attack’s goal is to saturate the bandwidth of the attacked site, and magnitude is measured in bits per second (Bps).

Protocol Attacks: Includes SYN floods, fragmented packet attacks, Ping of Death, Smurf DDoS and more. This type of attack consumes actual server resources, or those of intermediate communication equipment, such as firewalls and load balancers, and is measured in packets per second (Pps).

Application Layer Attacks: Includes low-and-slow attacks, GET/POST floods, attacks that target Apache, Windows or OpenBSD vulnerabilities and more. Comprised of seemingly legitimate and innocent requests, the goal of these attacks is to crash the web server, and the magnitude is measured in Requests per second (Rps).
Caption: DDoS assaults fall into three classes. 

We’re nonetheless getting hit commonly, however because of our SRE and DBRE groups, together with some code modifications made by our Public Platform Group, we’ve been in a position to decrease the affect that they’ve on our customers’ expertise. A few of these assaults at the moment are solely seen by means of our logs and dashboards. 

We needed to share a few of the normal ways that we’ve used to dampen the impact of DDoS assaults in order that others below the identical assaults can decrease them. 

Botnet assaults on costly SQL queries

Between two application-layer assaults, an attacker leveraged a really giant botnet to set off a really costly question. Some again finish servers hit 100% CPU utilization throughout this assault. What made this further difficult is that the assault was distributed over an enormous pool of IP addresses; some IPs solely despatched two requests, so charge limiting by IP handle can be ineffective.

We needed to create a filter that separated the malicious requests from the professional ones so we might block these particular requests. Initially, the filter was a bit overzealous however, over time, we slowly refined the filter to establish solely the malicious requests.

After we mitigated the assault, they regrouped and tried concentrating on consumer pages by requesting tremendous excessive web page counts. To keep away from detection or bans they incremented the web page quantity their bots requested. This subverted our earlier controls by attacking a distinct space of the website whereas nonetheless exploiting the identical vulnerability. In response, we put a filter to establish and block the malicious site visitors. 

These API routes, like every API that pulls information from a database, are essential to the day-to-day functioning of Stack Overflow. To guard routes like these from DDoS, right here’s what you are able to do:

  • Insist that each API name be authenticated. This may assist establish malicious customers. If having solely authenticated API calls is just not potential, set stricter limits for nameless / unauthenticated site visitors.
  • Reduce the quantity of information a single API name can return. Once we construct our entrance web page query listing, we don’t retrieve the entire information for each query. We paginate, lazy load solely the information within the viewport, and request solely the information that shall be seen (that’s, we don’t request the textual content for each reply till loading the query web page itself). 
  • Fee-limit all API calls. This goes hand-in-hand with minimizing information per name; to get giant quantities of information, the attacker might want to name the API a number of instances. No one must name your API 100 instances per second. 
  • Filter malicious site visitors earlier than it hits your software. HAProxy load balancers sit between all requests and our servers to steadiness the quantity of site visitors throughout our servers. However that doesn’t imply all site visitors has to go to a type of servers. Implement thorough and simply queryable logs so malicious requests will be simply recognized and blocked.

Whack-a-mole on malicious IPs

We additionally had been topic to some volume-based assaults. A botnet despatched numerous `POST` requests to ``. This one was straightforward: since we don’t use trailing slash on that URL, we blocked all site visitors on that particular path. 

The attacker figured it out, dropped the trailing slash, and got here again at us. As an alternative of simply reactively blocking each route the attacker hit, we collected the botnet IPs and blocked them by means of our CDN, Fastly. This attacker took three swings at us: the primary two precipitated us some difficulties, however as soon as we collected the IPs from the second assault, we might block the third assault immediately. The malicious site visitors by no means even made it to our servers. 

A brand new volume-based assault—presumably from the identical attacker—took a distinct strategy. As an alternative of throwing the whole botnet at us, they activated simply sufficient bots to disrupt the location. We’d put these IPs on our CDN’s blocklist, and the attacker would ship the subsequent wave at us. It was like a recreation of Whack-a-mole, besides not enjoyable and we didn’t win any prizes on the finish. 

As an alternative of getting our incident groups scramble and ban IPs as they got here in, we automated it like good little SREs. We created a script that might test our site visitors logs for IPs behaving a particular manner and robotically add them to the ban listing. Our response time improved on each assault. The attacker stored going till they bought bored or ran out of IPs to throw at us. 

Quantity-based assaults will be extra insidious. They appear like common site visitors, simply extra of it. Even when a botnet is specializing in a single URL, you may’t all the time simply block the URL. Reliable site visitors hits that web page, too. Listed here are a number of takeaways from our efforts:

  • Block bizarre URLs. For those who begin seeing trailing slashes the place you don’t use them, `POST` requests to invalid paths, flag and block these requests. If in case you have different catch-all pages and begin seeing unusual URLs coming in, block them. 
  • Block malicious IPs even when professional site visitors can originate from them. This does trigger some collateral injury but it surely’s higher to dam some professional site visitors than be down for all site visitors.
  • Automate your blocklist. The issue with blocking a botnet manually is the toil concerned with figuring out a bot and sending the IPs to your blocklist. For those who can acknowledge the patterns of a bot then automate blocking based mostly on that sample, your response time will go down and your uptime time will go up.
  • Tar pitting is an effective way to decelerate botnets and mitigate quantity based mostly assaults. The concept is to cut back the variety of requests being despatched by botnet by growing the time between requests.

Different issues we realized

By having to cope with quite a lot of DDoS assaults back-to-back, we had been in a position to be taught and enhance our general infrastructure and resiliency. We’re not about to say thanks to the botnets, however nothing teaches higher than a disaster. Listed here are a number of of the large general classes we realized. 

Spend money on monitoring and alerting. We recognized a number of gaps in our monitoring protocols that might have alerted us to those assaults sooner. The applying layer assaults specifically had telltale indicators that we might add to our monitoring portfolio. Normally, enhancing our tooling general has helped us reply and preserve website uptime. 

Automate all of the issues. As a result of we had been coping with a number of DDoS assaults in a row, we might spot the patterns in our workflow higher. When an SRE sees a sample, they automate it, which is strictly what we did. By letting our methods deal with the repetitive work, we diminished our response time. 

Write all of it down. For those who can’t automate it, file it for future firefighters. It may be onerous to step again throughout a disaster and take notes. However we managed to take a while and create runbooks for future assaults. The following time a botnet floods us with site visitors, we’ve bought a headstart on dealing with it. 

Speak to your customers. Tor exit nodes had been the supply of a major quantity of site visitors throughout one of many quantity assaults, so we blocked them. That didn’t sit properly with professional customers that occurred to make use of the identical IPs. Customers began a bit of untamed hypothesis, blaming Chinese language Communists for stopping nameless entry to the location (to be truthful, that’s half proper: I’m Chinese language). We had no intention of blocking Tor entry completely, but it surely was stopping different customers from reaching the location, so we got on Meta to explain the situation earlier than the pitchforks got here out en masse. We’re now including communication duties and tooling into our incident response runbooks so we will be extra proactive about informing customers. 

DDoS assaults can usually include success on the web. We’ve gotten quite a lot of consideration during the last 12 years, and a few of it’s sure to be destructive. For those who discover yourselves on the receiving finish of a botnet’s consideration, we hope the teachings that we’ve realized will help you out as properly. 

Tags: DDoS, devops, security

More Posts