Jump to content
Sign in to follow this  
Kanibal

Forum Downtime: March 6th - March 14th

Recommended Posts

Kanibal   

Welcome back to the FurMorphed Forum.  I'm sure it's been a very long week for you all with the community offline.  Those of you that follow us on Twitter or frequent the IRC will have had contact and been kept up to date with what was going on while the website was down.

 

Now that it's over here is a summery of what happened, why we suspect it happened and what has been done to try and ensure it never happens again.

 

Before the incident...

The day before the forum was taken down it experienced an excessively large amount of hits, not dissimilar from a DoS attack in many ways, though because of how it happened it didn't make a lot of page calls it just loaded the forums database up significantly.  Our host and the forum software alike interpreted this as high site traffic and enabled caching.  This is the process of generating fewer pages by handing out saved ones generated previously.

For those of you browsing the forum and posting at the time this is what made your posts seem to disappear - you were handed the same page again once you refreshed without your post shown.  It was still posted successfully.

 

What took the site down...

Put simply, we did.  When the activity settled down the forum was taken out of caching mode and all of the information was refreshed.  However our host was still caching pages, this was the point we found that out, and this second massive spike in server activity caused by the forum refreshing made them tighten the rope around our traffic even further.  This caused cached pages to be handed out without authentication checks, many members noticed being logged in as someone else, so to prevent any security issues which could potentially have occurred the site was taken down.  It's important to note that at this stage our forum was not involved because the pages were being handed out by the server without any interaction with the website code.

 

Why did it happen and could it happen again?

FM resides on a shared hosting package which means on the server our site runs from are dozens and dozens of other websites and to keep everything running smoothly the amount of server resource each website takes up is limited.  We breached our limit by quite a large margin and that is why we started getting cached.  In theory yes this could happen again.  All it takes is the site traffic to become sufficiently high (or database traffic) and the site could use more then the resources we are permitted though steps have now been taken to try and prevent this.

 

Why did it take so long to get back?

Our host.  Taking nearly a week to respond to our ticket regarding the breach in security their software caused and to disable the caching.  Also add a few days for me to run through logs and the ACP to determine what happened and how to prevent it in the future.

 

Why was there nothing in place to prevent this? / What has changed to try and stop this happening again?

There were many measures in place to try and prevent this from happening when the site is under load however as stated above the way in which the incident occurred resembled a DoS attack - these are almost impossible to prevent with our level of control.

What we have learned from this has made us change the way the forum calculates load and what it does when it reaches our imposed load limit, we have also removed and curtailed certain features of the site.  Our load limit for FM is well below that of our hosts enforced limit which should hopefully ensure that even if something goes wrong we never breach the terms of our hosting contract again.

As from now the forum caching will be handled differently and used more widely.  This may sound backward when caching caused the problem but the server caching pages and the forum caching the results of database calls are very different things.  The load will be greatly reduced by this and should the site load ever reach the load limit again the site will now display a "Website Busy" page instead of trying to continue serving pages in different ways.

 

We apologise for what happened and the lengthy extent of the down time.  No data was lost or damaged in the down time because thankfully the incident peaked coincidentally during a staff meeting - there could not have been more staff present at any other time!

 

I'll attempt to now field any further questions that arise from this...

Share this post


Link to post
Share on other sites
Kanibal   

We only know where the second spike came from. The second spike in CPU usage on the server wasn't traffic at all and was in fact us me - when I saw the forum handing out cached pages I ran the admin command to have the software rebuild the cache. Rebuilding the entire cache apparently isn't one successively and is done all at once - my bad,

The initial spike which began the trouble however we have no idea. We had an IP address at the time but tracing that wouldn't tell us anything and it wasn't connected to a registered members account.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

×