Batch (bulk) Delete Causes Lag / Unresponsive Site

Imageforge · Nov 11, 2015

Specifications:

Script version: 3.6.9
Environment: Nginx 1.8 (also tested 1.4) / Linux (kernel 3.19) / MySQL 5.5.46 / PHP 5.5.9 as php-fpm, fpm-fcgi (Also tested on LAMP)
Front End: Cloudflare using round-robin DNS to our two load balancers (also tested with no load balancers, and no cloudflare)
External Storage: runabove via openstack

For a couple of weeks I've been trying to figure out what is causing massive freeze / hang / lag in the site responsiveness.

I finally have discovered the cause, but not specifically what the underlying problem is. As it turns out, when a user, or admin executes a bulk-delete job, the site becomes unresponsive during the processing of that work. Looking at the system resources, I see CPU usage rise 300%, load averages also go up, but not so much as to cause the unresponsive condition.

If I select for example, 100 images for delete, the site hangs for a long time. So long in fact, that users think it's under attack.

I've tested this on several different high-performance hardware & software configurations, including both Apache, and nginx. The problem persists on a single 8 CPU dedicated server with 64GB of ram, as well as on a high-performance cluster of dedicated servers, each machine with 32 x Intel Xeon CPU, 256 GB ram, and 1.4 TB of SSD in RAID 10 with a MegaRaid controller. It's not a problem with the hardware, as each test platform has more than enough in terms of required specifications. The cluster is built for HA, using 4 application servers, 2 replicated DB servers, 2 load balancers out front.

When the bulk-delete job is not causing the problem, the site is blazing fast.

I don't see a bottleneck at the MySQL database servers, nor does it seem a problem with local IO. The one constant which is used in each test model is Runabove.

We also tried to solve this by optimizing the MySQL database, with no improvement.

Any ideas?

Thanks for your time.

SirMoo · Nov 12, 2015

Oooh. Interesting. Have you tried any other sort of external or local storage to see if this still happens?

Rodolfo · Nov 12, 2015

Try using this class.queue.php, note if the files gets deleted (both Chevereto and OpenStack) and if your website gets back to normal.

Imageforge · Nov 12, 2015

Rodolfo said:
Try using this class.queue.php, note if the files gets deleted (both Chevereto and OpenStack) and if your website gets back to normal.

I will let you know as soon as I have some test results. Thank you.

Imageforge · Nov 12, 2015

The class changes seem to have drastically improved the situation. There's a notable hang for approx. ~8-10 seconds after starting the job, however it's much better than 3 minutes of the site being totally locked-up, and watching our traffic plummet.

If this class modification is production-ready, I will roll it out and continue monitoring it. Also, could you tell me basically what is happening when a batch job is called? Obviously the queue is offloaded to a background worker.... I'm just curious where that is being done.

Thanks very much for your time, and for your prompt response, that's really awesome!

EDIT: Further, the images are being properly deleted from both the database, and the storage node.

Rodolfo · Nov 12, 2015

When I first added php-opencloud (the library that handles OpenStack) I didn't noticed that a bulk method was available, so all the object delete worked sending the requests to the external storage server one by one. Most likely the problem was the large amount of HTTP requests to the RunAbove API.

The only methods that will still using single object delete are FTP and Google Cloud. Regarding FTP, that is normal because is a very outdated protocol (everybody should use SFTP) and Google Cloud far as I know doesn't offer a bulk delete for the PHP SKD.

Cheers,
Rodolfo.

Imageforge · Nov 12, 2015

Good, good.

I'll push this change to the application server cluster and will update this thread in a week or so, after some production usage.

Regards.

Imageforge · Nov 18, 2015

Unfortunately, this problem is not solved.

Same results as before, but my previous testing was simply not adequate to provoke the issue when I had presumed it fixed. My last exercise has confirmed the following:

Model: One user, 10 albums, 771 images.
Action: Delete user
Result: Totally unresponsive to HTTP requests for ~200 seconds (roughly 3 minutes) while the job is executed. System resource graphs show no obvious bottleneck, with CPU / RAM / NET / LOAD all nominal, and zero IO wait. When this issue isn't occurring, the cluster is really fast, even at peak times when traffic is heavy (we average 150 million requests monthly). The cluster has never gone over 10% for any resource, even when daily backups are underway.

I'm anxious to get this sorted out, so if you would like me to do something specific in order to help diagnose the issue, please let me know.

Rodolfo · Nov 18, 2015

The only thing that you can do is debug where the bottleneck actually is, that can be known using a profiler like this one: http://www.xdebug.org/docs/profiler

If the bottleneck is in Chevereto I can have a look, otherwise I believe that is either a server configuration or an OpenStack thing because the library was made by RackSpace so I don't think that is rubbish. Also, Chevereto only handles a bunch of queues per request so it is very odd what is going on.

Imageforge · Nov 18, 2015

What server configurations would you suspect, keeping in mind that the problem is reproduced under both apache (mod_php, suPHP), and nginx (php-fpm)? I will inspect the kernel logs to see if nginx is complaining about connection backlog or other performance related errors.

I concur, that php-opencloud isn't rubbish, nor am I assigning such title to Chevereto. Returning to my original analysis, I suspect the problem is a choke point somewhere in the openstack swift API, or possibly the swift configuration itself.

Lets start by picking the lowest hanging fruit next - server attributes. Where to check?

Imageforge · Nov 18, 2015

The logs are clean. Nothing abnormal... no errors in kern.log, error.log, etc.

Have you used the wirelogging plugin during your development of this feature?

Rodolfo · Nov 18, 2015

I told you to profile the thing. Setup cachegrind otherwise you will be one month tying to guess what is going on.

Imageforge · Nov 18, 2015

Rodolfo · Nov 18, 2015

A bulk delete operation should take seconds, is just HTTP requests so if the thing is being unresponsive could be just a network issue.

Chevereto process the queues using a 1x1 pixel which is loaded in the front-end so it gets executed everytime someone sends a request to your website so if we think about it, 700 images will be split in 3 pixel jobs so it will be just 3 requests to the OpenStack server, that's almost nothing and if is hanging could be that PHP is still waiting for the HTTP response.

If the logs doesn't show any resource issue... The only thing left is network. Could be that the network becomes unresponsive and it hangs the script and therefore, PHP doesn't work for that peer and you experience an unresponsive website when everybody else is OK. You should check the network interface to check if it becomes unresponsive.

I asked to profile the thing because in that way you can easily know which function is taking too much time to complete and basically you can find any bottleneck. A bottleneck not necessarily will be caused by high CPU use, even unresponsive network issues will cause a bottleneck because the thing just takes forever to complete something simple like a file_get_contents() or something like that and is very easy to locate the conflict in code terms, but if you want to debug it using something else go ahead.

Batch (bulk) Delete Causes Lag / Unresponsive Site

Imageforge

Chevereto Member

SirMoo

💖 Chevereto Fan

Rodolfo

👑 Chevereto Godlike

Attachments

Imageforge

Chevereto Member

Imageforge

Chevereto Member

Rodolfo

👑 Chevereto Godlike

Imageforge

Chevereto Member

Imageforge

Chevereto Member

Rodolfo

👑 Chevereto Godlike

Imageforge

Chevereto Member

Imageforge

Chevereto Member

Rodolfo

👑 Chevereto Godlike

Imageforge

Chevereto Member

Rodolfo

👑 Chevereto Godlike