Imageforge
Chevereto Member
Specifications:
Script version: 3.6.9
Environment: Nginx 1.8 (also tested 1.4) / Linux (kernel 3.19) / MySQL 5.5.46 / PHP 5.5.9 as php-fpm, fpm-fcgi (Also tested on LAMP)
Front End: Cloudflare using round-robin DNS to our two load balancers (also tested with no load balancers, and no cloudflare)
External Storage: runabove via openstack
For a couple of weeks I've been trying to figure out what is causing massive freeze / hang / lag in the site responsiveness.
I finally have discovered the cause, but not specifically what the underlying problem is. As it turns out, when a user, or admin executes a bulk-delete job, the site becomes unresponsive during the processing of that work. Looking at the system resources, I see CPU usage rise 300%, load averages also go up, but not so much as to cause the unresponsive condition.
If I select for example, 100 images for delete, the site hangs for a long time. So long in fact, that users think it's under attack.
I've tested this on several different high-performance hardware & software configurations, including both Apache, and nginx. The problem persists on a single 8 CPU dedicated server with 64GB of ram, as well as on a high-performance cluster of dedicated servers, each machine with 32 x Intel Xeon CPU, 256 GB ram, and 1.4 TB of SSD in RAID 10 with a MegaRaid controller. It's not a problem with the hardware, as each test platform has more than enough in terms of required specifications. The cluster is built for HA, using 4 application servers, 2 replicated DB servers, 2 load balancers out front.
When the bulk-delete job is not causing the problem, the site is blazing fast.
I don't see a bottleneck at the MySQL database servers, nor does it seem a problem with local IO. The one constant which is used in each test model is Runabove.
We also tried to solve this by optimizing the MySQL database, with no improvement.
Any ideas?
Thanks for your time.
Script version: 3.6.9
Environment: Nginx 1.8 (also tested 1.4) / Linux (kernel 3.19) / MySQL 5.5.46 / PHP 5.5.9 as php-fpm, fpm-fcgi (Also tested on LAMP)
Front End: Cloudflare using round-robin DNS to our two load balancers (also tested with no load balancers, and no cloudflare)
External Storage: runabove via openstack
For a couple of weeks I've been trying to figure out what is causing massive freeze / hang / lag in the site responsiveness.
I finally have discovered the cause, but not specifically what the underlying problem is. As it turns out, when a user, or admin executes a bulk-delete job, the site becomes unresponsive during the processing of that work. Looking at the system resources, I see CPU usage rise 300%, load averages also go up, but not so much as to cause the unresponsive condition.
If I select for example, 100 images for delete, the site hangs for a long time. So long in fact, that users think it's under attack.
I've tested this on several different high-performance hardware & software configurations, including both Apache, and nginx. The problem persists on a single 8 CPU dedicated server with 64GB of ram, as well as on a high-performance cluster of dedicated servers, each machine with 32 x Intel Xeon CPU, 256 GB ram, and 1.4 TB of SSD in RAID 10 with a MegaRaid controller. It's not a problem with the hardware, as each test platform has more than enough in terms of required specifications. The cluster is built for HA, using 4 application servers, 2 replicated DB servers, 2 load balancers out front.
When the bulk-delete job is not causing the problem, the site is blazing fast.
I don't see a bottleneck at the MySQL database servers, nor does it seem a problem with local IO. The one constant which is used in each test model is Runabove.
We also tried to solve this by optimizing the MySQL database, with no improvement.
Any ideas?
Thanks for your time.
Last edited: