I have a LAMP stack setup on Digital Ocean (Ubunu 12.04) that is pretty stable. The only time we have had a crash is when we sent out a mass email to about 30,000 people. We are not using the server to send the message, but a third-party email service (iContact). I watch the server with Top and can see it fill up with apache entries (each taking about 20MB) for a short while then drop back down after the mail has finished being sent.
I have successfully adjusted the apache settings to no longer crash - it just slows down for a bit. These are not hits to the pages, but something is making apache ramp up and spin off a ton of workers during the email send process.
My question is, where do I look to get some idea of what is happening? Unfortunately iContact has been no help and the log files I've looked at aren't telling me much, so I think I'm likely looking in the wrong place.
I used to send emails to over 200,000 people directly from a single machine. Trying to do it from a webpage is pretty crazy, so I wrote a command-line based script to first write it into a database, and then send ~50 at a time from the database, deleting as it went.
With Symfony/Swiftmailer it is pretty easy these days - the sending part is just a shell script that keeps running 'app/console swiftmailer:spool:send', sleeping and restarting till the database is empty.
Related
We are currently testing an upgrade from CF11 to CF2018 for my company's intranet. To give you an idea how long this site has been running, our first version of CF was 3.1! It is still using application.cfm, and there is code from 1998, when I started writing this thing. Yes, 21 years -- I'm astonished, too. It is a hodgepodge of all kinds of older frameworks, too, including Fusebox.
Anyway, we're running Win 2012 VM connected to a SQL 2016 farm. Everything looked OK initially, but in the Week I've been testing, the server has come to a slowdown once (a page took more than 5 seconds to run, something that usually takes 100ms, no DB involvement), and another time, the server came to a grinding halt. The only way I could restart CF App service was by connecting to the server with another server via Services, because doing it via Remote Desktop was so slow.
Now keep in mind -- it's just me testing. This is a site that doesn't have a ton of users, but still, having 5 concurrent connections is normal and there are upwards of 200-400 users hitting this thing every day.
I have FusionReactor running on this thing now, so the next time a lockup happens, I will be able to take a closer look, but what do you think is the best way I can test this? Our site is mostly transactional, users going and filling out forms to put internal orders through. We also connect to XML web services and REST services; we also provide REST services, too. Obviously there's no way to completely replicate a production server's requests onto a test server, but I need to do more thorough testing. Any advice would be hugely appreciated.
I realize your focus for now is trying to recreate the problem on test. That may not be as easy as hoped. Instead, you should be able to understand and resolve it in production. FusionReactor can help, but the answer may well be in the cf logs.
You don't mention assessing the logs at the time of the hangup. See especially the coldfusion-error log, for outofmemory conditions.
You mention raising the heap, but the problem may be with the metaspace instead. If so, consider simply removing the maxmetaspace setting in the jvm args. That may be the sole and likely cause of such new and unexpected outages.
Or if it's not, and there's nothing in the logs at the time, THEN do consider FR. Does IT show anything happening at the time?
If not then consider a need to tune the cf/web server connector. I assume you're using iis. How many sites do you have? And how many connectors (folders in the cf config/wsconfig folder)? What are the settings in their workers.properties file? Are they optimized for the number of sites using that connector?
Also, have you updated cf2018? Are there any errors in the update error log? Did you update the web server connector also?
Are you running the cf2018 pmt (performance monitoring tool set)? Have you updated it?
There could be still more to consider, but let's see how it goes with those. I have blog posts on these and many more topics that would elaborate on things, both at my site (carehart.org) and the Adobe cf portal (coldfusion.adobe.com).
But let's hear if any of this gets you going.
I have VPS on hetzner. Server is located in Germany.
It has 256GB RAM, 6CPUs (12 threads).
I have a file which since yesterday, is requested about 30 times in one second. File has 2 Select, 2 Update, 2 Insert queries, so I assumed (not sure how this works) from this file server has about 180 requests per second. So right after this requests started, all the websites on the server just started loading poorly. I made this file run just one select query and than die. This didn't help. In WHM load is aboiut 0.02.
I've checked for error logs and there is no max_user_connection or any error there.
I have enabled slow query log and checked log file. there is nothing (I've tested it with select sleep(10) and this query was logged).
This is visit statistics, please bring your attention to may 30th:
Bandwidth stats for last 24 hours:
There are many errors like this in ssl_log (diff IPs of course):
188.121.206.150 - - [30/May/2018:19:50:03 +0200] "-" 408 - "-" "-"
I've been searching web a lot and couldn't find any solution. Could anyone at least tell what should I monitor or where. I have full access to anything there is possible inside the server. Any help is appreciated.
UPDATE 1
I have subdomain: banners.analyticson.com (access allowed for now) and there I have all the images and html5 files that are requested.
Take one image for example : https://banners.analyticson.com/img/suy8G1S6RU.jpg
It needs too much time to load. As I noticed, this sub domain has some issue.
Script, that I mentioned earlier (with 6 queries) just tries to get one of those banners to the user, so result of that script is to return one banner from banners.analyticson.com.
UPDATE 2
I've checked my script, it is fine. It takes less than 1 second to complete.
I've also checked Top command and there is a result. I'm not sure if $MEM value is fine.
You're going to have to narrow the problem down...
There are multiple potential issues.
First thing to eliminate would be the performance of your new script on a development laptop - I assume you're using PHP, so use the profiling tools to work out what is going on. If it's a database query, you'll see which one by looking at the profiler.
If your PHP script and database queries are fine, the next thing to look at: it sounds like you've hit some bottleneck resource on your infrastructure. In these cases, scripts that run fine as a single request start queueing for the bottleneck resource, and every new request adds to the queue until the whole server starts to crawl. This can be a bit of a puzzle - start with top and keep digging.
Next, I'd look at configuration of Apache to make sure everything is squeaky clean - Apache used to have a default to do a reverse DNS lookup for every request, which slows the server down rather impressively on production. You may also want to look at your SSL configuration - the error you report is linked to a load balancer issue.
If it's not as simple as memory, CPU etc., you're into more esoteric issues. You may need to ramp up a load testing rig so you can experiment without affecting the live site - typically, I do this on a machine as similar to live as possible, using Apache JMeter to generate load, and find the "inflection point". Typically, you see response times increase linearly with the number of concurrent requests, until you hit the bottleneck resource, at which point the response time increases rapidly. As a simple example, if you have 10 database connections available, response time should increase linearly up to 10 concurrent connections, and then become much larger from 11 up.
Knowing where the inflection point is and being able to recreate it allows you to use PHP profiling tools under load. This is a lot of work.
UPDATE
You're using php-cgi; this is easily the most inefficient way of running PHP scripts. Your server is barely breaking a sweat - CPU and memory basically idle. Here's a comparison for how to run PHP; consider changing to mod_php.
Can anyone tell me what is holding us back.
I tried every different php script in front end to send emails. Interspire, Oempro, PHP list, PHPknode. But we are only able to send 5 emails every 2 seconds.
We upgraded our server, Our H/W configuration is good. We have used EXIM, We even tried PMTA. Even though our sending speed does not improved.
Our requirement is to send 200,000 - 300,000 emails a day But we need to send this in peak hours i.e. between 9am to 1pm. We are only able to send 15000 emails in 6-7 hours.
I don't know what is the problem, Why are we not able to send emails quickly. Is it because of the PHP script, MTA or the server h/w configuration.
Can anyone please help me with this problem? Any help will be appreciated.
I can tell you directly that Interspire Emailmarketer is not especially high-performing. I had a similar situation as you do. We had a high-end server machine, with SAS disks, 16 CPU cores and lots of RAM. We had a highly fine-tuned Postfix MTA and MySQL server (spent a few days configuring those). The performance you get matches our experience. The load in our case was entirely in the PHP script, not the database and not in the MTA.
I suspect that the Interspire software is meant for very low-traffic newsletters (where receivers can be counted in the hundreds).
Interspire by default uses a single php process to process the email queue, and thus it's unable to use multi-core machines. There is a paid multi-processing script called MSH addon which takes IEM processing queue and distributes it across several processor cores for massive speed bonus. From the addons website:
MSH is built around "multi processing library", a multi platform
,shared nothing, multi processing framework inspired by Python
multiprocessing module(but very different from api level). It uses
"proc family functions" for process spawning and "soq" for IPC.
Disclaimer: I am one of the developers of MSH addon.
i have a very simple server running WAMP on a windows machine, with a php code who is a simple API for my clients that returns an XML. The things is that the hardware is very modest, and if a user calls the link to the API and hits F5 many times (calls the link repeatedly) the server performance goes down a little (response time goes up). Is there a way to limit the queries on port 80?
I know how to limit this in the the php code, but i think it is not good practice because even if you limit the queries on the php code the query is already made and I'm consuming resource checking with php if the user is making many queries.
Well, if you want to catch it before it reaches PHP, an Apache module would be one approach, e.g. mod_cband. Other than that, your firewall might help you, but I don't know if the default Windows one is up for that.
Other than that, handling it in your PHP code wouldn't be that bad. Yes, checking a DB consumes time, but it's still faster than collecting and returning XML.
Implement access control to the resources, keep track of active sessions and don't initiate heavy tasks while that particular user has a task open...?
We are experiencing this issue approximately once a month. It is very hard to pinpoint the cause so any help would be appreciated. This causes the App pool to stop and brings the site down. We have gone through all log files and have concluded nothing. We are using the 2.0.3 version on IIS 6.
I've noticed IIS defaults web apps on a 29-hour recycle schedule, which can be troublesome since it may recycle at times your users do not expect it to.
For example: web app starts at 12 am, which means the next day it recycles at 5am, the day after that at 10am, the day after that at 3pm, etc. (this is assuming there is enough request activity against your app to keep it alive so it does not shutdown due to inactivity)
If your web app relies heavily on in-memory session state this is especially bad because the recycle will kill sessions and possibly force users to re-authenticate and lose any unsaved work. (if you don't design your app to work seamlessly with recycling)
Check the recycle schedule and make sure it recycles at a time that you expect. See this for screenshots: http://remy.supertext.ch/2010/08/iis7-worker-process-reached-its-allowed-processing-time-limit/
Not sure about the infinite loop suggestion... sounds like you just have a recycling configuration issue to resolve.
This likely indicates an infinite loop in your application code.
Basically, every time a request comes into the web server, IIS hands the request off to a worker process. You can configure in IIS how many of those workers there are, and what the timeout value is. The timeout is to keep things moving in case the application code hangs -- it gets killed so the thread can go back in the pool to keep servicing new requests.
So look through your code for likely infinite loops. Or alternatively, it could be an extremely long-running database query that could have eventually finished but exceeded the timeout value. Perhaps your web application offers the end user an opportunity to make too broad of a query that returns too much data or requires too much DB processing time.
It's hard to give a specific cause for you, of course, but try to think along these lines.
If you're experiencing a crash as a result (sounds like you are) then you might want to grab a copy of Debugging Tools for Windows and spend some time reading Tess Ferrandez' blog--she offers great advice on performing post mortem crash analysis and makes WinDbg a whole lot more approachable.