Script for lazy-rendering in PhantomJS that sometimes locks phantomjs forever - phantomjs

I'm using phantomjs for rendering AnyChart v6 charts (http://6.anychart.com/) in PDF format.
An AnyChart v6 chart consists of an HTML file that calls an XML definition file through a Javascript library and renders it to SVG.
The XML definition file is the result of an on-the-fly complex processing, so the server can take up to few minutes to deliver the XML file to the Anychart javascript library.
My problem was to force phantomjs to wait for the XML file, so I came across this "twitter.js" script:
https://gist.github.com/cjoudrey/1341747
It works perfectly, except that sometimes it locks phantomjs forever and the only way to go on is to kill the Linux process.
It's a random behaviour, if I try again the same URL it works.
The server log shows that the XML file was correctly delivered, so it's not a server problem, it's a client problem.
Can you see a race condition or something in the "twitter.js" code that can lead to a phantomjs lock in some situations?

Not 100% sure, but it may have something to do with http://phantomjs.org/api/webpage/handler/on-resource-received.html
If the resource is large and sent by the server in multiple chunks, onResourceReceived will be invoked for every chunk received by PhantomJS.
Notice "stage : “start”, “end” (FIXME: other value for intermediate chunk?)"
Yet, forcedRenderTimeout should kill a page anyway.

Related

Somehow send command line commands on windows externally and get back the response

Problem: Need to convert local html (with local images etc) to pdf from an AIX box running Universe 11.2.5 with System Builder
Current solution: FTP over html file to a Windows server which converts in batches and sends the e-mail to the destination
Proposed Solution: Do everything on the AIX box, from converting html to pdf and sending the e-mail.
Current problem: Unable to find a way to convert local html to PDF on the AIX box. I have been trying many different ways from trying to install Python3, but to no avail.
The only really difficult part of the process is getting the HTML to render into a format will properly display your html into pages that are suitable for printing. There is a fair amount of magic that goes on between HTTP:GET and clicking print on a browser window that needs to be accounted for.
I was trying accomplish something similar many moons ago on AIX but kind of ran into a skill level/time wall because I was going to have essentially create a headless browser to render the html. It looks like there are now some utilities that you might be able to leverage. I found this recent updated article on Super User that actually got me somewhat excited, especially since I don't use AIX anymore so precompiled binaries and well understood and easily attainable dependencies are something I can actually have in my life.
https://superuser.com/questions/280552/how-can-i-render-a-website-as-an-image-from-the-shell
Good Luck.
There seems to be several questions rolled into this one item.
Converting HTML to PDF, while that is just a data manipulation that you could do in basic, writing such code would be a large task. The option you use sending it to another system is valid, but put more points of failure into the system. I would think you could find code to do it on the AIX box.
Rocket plans on getting the MV Python to work on AIX, this will make the converting of html to PDF much easier since there are a lot of open source modules.
As for my suggestion of using sockets, that would be if you intend to send it to a service that will take the htms, and return the pdf document.
i.e. Is there a web service for converting HTML to PDF?
Once you have the pdf document, you can either store it in a UniVerse type-19 file, or do the base64 encoding and store it in UniVerse hash file.
Hope this helps,
Mike

How to write a script that interacts with web browser and print content as PDF?

I'm looking to write an automated script that
Opens up a browser instance with a specific URL
Print the page as PDF output to a pre-defined location and document name
Simulate a click event on the web page that goes to the next report
Repeat 2 and 3 for a fixed number of times.
I'm not sure how to start doing this. Thought of using Javascript, but it won't be able to automate the printing process.
There is no control of the server, therefore I cannot use a query to get the collection of those reports.
The reason for the script is that there are many such reports, and the server can be very slow at times, it would be better to have them locally.
UPDATE: Forgot to mention that log in is required for the server.
I think scripting an off-the-shelf browser is very much the Hard Way to solve your problem. If you can at all predict the URLs for the individual report, use a command-line tool such as wget or curl to download them, and then look at this community wiki for rendering the downloaded HTML as PDF.
Or do you even need to go to PDF? If all you're interested in is having the reports available locally, why not keep them as HTML and view them in a browser (with a file: URL) rather than a PDF viewer?

How can I determine if files in a "drop folder" are completely transfered

Remote clients will upload images (and perhaps some instructional files in specially formatted text) to a "drop folder." Once the upload is complete we need to begin processing these images. It would be an easy, but flawed, solution to just have a script automatically begin processing any files in the folder every few seconds (the files can be move out of the folder once processed); but problems would arise when attempting to process large images which are only partially transfered.
What are some tricks I can use to ensure the files are fully uploaded before processing them?
A few of my own thoughts:
The script can check the validity of the file; ie, a partial jpeg would result in an error and you could respond to that error in the script, this would be fairly CPU intensive though. Some files have special markers on the end, but I can't count on this, I'm not sure what formats I'll be dealing with.
I've heard of "file handles" but haven't really figured out the basics of what they are and how I can tell if there is a "file handle" on a particular file. Basically the FTP daemon (actually, I'm on Windows, so "service") would keep a "handle" on the file while it's being uploaded and you would know not to process that file. These are just a few of my thoughts but I'm not really sure if they will work or if there are better or more accepted ways of solving this problem.
If you have an server-side script upload system (PHP, ASP, JSP, whatever), you could instruct the script to call another script to process the files, or to create a flag-file indicating the upload is done, something like this.
If your server is Linux-based, you can use lsof to check if the file is open. As your ftp/script/cgi will close the file after upload completes, lsof will not show the file in the list.
If your server is Windows-based, you can use Process Explorer to list the open files.
By what method are your users uploading the images?

What to do if I have a CGI that runs for several minutes before outputting data, and Apache times it out?

I have a CGI script that takes a really long time to execute. Long story short, it needs to process a lot of data, run a bunch of slow commands, and make some slow web queries, during which time it doesn't output anything, and when it's done, it finally prints its results out in JSON format. It takes several minutes to run, which is longer than the Timeout directive set in my Apache web server's httpd.conf.
I am not at liberty to change that Timeout value globally for everyone on the entire server. I thought of maybe overriding that in a per-directory basis using a .htaccess file, but it looks like the Timeout directive is not in .htaccess context, so that cannot be done. From what I understand, my script must continually output data, and if it doesn't output data for the Timeout number of seconds, Apache gives up.
I am getting the following error in Apache: (70007)The timeout specified has expired: ap_content_length_filter: apr_bucket_read() failed
What can I do?
Well, to offer the stupidly simple solution, why not just make the script occasionally produce some output while it's working? You could just print "Processing..." every few steps, or if you want to be more creative, have it print some status updates to indicate what it's doing. Or if you're worried about getting bored, print out a funny poem a line at a time. (Kind of reminds me of http://pages.cs.wisc.edu/~veeve/404.html)
If you don't want to do that, the next thing that comes to my mind is to use asynchronous processing. Basically, you'll have to spawn a separate process from the CGI script, and do the lengthy processing in that separate process. The main CGI script itself just outputs a simple HTML page that says the process is working and then exits. That HTML page would also have to contain some logic for periodically checking to see whether the background process on the server has finished. It could be a <meta http-equiv="refresh" ...> HTML element, or you could use AJAX.
I came up with a solution.
I would start outputting a dummy HTTP header, like Dummy: ..., and I can put whatever data I want as the value of that header, and it wouldn't affect the rest of the output. So I would output a character to that dummy value every minute or so, preventing it from timing out. And when I am ready, I can print a line return and continue printing the rest of my (real) HTTP headers and the content of the document.
A very pragmatic approach could be to start a background job and email the response to the client. 1O-1 they'd prefer that rather than having a browser window open all afternoon.

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.