How to save PDF from HTML in Azure Functions - selenium

I'm developing an application which will have a web crawler for some sites.
The application will trigger a Azure Function by URL where the crawler will start the work.
So far, so good, but, we'll have to save some evidence that the crawler passed though the site. We're thinking of save a PDF file with the screen that the crawler passed, but, as Azure Functions doesn't have GDI+, it won't work with Selenium or PhantomJS.
One different approach can be download the HTML content and somehow save this HTML string (with all the JS and CSS dependency) into a PDF file.
i'd like of some library which can work with Azure Functions to make the screenshot of some URL (or HTML string) and save to PDF.
Thanks.

Unfortunately the App Service Sandbox whose rules Azure Functions live by is going to block most GDI+ API calls. We have had success with one third party library (ByteScout) for some PDF generation needs but I think in your case that type of operation is explicitly blocked. You can find out more details here https://github.com/projectkudu/kudu/wiki/Azure-Web-App-sandbox#win32ksys-user32gdi32-restrictions
There is no workaround that I'm aware of because at the end of the day most of these solutions are relying on GDI+ in the underlying OS (directly or indirectly).
Your only real option is to offload that workload to virtual machine without the restriction on the API.That could take the form of a dedicated VM or something like an Azure Container Instance whose life-cycle you can manage more dynamically as needed. We do something similar today where we have a message queue being monitored on a VM and our azure function drops the request into the queue for processing.

Related

Asynchronous Pluggable Protocol for CID: (email), how to handle duplicate URLs

This is somewhat a duplicate of this question, but that question has no (valid) answer and is 1.5 years old so asking my own with hopes people have more info now.
If you are using multiple instances of a WebBrowser control, MSHTML, IHTMLDocument, or whatever... from inside the APP instance, mostly IInternetProtocol::Start, is there a way to know which instance is loading the resource? Or is there a way to use a different APP for each instance of the control, maybe by providing one via IDocHostUIHandler or ICustomDoc or otherwise? I'm currently using IInternetSession::RegisterNameSpace to make it process wide.
Optional reading below, don't feel you need to read it unless above isn't clear.
I'm working on a legacy (Win32 C++) email client that uses the MS ActiveX WebBrowser control (MSHTML or other names it goes by) to display HTML emails. It was saving everything to temp files, updating the cid: URLs, and then having the control load that. Now I want to do it the correct way, using APP. I've got it all working with some test code that just uses static variables/globals and loads one email.
My problem now is, the app might have several instances of the control all loading different emails (and other stuff) at the same time... not really multiple threads so much, just the asynchronous nature of the control. I can give each instance of the control a unique URL to load the email, say, cid:email-GUID, and then in my APP code I can use that URL to know which email to load. However, when it comes to loading any content inside the email, like attached images using src="cid:", those will not always be unique so I will not always know which image it is, for which email. I'd like to avoid having to modify the URLs of the HTML before displaying it (I'm doing that now for the temp file thing, but want to do it a better way).
IInternetBindInfo::GetBindString can return the referrer, BINDSTRING_XDR_ORIGIN, or the root URL, BINDSTRING_ROOTDOC_URL, but those require newer versions of IE and my legacy app must support older XP installs that might even have IE6 or IE7, so I'd rather not use these.
Tagged as TWebBrowser because that is actually what I'm using (Borland Builder 6 C++), but don't need answers specific to that platform.
As the Asynchronous Pluggable Protocol Handler us very low level, you cannot attach handlers individually to different rendering controls.
Here is a way to get the referrer:
Obtain BINDSTRING_HEADERS
Extract the referrer by parsing the line Referer: http://....
See also How can I add an extra http header using IHTTPNegotiate?
Here is another crazy way:
Create another Asynchronous Pluggable Protocol Handler by calling RegisterMimeFilter.
Monitor text/plain and text/html
Scan the incoming email source (content comes incrementally) and parse store all image links in a dictionary
In NameSpaceHandler you can use this dictionary to find the reference of any image resources.

Remove A URL Scheme Handler from Launch Services

I am developing a Cocoa Mac app which dynamically generates and registers itself for URL schemes. However, when the application registers itself to handle a newly generated URL scheme (e.g. myscheme1423://), I would like to prevent the application from responding to any previously registered URL schemes.
I am using LSSetDefaultHandlerForURLScheme() for the purpose of registering a URL scheme; in conjunction, the application automatically overwrites it's Info.plist to contain the new scheme. As you may know, the LSSetDefaultHandlerForURLScheme() function adds the given bundleID/scheme to a Launch Services database. However, I couldn't find an equivalent Launch Services function to remove the same bundleID/scheme pair from the database.
I know that I could simply ignore any external events which originated from a URL scheme other than the one for which the app is actively registered, but it feels to me that there should be a simple way to completely wipe out the system's knowledge of the previous scheme. If my application goes through the process of registering for a new scheme more than a few hundred times, a point will come where a significant amount of space (for a Plist, at least) is being taken up on disk by a plethora of pointless pieces of data (i.e. the old Launch Services entries).
I just fired up a playground and began playing. This is utterly undocumented but it appears to work.
Try passing ("None" as CFString) for the second parameter of
LSSetDefaultHandlerForURLScheme()

Tracking full browser activity in VB.Net

There's a website that contains some real time information. I want to have a VB.Net windows application that monitors the page, and when it detects certain events it triggers some actions based on the data in the page.
I've been searching like crazy for some mechanism to "hook" into the browser and hopefully inspect the messages transmitted for the application to know how to react.
I've seen the SHDocVw COM object, which comes very close. But when I use the BeforeNavigate2 event, it only seems to fire for GETs, and once I'm on the page where the information is displayed/refreshed the event is not raised.
Short of reverse engineering the page, or having to write some kind of proxy...is there a good way to do this in VB.Net?
Here's a method you can try:
Create a GreaseMonkey script embedded on the page to hook up events.
When changes occur, collect the changes in an array or display them on the screen.
Create a VB.Net webpage service to listen for POST requests.
Post using AJAX from your GreaseMonkey script to the webpage service, which then writes to a log file/database on your server.
Otherwise the proxy would probably be the next best method, providing the server isn't HTTPS.

How to efficiently deploy content types to a Content Type Hub

I have set up a Content Type Hub and tested the syndication is working correctly by creating a test content type and watching it be published to the client site.
Then I deployed the content types I am actually interested in publishing to the hub (by way of a feature) along with the site columns they depend on.
I get the error
Content type '...' cannot be published to this site because feature '...' is not enabled.
I want to deploy content types with features for upgradability and ease of porting between dev, qual and prod environments. But am left not understanding what the benefit of the Hub is.
If I have to activate the deploying feature, the content types will already be on the site before publishing will take place. If I have to manually create the content types on the Hub site with the web UI (yuck!), I have the issue of trying to keep three landscapes manually synchronized.
Is there a way to efficiently manage content type deployment to the Hub while still using the Hub to publish the content types?
The advantage of using the Content Type Hub, is that it allows you to use and reuse your Content Types over multiple site collections and Web Applications throughout your farm.
Because all of your site collections are now using instances of the same syndicated content types, if, in the future, you need to add/remove/rename columns within the content types, this is done as easily as updating the content type, and resubscribing (then allowing sharepoint to run its timer jobs, and double checking that the changes updated because you're a careful SharePoint administrator).
I am not sure which error you are receiving, there simply isn't enough context in your post. However, I think you may be slightly confused on how syndicated content types are published. First, you turn on the content syndication hub publishing feature on the site collection that holds all of the content types you are going to reuse throughout your farm. Next you configure the mixed metadata service, so that SharePoint loads each of your content types "into memory" more or less.
After this step, you get to choose which site collections you want to subscribe to the syndication hub. To do this, you need to turn on the content type publishing site collection feature. Note: If you use blank templates for your sites you may receive a feature error like you've described, due to a "flaw" with blank templates. See my post at: http://www.thesharepointblog.net/Lists/Posts/Post.aspx?ID=109
Only AFTER you've turned on the subscribing feature, And content Type Hub timer job has run, AND the subscribing timer job has run, will your site collection see the available content types.
As for manually creating content types on the hub site, the only OOB way of doing this is to use the UI. Personally, I wrote a utility that does everything I just described for me, from creating the initial content types, to creating the syndication hub, publishing them to all of the site collections, and most time consumingly, associating them with all of the lists and libraries on the subscribing site collections. I had intended for my employing company to sell it, but as they don't seem interested, I could open source it if there is enough interest.
Hope this was helpful.
This looks like a shortcoming of the hub, indeed.
I've witnessed it before.
If you've deployed your content type to the hub, please check if the INHERITS tag of the content type element is set to TRUE. Otherwise it won't work in a hub.
<ContentType ID="xxxxx"
Name="xxxx"
Group="xxxx"
Description="xxxx"
Inherits="TRUE"
Version="0">
</ContentType>
Don't forget that you can actually synchronize the content types BETWEEN farms -- this is especially valuable when you are developing on a separate farm and don't want to hassle with a PnP Framework for managing your content types... In some cases, the Content Type may already exist on the production farm and you need a copy of them on dev and/or test..

Google Gears hangs when capturing (unmanaged resource store)

Our code was written based on the example from Google Gears' own docs. We're using an unmanaged resource store. So we declare the files in an array, create the store, and capture all the files.
Trouble is, the capturing process hangs. It always hangs on a random file (no discernible pattern has emerged), and when you reload the page, it always successfully captures.
We're capturing 48 files. It seems to have nothing to do with the files themselves, as it hangs on every file type. I've seen it hang on the 6th file or the 47th. Windows and Mac. FF, IE, and Safari.
We are not using a WorkerPool, and I'm thinking this may be necessary. Any other ideas why it would hang?
I discovered that the problem was in the scope of the variable. The code we had used from Google's own example had the creation of the store, and the capture happening in separate functions, and since we were downloading so many files along the way, the object was getting destroyed by the browser's own trash collector.
That's why the callback was producing no errors, and it was instead just hanging.