Download Source Code of a website - pentaho

I would like to download html code source from one web page. Can I do this with HTTP Client ?
and in this case, have I to Generate Rows before ?
I am using Pentaho Data Integration 6, thanks.

To download the HTML from a web page you should use HTTP Client. From the documentation:
The HTTP client step doesn't do anything
Q: The HTTP client step doesn't do anything, how do I make it work?
A: The HTTP client step needs to be triggered. Use a Row generator step generating e.g. 1 empty row and link that with a hop to the HTTP client step.
So you need to have rows first. For instance use Generate Rows or Data Grid with the urls you want to fetch.
If you just add the url of the web page you want the HTML for in HTTP Client the HTML will be put in result

Related

How to run multiple http requests on single click in IntelliJ http file instead of running them individually

I have 10 http requests in http file and want to run them on single click instead of clicking each request and see the output. Subsequent requests use output from previous requests so I want to run them serially in automated way.
At the moment IDEA's text-based HTTP client supports variable substitution and allows to write some simple JavaScript to access response.
All you have to do is create .http file with all your requests defined in a sequence. Then after each request you can add JavaScript block to fill variables that next request(s) require. Example:
### getting json and setting variable retrieved from response body
GET http://httpbin.org/json
> {%
client.global.set("title",response.body["slideshow"]["title"])
%}
### making request using previously set variable as a body
POST http://httpbin.org/anything
Content-Type: text/plain
{{title}}
Next step is running all requests at once like this:
You could use "Run/Debug Configurations", create a few separated files and setup order in window. But you'll get a big task, you cannot wait and setup timeouts for requests.
I guess you should use Jmeter for real work.
You also could try setup simple curl requests in the Linux bases os.

Flowgear endpoint not working

Trying to test a sample endpoint for a workflow that I configured as follows:
POST https://mycompany.flowgear.net/bizrules/validation/gstCheck/?name={businessName}&number={businessNumber}&date={startDate}&canID={candidateID}&pID={placementID}
I tested that endpoint on a browser with this, but it gives me a JSON with "There is no service at this location":
https://mycompany.flowgear.net/bizrules/validation/gstCheck/?name=ZV Consulting Inc.&number=83848 5183&date=09/02/2014&canID=309731&pID=3835
What am I doing wrong?
You're binding to POST so you can't open the URL in a browser as your browser is performing a GET. To test you'd need to use the Postman plugin and set the method to POST.
One thing to note though is that Postman makes cross origin requests so you need to set Allowed Origins to * in your Flowgear site detail screen (same place you set your vanity domain).

Advanced Scrapy use Middleware

I want to developt many middlewares to be sure websites'll be parse.
This is the workflow I thinks :
First try with TOR + Polipo
If 2 HTTP errors, try without TOR (so website know my IP)
If 2 HTTP errors, try with proxy (use one of my other server to make HTTP REQ)
If 2 HTTP errors, try with random proxy (on list of 100). This is repeat 5 times
If none works, I save informations on ElasticSearch database, to see on my control panel
I'll create a custom middleware, with process_request function wich contains all of this 5 methods. But I don't find how save type of connection (for exemple if TOR not works, but direct connection yes, I want to use this settings for all of my other scrap, for the same website). How can I save this settings ?
Other thinks, I've a pipeline wich download images of items. Is there a solution to use this middleware (idealy with saving settings) to use on it ?
Thanks in advance for you're help.
I think you could use the retry middleware as a starting point:
You could use request.meta["proxy_method"] to keep track of which one you are using
You could reuse request.meta["retry_times"] in order to track how many times you have retried a given method, and then set the value to zero when you change the proxy method.
You could use request.meta["proxy"] to use the proxy server you want via the existing HTTP proxy middleware. You may want to tweak the middlewares ordering so that the retry middleware runs before the proxy middleware.

Capture the URL in the javascript initiated HTTP GET/POST requests using selenium

I'm using selenium for automating a procedure I frequently use on a site.
When I press on specific element on the site it runs some complex javascript code that eventually downloads a csv file using HTTP GET request.
I can see that the URL of this request looks like: www.somesite.com/somepage.php?token=RAPO09834HROLQ340HGIE309W&....
My question is: how can I get the token in this URL with selenium? (I need it for executing other HTTP GET requests for extracting more data from the site)
I am using Firefox driver on windows.
I tried to search all the html, js, cookies I get from this site, the token is not there. (its probably generated by the javascript code before it does the HTTP request)
I understand this is some kind of session id token as all the javascript generated HTTP requests are using the same token during my session.

Browser perform a request insted of show data uri

My Apache registered a data URI in access log.
/data:image/png%3bbase64,iVBORw0KGgoAAAANSUhEUgAAAAQAAAAECAMAAACeL25MAAAABlBMVEUzlME6qNuT3ZmEAAAAE0lEQVQI12NgZGRkYABiMAQzGQEAjAANk73rMwAAAABJRU5ErkJggg==
Apparently some browser did not understand the data URI and performed a request.
How to solve it?
Use some feature detector on client side (for example, Modernizr). And then check whether this feature is supported on document load. If it is not - replace all such urls with, for example, a path to a blank image.
In addition you could just block data uris in you firewall or on your front-end server.