I am trying to scrape data from this link https://www.flatstats.co.uk/racing-system-builder.php using scrapy.
I want to automate the ajax call using scrapy.
When I click "Full SP" Button (inspect in Firebug) the post parameter has the sql string which is "strange"
race|2|eq|Ordinary|0|~tRIDER_TYPE
What dialect is this?
My code :
import scrapy
import urllib
class FlatStat(scrapy.Spider):
name= "flatstat"
allowed_domains = ["flatstats.co.uk"]
start_urls = ["https://www.flatstats.co.uk/racing-system-builder.php"]
def parse(self, response):
query_lst = response.xpath('//table[#id="system"]//tr/td[last()]/text()').extract()
query_str = ' '.join(query_lst)
url = 'https://www.flatstats.co.uk/ajax/sb_report.php'
body_dict = {'a_e_max': '9.99',
'a_e_min': '0',
'arch_min': '0',
'exp_min': '0',
'report_type':'S',
# copied from the Post parameters by inspecting. Actually I tried everything.
'sqlFullString' : u'''Type%20(Rider)%7C%3D%7COrdinary%20(Exclude%20Amatr%2C%20App%2C%20Lady%20Races
)%7CAND%7Crace%7C2%7C0%7COrdinary%7C0%7C~tRIDER_TYPE%7C-t%7Ceq''',
#I tried copying this from the post parameters as well but no success.
#I also tried sql from the table //td text() which is "normal" sql but no success
'sqlString': query_str}
#here i tried everything FormRequest as well though there is no form.
return scrapy.Request(url, method="POST", body=urllib.urlencode(body_dict), callback=self.parse_page)
def parse_page(self, response):
with open("response.html", "w") as f:
f.write(response.body)
So questions are:
What is this sql.
Why isn't it returning me the required page. How can I run the right query?
I tried Selenium as well to click the button and let it do the stuff it self but that is another unsuccessful story. :(
It's not easy to say what the website creator is doing with the submitted sqlString. It probably means something very specific to how the data is processed by their backend.
This is an extract of the page JavaScript in-HTML code:
...
function system_report(type) {
sqlString = '', sqlFullString = '', rowcount = 0;
$('#system tr').each(function() {
if(rowcount > 0) {
var editdata = this.cells[6].innerHTML.split("|");
sqlString += editdata[0] + '|' + editdata[1] + '|' + editdata[7] + '|' + editdata[3] + '|' + editdata[4] + '|' + editdata[5] + '^';
sqlFullString += this.cells[0].innerHTML + '|' + encodeURIComponent(this.cells[1].innerHTML) + '|' + this.cells[2].innerHTML + '|' + this.cells[3].innerHTML + '|' + this.cells[6].innerHTML + '^';
}
rowcount++;
});
sqlString = sqlString.slice(0, -1)
...
Looks non trivial to reverse-engineer.
Although it's not a solution to your "sql" question above, I suggest that you try using splash (an alternative to selenium in some cases).
You can launch it with docker (the easiest way):
$ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
With the following script:
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
-- this clicks the "Full SP" button
assert(splash:runjs("$('#b-full-report').click()"))
-- loading the report takes some time
assert(splash:wait(5))
return {
html = splash:html()
}
end
you can get the page HTML with the popup of the report.
You can integrate Splash with Scrapy using scrapyjs (a.k.a scrapy-splash)
See https://stackoverflow.com/a/35851072/ with an example how to do so with a custom script.
Related
I now: the automatic token refreshing is not a new topic.
This is the use case that generate my problem: let's say that we want extract data from Dropbox. Below you can find the code: for the first time works perfectly: in fact 1) the user goes to the generated link; 2) after allow the app coping and pasting the authorization code in the input box.
The problem arise when some hours after the user wants to do the same operation. How to avoid or by-pass the newly generation of authorization code and go straight to the operation?enter code here
As you can see in the code in a short period is possible reinject the auth code inside the code (commented in the code). But after 1 hour or more this is not loger possible.
Any help is welcome.
#!/usr/bin/env python3
import dropbox
from dropbox import DropboxOAuth2FlowNoRedirect
'''
Populate your app key in order to run this locally
'''
APP_KEY = ""
auth_flow = DropboxOAuth2FlowNoRedirect(APP_KEY, use_pkce=True, token_access_type='offline')
target='/DVR/DVR/'
authorize_url = auth_flow.start()
print("1. Go to: " + authorize_url)
print("2. Click \"Allow\" (you might have to log in first).")
print("3. Copy the authorization code.")
auth_code = input("Enter the authorization code here: ").strip()
#auth_code="3NIcPps_UxAAAAAAAAAEin1sp5jUjrErQ6787_RUbJU"
try:
oauth_result = auth_flow.finish(auth_code)
except Exception as e:
print('Error: %s' % (e,))
exit(1)
with dropbox.Dropbox(oauth2_refresh_token=oauth_result.refresh_token, app_key=APP_KEY) as dbx:
dbx.users_get_current_account()
print("Successfully set up client!")
for entry in dbx.files_list_folder(target).entries:
print(entry.name)
def dropbox_list_files(path):
try:
files = dbx.files_list_folder(path).entries
files_list = []
for file in files:
if isinstance(file, dropbox.files.FileMetadata):
metadata = {
'name': file.name,
'path_display': file.path_display,
'client_modified': file.client_modified,
'server_modified': file.server_modified
}
files_list.append(metadata)
df = pd.DataFrame.from_records(files_list)
return df.sort_values(by='server_modified', ascending=False)
except Exception as e:
print('Error getting list of files from Dropbox: ' + str(e))
#function to get the list of files in a folder
def create_links(target, csvfile):
filesList = []
print("creating links for folder " + target)
files = dbx.files_list_folder('/'+target)
filesList.extend(files.entries)
print(len(files.entries))
while(files.has_more == True) :
files = dbx.files_list_folder_continue(files.cursor)
filesList.extend(files.entries)
print(len(files.entries))
for file in filesList :
if (isinstance(file, dropbox.files.FileMetadata)) :
filename = file.name + ',' + file.path_display + ',' + str(file.size) + ','
link_data = dbx.sharing_create_shared_link(file.path_lower)
filename += link_data.url + '\n'
csvfile.write(filename)
print(file.name)
else :
create_links(target+'/'+file.name, csvfile)
#create links for all files in the folder belgeler
create_links(target, open('links.csv', 'w', encoding='utf-8'))
listing = dbx.files_list_folder(target)
#todo: add implementation for files_list_folder_continue
for entry in listing.entries:
if entry.name.endswith(".pdf"):
# note: this simple implementation only works for files in the root of the folder
res = dbx.sharing_get_shared_links(
target + entry.name)
#f.write(res.content)
print('\r', res)
When using KarateDriver, I want to define and execute JS function in the browser.
Is is possible?
I want to define it like:
* def someFn =
"""
function(param) {
// DOM operation in the browser
// Event handling in the browser
return
}
"""
* assert someFn('param1') == '<span>param1</span>'
Edit1:
I define and execute;
* def keyword = 'karate'
* def formSubmit =
"""
function(formId) {
var formElem = document.getElementById(formId);
formElem.submit();
}
"""
Given driver 'https://github.com/search'
And driver.input('input[name=q]', keyword)
When driver.eval(formSubmit('search_form'))
Then eval driver.waitUntil(driver.location == 'https://github.com/search?utf8=%E2%9C%93&q=' + keyword + '&ref=simplesearch')
but this feature is failure.
javascript evaluation failed: driver.eval(formSubmit('search_form')), ReferenceError: "document" is not defined in <eval> at line number 2
Can it use DOM operations?
Edit2:
I can define and execute the JS function:
* def getSubmitFn =
"""
function(formId) {
return "var formElem = document.getElementById('" + formId + "');"
+ "formElem.submit();"
}
"""
You can do driver.eval() where the argument is raw javascript code as a string. I think this is sufficient for your needs:
* match driver.eval("location.href") == webUrlBase + '/page-01'
* assert driver.eval('1 + 2') == 3
EDIT: the JS engine for Karate and the Browser JS engine is different and there is no connection. So you have to pass JS as raw strings to driver.eval() here is an example that works for submitting a form.
* def getSubmitFn =
"""
function(formId) {
return "document.getElementById('" + formId + "').submit()"
}
"""
* def temp = getSubmitFn('eg02FormId')
* print temp
* driver.eval(temp)
EDIT: I just remembered, * driver.eval() is valid, no need to do * eval karate.eval()
Typically what you pass to driver.eval() can be simple, but it has to be a string, and you cannot use Karate variables (you have to hard-code them when creating the JS dynamically). You can use DOM objects and functions. You can have multiple statements of JS separated by ;.
UPDATE 1: I've created a GIST with actual running code in a test jig to show exactly what I'm running up against. I've included working bot tokens (to a throw-away bot) and access to a telegram chat that the bot is already in, in case anyone wants to take a quick peek. It's
https://gist.github.com/pleasantone/59efe5f9d7f0bf1259afa0c1ae5a05fe
UPDATE 2: I've looked at the following articles for answers already (and a ton more):
https://github.com/francois2metz/html5-formdata/blob/master/formdata.js
PhantomJS - Upload a file without submitting a form
https://groups.google.com/forum/#!topic/casperjs/CHq3ZndjV0k
How to instantiate a File object in JavaScript?
How to create a File object from binary data in JavaScript
I've got a program written in casperjs (phantomjs) that successfully sends messages to Telegram via the BOT API, but I'm pulling my hair out trying to figure out how to send up a photo.
I can access my photo either as a file, off the local filesystem, or I've already got it as a base64 encoded string (it's a casper screen capture).
I know my photo is good, because I can post it via CURL using:
curl -X POST "https://api.telegram.org/bot<token>/sendPhoto" -F chat_id=<id> -F photo=#/tmp/photo.png
I know my code for connecting to the bot api from within capserjs is working, as I can do a sendMessage, just not a sendPhoto.
function sendMultipartResponse(url, params) {
var boundary = '-------------------' + Math.floor(Math.random() * Math.pow(10, 8));
var content = [];
for (var index in params) {
content.push('--' + boundary + '\r\n');
var mimeHeader = 'Content-Disposition: form-data; name="' + index + '";';
if (params[index].filename)
mimeHeader += ' filename="' + params[index].filename + '";';
content.push(mimeHeader + '\r\n');
if (params[index].type)
content.push('Content-Type: ' + params[index].type + '\r\n');
var data = params[index].content || params[index];
// if (data.length !== undefined)
// content.push('Content-Length: ' + data.length + '\r\n');
content.push('' + '\r\n');
content.push(data + '\r\n');
};
content.push('--' + boundary + '--' + '\r\n');
utils.dump(content);
var xhr = new XMLHttpRequest();
xhr.open("POST", url, false);
if (true) {
/*
* Heck, try making the whole thing a Blob to avoid string conversions
*/
body = new Blob(content, {type: "multipart/form-data; boundary=" + boundary});
utils.dump(body);
} else {
/*
* this didn't work either, but both work perfectly for sendMessage
*/
body = content.join('');
xhr.setRequestHeader("Content-Type", "multipart/form-data; boundary=" + boundary);
// xhr.setRequestHeader("Content-Length", body.length);
}
xhr.send(body);
casper.log(xhr.responseText, 'error');
};
Again, this is in a CASPERJS environment, not a nodejs environment, so I don't have things like fs.createReadableStream or the File() constructor.
I am using MailChimps inline-css form at: http://beaker.mailchimp.com/inline-css. It does a great job of preparing an html file for use in sending as an email.
I have an api key. I prefer not to have to run their PHP app for only one API call. If it possible to use curl to access their inlineCss API? If so, what is the syntax?
Here is the doc page: http://apidocs.mailchimp.com/api/1.2/inlinecss.func.php
See also line: 2096 of this gist: https://gist.github.com/740362
My key looks something like:
f1b46???????????????????f2d5-us2
Here is a start of what I will like to achieve:
curl post -d #input.html apiKey=xxxxxxxx "http://us1.api.mailchimp.com/1.2/"
Thank you
This is what I hacked up for anyone looking for a similar solution. Comments, other options are welcomed:
import os
import re
import urllib
import mechanize
import xml.sax.saxutils as saxutils
from xml.sax.saxutils import unescape
try:
issueRoot = os.environ['newslettersroot'] + os.environ['currYear'] + '/' + os.environ['issueRoot'] + '/'
except KeyError:
print "Please run init.bat"
sys.exit(1)
srcEmailFilename = 'email.html'
dstEmailFilename = 'email_inline_css.html'
# retrieve <body> section only
html = open(issueRoot + srcEmailFilename, 'rb').read()
html = re.findall("(?si)<body.*?</body>", html)[0]
# use mailchimp inlineCss site to inject class rules into html tags
response = mechanize.urlopen("http://beaker.mailchimp.com/inline-css")
# retrieve form
form = mechanize.ParseResponse(response, backwards_compat=False)[0]
form["html"] = html
# form["strip"] = "checked"
# submit form and retrieve result
html = mechanize.urlopen(form.click()).read()
match = re.search('<textarea name="text" cols="100" rows="12">(.*?)</textarea>', html, re.DOTALL | re.IGNORECASE | re.MULTILINE)
if not match:
print html
exit("Expected to find output from mailchimp.")
# clean up output
html = match.group(1)
html = saxutils.unescape(html)
html = urllib.unquote_plus(html)
html = unescape(html, {"'": "'", """: '"'})
html = html.replace('&', '&').replace('%2F', '/').replace('%3A', ':')
# #sed -r 's/ class="[a-zA-Z0-9-]+"//g' %newslettersroot%%currYear%\%issueRoot%\email_inlinedcss.html > %newslettersroot%%currYear%\%issueRoot%\email_removedstyle.html
#replace class tags
html = re.sub(r'(?sim)\s*class="[a-zA-Z0-9-]+"', "", html)
fh = open(issueRoot + dstEmailFilename, 'wb')
fh.write(html)
fh.close()
What is the most convenient way using Selenium WebDriver to check if an URL GET returns successfully (HTTP 200)?
In this particular case I'm most interested in verifying that no images of the current page are broken.
Try this:
List<WebElement> allImages = driver.findElements(By.tagName("img"));
for (WebElement image : allImages) {
boolean loaded = ((JavaScriptExecutor) driver).executeScript(
"return arguments[0].complete", image);
if (!loaded) {
// Your error handling here.
}
}
You could use the getEval command to verify the value returned from the following JavaScript for each image on the page.
#Test
public void checkForBrokenImages() {
selenium.open("http://www.example.com/");
int imageCount = selenium.getXpathCount("//img").intValue();
for (int i = 0; i < imageCount; i++) {
String currentImage = "this.browserbot.getUserWindow().document.images[" + i + "]";
assertEquals(selenium.getEval("(!" + currentImage + ".complete) ? false : !(typeof " + currentImage + ".naturalWidth != \"undefined\" && " + currentImage + ".naturalWidth == 0);"), "true", "Broken image: " + selenium.getEval(currentImage + ".src"));
}
}
Updated:
Added tested TestNG/Java example.
I don't think that first response will work. When I src a misnamed image, it throws a 404 error as expected. However, when I check the javascript in firebug, that (broken) image has .complete set to true. So, it was a completed 404, but still a broken image.
The second response seems to be more accurate in that it checks that it's complete and then checks that there is some width to the image.
I made a python version of the second response that works for me. Could be cleaned up a bit, but hopefully it will help.
def checkForBrokenImages(self):
sel = self.selenium
imgCount = int(sel.get_xpath_count("//img"))
for i in range(0,imgCount):
isComplete = sel.get_eval("selenium.browserbot.getCurrentWindow().document.images[" + str(i) + "].complete")
self.assertTrue(isComplete, "Bad Img (!complete): "+sel.get_eval("selenium.browserbot.getCurrentWindow().document.images[" + str(i) + "].src"))
typeOf = sel.get_eval("typeof selenium.browserbot.getCurrentWindow().document.images[" + str(i) + "].naturalWidth")
self.assertTrue(typeOf != 'undefined', "Bad Img (w=undef): "+sel.get_eval("selenium.browserbot.getCurrentWindow().document.images[" + str(i) + "].src"))
natWidth = int(sel.get_eval("selenium.browserbot.getCurrentWindow().document.images[" + str(i) + "].naturalWidth"))
self.assertTrue(natWidth > 0, "Bad Img (w=0): "+sel.get_eval("selenium.browserbot.getCurrentWindow().document.images[" + str(i) + "].src"))
Instead of traversing in Java, it may be faster to call javascript only once for all images.
boolean allImgLoaded = (Boolean)((JavascriptExecutor) driver).executeScript(
"return Array.prototype.slice.call(document.images).every("
+ "function (img) {return img.complete && img.naturalWidth > 0;});");
One of the alternative solutions is analyzing web server logs after test executing. This approach allows to catch not only missed images, but css, scripts and other resources.
Description of how to do it is here.
Funda for Checking 404:
Basically 404s can be checked via HTTP Response of the URL.
Step 1: Import the Library for HTTPTestAPI
Step 2: Create the HTTPRequest.
String URL="www.abc.com";
HTTPRequest request = new HTTPRequest(URL);
//Get the response code of the URL
int response_code = request.getResponseCode();
//Check for 404:
if(response_code == 404)
FAIL -- THE URL is leading to 404.
else
PASS