how does scrapy-splash handle infinite scrolling? - scrapy

I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933. screwrand doesn't seem to follow any pattern, so the reversing the urls don't work. I'm considering the automatic rendering using Splash. How to use Splash to scroll like browsers? Thanks a lot!
Here are the codes for two request:
request1 = scrapy_splash.SplashRequest(
'https://www.crowdfunder.com/user/following/{}'.format(user_id),
self.parse_follow_relationship,
args={'wait':2},
meta={'user_id':user_id, 'action':'following'},
endpoint='http://192.168.99.100:8050/render.html')
yield request1
request2 = scrapy_splash.SplashRequest(
'https://www.crowdfunder.com/user/following_user/80159?user_id=80159&limit=0&per_page=20&screwrand=76',
self.parse_tmp,
meta={'user_id':user_id, 'action':'following'},
endpoint='http://192.168.99.100:8050/render.html')
yield request2
ajax request shown in browser console

To scroll a page you can write a custom rendering script (see http://splash.readthedocs.io/en/stable/scripting-tutorial.html), something like this:
function main(splash)
local num_scrolls = 10
local scroll_delay = 1.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end
To render this script use 'execute' endpoint instead of render.html endpoint:
script = """<Lua script> """
scrapy_splash.SplashRequest(url, self.parse,
endpoint='execute',
args={'wait':2, 'lua_source': script}, ...)

Thanks Mikhail, I tried your scroll script, and it worked, but I also notice that your script scroll too much one time, some js have no time too render and is skipped, so I do some little change as follow:
function main(splash)
local num_scrolls = 10
local scroll_delay = 1
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
local height = get_body_height()
for i = 1, 10 do
scroll_to(0, height * i/10)
splash:wait(scroll_delay/10)
end
end
return splash:html()
end

I do not think that setting the number of scrolls hard coded is a good idea for infinite scroll pages, so I modified the above-mentioned code like this:
function main(splash, args)
current_scroll = 0
scroll_to = splash:jsfunc("window.scrollTo")
get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(3)
height = get_body_height()
while current_scroll < height do
scroll_to(0, get_body_height())
splash:wait(5)
current_scroll = height
height = get_body_height()
end
splash:set_viewport_full()
return splash:html()
end

Related

How to get more than 100 rows using Airtable API using offest?

I am really new to the Airtable API and for some reason connecting the API this way did not work.
at = airtable.Airtable('Base_Key', 'Airtable_Key')
But I got it working this way -
get_url = ‘https://api.airtable.com/v0/BASE_ID/TABLE_NAME’
get_headers = {
‘Authorization’: ‘Bearer API_KEY’ }
Response = requests.get(get_url, headers=get_headers)
Response_Table = Response.json()
However, this fetches only the first 100 records and am reading about offset and pagination but I am unable to figure how to incorporate it into this code.
Thank you for the time!
After a lot of issues, I found this solution. Posting it for anyone else facing the same problem.
global offset
offset = '0'
result = []
while True :
url = "https://api.airtable.com/v0/BASE_ID/TABLE_NAME"
querystring = {
"view":"Published View",
"api_key":"YOUR_KEY",
"offset": offset}
try :
response= requests.get(url, params=querystring)
response_Table = response.json()
records = list(response_Table['records'])
result.append(records)
#print(records[0]['id'] , len(records))
try :
offset = response_Table['offset']
#print(offset)
except Exception as ex:
#print(ex , offset)
break
except error as e:
print(e)

AmCharts 4 : Can't customize (color, strokeWidth) my series

EDIT : OK, It was my css page which had a rule on path, 'cause I use svg a lot. Removed that rule and the problem was gone !
I'm facing something pretty annoying and which I do not understand.
I'm using amChart to make a XY chart with multiple series. Not that hard.
The thing is, I can't customize my series ! Bullets and legend are ok, but not series.
Here's a screenshot for better understanding :
MyWeirdChart (new OP can't embed images, sorry)
As you can see I have my custom bullet pushed on my series and my legend is exactly what I want for my chart BUT series are staying unchanged.
Here is my JS draw function :
function drawChart(dateArray, casesArray, deathsArray, healedArray, hospitalizationsArray, reanimationsArray) {
am4core.useTheme(am4themes_animated);
var chart = am4core.create("chartdiv", am4charts.XYChart);
chart.data = generateChartData(dateArray, casesArray, deathsArray, healedArray, hospitalizationsArray, reanimationsArray);
var dateAxis = chart.xAxes.push(new am4charts.DateAxis());
var valueAxis = chart.yAxes.push(new am4charts.ValueAxis());
function pushSeries(field, name, color) {
let series = chart.series.push(new am4charts.LineSeries());
series.dataFields.valueY = field;
series.dataFields.dateX = "date";
series.name = name;
series.tooltipText = name + ": [b]{valueY}[/]";
series.stroke = am4core.color(color);
series.strokeWidth = 3;
series.fill = am4core.color(color);
series.fillOpacity = 0.5;
let bullet = series.bullets.push(new am4charts.CircleBullet());
bullet.circle.stroke = am4core.color(color);
bullet.circle.strokeWidth = 2;
bullet.circle.fill = am4core.color(color);
bullet.circle.fillOpacity = 0.5;
bullet.circle.radius = 3;
}
pushSeries("cases", "Cas confirmés", "#32B3E3");
pushSeries("healed", "Guéris", "#00C750");
pushSeries("hospitalizations", "Hospitalisations", "#FFBB33");
pushSeries("reanimations", "Réanimations", "#FE3446");
pushSeries("deaths", "Morts", "black");
chart.cursor = new am4charts.XYCursor();
chart.scrollbarX = new am4core.Scrollbar();
chart.legend = new am4charts.Legend();
chart.cursor.maxTooltipDistance = 0;
}
Did I miss something ? I crawled forums and documentations and I'm now helpless.
My code is in my webpack app.js file. But I include amCharts with HTML scripts,
<script src="https://www.amcharts.com/lib/4/core.js"></script>
<script src="https://www.amcharts.com/lib/4/charts.js"></script>
<script src="https://www.amcharts.com/lib/4/themes/animated.js"></script>
not with webpack import. But I guess that if this was the problem, I would not be able to draw a chart at all.
OK, It was my css page which had a rule on path, 'cause I use svg a lot. Removed that rule and the problem was gone !

how to handle multiple return values in scrapy from splash

i'm using scrapy with splash, in my splash i can send multiple values but in my scrapy code i could not handle all.for example,
this my splash script
splash_script = """
function main(splash)
local url = splash.args.url
return {
html = splash:html(),
number = 1
}
end
"""
The method trigger splash from scrapy
yield scrapy.Request(
url= response.urljoin(url),
callback = self.product_details,
errback=self.error,
dont_filter=True,
meta = {
'splash':{
'endpoint': 'render.html',
'cache_args': ['lua_source'],
'args' :{
'index': index,
'http_method':'GET',
'lua_source': self.splash_script,
}
}
},
)
The call back method
def product_details(self,response):
print response.body
This method receives only html content, i cant see the number
Your are printing response.body . This only includes the html.
You have to use response.data to see the 1.
You can also access the elements individually:
response.data['html']
or
response.data['number']
And when you return stuff, make sure you are assigning it in the return statement:
NOT-
html = splash:html()
number = 1
return {number,html}
BUT
return {number = 1, html = splash:html()}
Basically, you have to assign the JSON keys in the return statement even if you might have done so outside.
Extra info but that really screwed me up and you might run into the same problem.

Possible to speed up this algorithm?

I am trying to speed up a ruby algorithm. I have a rails app that uses active record and nokogiri to visit a list of urls in a database and scrape the main image from the page and save it under the image attribute associated with that url.
This rails task usually takes about 2:30 s to complete and I am trying to speed it up as a learning exercise. Would it be possible to use C through RubyInline and raw SQL code to achieve the desired result? My only issue is that if I use C I lose the database connection that active record with ruby had, and have no idea how to write SQL queries in conjunction with the C code that will properly connect to my db.
Has anyone had experience with this, or even know if it's possible? I'm doing this as primarily a learning exercise and was wondering whether it was even possible. Here is the code that I want to translate into C and SQL if you are interested:
task :getimg => :environment do
stories = FeedEntry.all
stories.each do |story|
if story.image.nil?
url = story.url
doc = Nokogiri::HTML(open(url))
if doc.at_css(".full-width img")
img = doc.at_css(".full-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".body-width img")
img = doc.at_css(".body-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".body-narrow-width img")
img = doc.at_css(".body-narrow-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".caption img")
img = doc.at_css(".caption img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnnArticleGalleryPhotoContainer img")
img = doc.at_css(".cnnArticleGalleryPhotoContainer img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnn_strylftcntnt div img")
img = doc.at_css(".cnn_strylftcntnt div img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnn_stryimg640captioned img")
img = doc.at_css(".cnn_stryimg640captioned img")[:src]
story.image = img
story.save!
end
else
#do nothing
end
end
end
I would appreciate any and all help and insights in this matter. Thank you in advance!!
Speed of DB Saving
I've written a web crawler in ruby and I found that one of the bottlenecks that can affect performance is the actual creation of the row in the database. It's faster to have a single mass insert at the end of extracting all URLs than to have multiple individual inserts (at-least for Postgres).
So instead of calling YourModel.save! for every url you visit, just push every url to an array that will keep track of url's that you need to save to the database. Then once you've finished scraping all links, do a mass insert of all the image links through an sql command.
stories.each do |story|
url = story.url
doc = Nokogiri::HTML(open(url))
img_url = doc.at_css("img")[:src]
to_insert.push "(#{img_url})"
end
#notice the mass insert at the end
sql = "INSERT INTO your_table (img_url) VALUES #{to_insert.join(", ")}"
#CONN is a constant declared at the top of your file (CONN = ActiveRecord::Base.connection)
#that connects to the database
CONN.execute sql
"Speed Up" Downloading
The downloading of links will also be a bottleneck. Thus, the best option would be to create a thread pool, where each thread is allocated a partition of URLs from the database to scrape. This way, you will never be stuck waiting for a single page to download before you do any real processing.
Some pseudoish ruby code:
number_of_workers = 10
(1..number_of_workers).each do |worker|
Thread.new do
begin
urls_to_scrape_for_this_thread = [...list of urls to scrape...]
while urls_to_scrape > 0
url = take_one_url_from_list
scrape(url)
end
rescue => e
puts "========================================"
puts "Thread # #{i} error"
puts "#{e.message}"
puts "#{e.backtrace}"
puts "======================================="
raise e
end
end
end
Are the URLs remote? if so, first benchmark it to see the network latency. If that's the bottleneck, I think you have nothing to do with your code or your choice of language.
How many FeedEntrys do you have in your database? I suggest using FeedEntry.find_each instead of FeedEntry.all.each, because the former loads 1000 entries into memory, processes them, and then loads the next 1000 entries ..., while the latter loads all entries into memory and then iterates over them, which requires more memory and increases GC cycles.
If the bottleneck is neither one of the above, then maybe it's the DOM node searching algorithm which is slow. You can find the (only one?) img node, then check its parent node or grandparent node if necessary, and update your entries accordingly.
image_node = doc.at_css('img')
story.update image: image_node['src'] if needed?(image_node)
def needed?(image_node)
parent_node = image_node.parent
parent_class = image_node.parent['class']
return true if parent_class == 'full-width'
return true if parent_class == 'body-width'
return true if parent_class == 'body-narrow-width'
return true if parent_class == 'caption'
return true if parent_class == 'cnnArticleGalleryPhotoContainer'
return true if parent_class == 'cnn_stryimg640captioned'
return false unless parent_node.node_type == 'div'
return true if parent_node.parent['class'] == 'cnn_strylftcntnt'
false
end

How to get an outline view in sublime texteditor?

How do I get an outline view in sublime text editor for Windows?
The minimap is helpful but I miss a traditional outline (a klickable list of all the functions in my code in the order they appear for quick navigation and orientation)
Maybe there is a plugin, addon or similar? It would also be nice if you can shortly name which steps are neccesary to make it work.
There is a duplicate of this question on the sublime text forums.
Hit CTRL+R, or CMD+R for Mac, for the function list. This works in Sublime Text 1.3 or above.
A plugin named Outline is available in package control, try it!
https://packagecontrol.io/packages/Outline
Note: it does not work in multi rows/columns mode.
For multiple rows/columns work use this fork:
https://github.com/vlad-wonderkidstudio/SublimeOutline
I use the fold all action. It will minimize everything to the declaration, I can see all the methods/functions, and then expand the one I'm interested in.
I briefly look at SublimeText 3 api and view.find_by_selector(selector) seems to be able to return a list of regions.
So I guess that a plugin that would display the outline/structure of your file is possible.
A plugin that would display something like this:
Note: the function name display plugin could be used as an inspiration to extract the class/methods names or ClassHierarchy to extract the outline structure
If you want to be able to printout or save the outline the ctr / command + r is not very useful.
One can do a simple find all on the following grep ^[^\n]*function[^{]+{ or some variant of it to suit the language and situation you are working in.
Once you do the find all you can copy and paste the result to a new document and depending on the number of functions should not take long to tidy up.
The answer is far from perfect, particularly for cases when the comments have the word function (or it's equivalent) in them, but I do think it's a helpful answer.
With a very quick edit this is the result I got on what I'm working on now.
PathMaker.prototype.start = PathMaker.prototype.initiate = function(point){};
PathMaker.prototype.path = function(thePath){};
PathMaker.prototype.add = function(point){};
PathMaker.prototype.addPath = function(path){};
PathMaker.prototype.go = function(distance, angle){};
PathMaker.prototype.goE = function(distance, angle){};
PathMaker.prototype.turn = function(angle, distance){};
PathMaker.prototype.continue = function(distance, a){};
PathMaker.prototype.curve = function(angle, radiusX, radiusY){};
PathMaker.prototype.up = PathMaker.prototype.north = function(distance){};
PathMaker.prototype.down = PathMaker.prototype.south = function(distance){};
PathMaker.prototype.east = function(distance){};
PathMaker.prototype.west = function(distance){};
PathMaker.prototype.getAngle = function(point){};
PathMaker.prototype.toBezierPoints = function(PathMakerPoints, toSource){};
PathMaker.prototype.extremities = function(points){};
PathMaker.prototype.bounds = function(path){};
PathMaker.prototype.tangent = function(t, points){};
PathMaker.prototype.roundErrors = function(n, acurracy){};
PathMaker.prototype.bezierTangent = function(path, t){};
PathMaker.prototype.splitBezier = function(points, t){};
PathMaker.prototype.arc = function(start, end){};
PathMaker.prototype.getKappa = function(angle, start){};
PathMaker.prototype.circle = function(radius, start, end, x, y, reverse){};
PathMaker.prototype.ellipse = function(radiusX, radiusY, start, end, x, y , reverse/*, anchorPoint, reverse*/ ){};
PathMaker.prototype.rotateArc = function(path /*array*/ , angle){};
PathMaker.prototype.rotatePoint = function(point, origin, r){};
PathMaker.prototype.roundErrors = function(n, acurracy){};
PathMaker.prototype.rotate = function(path /*object or array*/ , R){};
PathMaker.prototype.moveTo = function(path /*object or array*/ , x, y){};
PathMaker.prototype.scale = function(path, x, y /* number X scale i.e. 1.2 for 120% */ ){};
PathMaker.prototype.reverse = function(path){};
PathMaker.prototype.pathItemPath = function(pathItem, toSource){};
PathMaker.prototype.merge = function(path){};
PathMaker.prototype.draw = function(item, properties){};