Let me describe the flow of my spider:
First I provide around 300 urls.
Scrapy starts crawling the first 10 urls (is 10 configurable?)
Then for each url, there are 2 actions:
First Action: Spider goes to all items listed on the page (48 items).
For each item, I crawl all the paginations. It can go up to 50 or more of the feedback of the item and store them in Postgres.
Second Action: Spider fetches the next page and does the same routine again
The Depth for my Scrapy is 20, so if we do some calculations, the total number of pages crawled should be:
300*20*48*50 = 14 400 000 pages in one crawl.
Is this something Scrapy is capable of?
My server is 8Gb of RAM
Now what happens is that Scrapy gets lost with the first 10 urls and never goes beyond that. Do you guys know why would that happen?
Related
I have several Scrapy crawlers where each of them is focused on one particular domain.
Now I want to start them from one script in order to crawl particular products on all domains on request. E.g. Is there a way to start multiple crawlers from one crawler?
Idea is, that I will pass the product ID to that central script which will then start the crawl of this product on all let's say 100 domains.
Is that possible? If yes, what would be the right approach?
Background. As some of you know I Publish a business deals listings magazine in PDF format. I am after Google Analytics/Matomo style tracking of readership of the PDF. (I use Matomo for my websites and prefer it over google)
Where I want to get to for example is User one spent 3 seconds on Page 1 skipped page 2, read pages 3-10 left.
User 2 read pages 20-30 and spent an average of 5 seconds on each page. Etc Etc,
Is there a way to do this or is it even possible?
I already know if the magazine has been downloaded and where and when and have this data.
Many thanks
We’ve come across this question fairly often at Load Impact, so I thought I’d add it to the Stack Overflow community to make it easier to find:
I want my load test to be realistic. How do I create a Load Impact user scenario that emulates a realistic user behaviour, accessing different pages and also accessing some pages more frequently (for example the home page), just like real users would?
If you have 3 pages on your site that users can visit, and you know how many times each page is visited by the users, you can calculate the “weight” of each page, and create a user scenario that simulates the same kind of visitor pattern on the site that real users exhibit. This is an example of how to do that.
First, we have to find out how popular each of the three pages are. This can be done by looking at statistics from e.g. Google Analytics to see how many times each page was visited the last month or so. Let’s say we have these figures:
==== Page ==== ==== Visits/day ====
/ 8453
/news.php 1843
/contacts.php 277
The total number of page visits is 10573 (8453+1843+277). If we divide each individual number by the total, we get the “weight” (percentage) for that particular page – i.e. how big the chance is that a random page load on the site happens to load that particular page:
==== Page ==== ==== Visits/day ==== =========== Weight ===========
/ 8453 0.799 (79.9% of all page loads)
/news.php 1843 0.174 (17.4% of all page loads)
/contacts.php 277 0.026 (2.6% of all page loads)
Now we can create our user scenario that replicates real traffic on our site – i.e. that will exercise our web server in the same way real users do. Here is the code:
-- We create functions for each of the three pages. Calling one of these functions
-- will result in the simulated client loading all the resources necessary for rendering
-- the page. I.e. the client will perform one page load of that particular page.
--
-- Main/start page
local page1 = function()
-- First load HTML code
http.request_batch({
"http://test.loadimpact.com/"
})
-- When HTML code is done loading, start loading other resources that are
-- referred to in the HTML code, emulating the load order a real browser uses
http.request_batch({
"http://test.loadimpact.com/style.css",
"http://test.loadimpact.com/images/logo.png"
})
end
--
-- /news.php page
local page2 = function()
-- This example page consist of only one resource - the main HTML code for the page
http.request_batch({
"http://test.loadimpact.com/news.php"
})
end
--
-- /contacts.php page
local page3 = function()
-- This example page consist of only one resource - the main HTML code for the page
http.request_batch({
"http://test.loadimpact.com/contacts.php"
})
end
--
--
-- Get a random page to load, using our page weights that we found out earlier
--
-- Generate a value in the range 0-1
local randval = math.random()
-- Find out which page to load
if randval <= 0.799 then
-- 79.9% chance that we load page1
page1()
elseif randval <= (0.799 + 0.174) then
-- 17.4% chance that page2 gets loaded
page2()
else
-- ...and the rest of the time (2.7%), page3 gets loaded
page3()
end
I suggest using an automated web testing tool.
One option is JMeter. Please see the Web Test Plan Manunal for instructions on how to create test plan for basic web site testing, including; user actions, number of users, execution speed and frequency, and data collection.
Another option for basic web scripting is Selenium IDE.
Or, if you have programming experience, I would look at using Selenium Web Driver. This gives you the most flexibility, and can integrate into an existing Java, C#, Python, etc... test project. This also scales nicely and can be integrated with CI services such as Sauce Labs
There's also the option of recording user behaviour to create scripts for Load Impact.
Reference and instructions are here Simulating realistic load
Once recorded and possibly adapted user scenarios should go into a test configuration including how to distribute users across locations and scenarios to create a simulation as close to real world usage as possible.
Finding out how many users you should have in your test is a slightly different question and I'll defer to expand on that until actually needed.
I have submitted 1000 pages to Google included in my sitemap, but I did it by mistake. I didn't want to submit all, I wanted to submit 800 and release 1 page per day during 200 days in order to add new fresh content every day, that way Google would see it as a frequently updated website which is a good SEO practice.
I don't want Google to know about the existence of those 200 pages right now, I want Google to think it is fresh content when I release it every day.
Shall I resend the sitemap.xml with only 800 links and hide the pages in the website?
If Google has already indexed the pages, is there any change to make Google "forget" those pages and not recognize them in the future when I release them back?
Any suggestion about what to do?
Thank you guys.
I wouldn't do that. You're trying to cheat, Google doesn't like this. Remain your site as it is now, create new content and submit it into Google's index as frequently as you want. If you'll exclude previously submitted data from the index, with a high probability it won't be indexed again.
I just started playing with the Google Map API for Static Images, and in just an hour it has appeared this image:
http://www.coon.it/drop/limit.png
Is that normal?
My page need 6 static image, that means that I call it, like 170 times?
I don't think it's possible since the pictures are always the same and the documentation says that if i call the same image it doesn't count.
What can i do?
Thank you
Although there is an absolute limit on the number of images, that limit may be calculated over a shorter period than 24 hours. 2400/day could be interpreted as 100/hour, so you can't use all 2400 in one go [I can't remember what the limit is for Static Maps, but you get the idea].
With most Google services there is also a rate limit, and it's possible that fetching six images almost simultaneously breaks the rate limit. Rate limits vary depending on server load, but spacing requests out to 200ms should be ok.
Or it may look like an automated image-fetcher (especially if you have run it 30 times in the last hour). Google don't like automated crawlers either.
What should it display? With me it just displays a landscape with a row of trees(what i gues is what you want to see?).