Crawling over 1000 pagers with a single crawler vs multiple small crawlers - beautifulsoup

I want to crawl a local news website, which has over 1000 pagers for each category. I'm using Beautiful Soup and requests. But, my crawler is super slow, even after running it from 10:am to 5:30pm, it won't finish. So, I got an idea which is to make mini size of the latter one ( each for 100 pagers or so) and run them in parallel.
Now, the advice I seek for is:
Which method would be less time consuming?
Thanks and Sorry for my bad English.

Related

auto-optimize data in apache solr 5.4.1

I am indexing over than 1500000 of items from Mysql with Apache Solr 5.4.1, and when I enter to the Solr Admin Page everyday, I found that there is over than 5000 deleted items that they should be optimized, then I click to Optimze and all will be okay.
Is there a simple url to put it in the Crontab to Automate the Optimization of the indexes in Apache Solr 5.4.1 ?
Thank you.
Example from UpdateXMLMessages:
This example will cause the index to be optimized down to at most 10
segments, but won't wait around until it's done (waitFlush=false):
curl
'http://localhost:8983/solr/core/update?optimize=true&maxSegments=10&waitFlush=false'
.. but in general, you don't have to optimize very often. It might not be worth the time spent doing the actual optimize and the extra disk activity. If you're re-indexing the index completely each time as well, indexing to a fresh collection and then swapping the collections afterwards is also a possible solution.

Does multiple custom post loop on a single page affects pagespeed

I have 5 custom post types on homepage that are fetching content, and to do that i have kept 5 loops to make it work.
Now, my question is, does it affect pagespeed and if it does it will affect SEO.
It depends upon your page content and speed.
If your loops are making your page slow it will hurt SEO (Speed is not the only factor).
Pagespeed score depend upon the content. If it is optimized it will not hurt pagespeed score.If it is not optimized it will hurt that score.
To make it simple take example below:
If you current page have 2 post, every post has 2 images/resource and your every custom loop increase 2 more post with 2 more image/resource than you have two conditions.
First if your images/resource are not optimized then it will decrease your pagespeed score with the number of post increase on your page.
Second if your images/resource are optimized then it will not ring a bell on pagespeed score.

Cost of http request vs file size, rule of thumb?

This sort of question has been asked before HTTP Requests vs File Size?, but I'm hoping for a better answer. In that linked question, the answerer seemed to do a pretty good job of answering the question with the nifty formula of latency + transfer time with an estimated latency of 80 ms and transfer speed of 5Mb/s. But it seems flaw in at least one respect. Don't multiple requests and transfers happen simultaneously in a normal browsing experience? That's what it looks like when I examine the Network tab in Chrome. Doesn't this mean that request latency isn't such a terrible thing?
Are there any other things to consider? Obviously latency and and bandwidth will vary but is 80ms and 5Mb/s a good rule of thumb? I thought of an analogy and I wonder if it is correct. Imagine a train station with only one track in and one track out (or maybe it is one for both). Http requests are like sending an engine out to get a bunch of cars at another station. They return pulling a long train of railway cars, which represents the requested file being downloaded. So you could send one engine out and have it bring back a massive load. Or you could send multiple engines out and they could each bring back smaller loads, of course they would all have to wait their turn coming back into the station. And some engines couldn't be sent out until other ones had come in. Is this a flawed analogy?
I guess the big question then is how can you predict how much overlap there will be in http requests so that you can know, for example, if it is generally worth it to have two big PNG files on your page or instead have a webp image, plus the Webpjs js and swf files for incompatible browsers. That doubles the number of requests but more than halves the total file size (say 200kB savings).
Your analogy it's not bad in general terms. Obviously if you want to be really precise in all aspects, there're things that are oversimplified or incorrect (but that happens with almost all analogies).
Your estimate of 80ms and 5mb/s might sound logical, but even though most of us likes theory, you should manage this kind of problems in another way.
In order to make good estimates, you should measure to get some data and analyze it. Every estimation depends on some context and you should not ignore it.
Think about not being the same estimating latency and bandwidth for a 3G connection, an ADSL connection in Japan or an ADSL connection in a less technology-developed country. Are clients accessing from the other end of the world or in the same country?. Like your good observation of simultaneous connections on the client, there're millions of possible questions to ask yourself and very little good-quality answers without doing some measure.
I know I'm not answering exactly your question, because I think is unanswerable without so many details about the domain (plus constrains, and a huge etc).
You seem to have some ideas about how to design your solution. My best advice is to implement each one of those and profile them. Make measurements, try to identify what your bottlenecks are and see if you have some control about them.
In some problems this kind of questions might have an optimal solution, but the difference between optimal and suboptimal could be negligible in practice.
This is the kind of answer I'm looking for. I did some simplistic tests to get a feel for the speed of many small files vs one large files.
I created html pages that loaded a bunch of random sized images from placekitten.com. I loaded them in Chrome with the Network tab open.
Here are some results:
# of imgs Total Size (KB) Time (ms)
1 465 4000, 550
1 307 3000, 800, 350, 550, 400
30 192 1200, 900, 800, 900
30 529 7000, 5000, 6500, 7500
So one major thing to note is that single files become much quicker after they have been loaded once. (The comma seperated list of times are page reloads). I did normal refresh and also Empty Cache and Hard Reload. Strangely it didn't seem to make much difference which way I refreshed.
My connection had a latency or return time or whatever of around 120 - 130ms and my download speed varied between 4 and 8Mbps. Chrome seemed to do about 6 requests at a time.
Looking at these few tests it seems that, in this range of file sizes at least, it is obviously better to have less requests when the file sizes are equal, but if you could cut the file size in half, even at the expense of increasing the number of http requests by 30, it would be worth it, at least for a fresh page load.
Any comments or better answers would be appreciated.

Scaling CakePHP Version 2.3.0

I'm beginning a new project using CakePHP. I like the "auto-magic" features, I think its a good fit for the project. I'm wondering about the potential to scale CakePHP to several million IP hits a day. and hundreds of thousands of database writes and reads a day. Also about 50,000 to 500,000 users, often with 3000 concurrently using the site. I'm making use of heavy stored procedures to offset this, and I'm accessing several servers including a load balancer.
I'm wondering about the computational time of some of the auto-magic and how well Cake is able to assist with session requests making many db hits. Has anyone has had success with cake running from a single server array setup with this level of traffic? I'm not using the cloud or a distributed database (yet). I'm really worried about potential bottlenecks with using this framework. I'm interested in advice from anyone who has worked with Cake in production. I've reseached, but I would love a second opinion. Thank you for your time.
This is not a problem but optimization is up to you.
There are different cache methods available you can implement, memcache, redis, full page caching... All of that is supported by cacke already. What you cache and where is up to you.
For searching you could try elastic search to speedup things
There are before dispatcher filters to by pass controller instantiation (you might want to do that in special cases, check the asset filter for example)
Use nginx not apache
Also I would not start with over optimizing and over-thinking this before any code is written, start well, think about caching but when you start to come across bottleneck analyse and fix them. Otherwise you'll waste a lot of time with over optimization before you even have written anything that works.
Cake itself is very fast. Just to proof the bullshit factor of these fancy benchmarks some frameworks do we did one using a dispatcher filter to "optimize" it and even beat Yii who seems to be pretty eager to show how fast it is, but benchmarks are pointless, specially in a huge project where so many human made fail can be introduced.

Which SEO practices are likely to be responsible for SO questions appearing so quickly in Google searches?

Does anyone have some idea as to how come questions posted here on SO are showing up so quickly on Google?.
Sometimes questions submitted are appearing as the first 10 entries or so - on the first page within 30 minutes of submitting a question. Pray tell, what sort of magic is being wielded here?
Anybody have some ideas, suggestions?. My first thought is that they have info in their sitemap that tells google robots to trawl every N minutes or so - is that whats going on?
BTW, I am aware that simply instructing Googlebots to scan your site every N minutes will not work if you dont have quality information (that is constantly being updated on your site).
I'd just like to know if there is something else that SO may be doing right (apart from the marvelous content of course)
To put it simply, more popular websites with more quality content and more frequent changes are ranked higher with Google's algorithm, and are indexed and cached more frequently than sites that are less popular or change less frequently.
Broadly speaking, it's only content that does it. The size and quality of content has reached Google's threshold for "spider as fast as the site will permit". SO has to actively throttle the Googlebot; Jeff has said on Coding Horror that they were getting more then 50,000 requests per day from Google, and that was over a year ago.
If you scan through non-news sites from the Alexa top 500 you will find virtually all of those have results in Google that are just minutes old. (i.e. type site:archive.org into Google and choose "Latest" in the menu on the left)
So there's nothing practical you can do to your own site to speed up spidering, except to increase the amount of traffic to your site...
It is really simple.
SO is a PageRank 6 site that gives the world new information.
Google has a strong bias on new information. It will crawl the site many times a day and it will immediately add the pages to its index. It will favor a page (top 10) to say a specific query for a small period of time (a few days) and then it will stop favoring that page and rank it as normal.
This is standard G procedure and it happens with many many sites.
As you might guess, grayhat/blackhat seo uses that fact in many ways.
Also helped by SO providing an RSS feed, I think google likes feeds from reliable sources.