How to start multiple scrapy crawlers programatically? - scrapy

I have several Scrapy crawlers where each of them is focused on one particular domain.
Now I want to start them from one script in order to crawl particular products on all domains on request. E.g. Is there a way to start multiple crawlers from one crawler?
Idea is, that I will pass the product ID to that central script which will then start the crawl of this product on all let's say 100 domains.
Is that possible? If yes, what would be the right approach?

Related

Django Dynamic Scraper: Automatically Scrape a HUGE amount of urls at scale #143

I am currently working on a project whose goal is to create scrapers in a dynamic way and then being able to process a huge amount of urls at scale.
For example, I have in DDS two websites: www.xxx.it associated with Scraper IT and www.xxx.ca associated with Scraper CA.
I want to send an infinite amount of urls to DDS, and I want all the urls in the form xxx.it/* to be scraped by Scraper IT, and all the urls in the form xxx.ca/* to be scraped by Scraper CA. I want this to be automatically inferred and done by DDS.
Is there a way to achieve this with the current implementation?

Multiple categories of item in sitemap.xml

On my site I have items that can be in few or more categories.
Links to one item may look like (for example):
example.com/category_id_1/item_id_1
example.com/category_id_67/item_id_1
example.com/category_id_106/item_id_1
So I dont understand do I need to set all links for one item in sitemap.xml or just certain one? If only one - which one?
Which way would be more correct according to SEO optimization?
If all the three URLs serves same content it will create some duplicate content issues.
You didn't mention whether you are using a dedicated product URL or not. Better approach to use product URLs will be something like,
yourdomain.com/products/xyx-product

Multipage Bootstrap and Google Analytics

I have sort of a problem how to use Google Analytics properly with Boostrap.
My page has 3 level deep subpages and the last subpage has it's own subdomain. In GA I see I can use max. 50 tracking codes within one service. What if I need more than that?
You are limited to 50 properties not 50 pages. Each property can track many pages and (up to 10 million hits a month for the free version) and events.
Typically you would use the same property code on all pages on the same site so you can see all that data together (though with option to drill down).
You would only use a new property code for a new site (though your subdomain might qualify for that if you want to track it separately).
So the two questions you want to ask yourself are:
Do you want to be able to report on two pages together? E.g. To see that your site gets 10,000 hits and 20% are for this page and 5% are for that page. Or people start at this page and then go to that page and then on to this page. If so it should be the same analytics property.
Do different people need to see these page stats? And is it a problem if they do? If so put as a separate property so you can permission separately.
It sounds like these are part of the same site so I'd be veering towards tracking them together on same property.
On a different note you should set one page as the main version (with a rel canonical tag) and redirect other version to that page to avoid confusing search engines thinking you have duplicated content. Do you have a reason for having the same content on two different addresses? It can cause SEO and other problems.

Elasticsearch URL prioritization in Searching

I am currently working on a search infrastructure which uses elasticsearch as the indexing engine. The requirement is to crawl and index 5 subdomains:
subdomain a is related to products
subdomain b is related to FAQs/Questions
subdomain c is related to internet plans
Now, once you search for anything related to products, it is required to prioritize searching in subdomain a -- that is, top results must belong to subdomain a. If one searches for questions, then top results must primarily come from subdomain b, and so on.
My idea is to separately index based on subdomain url, then give each some sort of priority using index.priority in elasticsearch. However, that proved to be unstable and still does not produce the desired effect.
Any other possible approaches you can suggest? Thanks in advance!

Search engines and duplicate content across two sites

I have a client who has presented the following situation:
A parent company works with two distributors of their products. Both distributors want a new website developed. They both sell the same product, so will want to share content and basic page layouts. For example, product listings will be same across both sites, as will copy about the products they provide.
My concerns here are with SEO and duplicate content. Google defines duplicate content as:
Duplicate content generally refers to substantive blocks of content
within or across domains that either completely match other content or
are appreciably similar.
It seems that in this case, where two distributors are selling the same product, each having a website that duplicates content, is legitimate. But, I have a feeling either site could get penalised. So perhaps having two sites would be too damaging.
Any thoughts on this much appreciated.
Thanks
If it can't be helped, it's fine. Ideally you would want to make unique descriptions and content.
As recent as a couple months back, I had a staging site that didn't have the noindex tag on it by mistake. The staging site and the actual site were both ranking well for keywords.
While you are probably fine, you should still look into allotting time for content development.