import.io crawler for non-standard system of pagination - import.io

I am trying to build an import.io crawler for this site http://theaccelblog.squarespace.com/ but when I click "next" to get to next page to train, it takes me back to the first page because of the system of pagination being used. Would appreciate any suggestions on how to get the import.io crawler to crawl through these pages. As suggested on import.io website I tried to find the system of pagination in the packets being exchanged with server but did not succeed. Thanks if you can help. JRH

I used Bulk Extract to create an API.
https://import.io/data/mine/?id=bc7d67f2-24d3-4b5c-b134-01544430998a
If you use the offset pagination below you can input it into Bulk and get the data you need.
http://theaccelblog.squarespace.com/?offset=1418833411427
http://theaccelblog.squarespace.com/?offset=1409932229141
http://theaccelblog.squarespace.com/?offset=1402342675828
http://theaccelblog.squarespace.com/?offset=1397601000000
http://theaccelblog.squarespace.com/?offset=1397511000000
http://theaccelblog.squarespace.com/?offset=1390543200000
http://theaccelblog.squarespace.com/?offset=1375383600000
http://theaccelblog.squarespace.com/?offset=1359748800000
http://theaccelblog.squarespace.com/?offset=1285959600000
Thanks,
Meg

Related

Can I use selenium to get to certain point and after that use request + lxml(or other package like scrapy) to make the process faster?

I have a web crawling task which the website needs the user to login and pass the Email 2FA authentication before getting to the main page. On the main page, I have a lot of detail pages(all in the same format) to visit and collect the data and store them in a CSV file.
Now I got no problem with getting to the main page. However, I'd like to optimise tasks after entering the main page, my current solution for this part is I use a for loop and visit each page and collect the data.
for each in page_id_list:
driver.get("https://...." + str(each))
driver.find_elements_by_xpath("...")
....
Functionally it works fine, but I was thinking if I can make the process faster. As selenium loads the whole web page and other packages like request + lxml only retrieve HTML maybe I should somehow use another way to visit those pages, and also I wonder if Scrapy will be useful in my case?
Thank you.

Scrapy does not return results

I'm studying scrapy and am trying to crawl through this website - http://bananarepublic.gap.com/browse/category.do?cid=1055063&sop=true
However my scrapy code cannot find the product links listed on this website. Could anyone tell me why? The xpath Im using is //a[#class="product-card--link"]/#href
Is this because of js? If so, I tried using scrapy splash but still cannot find the product links listed. Could someone please help!
Thank you!
The items are generated via AJAX request. When you connect to a page a javascript script is executed that makes some extra http requests to retreive some json data. However scrapy does not execute any javascript so you need to manually find and call those AJAX requests.
See related issue: Can scrapy be used to scrape dynamic content from websites that are using AJAX?, to see how inspect network traffic and solve such cases.
In this particular case you can see first xhr requests that is being made returns a huge json file with all of the item data:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063&isFacetsEnabled=true&globalShippingCountryCode=&globalShippingCurrencyCode=&locale=en_US&
As you can see url takes some arguments, most importantly it takes cid which stands for category id and other arguments are mostly for calculating shipping prices, so if you don't care about those this works just as well:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063
An alternative that avoids digging deep into the AJAX requests would be using Splash (https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/) to scrape the page after the AJAX has been processed.
Can be a bit easier to implement, your xpath expression should work fine with Splash. But the scraper will be slower as it has to render each page.

Which is the better way to use Scrapy to crawl 1000 sites?

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?
I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details.
This would read from the URL file and extract item details.
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.

How to force google to show my first page from a page set with pagination?

I have a website and in my website I have, for example, a list of Audi models. I saw, using google webmaster tools, that my website appears in the google search by the word audi, but the target page was the 22nd page from my result set, not the first. I need my first page to appead, not my last (or middle), but I cannot tell google that this is a parameter, because my URLs are rewritten using mod rewrite. Any ideas?
BTW, I have read in a SEO forum, that it's a bad idea to use a cannonical tag. So is it really a bad idea in my case?
You can't force Google to do anything, however, they have made it easier to deal with pagination issues with a recent post on rel="next" and rel="prev".
But the primary problem you face is signalling to Google that your first (main) page is the starting point - this is achieved using internal link and back-link "juice" focussed on that page. You need to ensure that the first page of results is linked to properly from higher-value pages (like the home-page).
Google recently announced that you can use View All which will allow them to find and index entire articles that are normally broken up using pagination and display them all as one result.

How to speed up Google Translate

I have a web page that has 70000 characters. As you know when doing translation through Google API you can only send up to 5000 characters at a time. Which means I have to send data to Google 14 times (70000/5000) which takes a lot of time and then my page is displayed. Is there a way to speed up the process?
Thanks
have you tried caching the translation?
If you were using some AJAX framework (you don't mention what your web page is created with eg c#) then you can make it faster by making the API call via the AJAX framework.
It would look something like this (psuedo-code since we don't know what you are using):
Serve web page (almost instant)
Web page starts AJAX call:
Break text into chunks
Foreach chunk
Translate via API
Append to the page
This way the user will see the page immediately, and will also see the translation appear piece by piece as it is processsed instead of having to wait until the end.
My best bet would be to generate a page in one language, then ask google to translate it trough HTTP and display result as your own, to make it seamless for user. I believe that is what Google Chrome does when translating web pages.
Example of URL that makes Google translate the whole web page:
http://translate.google.com/translate?hl=en&sl=ru&tl=en&u=http%3A%2F%2Flinux.org.ru%2F
Of course, another option is to use Google Translate API and cache result if page content is not changing frequently.
go to the Javascript file in Google, it will lead you also to the CSS file, make a file or perhaps two, or you may be able to add CSS to your own, now make Javascript page on your web site in own directory. make a nip of code to update the Javascript code every so many seconds or minutes, and this will make the transition much faster, just by refreshing the content they give.. have fun :) also ultimately you can also send a request at the same time as the first one to translate after char 5000 which should be relatively easy to do.