Unable to scrape specific site

Unable to scrape specific site - import.io

I am unable to use Magic, crawl and build a connector on this site:
http://digitaltmuseum.se
If using "magic" option, import.io just freeze.
If using "crawler", i can create API, but is unable to crawl.
If using "Connetor", after the first recording, the pink button "take me to the next step" never shows up ?
Any thoughts, why this is impossible or any hints on how i could proceed ?

Its impossible because of the javascript used on the site.
Found this solution :http://support.import.io/knowledgebase/articles/623235-infinite-scroll-and-javascript-prerender-beta

Related

Scrapy. How to navigate, select and submit form

I am trying to make a bot to simulate some human behaviors, and I got some instructions about scrapy to login at a page like nike.com.br, but once I need to select some buttons and submit some forms I was not able to find how.
Can anyone help me on it?
for example, after the login, I need to chose the size of the product and click at add to the cart, that is some way to do it using scrapy?

It's hard to answer you question because it's too generic, and this probably will have different solutions for different pages.
Generally speaking you need to check what the page is doing when you click to submit the form. Most likely a POST request, so you will need to mimic that POST request with scrapy (check FormRequest).
Same logic applies to add item to the card.
I think the best way to approach that is to use the browser's network tool. In scrapy docs there are a few tips about using it for similar purpose (here).

Google Custom Search refinement redirect

So I'm using Google Custom Search (Google CSE) and I'm trying to use the refinement functionality to redirect search queries to Google Scholar.
Basically I'm following exactly the documentation found here. However it turns out that, despite there being documentation, this functionality doesn't exist, and it doesn't appear that Google has any plans to implement it in the near future (see the StackOverflow post here).
My question is, does anyone have a hack/workaround for this problem, so that I could use Google CSE to search Google Scholar?

Server Side
You can use something like https://github.com/ckreibich/scholar.py to parse the results from google scholar yourself and expose it as an API that you could consume and render any way you liked.
It would use scholar search under the hood. However, since this isn't an official API this might break at any time, it also requires you to have server side resources to service the requests, but would let you have the nicest interface that you have full control over.
IFrame
You can open an iframe at the particular URL, and this can be embedded inside your page. It looks a bit clunkier, but it means you don't have to link externally and you can embed it locally
<iframe src='http://scholar.google.com/scholar?q={query}'></iframe>
See documentation here. It might be specifically what renders well for you.
External Link
Alternatively, you can just open a new tab/window with:
<a href='http://scholar.google.com/scholar?q={query}' target='_blank'> My Link </a>

How can I extract data behind a login page using import.io

I need to crawl some data that sits behind a login page. To be able to scrap it I need a tool that is able to login and then crawl the pages behind it. Is it possible to do this behind import.io?

Short version: yes, it is.
Longer version:
There are at least two ways, both require you to sign up and download the desktop app (all free)
Extractor version (simpler):
Point the browser to the page where the login is. Login normally, then train your API to extract the data you need. The downside of using this method is that it will only work as long as you are logged in. If you want import.io to login for you you'll need the..
Authenticated version:
As above, but create an authenticated API. This will record for login procedure and execute it for you every time you execute the API

Since the chosen answer doesn't work anymore :( I recommend Cloudscrape. You will get a free trial with 20 hours of crawling and/or scraping if you sign up. For data behind a login you will need a scraper.
Handy tutorials
Tutorial for logging in with scraper.
Tutorial for pagination.

Getting The Results of a Google Search Programmatically via Custom Search

I am trying to send a request to Google.com via Custom Search API and get the response with a proper format (XML or HTML preferably). From the Custom Search API website, I have seen that this is actually possible via "Retrieving the Code for the Search Results" (it is here). The thing is that I cannot get it working. Every time the broken robot from Google shows up. I was wondering if anyone had any experience with it and they could help me. I am trying to use the search results for a small project.
Here are the things I have done:
I can use the Google search API in general (I have used it with the search text box)
I have set up my Custom Search API to search the entire web.
Here are a bunch of things that I am not going to do:
I'm not trying to have a Google search box in my site.
I'm not trying to grab what Google says by parsing the Google.com page.
Here is what I need to do:
I need the content of what Google returns as search results via whatever API Google has to offer.
I will be using PHP for writing this program. If anyone has a better way to get these search results in a proper format, it is very appreciated.

script to click a link at a certain time

I am interested in writing a script that goes to a website and clicks a link at a certain time. How do I go about doing something like this?

You should use selenium http://seleniumhq.org/
You can control it using anyone of the language you specified in the tags.
You can start to browse from
http://seleniumhq.org/projects/remote-control/

"clicking a link" could have two meanings:
Actually clicking the link in a browser, or just doing an HTTP GET that would result from it. This could be as simple as software that runs on your desktop and simulates a click at a certain point, to something as complicated as Selenium for automation of website interactions.
If you just need to do the GET request that clicking the link would do, anything would do. Unix systems typically include wget and curl, which take a url to request. Or if you want to process the data, you can do this in most programming languages. For example, in python you could do urllib2.urlopen('http://stackoverflow.com') and then do whatever you want with the data. Perl has an equivalent.

Are you familiar with cURL?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Unable to scrape specific site - import.io

Its impossible because of the javascript used on the site. Found this solution :http://support.import.io/knowledgebase/articles/623235-infinite-scroll-and-javascript-prerender-beta

Related

Scrapy. How to navigate, select and submit form

Google Custom Search refinement redirect

How can I extract data behind a login page using import.io

Getting The Results of a Google Search Programmatically via Custom Search

script to click a link at a certain time

Categories

Resources