I want to know how to add pagination in APIs for enhanced data retrieval most specifically the youtube API!
I didn't try out anything so far as it's a new concept towards me!
What i personally do usually is one of two things,
(the most preferred way for me) I create more than one API Token. every X requests i dynamically change the API that executes the request., then it avoids throttlings.
When requesting or sending a large amount of requests, you can stop dynamically every X time.
I've been using import.io to extract lots of data from hundreds of web pages. I've already created extractors for those URLs and still adding more.
I've designed an automated process that sends an HTTP request to the import.io RESTful API for all extractors recursively.
Every time I create a new extractor, I perform some manual task to insert the endpoint of the newly created extractor into my database. This approach is very time consuming and may be error prone since there is a chance of copy/paste error.
Although import.io maintains the list of my data, I'd love to download all the extractors along with their endpoints for RESTful request so that the data can be stored into my database.
Is there a way to download or extract all of my data into excel format or some other ?
At this time there is no way to Bulk download API endpoints for all your extractors I'm afraid.
It is possible to get the GUIDs of you connectors using this method.
http://api.docs.import.io/legacy/#ConnectorMethods "search connectors"
You could write a small script in python for example to parse this to get the GUIDs.
Potentially you could add this to your automated process.
I need to crawl some data that sits behind a login page. To be able to scrap it I need a tool that is able to login and then crawl the pages behind it. Is it possible to do this behind import.io?
Short version: yes, it is.
Longer version:
There are at least two ways, both require you to sign up and download the desktop app (all free)
Extractor version (simpler):
Point the browser to the page where the login is. Login normally, then train your API to extract the data you need. The downside of using this method is that it will only work as long as you are logged in. If you want import.io to login for you you'll need the..
Authenticated version:
As above, but create an authenticated API. This will record for login procedure and execute it for you every time you execute the API
Since the chosen answer doesn't work anymore :( I recommend Cloudscrape. You will get a free trial with 20 hours of crawling and/or scraping if you sign up. For data behind a login you will need a scraper.
Handy tutorials
Tutorial for logging in with scraper.
Tutorial for pagination.
I've been looking and searching for quite a while on https://developers.google.com/doubleclick-advertisers/reporting/v1.3/. What I'm trying to find is a way to directly get the data via the API. Is this possible?
As far as I can see you can create reports, run this report which generates a file and then get the download url to the file with the actual data in it. Is it not possible to directly get the data through the DFA reporting API?
No! I don't think it is. As far as I know the API is more like a programmatic recreation of the web interface. It doesn't operate like I'd expect an API to work.
This question has less to do with actual code, and more to do with the underlying methods.
My 'boss' at my pseudointernship has requested that I write him a script that will scrape a list of links from a users' tweet (the list comes 'round once per week, and it's always the same user) and then publish said list to the company's Tumblr account.
Currently, I am thinking about this structure: The base will be a bash script that first calls some script that uses the Twitter API to find the post given a hashtag and parse the list (current candidates for languages being Perl, PHP and Ruby, in no particular order). Then, the script will store the parsed list (with some markup) into a text file, from where another script that uses the Tumblr API will format the list and then post it.
Is this a sensible way to go about doing this? So far in planning I'm only up to getting the Twitter post, but I'm already stuck between using the API to grab the post or just grabbing the feed they provide and attempting to parse it. I know it's not really a big project, but it's certainly the largest one I've ever started, so I'm paralyzed with fear when it comes to making decisions!
From your description, there's no reason you shouldn't be able to do it all in one script, which would simplify things unless there's a good reason to ferry the data between two scripts. And before you go opening connections manually, there are libraries written for many languages for both Tumblr and Twitter that can make your job much easier. You should definitely not try to parse the RSS feed - they provide an API for a reason.*
I'd personally go with Python, as it is quick to get up and running and has great libraries for such things. But if you're not familiar with that, there are libraries available for Ruby or Perl too (PHP less so). Just Google "{platform} library {language}" - a quick search gave me python-tumblr, WWW:Tumblr, and ruby-tumblr, as well as python-twitter, Net::Twitter, and a Ruby gem "twitter".
Any of these libraries should make it easy to connect to Twitter to pull down the tweets for a particular user or hashtag via the API. You can then step through them, parsing it as needed, and then use the Tumblr library to post them to Tumblr in whatever format you want.
You can do it manually - opening and reading connections or, even worse, screen scraping, but there's really no sense in doing that if you have a good library available - which you do - and it's more prone to problems, quirks, and bugs that go unnoticed. And as I said, unless there's a good reason to use the intermediate bash script, it would be much easier to just keep the data within one script, in an array or some other data structure. If you need it in a file too, you can just write it out when you're done, from the same script.
*The only possible complication here is if you need to authenticate to
Twitter - which I don't think you do,
if you're just getting a user timeline
- they will be discontinuing basic authentication very soon, so you'll
have to set up an OAuth account (see
"What is OAuth" over at
dev.twitter.com). This isn't really a
problem, but makes things a bit more
complicated. The API should still be
easier than parsing the RSS feed.
Your approach seems to be appropriate.
Utilize user_timeline twitter api to fetch all tweets posted by a user.
Parse the fetcned list ( may be using regex ) to extract links from tweets and store them in an external file.
Post those links to tumblr account using tumblr write api.
You may also want to track last fetched tweet id from twitter so that you can continue extraction from that tweet id.