Is scraping Google+ pages/comments/notifications via cURL legal? - google-plus

The limitations of the Google+ API have just put a hold on a little project I am working on.
I can achieve what I need with a basic cURL script (login to Google+ Page, scrape page, parse data) but I was just wondering if this is allowed?
(yes the script will break whenever they update G+, I can live with that)
A search on "are you allowed to login to google with curl" produces lots of results. So it seems lots of people are doing it, just wondering if anyone knows if it is "really" allowed?

I am not a legal attorney, always seek advice from law professionals.
However my take on this is that it is legal. Nothing restricts you from crawling websites and performing datamining without wasting the resources of the server (Such as DDOS) and without using any illegal means to attain this information such as exploiting the server software or using some vulnerability that might expose the user data.
If this information is publicly available online it belongs to the Public Domain, and as long as you are not selling it, it can be considered fair use.
On the other hand you are violating an awful lot of user agreements.

Related

Instagram Automation without API allowed?

my two partners and me are about to create a software which automates liking, commenting and following for Instagram with the use of browser simulation (that means that we log into the account of the user through a browser, like google chrome).
Is that kind of automation allowed by Instagram? And if not, is there a possiblity to get aproved?
Yes it's against their terms. I wouldn't bother nor risk it. Instagram is actively suing bot services. Look at the biggest bot service, Instagress - mysteriously shut down entirely.
They're also penalizing accounts that use bots. I run an agency and have seen my clients' engagement mysteriously drop by 50-90% for a seemingly endless amount of time after using bots.
I imagine the purpose of doing it with "browser simulation" like Chrome is to try to avoid detection? Good luck. Instagram is smart and of course has some of the best programmers in the world who know how to combat this type of stuff.
I would say that such operation goes against the terms of user of Instagram. Under "General Description", section 10:
We prohibit crawling, scraping, caching or otherwise accessing any content on the Service via automated means, including but not limited to, user profiles and photos (except as may be the result of standard search engine protocols or technologies used by a search engine with Instagram's express consent).
Since you will be accessing content (and performing actions) via automated means, I would interpret that as a violation of this section.

Rightmove API and scraping technical and legal

I'm looking to build an app using property data. Nestoria has a free API and rules of use and Zoopla an API you register for. OnTheMarket and Rightmove have same terms of use to the letter (bizarre for competitors?). Rightmove advertise an API for upload but not download - I can't find anything for OnTheMarket.
I've discovered that Rightmove does have an API although the post code search is obfuscated by their own outcode mappings...
https://api.rightmove.co.uk/api/sale/find?index=0&sortType=1&numberOfPropertiesRequested=2&locationIdentifier=OUTCODE%5E1&apiApplication=IPAD
I'm wary of using an API that's not promoted. The alternative is scraping, which is harder technically and legally questionable, although from what I read the data is in the public domain and so free to use.
I've contacted Rightmove but got no response.
Is anyone using the Rightmove api and had this authorised by them? Seems most strange that it's open and available but barely mentioned when searching for it.
Can anyone clarify what rules/law/ethics are in place for scraping data?
Don't query their hidden API. But you can run a web crawler on RightMove.co.uk website, and it is perfectly legal as outlined in their Terms of Service under section 3.3 :
You must not use or attempt to use any automated program unless the automated program identifies itself uniquely in the User Agent field and is fully compliant with the Robots Exclusion Protocol
A web crawler like Apache Nutch perfectly follows the Robots Exclusion Protocol. From their robots.txt file I found they have elaborate nested sitemap.xml files, and hence they rather promote organized but polite crawling of their website. I was myself wanting to get their data, so I am beginning with my endeavour to crawl them with my resources - do let me know if you need access to this data.
You are not allowed to scrape their data, here what their terms&conditions say about it:
"You must not use or attempt to use any automated program (including, without limitation, any spider or other web crawler) to access our system or this Site. You must not use any scraping technology on the Site. Any such use or attempted use of an automated program shall be a misuse of our system and this Site. Obtaining access to any part of our system or this Site by means of any such automated programs is strictly unauthorised."

Mailchimp API: Add emails to my app's users' email lists

Is it possible to use Mailchimp API to subscribe emails to the lists of MY USERS' Mailchimp Accounts and not my own?
Basically I have a web app, and users collect emails of various subscribers through this app. I then want them to be able to click a button and subscribe all those emails to their lists.
I've looked at Mailchimp's API - particularly the /lists/subscribe and the /lists/batch-subscribe methods. However so far it appears that these will only work for your own Mailchimp account and not for remote users' accounts.
Can someone please tell me whether what I'm trying to achieve is possible with Mailchimp's API?
You would need to execute the api-calls with your users' api-key, which would mean that you execute the calls with their credentials.
There are three different ways to get their api keys, with different practicality levels.
You guess. They look like guids without dashes, and some information about which datacenter it is associated with. Some easy (and somewhat bad) calculations indicate that there are 2^128 api keys in every datacenter, so this will consume both cpu- and network-resources, and invoke the rage of the Mailchimp. The linked image shows him on a good day. He won't be as pleasant if you choose this alternative. Dont do this.
You ask, in an evil way, for their username/password. This is bad since it will give you to all accounts those credentials works with. This would also give you access to stuff that aren't available using api calls (like payment stuff). This wont work at all if your user is intelligent administrators that are using AlterEgo, the two-factory security alternative. This alternative is less bad than blindly guessing, but still provides too much access, if it works at all.
You ask, in a user-friendly way (with perhaps some quick tutorials), for the user to generate an api-key in mailchimp to provide to you. This is the Good Alternative (tm).
You may choose any implementation as long as you choose number three.

Design an API for a web service without "selling the farm"?

I'm going to try to phrase this as a generic question.
A company runs a website that has a lot of valuable information on it. This information is queried from an internal private database. So technically, the information in the database is the valuable part.
If this company wished to develop an API that developers could use to access their database of valuable & useful information, what approach should the company take?
It's important to give developers what they need. But it is also important to keep competing websites from essentially using the API to steal everything and essentially steal all traffic from the company's website.
Is there was some way the API could be used in a way that drives traffic back to the original company's website somehow? Something that gives users a reason to keep going there.
This is a design consideration that my company is struggling with that I can imagine other web-based services have come across before.
Institute API keys - don't make it public. Maybe make the signup process more complex than "anyone with an e-mail address".
Rate limit the API based on keys. If you're running more than X requests a minute, you're likely mining the database.
Don't provide a "fetch everything" API. Make the users know something to get information on it. Don't reveal what you know.
I've seen a lot of companies giving out API keys and stating a TOS that all developers must adhere to. For example, any page that uses data from the API must include your logo and a link back to your website. If any developer is found breaking the rules, the API key can be cancelled and your data is safe again.
Who is meant to use the API?
A good general method of solving this problem is to limit access to the data to end users (rather than allow applications or developers at it). Provide applications and users with identification, each, and make sure that to access a subset of the data, a combination of both user and application key is required.
Following this pattern, each user will have access to a very limited subset of the data (presumably, the data that they require for their own specific use), and you can put measures in place to enforce this. Any attempts at data-mining will become obvious.
This type of approach meshes well with capability-type security models on the server side.

How would you go about making an application that automatically retrieves your bank account balance twice a day?

I'm building a utility that will hopefully keep my wife in tune with how much money we have available.
I need a simple secure way of logging into my bank account and retrieving the balance.
Something like mechanize is the only method I can think of. I'm not even sure if that would work given the properly authenticated https that banks use.
Any ideas?
Write a perl script using LWP::UserAgent. It supports HTTPS connections. The only issue might be if the site requires javascript.
Web Client Programming with Perl has a few examples to get you started if you're not too familiar with perl.
If you really want to go there, get these extensions for Firefox: Live HTTP Headers, Firebug, FireCookie, and HttpFox. Also download cURL and a scripting language that can run cURL command-line tasks (or a scripting language like PHP or Perl that has access to cURL libraries directly).
I've started down this road for some idempotent GET tasks like getting PDFs of the S&P reports (of the stocks I track) from my online brokerage, and downloading the check images for my bank account. Both tasks are repetitive and slow ways of downloading data to my computer that the financial institutions don't provide any way of making it easier.
Here's why you shouldn't: (as a shortcut I'm going to call the archetypal large bank, brokerage, or other financial institution "BloatBank")
BloatBank is not likely to make public their API for accessing this kind of information. So it can change any time and all your hard work will be for naught. Whenever they change their mechanism, you'll have to adapt.
If BloatBank finds out you've been using automatic scripting to try to access your account information, they may ban you because you've violated their terms of service.
You might screw up, and the interaction between the hodgepodge of scripts on BloatBank's server, and your scripts that access your account, might cause a Bad Thing like closing your account. Testing this kind of script is tremendously difficult because you don't have any documentation about how their online service works, and you don't have a test account you can mess with.
(a variant of the above) You think you're safe because you're issuing GET requests. But BloatBank is just a crazy bank that doesn't know anything about REST, so there are some GET requests that can mess up your account.
If someone else does use your script to maliciously sniff your online password or mess with your account, any liability coverage from BloatBank may disappear because you've opened a security hole.
Why don't you teach your wife how to login to the bank herself? Or use Quicken (or Mint, etc) and teach her how to use the auto-download feature?
Have you checked out Watir? It is fantastic for automating web-browser actions. And since it's written in Ruby, you can take the results and store them in a DB (or email them to yourself) if needed.
If you are open to AIR, I'd say build an AIR app. I have worked with mechanize and I think it's cool. AIR gives you similar features with a richer GUI (see HTMLLoader and DOM manipulation of webpage).
If I were you, I'd simply pull the page and manipulate the DOM to suit my visual needs.
Please, if you find this easy to do for your bank please post your bank's name. If I have the same one I'll be closing my account.
More to your question. The process of loading a web page inside of your code rather than in a browser can be a black art, especially if their is any javascript involved. Your best bet would probably be embedding the IE Web Browser control in your app and then simulating key strokes and mouse clicks to arrive at your balance page. Then scrape the HTML for the balance.
I could try paying for Quicken and letting it do the balance downloading. Then I'd just need to find a way to get the number out of the software automatically.
This way I'm not violating any terms of service and I'm also reducing security risk since all "hacking" goes on locally.