Selenium based malware (malvertising) checking - A few questions - api

We recently had an issue where an advertiser who purchased advertisments via a 3rd party was distributing malware via the ads they purchased.
This led to google black listing our web property for a short period of time.
This issue is now resolved.
After this happened we decided that we will self-aduit our advertisers.
After searching the web for services that provide this service, we found a few... Armorize (www.armorize.com), amongst others, provides this type of service. But after speaking with their sales on the telephone we found that they charge aprox 10K-15K USD / year for this service. Way out of our price range.
We dont have that kind of cake.
What we do have is a smart head on our (err, my) shoulders.
So, here is what I have developed.
A) Selenium running firefox.
B) Firefox proxying all requests via a locally hosted squid proxy.
The result?
Pipe in advertisers URL -> Selenium Firefox -> Squid access log -> A nice clean list of all URLS hit by the advertisment(s).
The next step was to test these against some sort of malware list. We are now testing them againts googles safebrowsing API ( https://developers.google.com/safe-browsing/ ).
The result is exactly what we wanted. A way to test via "real browser" each of the URLS hit by our advertisers.
So, the questions are as follow:
a) Is using their (googles) API like this acceptable as far as google is concerned? We will be keeping this 100% in house, and will not be reselling this service. Its 100% for us.
b) Does the google safe browser API allow checking of FULL URLs, or does it work only on a per-domain basis?
c) Does anyone know any other APIs where we can test these URLs? Free / low cost would be great :)
Thanks!

a. Reviewing the Safe Browsing API Terms of Service together with the Google APIs Terms of Service I cannot find anything that you are doing that falls outside of these.
b. The docs consistently refer to URL rather than domain - having performed some tests (e.g. liderlab.ru / absa/ vs. liderlab.ru /absa / page/ 1) the first is a phising site and gives the appropriate warning whereas the second doesn't).
c. PhishTank is good and free and seems to be a little more current than Google (from a brief investigation). BrightCloud is a reasonably priced pay for service. URL Blacklist is a pay for service that works on a honour system so you can see their data.

Related

Workarounds for Safari ITP 2.3

I am very confused as to how Safari 2.3 works in certain respects, and why sites can’t easily circumvent it. I don’t understand under what circumstances limits are applied, what the exact limits are, to what they are applied, and for how long.
To clarify my question I broke it down into several cases. I will be referring to Apple’s official blog post about ITP 2.3 [1] which you can quote from, but feel free to link to any other authoritative or factually correct sources in your answer.
For third-party sites loaded in iframes:
Why can’t they just use localStorage to store the values of cookies, and send this data along not as actual browser cookies🍪, but as data in the body of the request? Similarly, they can parse the response to updaye localStorage. What limits does ITP actually place on localStorage in third party iframes?
If the localStorage is frequently purged (see question 1), why can’t they simply use postMessage to tell a script on the enclosing website to store some information (perhaps encrypted) and then spit it back whenever it loads an iframe?
For sites that use link decoration
I still don’t understand what the limits on localStorage are in third party sites in iframes, which did NOT get classified as link decorator sites. But let’s say they are link decorator sites. According to [1] Apple only start limiting stuff further if there is a querystring or fragment. But can’t a website rather trivially store this information in the URL path before the querystring, ie /in/here without ?in=here … certainly large companies like Google can trivially choose to do that?
In the case a site has been labeled as a tracking site, does that mean all its non-cookie data is limited to 7 days? What about cookies set by the server, aren’t they exempted? So then simply make a request to your server to set the cookie instead of using Javascript. After all, the operator of the site is very likely to also have access to its HTTP server and app code.
For all sites
Why can’t a service like Google Analytics or Facebook’s widgets simply convince a site to additional add a CNAME to their DNS and get Google’s and Facebook’s servers under a subdomain like gmail.mysite.com or analytics.mysite.com ? And then boom, they can read and set cookies again, in some cases even on the top-level domain for website owners who don’t know better. Doesn’t this completely defeat the goals of Apple’s ITP, since Google and Facebook have now become a “second party” in some sense?
Here on StackOverflow, when we log out on iOS Safari the StackOverflow network is able to log out of multiple sites at once … how is that even accomplished if no one can track users across websites? I have heard it said that “second party cookies” still can be stored but what exactly makes a second party cookie different from a third party?
My question is broken down into 6 cases but the overall theme is, in each case: how does Apple’s latest ITP work in that case, and how does it actually block all cases of potentially malicious tracking (to the point where a well-funded company can’t just do the workarounds above) while at the same time allowing legitimate use cases?
[1] https://webkit.org/blog/9521/intelligent-tracking-prevention-2-3/

Instagram Automation without API allowed?

my two partners and me are about to create a software which automates liking, commenting and following for Instagram with the use of browser simulation (that means that we log into the account of the user through a browser, like google chrome).
Is that kind of automation allowed by Instagram? And if not, is there a possiblity to get aproved?
Yes it's against their terms. I wouldn't bother nor risk it. Instagram is actively suing bot services. Look at the biggest bot service, Instagress - mysteriously shut down entirely.
They're also penalizing accounts that use bots. I run an agency and have seen my clients' engagement mysteriously drop by 50-90% for a seemingly endless amount of time after using bots.
I imagine the purpose of doing it with "browser simulation" like Chrome is to try to avoid detection? Good luck. Instagram is smart and of course has some of the best programmers in the world who know how to combat this type of stuff.
I would say that such operation goes against the terms of user of Instagram. Under "General Description", section 10:
We prohibit crawling, scraping, caching or otherwise accessing any content on the Service via automated means, including but not limited to, user profiles and photos (except as may be the result of standard search engine protocols or technologies used by a search engine with Instagram's express consent).
Since you will be accessing content (and performing actions) via automated means, I would interpret that as a violation of this section.

Can someone explain me what is an API.?

I've googles about it, yet couldn't understand it properly.. Not sure if it's a library or intra-server communicator..
Can someone explain me in a high-level /low-level what is meant by an API.??
http://en.wikipedia.org/wiki/Application_programming_interface
Read it from here , will hopefully clear most of your doubts.
An API stands for Application Programming Interface, which means using and existing program or code and accessing it with your code.
===
Example, Search Engine:
Search engine 1: offers search and api (if you want this can be google)
Search engine 2: uses googles api to get results (this is your one)
To get results you basically search the other search engine and get their results to yours
====
An API can be used in many ways, to access others data or code, ect
An in-depth explination can be found here: http://en.wikipedia.org/wiki/Application_programming_interface
An application-programming interface (API) is a set of programming instructions and standards for accessing a Web-based software application or Web tool. A software company releases its API to the public so that other software developers can design products that are powered by its service.
For example, Amazon.com released its API so that Web site developers could more easily access Amazon's product information. Using the Amazon API, a third party Web site can post direct links to Amazon products with updated prices and an option to "buy now."
An API is a software-to-software interface, not a user interface. With APIs, applications talk to each other without any user knowledge or intervention. When you buy movie tickets online and enter your credit card information, the movie ticket Web site uses an API to send your credit card information to a remote application that verifies whether your information is correct. Once payment is confirmed, the remote application sends a response back to the movie ticket Web site saying it's OK to issue the tickets.
As a user, you only see one interface -- the movie ticket Web site -- but behind the scenes, many applications are working together using APIs. This type of integration is called seamless, since the user never notices when software functions are handed from one application to another.
This article shows an example
http://www.codeproject.com/Tips/127316/Integrate-FB-javascript-API-to-your-asp-net-app-to

Is it a bad idea to have a web browser query another api instead of my site providing it?

Here's my issue. I have a site that provides some investing services, I pay for end of day data which is all I really need for my service but I feel its a bit odd when people check in during the day and it only displays yesterdays closing price. End of day is fine for my analytics but I want to display delayed quotes on my site.
According to the yahoo's YQL faq: If you use IP based authentication then you are limited to 1000 calls/day/IP, if my site grows I may exceed that but I was thinking of trying to push this request to the people browsing my site themselves since its extremely unlikely that the same IP will visit my site 1,000 times a day(my site itself has no use for this info). I would call a url from their browser, then parse the results so I can allow them to view it in the format of the sites template.
I'm new to web development so I'm wondering is it a common practice or a bad idea to have the users browser make the api call themselves?
It is not a bad idea at all:
You stretch up limitations this way;
Your server will respond faster (since it does not have to contact the api);
Your page will load faster because the initial response is smaller;
You can load the remaining data from the api in async manner while your UI is already responsive.
Generally speaking it is a great idea to talk with api's from the client. It's more dynamic, you spread traffic, more responsiveness etc...
The biggest downside I can think of is depending on the availability of other services. On the other hand your server(s) will be stressed less because of spreading the traffic.
Hope this helped a bit! Cheers!

How would you go about making an application that automatically retrieves your bank account balance twice a day?

I'm building a utility that will hopefully keep my wife in tune with how much money we have available.
I need a simple secure way of logging into my bank account and retrieving the balance.
Something like mechanize is the only method I can think of. I'm not even sure if that would work given the properly authenticated https that banks use.
Any ideas?
Write a perl script using LWP::UserAgent. It supports HTTPS connections. The only issue might be if the site requires javascript.
Web Client Programming with Perl has a few examples to get you started if you're not too familiar with perl.
If you really want to go there, get these extensions for Firefox: Live HTTP Headers, Firebug, FireCookie, and HttpFox. Also download cURL and a scripting language that can run cURL command-line tasks (or a scripting language like PHP or Perl that has access to cURL libraries directly).
I've started down this road for some idempotent GET tasks like getting PDFs of the S&P reports (of the stocks I track) from my online brokerage, and downloading the check images for my bank account. Both tasks are repetitive and slow ways of downloading data to my computer that the financial institutions don't provide any way of making it easier.
Here's why you shouldn't: (as a shortcut I'm going to call the archetypal large bank, brokerage, or other financial institution "BloatBank")
BloatBank is not likely to make public their API for accessing this kind of information. So it can change any time and all your hard work will be for naught. Whenever they change their mechanism, you'll have to adapt.
If BloatBank finds out you've been using automatic scripting to try to access your account information, they may ban you because you've violated their terms of service.
You might screw up, and the interaction between the hodgepodge of scripts on BloatBank's server, and your scripts that access your account, might cause a Bad Thing like closing your account. Testing this kind of script is tremendously difficult because you don't have any documentation about how their online service works, and you don't have a test account you can mess with.
(a variant of the above) You think you're safe because you're issuing GET requests. But BloatBank is just a crazy bank that doesn't know anything about REST, so there are some GET requests that can mess up your account.
If someone else does use your script to maliciously sniff your online password or mess with your account, any liability coverage from BloatBank may disappear because you've opened a security hole.
Why don't you teach your wife how to login to the bank herself? Or use Quicken (or Mint, etc) and teach her how to use the auto-download feature?
Have you checked out Watir? It is fantastic for automating web-browser actions. And since it's written in Ruby, you can take the results and store them in a DB (or email them to yourself) if needed.
If you are open to AIR, I'd say build an AIR app. I have worked with mechanize and I think it's cool. AIR gives you similar features with a richer GUI (see HTMLLoader and DOM manipulation of webpage).
If I were you, I'd simply pull the page and manipulate the DOM to suit my visual needs.
Please, if you find this easy to do for your bank please post your bank's name. If I have the same one I'll be closing my account.
More to your question. The process of loading a web page inside of your code rather than in a browser can be a black art, especially if their is any javascript involved. Your best bet would probably be embedding the IE Web Browser control in your app and then simulating key strokes and mouse clicks to arrive at your balance page. Then scrape the HTML for the balance.
I could try paying for Quicken and letting it do the balance downloading. Then I'd just need to find a way to get the number out of the software automatically.
This way I'm not violating any terms of service and I'm also reducing security risk since all "hacking" goes on locally.