What free/paid search API's allow for programmatic querying and caching/storage of the resulting data? - api

If you've done any serious research into search API's, you know that most of them have a huge slew of TOS/TOU restrictions that make them nearly impossible to use in anything but the most inane applications.
Bing's 2.0 API, Yahoo Search BOSS, Google Places, Google AJAX Search (dead), et al, are far too restrictive for us. I need to run a finite and relatively small number of queries (perhaps 500k) one time only, storing specific data from the results for use within our application.
For example, we need to match up business names with their target websites (we have written the algorithm to make a 'best guess' from a set of results if necessary; we just need a vanilla result set). Also, we need to match an address to this company in question.
Unfortunately, I can find ZERO search API's that will allow us to fire off queries in a programmatic, non-user-initiated manner.
We're even quite eager to give someone cold, hard cash for access to this kind of data; Google, Bing, Yahoo, and others simply seem to not want our money (as evidenced by their TOSes)...
Any thoughts?

A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2.
http://commoncrawl.org/
Their Terms of Service (or TOU) are pretty reasonable and unrestricted too:
http://commoncrawl.org/about/terms-of-use/

If you know some visual basic I'd suggest playing around with Bing Ad Intelligence. It's a free Excel plugin and all you need to use it is a free Microsoft account.
The query limit is 20,000 words per query. You can get information on Clicks, Impressions, CTR, CPC, Average Bid and Total Cost. The query limit is a little lower if you use the more advanced keyword research features.

Related

How to optimise Google Translate API calls to translate multiple words in a single request

Everyone. Recently Google Translate Is Integrated Into My Project, Which Plays The Role Of Translating Some Product Names, Product Descriptions, Product Related Category Names. But Cause There Are Plenty Of Products In My Database(And Increased Quickly), Google Translate Api Would Cost Considerable Money.
I Want To Translate By Google As Less As Possible. In The Translation, Many Words Are Same Among Many Products, For Example : 阿迪达斯 - Adidas, 苹果 - iphone, 篮球 - Basketball, Bla Bla..... I Wanna Do Some Tricks, But Find No Idea.
Did Anyone Encounter Such Questions?
Any Help Would Be Appreciated.
It sounds like what you need is actually the ability to reuse translation at the string or substring level (in other words, per database entry). You can't really do that with Google, that I know of. You've got a few options, as I see it:
You could switch over to Microsoft Translator and use their methods
that allow you to place translations yourself, such as their
Collaborative Translation feature that lets you override the MT with
a preferred translation and even to vote translations up/down. Quality here will be broadly comparable to Google (I often find it better), and you have methods at your disposal that allow this override. Also, unlike Google, the Microsoft API is free up to a certain volume. Take a look:
http://www.microsoft.com/en-us/translator/developers.aspx
Microsoft also has a unique feature called the Microsoft Translator Hub, which can use your terminology, for example, for translations. However,depending on how you implemented any solution with Microsoft, you might still have the problem that you are making more calls out to Microsoft than you'd like, and, moreover, that "matching" only takes place at the level of a whole record or string, so it would not hit the case of shared linguistic elements being concatenated into one string.
There's a commercial offering called GeoFluent (full disclosure--I am the product manager for this product, so I'm clearly biased :)) that works with Microsoft Translator but provides pre and post translation processing that can deal with sub-segment and may reduce the volume you are therefore putting through translation each time. It could make sense if, as you mention, you are rapidly adding to your database. Of course, this is a commercial offering too, so you'd have to balance the costs.
Let me know if this helps, and happy to answer any other questions you have.
Marcus
There is a PHP sample here : http://weblite.ca/svn/dataface/modules/tm/trunk/lib/googleTranslatePlugin.php
That allows you to send and array and return an array.
array(source=>target) getTranslations()
translates all of the user provided strings into the target language using the Google Translate API and returns an array of source=>target
strings.

Does changing query string in Destination URL via Adwords API trigger editorial?

This is an ongoing debate at my company. We run massive Adwords accounts. I need to automate adding or changing query string parameters on large numbers of keywords. Our SEM team is nervous that this could cause the Adwords editorial staff to disable large numbers of keywords while they review the changes (effectively killing traffic).
I can understand why changing the domain or page in the URL might cause this. But changing the query string should not, I would think.
Can anybody confirm/deny this possibility?
UPDATE 2013-03-19
I'm not sure this qualifies as an "answer". I posted this same question on Google Groups and, as usual, Google was noncommittal. However, this is what I have gathered so far (I'm sharing for posterity):
First of all, "editorial" doesn't seem to be a term that is widely accepted at Google. I'm saying this based on our discussions with our account reps. The term didn't mean much to them and required clarification. So, to clarify, when I say "editorial" I mean Google's process of reviewing your keyword term, ad copy, landing page, and potentially resetting statistics (quality score). This is mostly automated in their system from what I can tell. However, there are times when it appears that human beings actually get involved in the process.
And now the answer: It seems that modifying keyword destination URLs does not cause manual review nor reset of statistics. Possible exceptions are trademarked terms or pharmaceutical terms. I translate that more broadly as "any terms that have special rules".
NOTE: Folks in the Google Groups seem to all agree that modifying URLs at the Creative level DOES reset statistics. So tread carefully.
Here is the Google Group thread.
UPDATE 2013-07-11
I'm just talking to myself now. I feel so alone.
I just received the "tumbleweed" award for this post.
The Adwords team finally told us that modifying Destination URLS should not affect traffic.
So, we modified several hundred thousand Destination URLs and our SEM Team reported that some of their high volume keywords stopped getting traffic for about 10 days and then magically came back. No explanation. The Adwords team dug up some "expert" from the bowels of their staff and subsequently told us that modifying Destination URLs does affect traffic.
However, we were also transitioning these accounts to Enhanced Campaigns at the same time. So the results are inconclusive in my mind.
I don't think even the Adwords team knows how Adwords works.
It's cold here. Need to make a fire...

How long does it take to do a yodlee implementation?

I'm a non-technical (well, non-software. hardware background) founder who has hired a pretty good developer that has built a site with backend on Rails and frontend with CSS/HTML pretty capably. our next step is to develop a Yodlee integration, and we both want to know how long it takes to do this. He has an estimate which I think is reasonable, but would like feedback from the community without biasing the responses.
Also, if anybody has done an implementation before, I would really appreciate your perspective and help!
I have implemented a complex Yodlee integration for a LA based start-up over the last two years. They built a social game and money management platform on top of it. The short answer is that it's tough and dirty work.
The technical aspect of getting your application to communicate to the Yodlee API is not at all the hard part (its pretty much a standard web service). Following are some aspects highlighting the difficulty:
The most difficult part is dealing with the unknowns and the variability in the client data.
There is effectively no documentation for the API
There are several way to do each operation that will return different data
Ive been designing and building systems for 15 years and have gotten pretty good at estimating projects. We were way off with Yodlee; in fact we are still dealing with issues.
In order to understand why its so tough, you really need to understand what Yodlee is.. it is an aggregator of 10,000 different systems. Now these other systems might be big professional systems like Bank of America, Chase, ... but they are often small little banks (Bob's Bank in Omaha).
When Yodlee communicates with the big companies (they are called content services) there is most always an api that actually returns good data. But with the little ones, they are doing screen scraping. You can imagine that breaks all the time. They have an entire team in India which is just focused on that.
The other issue is about modelling the data; each of the content services at its source has modeled the data differentley (different names, different elements, different relationships,...) but Yodlee but combine all 10,000 models into one view. What this leaves you with is a very bloated model, where you can never know or count on getting a certain data element.
To give you an idea... there are extra fields about a credit account (apr, credit amount, last payment, ...) beyond the standard base'class fields (balance, ...). While this sounds great that you have this data, in practice the number of content services that provide these extra data elements is so low that you cant really depend on them. I'd say that the fidelity of those data elements is very low. All you can really count on is the base elements (account name, type, balance) and (transaction date, description and type).
Speaking of transactions... their transaction categorization system is not that good. They have clearly taken a breadth first approach to this, rather than focus on accuracy. We built an entire system for transaction categorization which is far more effective.
A couple other things: The DAG account test system is useless; it does not operate the same way real accounts do. You will be far better off opening 5-10 accounts at different content services and giving your developers the username/passwords for these for testing. The MFA (multifactor auth) system for account security has been an endless headache. This isnt Yodlee's fault, its the nature of the game. The banks are doing more and more crazy things that add security layers. Yodlee has the MFA system in place to compensate for this. At any given time about 20% of our accounts are in error for some reason. We have built an entire component just to manage this.
So what does this all mean? Double your estimate, get ready to get dirty. I dont want to put Yodlee down at all (except for the lack of documentation); they really are solving a hard problem. There really arent any other better options.
I run the team responsible for sales and support of the Yodlee APIs so the response may be a little biased.
I have seen clients get up and running in anywhere from 10 days to 3 months to 6 months. The time to implement depends on the number of fields in the data model you are using and how you are going to use the data or manipulate it before presenting it to your users.
While the most prevalent data fields such as account balance or transaction amount will always be available, Craig is right, as you get into the broader data model you will have to code for exceptions when the data is not there. Yodlee does provide documentation on how often the fields will be available to help with this process. But if you are only going to be using basic account and transactional information, you will not have to worry about these complexities and it will speed implementation.
How you use the data once you receive it from Yodlee will also play a big part in the time it takes to get integrated. If you are deriving additional data from the transaction descriptions or are doing something with categorization then there is more complexity and it will require more time. If you are using many of the fields as-is, then this will be easier.
The other item that Craig mentioned is the extra security questions (Multi-factor Authentication). While that section of the API does add some work, we have added documentation around this to make integration easier. Also, with any development issues that come up we give clients access to a developer forum that is monitored by our Technical Consulting team.

How does Google Know you are Cloaking?

I can't seem to find any information on how google determines if you are cloaking your content. How, from a technical standpoint, do you think they are determining this? Are they sending in things other than the googlebot and comparing it to the googlebot results? Do they have a team of human beings comparing? Or can they somehow tell that you have checked the user agent and executed a different code path because you saw "googlebot" in the name?
It's in relation to this question on legitimate url cloaking for seo. If textual content is exactly the same, but the rendering is different (1995-style html vs. ajax vs. flash), is there really a problem with cloaking?
Thanks for your put on this one.
As far as I know, how Google prepares search engine results is secret and constantly changing. Spoofing different user-agents is easy, so they might do that. They also might, in the case of Javascript, actually render partial or entire pages. "Do they have a team of human beings comparing?" This is doubtful. A lot has been written on Google's crawling strategies including this, but if humans are involved, they're only called in for specific cases. I even doubt this: any person-power spent is probably spent by tweaking the crawling engine.
Google looks at your site while presenting user-agent's other than googlebot.
See the Google Chrome comic book page 11 where it describes (even better than layman's terms) about how a Google tool can take a schematic of a web page. They could be using this or similar technology for Google search indexing and cloak detection - at least that would be another good use for it.
Google does hire contractors (indirectly, through an outside agency, for very low pay) to manually review documents returned as search results and judge their relevance to the search terms, quality of translations, etc. I highly doubt that this is their only tool for detecting cloaking, but it is one of them.
In reality, many of Google's algos are trivially reversed and are far from rocket science. In the case of, so called, "cloaking detection" all of the previous guesses are on the money (apart from, somewhat ironically, John K lol) If you don't believe me set up some test sites (inputs) and some 'cloaking test cases' (further inputs), submit your sites to uncle Google (processing) and test your non-assumptions via pseudo-advanced human-based cognitive correlationary quantum perceptions (<-- btw, i made that up for entertainment value (and now i'm nesting parentheses to really mess with your mind :)) AKA "checking google resuts to see if you are banned yet" (outputs). Loop until enlightenment == True (noob!) lol
A very simple test would be to compare the file size of a webpage the Googlbot saw against the file size of the page scanned by an alias user of Google that looks like a normal user.
This would detect most suspect candidates for closeer examination.
They call your page using tools like curl and they construct a hash based on the page without the user agent, then they construct another hash with the googlebot user-agent. Both hashes must be similars, they have algorithms to check the hashes and know if its cloaking or not

SEO for product known by different names

If you're selling widgets, we all know that having "Bob's Widgets" in the title and the H1 gives you a better ranking in Google when people search for "widgets".
But what if, as someone explained to me the other day, their product is known by different names in different parts of the world?
In the US, it's called a Widget. In Canada, it's called a Flidget. In Australia, it's called a Zidget. There's really no official name for it, just informal names.
Meta-tags are no problem, but apart from that, what's the best way to cope with that situation? Just make separate pages? You can't have 3 H1s on the page. One H1 which says "Widgets, (aka Flidgets, Zidgets)"?
Or do I just trust that Google is smart enough and some magical taxonomy database groups those three words together as the same thing?
EDIT: This question got downvoted simply because it's about SEO? How bizarre. If you even bother to read the question, you can see I'm not trying to game the system or get away with anything. I have a genuinely interesting question and a valid client need.
Please note also, that I always use semantic HTML, I am well aware of how search engine rankings work, and I'm not trying to get away with anything shady.
If my client was selling beer, I would simply use semantic HTML to put the word "beer" first and foremost. If I was selling beer to French people, I would make another page in French and do the same with "biere". But imagine for a second that beer isn't called "beer" in other English-speaking nations. Imagine it's called "reeb". How do I correctly, semantically code an English-language page when different English-language users will be searching using a different string, but searching for the same thing.
HTML meta-tags were originally created for the purpose of embedding exactly such metadata into a webpage. But because of the SEO industry and the commercialization of the web, meta-tags like 'keywords' are no longer used by major search engines.
With all of the advances in page ranking algorithms and intelligent search robots over the years, there's really not much to do in terms of active 'search engine optimization' for legitimate websites. In today's search environment, all you have to do is optimize your site for your visitors, and it will automatically be optimize for searching.
So you can passively optimize your site's ranking by doing any(or all) of the following:
Use good spelling and writing etiquette (like not writing your entire site in caps or text-message-speak)
Format your pages using proper markup. (Title your document, mark your headings with H1/H2/etc., delimit your paragraphs, and so on and so forth.)
Abide by established web standards and write well-formed code.
Weed out broken links and make sure your site works properly.
Don't use pop-ups, cover your site with banner ads, or otherwise bombard visitors with advertising
Don't link to disreputable websites
Simply put, make your site as user-friendly and as accessible as possible. If your site is useful to visitors and provides valuable content, most major search engines like Google or Yahoo! are smart enough to rank it fairly. Your ranking may be modest at first. But if you're genuinely supplying quality content then, as your site becomes better established on the web, other sites will start linking to you, increasing your search ranking.
And if other webpages linking to your site use the various names & nicknames your product is referred to by, then your site will also be associated with those names/keywords (that's how Google Bombing works). Google also tracks synonymous search terms and is even smart enough to recommend related/alternative search terms in some cases.
On the other hand, if you're creating a spam site or the 10 millionth affiliate marketing website with the same exact products and content as the other 9,999,999 sites of the same exact nature, then expect your search engine ranking to be reasonably poor.
It's generally only websites with no original content and that provide no legitimate value to visitors that require active (black hat) SEO techniques to gain a decent ranking--polluting search results in the process. Otherwise, if you're actually building a useful website, then just optimize it for your visitors and let Google/Yahoo! do their job.
The anchor text of your inbound links is a lot more important than the tags you use. So try getting links to your page with both "beer" and "reeb". As long as you'll get enough links with both terms, you'll do well in SERPs, no matter the keywords you use in it.
One option is to localize pages for the different target regions you are interested.
If you use a local domain, google will give it priority on default searches on that country. When I hit www.google.com, it redirects me to www.google.com.mx, and any search I do tends to display high results from mexico domains. I actually have to hit a couple options, when I don't want that behavior.
I also think google has an option to map parts of the site to a region, so you can keep the single domain.
Update: Regarding the beer example, you can localize per country (which is what I mention above). Actually its not that of a special need, since english british and english US have their differences.
The talk has been language agnostic, but consider how .net handle resources. Lets say the current request is being processed for en-GB, and you look for a resource (i.e. a text, image, etc). It will first try to find the resource for the specific culture: en-GB, if it isn't found it will look under the more general en (and then in the default resource file).
The previous allows you to selectively localize what you really need on the more specific resource files. If you only need to localize the resources with the key beerName, you can just configure that on the specific languages and leave the rest.