How does a site like kayak.com aggregate content? [closed] - api

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Greetings,
I've been toying with an idea for a new project and was wondering if anyone has any idea on how a service like Kayak.com is able to aggregate data from so many sources so quickly and accurately. More specifically, do you think Kayak.com is interacting with APIs or are they crawling/scraping airline and hotel websites in order to fulfill user requests? I know there isn't one right answer for this sort of thing but I'm curious to know what others think would be a good way to go about this. If it helps, pretend you are going to create kayak.com tomorrow ... where is your data coming from?

I'm working in travel industry as a software architect / project lead on the precisely kind of project you describe - in our region we work with suppliers directly, but for outgoing we connect to several aggregators.
To answer your question... some data you have, some you get in various ways, and some you have to torture and twist until it confesses.
What's your angle?
The questions you have to ask are... Do you want to sell advertising like Kayak or do you take a cut like Expedia? Are you into search or into selling travel services? Do you target niche (for example, just air travel) or everything (accommodation, airlines, rent-a-car, additional services like transport/sightseeing/conferences etc)? Do you target region (US or part of US) or the world? How deep do you go - do you just show several sites on a single screen, or do you bundle different services together and package them dynamically?
Getting the data
If you're going with Kayak business model, you technically don't need site's permission... but a lot of sites have affiliate programs with IFrames or other simple ways to direct the customer to their site. On the plus side, you don't have to deal with payments/complaints and travelers themselves. As for the cons... if you want to compare prices yourself and present the cheapest option to the user, you'll have to integrate on a deeper level, and that means APIs and web scraping.
As for web scraping... avoid it. It sucks. Really. Just don't do it. Trust me on this one. For example, some things like lowcosters you can't get without web scraping. Low cost airlines live from value added services. If the user doesn't see their website, they don't sell extra stuff, and they don't earn anything. Therefore, they don't have affiliates, they don't offer APIs, and they change their site layout almost constantly. However, there are companies which earn a living by web scraping lowcoster's sites and wrapping them into nice APIs. If you can afford them, you can give your users cost-comparison of low cost flights and that's huge.
On the other hand, there are "normal" carriers which offer APIs. It's not that big of a problem to get to airlines since they're all united under IATA; basically, you buy from IATA, and IATA distributes the money to carriers. However, you probably don't want to connect directly to carrier network. They have web services and SOAP these days, but believe me when I say that there are SOAP protocols which are just an insanely thin wrappers around a text prompt through which you can interact with a mainframe with an 80es-style protocol (think of a Unix prompt where you're billed per command; and it takes about 20 commands to do one search). That's why you probably want to connect to somebody a bit more down the food chain, with a better API.
Airlines are thus on both extremes of Gaussian curve; on one side are individual suppliers, and on the other highly centralized systems where you implement one API and you're able to fly anywhere in the world. Accommodation and the rest of travel products are in between. There are several big players which aggregate hotels, and a ton of small suppliers with a lot of aggregators which cover only part of a spectrum. For example, you can rent a lighthouse and it's even not that expensive - but you won't be able to compare the prices of different lighthouses in one place.
If you're into Kayak business model, you'll probably end up scraping websites. If you're into integrating different providers, you'll often work with APIs, some of which are pretty good, and most of which are tolerable. I haven't worked with RSS but there's not a lot of difference between RSS and web scraping. There is also a fourth option not mentioned in Jeff's answer... the one where you get your data nightly, for example .CSV files through FTP and similar.
Life sucks (mini-rant)
And then there's complexity. The more value you want to add, the more complexity you'll have to handle. Can you search accommodations which allow pets? For a hostel which is located less than 5 km from the town center? Are you combining flights, and are you able to guarantee that the traveler will have enough time to get from one airport to another... can you sell the transport in advance? A famous cellist doesn't want to part from his precious 18th century cello; can you sell him another seat for the cello (yep, not making this one up)?
Want to compare prices? Sure, the room is EUR 30 per night. But you can either get one double for 30 and one single for 20, or you can get one extra bed in a double and get 70% off for third person. But only if it's a child under 12 years of age; our extra beds are not for adults. And you don't get the price for extra bed in search results - only when you calculate the final price.
And don't even get me started on dynamic packaging. Want to sell accommodation + rent-a-car? No problem; integrate with two different providers, and off you go... manually updating list of locations in the city (from rent-a-car provider) to match with hotels (from accommodation provider, who gives you only the city for each hotel). Of course, provided that you've already matched the list of cities from the two, since there is no international standard for city codes.
Unlike a lot of other industries which have many products, travel industry has many very complex products. Amazon has it easy; selling books and selling potatoes, it's the same thing; you can even ship them in the same box. They combine easily and aren't assembled from many parts. :)
P.S. Linking to an interesting recent thread on Hacker News with some insider info regarding flights.
P.P.S. Recently stumbled on a great albeit rather old blogpost on IATA's NDC protocol with overview of how travel industry is connected and a history lesson how this came to be.

They use a software package like ITA Software, which is one of the companies Google is in the process of picking up.

Only 3 ways I know of to get data from websites.
RSS Feeds - We use rss feeds a lot at my company to integrate existing site's data with our apps. It's fast and most sites already have an RSS feed available. The problem with this is not all sites implement the RSS standard properly so if you're pulling data from many RSS feeds across many sites then make sure you write your code so that you can add exceptions and filters easily.
APIs - These are nice if they are designed well and have all the information you need, however that's not always the case, plus if the sites are not using a standard api format then you'll have to support multiple API's.
Web Scraping - This method would be the most unreliable as well as the most expensive to maintain. But if you're left with nothing else it can be done.

This article says that Kayak was asked to stop scrapping a certain airlines page. That leads me to believe that they probably do scraping on sites that they don't have a relationship with (and a data feed that comes with that relationship).

Travelport offer a product called "Universal API" which connects to flights and hotels and car rental companies and copes with package deals and all the various complexities to do with taxes and exchange rates:
https://developer.travelport.com/app/developer-network/resource-centre-uapi
I've just started using it and it seems fine so far. The queries are a little slow, but then so is every query on every OTA (Online travel agent)'s site.

There's two good APIs I've found from flight comparison websites recently
There's one from Wego, and one from Skyscanner. Both seem to have a good range and breadth of data from a number of airlines and good documentation too.
Wego pays each time a user clicks from your app to a booking website and Skyscanner pay affiliates 50% of 'revenue' (I assume that means the commission they make from airlines)

This is an old post but I thought I'd just add. I'm a data architect who works for a company that feeds these travel sites with content. This company enters into contracts with many hotel brands, individual hotels and other content providers. We aggregate this information then pass it onto the different channels. They then aggregate again in to their system.
The Large GDS systems are also content providers.
Aggregation is done by many methods... matching algorithms(in-house) and keys. Being an aggregation service, we need to communicate on the client level.
Hope this helps! cheers!

Related

Recommendations for auto formatting block of text from TLDR API?

I'm using an API (TLDR) to use AI to provide a Human-like written summary of articles. The API pushes out unformatted text with no line breaks or paragraphs.
I'm looking for an online service (API) that can prettify / beautify the text and add meaningful paragraphs and line breaks.
Does anything like this exist or should I create my own service? Example text output below.
Six years after Yahoo purchased Tumblr for north of $1 billion, its
parent corporation is selling the once-dominant blogging platform.
WordPress owner Automattic Inc. has agreed to take the service off of
Verizon’s hands. Terms of the deal are undisclosed, but the number is
“nominal,” compared to its original asking price, per an article in
The Wall Street Journal.Axios is reporting that the asking price for
the platform is “well below $20 million,” a fraction of a fraction of
its 2013 price tag.Once the hottest game in town, the intervening
half-decade has been tough on Tumblr, as sites like Facebook,
Instagram, Reddit and the like have since left the platform in the
dust. More recently, a decision to ban porn from the platform has had
a marked negative impact on the service’s traffic. According to Sensor
Tower, first-time users for Tumblr’s mobile app declined 33%
year-over-year last quarter.“Tumblr is one of the Web’s most iconic
brands,” Automattic CEO Matt Mullenweg said of the news. “It is an
essential venue to share new ideas, cultures and experiences, helping
millions create and build communities around their shared interests.
We are excited to add it to our lineup, which already includes
WordPress.com, WooCommerce, Jetpack, Simplenote, Longreads, and
more.”The news certainly isn’t surprising. In May, it was reported
that Verizon was looking for a new owner for the site it inherited
through its acquisition of Yahoo. Tumblr was Yahoo’s largest
acquisition at the time, as then-CEO Marissa Mayer “promise[d] not to
screw it up” in a statement made at the time.Tumblr proved not to be a
great fit for Yahoo — and even less so Verizon, which rolled the
platform into its short-lived Oath business and later the Verizon
Media Group (also TechCrunch’s umbrella company). On the face of it,
at least, Automattic seems a much better match. The company runs
WordPress.com, one of the internet’s most popular publishing tools,
along with Jetpack and Simplenote. As part of the deal, the company
will take on 200 Tumblr staffers.“We couldn’t be more excited to be
joining a team that has a similar mission. Many of you know
WordPress.com, Automattic’s flagship product. WordPress.com and Tumblr
were both early pioneers among blogging platforms,” Tumblr fittingly
wrote in a blog post. “Automattic shares our vision to build
passionate communities around shared interests and to democratize
publishing so that anyone with a story can tell it, especially when
they come from under-heard voices and marginalized
communities.”.“Today’s announcement is the culmination of a
thoughtful, thorough and strategic process,” Verizon Media CEO Guru
Gowrappan said in a statement. “Tumblr is a marquee brand that has
started movements, allowed for true identities to blossom and become
home to many creative communities and fandoms. We are proud of what
the team has accomplished and are happy to have found the perfect
partner in Automattic, whose expertise and track record will unlock
new and exciting possibilities for Tumblr and its users.
Thanks
Jonathan

What free/paid search API's allow for programmatic querying and caching/storage of the resulting data?

If you've done any serious research into search API's, you know that most of them have a huge slew of TOS/TOU restrictions that make them nearly impossible to use in anything but the most inane applications.
Bing's 2.0 API, Yahoo Search BOSS, Google Places, Google AJAX Search (dead), et al, are far too restrictive for us. I need to run a finite and relatively small number of queries (perhaps 500k) one time only, storing specific data from the results for use within our application.
For example, we need to match up business names with their target websites (we have written the algorithm to make a 'best guess' from a set of results if necessary; we just need a vanilla result set). Also, we need to match an address to this company in question.
Unfortunately, I can find ZERO search API's that will allow us to fire off queries in a programmatic, non-user-initiated manner.
We're even quite eager to give someone cold, hard cash for access to this kind of data; Google, Bing, Yahoo, and others simply seem to not want our money (as evidenced by their TOSes)...
Any thoughts?
A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2.
http://commoncrawl.org/
Their Terms of Service (or TOU) are pretty reasonable and unrestricted too:
http://commoncrawl.org/about/terms-of-use/
If you know some visual basic I'd suggest playing around with Bing Ad Intelligence. It's a free Excel plugin and all you need to use it is a free Microsoft account.
The query limit is 20,000 words per query. You can get information on Clicks, Impressions, CTR, CPC, Average Bid and Total Cost. The query limit is a little lower if you use the more advanced keyword research features.

How long does it take to do a yodlee implementation?

I'm a non-technical (well, non-software. hardware background) founder who has hired a pretty good developer that has built a site with backend on Rails and frontend with CSS/HTML pretty capably. our next step is to develop a Yodlee integration, and we both want to know how long it takes to do this. He has an estimate which I think is reasonable, but would like feedback from the community without biasing the responses.
Also, if anybody has done an implementation before, I would really appreciate your perspective and help!
I have implemented a complex Yodlee integration for a LA based start-up over the last two years. They built a social game and money management platform on top of it. The short answer is that it's tough and dirty work.
The technical aspect of getting your application to communicate to the Yodlee API is not at all the hard part (its pretty much a standard web service). Following are some aspects highlighting the difficulty:
The most difficult part is dealing with the unknowns and the variability in the client data.
There is effectively no documentation for the API
There are several way to do each operation that will return different data
Ive been designing and building systems for 15 years and have gotten pretty good at estimating projects. We were way off with Yodlee; in fact we are still dealing with issues.
In order to understand why its so tough, you really need to understand what Yodlee is.. it is an aggregator of 10,000 different systems. Now these other systems might be big professional systems like Bank of America, Chase, ... but they are often small little banks (Bob's Bank in Omaha).
When Yodlee communicates with the big companies (they are called content services) there is most always an api that actually returns good data. But with the little ones, they are doing screen scraping. You can imagine that breaks all the time. They have an entire team in India which is just focused on that.
The other issue is about modelling the data; each of the content services at its source has modeled the data differentley (different names, different elements, different relationships,...) but Yodlee but combine all 10,000 models into one view. What this leaves you with is a very bloated model, where you can never know or count on getting a certain data element.
To give you an idea... there are extra fields about a credit account (apr, credit amount, last payment, ...) beyond the standard base'class fields (balance, ...). While this sounds great that you have this data, in practice the number of content services that provide these extra data elements is so low that you cant really depend on them. I'd say that the fidelity of those data elements is very low. All you can really count on is the base elements (account name, type, balance) and (transaction date, description and type).
Speaking of transactions... their transaction categorization system is not that good. They have clearly taken a breadth first approach to this, rather than focus on accuracy. We built an entire system for transaction categorization which is far more effective.
A couple other things: The DAG account test system is useless; it does not operate the same way real accounts do. You will be far better off opening 5-10 accounts at different content services and giving your developers the username/passwords for these for testing. The MFA (multifactor auth) system for account security has been an endless headache. This isnt Yodlee's fault, its the nature of the game. The banks are doing more and more crazy things that add security layers. Yodlee has the MFA system in place to compensate for this. At any given time about 20% of our accounts are in error for some reason. We have built an entire component just to manage this.
So what does this all mean? Double your estimate, get ready to get dirty. I dont want to put Yodlee down at all (except for the lack of documentation); they really are solving a hard problem. There really arent any other better options.
I run the team responsible for sales and support of the Yodlee APIs so the response may be a little biased.
I have seen clients get up and running in anywhere from 10 days to 3 months to 6 months. The time to implement depends on the number of fields in the data model you are using and how you are going to use the data or manipulate it before presenting it to your users.
While the most prevalent data fields such as account balance or transaction amount will always be available, Craig is right, as you get into the broader data model you will have to code for exceptions when the data is not there. Yodlee does provide documentation on how often the fields will be available to help with this process. But if you are only going to be using basic account and transactional information, you will not have to worry about these complexities and it will speed implementation.
How you use the data once you receive it from Yodlee will also play a big part in the time it takes to get integrated. If you are deriving additional data from the transaction descriptions or are doing something with categorization then there is more complexity and it will require more time. If you are using many of the fields as-is, then this will be easier.
The other item that Craig mentioned is the extra security questions (Multi-factor Authentication). While that section of the API does add some work, we have added documentation around this to make integration easier. Also, with any development issues that come up we give clients access to a developer forum that is monitored by our Technical Consulting team.

Need Help Planning Architecture for Categorization Connundrum [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I have more of a "what-would-you-do" question than an actual coding question.
It relates to a project I am currently working on. In this project,
we are tasked with combining several marketplace APIs into one interface. Each
API has its own unique way of categorizing products. The top-level
parent categories for all the APIs we are looking at are more-or-less
the same with some variations. But, the subcategories are wildly
different.
For example, one API requires long bread-crumb trails
to select a category, such as: Sports > Ball Sports > New England >
Football > Active Teams > Patriots > Memorabilia. While another API has two-level
categorization: Sports > Patriots Memorabilia. In many cases, there are sub-categories
that don't relate whatsoever to the subcategories of other APIs.
So, the question is - what is the best approach to take when designing
the interface? We are currently wrestling between two possibilities:
1) Design a custom category UI on the client and then build logic into
the server that is able to sort through the needs of the various APIs
based on user-selected choices.
2) Create the UI in such a way that the user has to walk through the
necessary steps for each individual API. Depending on user settings, this means that
he may need to fill out API - specific information 5,6,10, or more
times.
While I am told that option number one is a real programming nightmare (the
example I am given is changing API data fields) I feel strongly that option number two will piss off customers.
Any ideas out there??
This a very hard problem. If you search for "ontology product classification" you'll find many research papers and discussions on the topic. If one was simply a more detailed version of the other it would be quite feasible, but your description implies that isn't the case, and thus you'll need to construct your own classification scheme and map the others onto it.
Do you have a common key (UPC code? or other) that will allow you to verify your mapping between the different product categories? If so, you might be able to construct your own categorization scheme and then map the others onto it with some degree of success.
Clearly the first option is the best for the consumer but it could be very hard and very time consuming to construct such a mapping and it will need constant updates.
One approach would be to construct a simpler hierarchy than any of the ones provided. A simpler hierarchy will make the [mostly manual] effort of mapping categories into your hierarchy much easier as most will simply be inclusions. This might make the user experience worse but if you add great search capabilities and great "related products" / "people who bought this also bought this" tools around the product browsing experience you can probably make up for the lack of hierarchy.
Number One isn't that bad of a nightmare. Your users' experience is the number one priority; never forget that. If the user has an easier time navigating a shorter route, then give them that opportunity. Also, I would wrap the API with some abstraction so my code doesn't know about the API at all and only knows about the abstraction layer. This way I can change APIs and leave most of my code alone, only changing the abstraction layer.
Use a session to pass data from page to page and a factory to create the page's state on entry based on the session data; this will strengthen the context between page, state, and user data.
Keep first level objects (the ones the page directly talks to) in context to the page; this will help when diagnosing problems.
Most importantly, create a series of tests for your abstraction layer that test every object, function, and input output pair to make sure your application is rock solid.
You have to provide to your clients a consistent not changing interface.
I would like to see an example of the two different approaches you have in mind.
API.find( PRODUCT, CATEGORIES_LIST )

What are the differences between programming in an IT department and a Product Development department?

Our company has recently decided that a good section of our IT department is actually doing product development and not internal IT development and now has created a new department.
What are the types of changes that developers should be looking to make during this type of transition?
Is there really any difference between internal development and product development?
I don't know how quickly the differences will assert themselves in an existing group which is transitioning from one role to the other, but having worked as both an internal developer and a product developer, two huge differences leap to mind: requirements and and testing.
As a developer of internal tools, I was pretty much given free reign regarding interface, organization, and even scope. Specs were in the form of an email saying "can you write something that does X?" Similarly, testing was almost non-existent. After whatever testing I was able to do, the tools would be deployed directly to their target audience and bug reports, when they came at all, were directly from those end-users, again usually via email or even hall-tackle.
Now that I'm doing product development, the difference is dramatic. Specs are 30-100 page Word documents and we have a dedicated testing department which makes darn sure that what we produce matches those specs. I'm much better supported by the project managers and have clear channels for any feedback I have on requirements or design. It could be argued that product development offers less freedom to the individual developer, but in exchange for being part of a (hopefully) better-organized and better-supported team.
[I work with Jeff.]
As others have mentioned, a key difference stems from the nature of the user:
When developing internal applications, you're typically dealing users from a single department or group who use a limited number of apps, and they're mostly a captive userbase.
When developing external products, you're dealing with users who see the whole enchilada, and they're paying customers who can take their business elsewhere.
That difference matters because a coherent user experience across multiple applications becomes much more important for paying customers who see the entire set of apps.
In the IT case, neither the business nor the users themselves are typically concerned about the fact that their app doesn't look and work like some other app that some other department uses. And if for whatever reason they did care, their ability to do anything about it is more limited. While it's possible to make a business case for consistency across apps (branding to internal employees, usability for employees who use multiple apps, reduced development costs resulting from the reuse of common libraries, patterns, services, etc.), this kind of concern typically takes a backseat to the development of new business functionality.
In the product case, the users do care about coherence across the entire set of apps, and they can do something about it if the experience is weak.
So one major difference is the relative importance of a coherent, quality user experience. But that goal itself has significant organizational ramifications that we're already starting to see.
We'll see an increased emphasis on "horizontal" activities that seek to establish standards and improve communications across teams, since such activities directly support the goal of producing coherent products. Cross-cutting teams (like user experience) will become more influential than they formerly were, we may see new cross-cutting teams (e.g. teams that look at architecture across many systems rather than just a single system), we'll probably see more presentations about what different apps do, more cross-training, etc.
App development teams will have less autonomy than they formerly had in charting their own course. The user experience team (working with end users, business stakeholders and engineering teams) will specify standards around visual design, interaction design, etc. and the app teams will be expected to adopt those. We'll see test practices become more standardized. Builds and deployments across apps will be more uniform and coordinated. Monitoring, alerting, response time SLAs and other operational metrics will be more uniform across apps. The app teams won't be able to define these for themselves anymore (though they can certainly contribute to the larger discussion).
Management will increasingly allocate resources in a way that seeks to optimize globally across the entire product instead of optimizing locally for individual apps. So where in the past the composition of individual app teams has been relatively fixed (developers, SQA, etc.), going forward we should expect to see more fluidity.
A huge difference is in your customer. IT developers have the rest of the company (and sometimes partner / subsidiary companies) as their primary customers. The Product Development developers have customers (i.e. the people who buy the product that is the companies reason for existing) as the primary customer.
Yes, very much so. There's a huge difference between performing an activity that drives revenue and invoices and one that's perceived as overhead.
My experiance is that people working on product development side of the house have bigger budgets, better training, better travel and more skilled employees. Having worked in a product development company it always felt like the lower skilled employees were thrown over to IT department.
In-house development is very process-orientated. They just want to deliver xyz functionality and have it work based on a company-wide strategy. It's not the iterative make-the-product better cycle you get when your primary product is your code. As a result, in-house stuff is often just 'good enough' whereas software companies probably tend to try to improve their product long after it 'just works'.
Note, in-house dev teams can get away with being 'good enough' but software development companies can get away with it for a while but will ultimately lose out to the ones who strive to improve.
I think moving between both environments could be a shock to the system in either direction but that's not to say they both can't be run in the same way - just that in my experience they probably won't. For example, UI is usually given a lower priority when it is in-house software as the customers are generally getting paid to use it rather than paying to use it.
I have been equals amounts of time in both types of organizations. The particular organization where I worked where the software development was part of an IT department treated the development of software as a cost center, where as software development as part of an product was seen as a profit center.
The two are very different. The skill level of developers in my case was vastly different-- those working on a public product were overall better skilled and cared more about the quality of their work. As a developer of a real public product you are actually making the company money.
In my internal software job I typically had a set of known fixed requirements mostly worked out. I designed a solution, but was always given a deadline that was unreasonable if quality was a concern (including code quality), rushed the coding, and delivered the result. Any bugs found that passed any short QA process typically only got fixed if they made formal requests for fixes.
Product development in my experience is almost the reverse. All the requirements are not fixed (only what I was working on for that week was fixed), design is usually dictated by someone who has worked on the product the longest. I get decide how long something takes (but got to really explain why and justify the time), and coding is usually not rushed. Experimenting with ideas generally is more difficult in product development because a product that is meant for public use should use tried and tested approaches.
Therefore, I would say that if creativity is really important then product development may not be for you because a particular idea that you personally have is unlikely to ever make it into the product unless you can make a business case for it and somehow make it more important than what the business has already been planning.
Choosing a particular library is also more difficult depending on how the software is deployed. For example software used by the government typically has to pass Common Criteria certification, which can eliminate certain library choices.