As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for geocoding service, which can provide 1 million queries per day.
I've already read about google/yahoo api, but unfortunately none of them can offer this quantity.
Any help is appreciated.
Google, Yahoo, MapQuest (licensed service) or Microsoft will be more than happy to allow you to use their API with this kind of volume, with their premium plans.
If you want this for free, MapQuest Open runs Nominatim, a free geocoder, based on OpenStreetMap data. This service is not, as of today, rate-limited.
Or, if you want more control, why not set up your own geocoder, based on Nominatim?
I work at SmartyStreets where we specialize in address verification and geocoding. While I'm not sure yet (see my comment to your question) if you are geocoding by address or by IP, I know of some venues you could investigate. I'll start with some general principles then offer a recommendation or two.
There are services that will perform either batch geocoding or geocoding en masse for such large quantities. Ultimately, to service upwards of a million requests daily from a single user, the API you determine to use should have the following characteristics:
Geo-distributed. Latency can easily double the time of a request, and over a million queries in just one day (about 11 queries/sec) can seriously affect your app's performance.
Scalable. If one machine becomes overwhelmed servicing API requests, how will the system cope and service others pending?
SLA with guaranteed uptime. Especially for mission-critical operations, geocoding must not get in your way, and for such a large quantity you want to make sure the availability isn't affected arbitrarily.
Portable/lightweight. In other words, you want something that can output results in a universal format. XML is nice, but often difficult to use and has its limitations. I've personally found JSON to be a great format for sending and receiving data.
Affordable. The premium plans of Google and Yahoo's APIs are generally designed for corporate entities, which carry a hefty cost. Your means may not allow that.
Also keep in mind that Google, OpenStreetMap (Nominatim), Yahoo, and others, don't actually verify the locations they geocode. In other words, you can give Google or OSM an address that doesn't really exist, and it will still give you coordinates --- because they perform address approximation, not address verification. Their purpose is to help you search/find things, but if you need accurate coordinates, you best make sure the address is correct.
Start looking around for APIs like this. I would suggest you start with LiveAddress, and see how it meets your needs. We service millions of requests per day and can easily handle thousands of requests per second, and the data we return to you will only actually exist: no guessing about the addresses. It comes with an SLA, is serviced from 3 data centers across the US, and has a simple JSON output. Response times are generally around 100ms or less (excluding external latencies out of our control).
And by the way, it's free to use for 250 addresses, or queries, per month, which in your case should help you get started real easy...
If you have any further questions, I'll be happy to help you personally.
For free? If so, you won't find such a service from a commercial company like Google, Yahoo, Microsoft, MapQuest etc. The only way is to pay for more daily queries or to use OpenStreetMap.org (OSM). However, OSM's API does only offer 2,5k queries a day, but you can download the whole map stuff (or even some parts, e.g. particular cities or countries) and put it on your own server. Note, OSM does not offer satelite or street viewing.
Related
I want to create a tokenomics database.
I want to build a bot to automate data collection that answers for each token on the blockchain some of the questions asked and answered in this article.
I am a total newbie when it comes to blockchain. So, I have some basic questions.
Where is this information located? How did the author discover the numbers he quotes? Is there an API one could use to collect this information? (See update.)
Update: In writing this question, I discovered this Etherscan API. How might one leverage that API to obtain the tokenomics data I want?
Source [emphasis mine]
There will only ever be 21,000,000 bitcoin, and they’re released at a rate that gets cut in half every four years or so. Roughly 19,000,000 already exist, so there are only 2,000,000 more to be released over the next 120 years.
What about Ethereum? The circulating supply is around 118,000,000, and there’s no cap on how many Ether can exist. But Ethereum’s net emissions were recently adjusted via a burn mechanism so that it would reach a stable supply, or potentially even be deflationary, resulting in somewhere between 100-120m tokens total. Given that, we shouldn’t expect much inflationary pressure on Ether either. It could even be deflationary.
Dogecoin has no supply cap either, and it is currently inflating at around 5% per year. So of the three, we should expect inflationary tokenomics to erode the value of Doge more than Bitcoin or Ethereum.
The last thing you want to consider with supply is allocation. Do a few investors hold a ton of the tokens which are going to be unlocked soon? Did the protocol give most of its tokens to the community? How fair does the distribution seem? If a bunch of investors have 25% of the supply and those tokens will unlock in a month, you might hesitate before buying in.
The question is about using a chat-bot framework in a research study, where one would like to measure the improvement of a rule-based decision process over time.
For example, we would like to understand how to improve the process of medical condition identification (and treatment) using the minimal set of guided questions and patient interaction.
Medical condition can be formulated into a work-flow rules by doctors; possible technical approach for such study would be developing an app or web site that can be accessed by patients, where they can ask free text questions that a predefined rule-based chat-bot will address. During the study there will be a doctor monitoring the collected data and improving the rules and the possible responses (and also provide new responses when the workflow has reached a dead-end), we do plan to collect the conversations and apply machine learning to generate improved work-flow tree (and questions) over time, however the plan is to do any data analysis and processing offline, there is no intention of building a full product.
This is a low budget academy study, and the PHD student has good development skills and data science knowledge (python) and will be accompanied by a fellow student that will work on the engineering side. One of the conversational-AI options recommended for data scientists was RASA.
I invested the last few days reading and playing with several chat-bots solutions: RASA, Botpress, also looked at Dialogflow and read tons of comparison material which makes it more challenging.
From the sources on the internet it seems that RASA might be a better fit for data science projects, however it would be great to get a sense of the real learning curve and how fast one can expect to have a working bot, and the especially one that has to continuously update the rules.
Few things to clarify, We do have data to generate the questions and in touch with doctors to improve the quality, it seems that we need a way to introduce participants with multiple choices and provide answers (not just free text), being in the research side there is also no need to align with any specific big provider (i.e. Google, Amazon or Microsoft) unless it has a benefit, the important consideration are time, money and felxability, we would like to have a working approach in few weeks (and continuously improve it) the whole experiment will run for no more than 3-4 months. We do need to be able to extract all the data. We are not sure about which channel is best for such study WhatsApp? Website? Other? and what are the involved complexities?
Any thoughts about the challenges and considerations about dealing with chat-bots would be valuable.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
The problem is the following. We gather some data in real time, let say 100 entries per second. We want to have real-time reports. The reports should present data by hours. All we want to do is to create some sums of incoming data and have some smart indexing so that we can easily serve queries like "give me value2 for featureA = x, and featureB = y, for 2012-01-01 09:00 - 10:00"
To avoid too many I/O operations we aggregate data in memory (which means we sum them up), then flush them to database. Let us say it happens every 10 seconds or so, which is an acceptable latency for our real-time reports.
So basically, in SQL terms, we end-up with 20 (or more) tables like this (ok, we could have little less of them by combining sum, but it does not make a lot of difference):
Time, FeatureA, FeatureB, FeatureC, value1, value2, valu3
Time, FeatureA, FeatureD, value4, value5
Time, FeatureC, FeatureE, value6, value7
etc.
(I do not say the solution has to be SQL, I only present this to explain the issue at hand.) The Time column is timestamp (with hour precision), Feature columns are some ids of system entities, and values are integer values (counts).
So now the problem arises. Because of the very nature of the data, even if we aggregate them, there are still (too) many inserts to these aggregating tables. This is because some of the data are sparse, which means that for every 100 entries, we have, say, 50 entries to some of aggregating tables. I understand that we can go forward by upgrading the hardware, but what I feel is that we could do better by having smarter storing mechanism. For example, we could use SQL database, but we do not need most of its features (transactions, joins etc.).
So given this scenario my question is the following. How do you guys deal with real-time reporting of high volume traffic? Google somehow does this for web analytics, so it is possible after all. Any secret weapon here? We are open to any solutions - be it Hadoop & Co, NoSQL, clustering or whatever else.
Aside from splitting the storage requirements for collection and reporting/analysis, one of the things we used to do, is look at how often significant changes to a value occurred, and how the data would be used.
No idea what your data looks like, but reporting and analysis is usually looking for significant patterns. In tolerance to out, and vice versa and particularly oscillation.
Now while it's might be laudable to collect an "inifinite" amount of data just in case you want to analyse it, when you bump into the finite limits of implementing, choices have to be made.
I did this sort of thing in manufacturing environment. We had two levels of analysis. One for control where the granularity was as high as we could afford. Then as the data got further in the past we summarised it for reporting.
I ran into the issues you appear to be more than a few times, and while the loss of data was bemoaned, the complaints about how much it would cost were much louder.
So I wouldn't look at this issue from simply a technical point of view, but from a practical business one. Start from how much the business believes it can afford, and see how much you can give them for it.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Apart from graphical features, online games should have a simple Relational Database structure. I am curious what database do online games like Farmville and MafiaWars use?
Is it practical to use SQL based databases for such programs with such frequent writes ?
If not, how could one store the relational dependence of users in these games?
EDIT: As pointed, they use NOSQL databases like Couchbase. NOSQL is fast with good cuncurrency (which is really needed here); but the sotrage size is much larger (due to key/value structure).
1. Does't it slow down the system (as we need to read large database files from the disk)?
2. We will be very limited as we do not have SQL's JOIN to connected different sets of data.
These databases scale to about 500,000 operations per second, and they're massively distributed. Zynga still uses SQL for logs, but for game data, they presently use code that is substantially the same as Couchbase.
“Zynga’s objective was simple: we needed a database that could keep up with the challenging demands of our games while minimizing our average, fully-loaded cost per database operation – including capital equipment, management costs and developer productivity. We evaluated many NoSQL database technologies but all fell short of our stringent requirements. Our membase development efforts dovetailed with work being done at NorthScale and NHN and we’re delighted to contribute our code to the open source community and to sponsor continuing efforts to maintain and enhance the software.” - Cadir Lee, Chief Technology Officer, Zynga
To Answer your edit:
You can decrease storage size by using a non key=>value storage like MongoDB. This does still have some overhead but less than trying to maintain a key=>value store.
It does not slow down the system terribly since quite a few NoSQL products are memory mapped which means that unlike SQL it doesn't go directly to disk but instead to a fsync queue that then writes to disk when it is convient to. For those NoSQL solutions which are not memory mapped they have extremely fast read/write speeds and it is normally a case of trade off between the two.
As for JOINs, it is a case of arranging your schema in such a manner that you can avoid huge joins. Small joins to join say, a user with his score record are fine but aggregated joins will be a problem and you will need to find other ways around this. There are numerous solutions provided by many user groups of various NoSQL products.
The database they use has been reported to be Membase. It's open source, and one of the many nosql databases.
In January 2012, Membase became Couchbase, and you can download it here if you want to give it a try.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Here are the estimates the system should handle:
3000+ end users
150+ offices around the world
1500+ concurrent users at peak times
10.000+ daily updates
4-5 commits per second
50-70 transactions per second (reads/searches/updates)
This will be internal only business application, dedicated to help shipping company with worldwide shipment management.
What would be your technology choice, why that choice and roughly how long would it take to implement it? Thanks.
Note: I'm not recruiting. :-)
So, you asked how I would tackle such a project. In the Smalltalk world, people seem to agree that Gemstone makes things scale somewhat magically.
So, what I'd really do is this: I'd start developing in a simple Squeak image, using SandstoneDB. Then, this moment would come where a single image begins being too slow.
GemStone then takes care of copying your public objects (those visible from a certain root) back and forth between all instances. You get sessions and enhanced query functionalities, plus quite a fast VM.
It shares data with C, Java and Ruby.
In fact, they have their own VM for ruby, which is also worth a look.
wikipedia manages much more demanding requirements with MySQL
Your volumes are significant but not likely to strain any credible RDBMS if programmed efficiently. If your team is sloppy (i.e., casually putting SQL queries directly into components which are then composed into larger components), you face the likelihood of a "multiplier" effect where one logical requirement (get the data necessary for this page) turns into a high number of physical database queries.
So, rather than focussing on the capacity of your RDBMS, you should focus on the capacity of your programmers and the degree to which your implementation language and environment facilitate profiling and refactoring.
The scenario you propose is clearly a 24x7x365 one, too, so you should also consider the need for monitoring / dashboard requirements.
There's no way to estimate development effort based on the needs you've presented; it's great that you've analyzed your transactions to this level of granularity, but the main determinant of development effort will be the domain and UI requirements.
Choose the technology your developers know and are familiar with. All major technologies out there will handle such requirements with ease.
Your daily update numbers vs commits do not add up. Four commits per second = 14,400 per hour.
You did not mention anything about expected database size.
In any case, I would concentrate my efforts on choosing a robust back end like Oracle, Sybase, MS etc. This choice will make the most difference in performance. The front end could either be a desktop app or WEB app depending on needs. Since this will be used in many offices around the world, a WEB app might make the most sense.
I'd go with MySQL or PostgreSQL. Not likely to have problems with either one for your requirements.
I love object-databases. In terms of commits-per-second and database-roundtrip, no relational database can hold up. Check out db4o. It's dead easy to learn, check out the examples!
As for the programming language and UI framework: Well, take what your team is good at. Dynamic languages with fewer meta-time wasting will probably save time.
There is not enough information provided here to give a proper recommendation. A little more due diligence is in order.
What is the IT culture like? Do they prefer lots of little servers or fewer bigger servers or big iron? What is their position on virtualization?
What is the corporate culture like? What is the political climate like? The open source offerings may very well handle the load but you may need to go with a proprietary vendor just because they are already used to navigating the political winds of a large company. Perception is important.
What is the maturity level of the organization? Do they already have an Enterprise Architecture team in place? Do they even know what EA is?
You've described the operational side but what about the analytical side? What OLAP technology are they expecting to use or already have in place?
Speaking of integration, what other systems will you need to integrate with?