What is a preferred datastore for fast aggregating of data?
I have data that I pull from other systems regularly, and the data store should support queries like:
What is the number of transactions done by a user in a time range.
What is the total sum of successful transactions done by a user in a time range.
Queries should support sql constructs like group by, count, sum etc over a large set of data.
Right now, I'm using a custom data model in Redis, and data is fetched in memory, and then aggregates are run over it. The problem with this model is that this is closely tied to my pivots (columns) and any additional pivot, if added will cause my data to explode leading to huge memory consumption on my redis boxes.
I've explored elasticsearch, but elasticsearch queries with aggregations are taking longer than 200ms, for the kind of data that I have.
Are there any other alternatives, I'm also looking at Aerospike now. Can someone throw some light on how does aerospike aggregates work in this scenario?
Aerospike supports aggregations on top of secondary index queries. Seems most of your queries are pivoted on user. You can build a secondary index on top of userid and query for all the data corresponding to a user. You can then slap the aggregation logic and filter the stuff based on desired time range. you need to do this because Aerospike does not yet support multiple where clause where you query for a user and a time range at same time.
Your queries 1 & 2 can be done by writing an aggregation UDF based on a secondary index query on userid as above.
I am not very clear about your 3 questions. Aerospike does not provide group by, sum, count etc as native queries. But you can always write an aggregation UDF to achieve it. http://www.aerospike.com/docs/guide/aggregation.html
Related
I have big data events (TBs) I need to query and I am trying to partition it correctly.
I have client and each client has many games.
The problem is there are fields we query for, that might be null in some events, therefore they cannot be used as partitions (for example: segment).
I thought about 2 strategies:
partitions by: client/game/date (S3)
different table per client or game, and partition only by date.
different buckets.
option 1, is simple - and I filter in where clause.
option 2, will require unions.
What is the correct way to partition such data?
And by correct I mean most efficient and most cost effective?
Reagards,
Ido
As far as the big data event is described, the events are as per following behavior:
Multiple Clients, each clients with multiple games and each games with multiple events which can be partitioned on Date.
Now, for different games, event schema may be different and hence, while querying may return in null values. There is no dependency on client. So, with different clients and same game, event schema should be same.
So, among client/games/date and games/client/date, better is to make partition with games/client/date because the above partition would be more helpful as after first level of partition, the events schema would be same. From query perspective for query without game field partition, it would not make any difference but if games partition field is used in query, then it would result in higher efficiency.
I've been optimizing many cubes, that got a long time processing. Approximately 20 min per 10 mln rows. I've created partitions and processing became a short - about 4 min per 10 mln. Also I've create one aggregation for all partition with full processing molap and 100% aggregate (cube is not so big). Is there any reason to create aggregation for each partition? Will it work faster when user try to refresh pivot table based on olap cube?
Thanks.
Typically you have one aggregation design shared by all partitions in a measure group. On very large measure groups you might have a second lightweight aggregation design for very old rarely used partitions.
Adding lots of aggregation designs (like a separate one per partition) will likely slow down queries a tiny bit because of all the extra time it takes internally to figure out which aggregation to read from.
If you used the aggregation wizard don't bother. It knows nothing about how you query your cube and will create stupid useless aggs that waste processing time. Instead deploy your cube then go back in a few days after users have run some queries and do Usage Based Optimization instead.
Creating partitions is a good way to improve the cube processing time.
Aggregations are useful if done on the correct fields. By correct I mean , the filter selections used most frequently by the users. Usage Based Optimization is good approach to achieve it.
Also read through the below article to understand the approach used while checking the performance.
https://mytechconnect.wordpress.com/2013/08/27/ssas-performance-best-practices-and-performance-optimization/
In traditional data modeling, I create hourly and daily rollup table to reduce data storage and improve query response time. However, the attempt to create similar rollup table easily run into "Response too large to return" error. What is the recommended method to create rollup table with BigQuery? I need to reduce data to reduce cost of storage and query.
Thx!
A recently announced BigQuery feature allows large results!
Now you can specify a flag and a destination table. Results of arbitrary size will be stored in the designated table.
https://developers.google.com/bigquery/docs/queries#largequeryresults
It sounds like you are appending all of your data to a single table, then want to create smaller tables to query over ... is that correct?
One option would be to load your data in the hourly slices, then create the daily and 'all' tables by performing table copy operations with write_disposition=WRITE_APPEND. Alternately, you can use multiple tables in your queries. For example select foo from table20130101,table20130102,table20130102. (Note this does not do a join, it does a UNION ALL. It is a quirk of the bigquery query syntax).
If it will be difficult to change the layout of your tables, there isn't currently support for larger query result sizes, but it is something that is one of our most requested features and we have it a high priority.
Also, creating smaller tables won't necessarily improve query performance, since bigquery processes queries in parallel to the extent possible. It won't reduce storage costs, unless you're only going to store part of the table. It will, of course, reduce the costs of a query, since running queries against larger tables is more expensive.
If you describe your scenario a bit more I may be able to offer more concrete advice.
We are experimenting with BigQuery to analyze user data generated by our software application.
Our working table consists hundreds of millions of rows, each representing a unique user "session". Each containing a timestamp, UUID, and other fields describing the user's interaction with our product during that session. We currently generate about 2GB of data (~10M rows) per day.
Every so often we may run queries against the entire dataset (about 2 months worth right now, and growing), However typical queries will span just a single day, week, or month. We're finding out that as our table grows, our single-day query becomes more and more expensive (as we would expect given BigQuery architecture)
What isthe best way to query subsets of of our data more efficiently? One approach I can think of is to "partition" the data into separate tables by day (or week, month, etc.) then query them together in a union:
SELECT foo from
mytable_2012-09-01,
mytable_2012-09-02,
mytable_2012-09-03;
Is there a better way than this???
BigQuery now supports table partitions by date:
https://cloud.google.com/blog/big-data/2016/03/google-bigquery-cuts-historical-data-storage-cost-in-half-and-accelerates-many-queries-by-10x
Hi David: The best way to handle this is to shard your data across many tables and run queries as you suggest in your example.
To be more clear, BigQuery does not have a concept of indexes (by design), so sharding data into separate tables is a useful strategy for keeping queries as economically efficient as possible.
On the flip side, another useful feature for people worried about having too many tables is to set an expirationTime for tables, after which tables will be deleted and their storage reclaimed - otherwise they will persist indefinitely.
I have a postgres database with several million rows, which drives a web app. The data is static: users don't write to it.
I would like to be able to offer users query-able aggregates (e.g. the sum of all rows with a certain foreign key value), but the size of the database now means it takes 10-15 minutes to calculate such aggregates.
Should I:
start pre-calculating aggregates in the database (since the data is static)
move away from postgres and use something else?
The only problem with 1. is that I don't necessarily know which aggregates users will want, and it will obviously increase the size of the database even further.
If there was a better solution than postgres for such problems, then I'd be very grateful for any suggestions.
You are trying to solve an OLAP (On-Line Analytical Process) data base structure problem with an OLTP (On-Line Transactional Process) database structure.
You should build another set of tables that store just the aggregates and update these tables in the middle of the night. That way your customers can query the aggregate set of tables and it won't interfere with the on-line transation proceessing system at all.
The only caveate is the aggregate data will always be one day behind.
Yes
Possibly. Presumably there are a whole heap of things you would need to consider before changing your RDBMS. If you moved to SQL Server, you would use Indexed views to accomplish this: Improving Performance with SQL Server 2008 Indexed Views
If you store the aggregates in an intermediate Object (something like MyAggragatedResult), you could consider a caching proxy:
class ResultsProxy {
calculateResult(param1, param2) {
.. retrieve from cache
.. if not found, calculate and store in cache
}
}
There are quite a few caching frameworks for java, and most like for other languages/environments such as .Net as well. These solution can take care of invalidation (how long should a result be stored in memory), and memory-management (remove old cache items when reaching memory limit, etc.).
If you have a set of commonly-queried aggregates, it might be best to create an aggregate table that is maintained by triggers (or an observer pattern tied to your OR/M).
Example: say you're writing an accounting system. You keep all the debits and credits in a General Ledger table (GL). Such a table can quickly accumulate tens of millions of rows in a busy organization. To find the balance of a particular account on the balance sheet as of a given day, you would normally have to calculate the sum of all debits and credits to that account up to that date, a calculation that could take several seconds even with a properly indexed table. Calculating all figures of a balance sheet could take minutes.
Instead, you could define an account_balance table. For each account and dates or date ranges of interest (usually each month's end), you maintain a balance figure by using a trigger on the GL table to update balances by adding each delta individually to all applicable balances. This spreads the cost of aggregating these figures over each individual persistence to the database, which will likely reduce it to a negligible performance hit when saving, and will decrease the cost of getting the data from a massive linear operation to a near-constant one.
For that data volume you shouldn't have to move off Postgres.
I'd look to tuning first - 10-15 minutes seems pretty excessive for 'a few million rows'. This ought to be just a few seconds. Note that the out-of-the box config settings for Postgres don't (or at least didn't) allocate much disk buffer memory. You might look at that also.
More complex solutions involve implementing some sort of data mart or an OLAP front-end such as Mondrian over the database. The latter does pre-calculate aggregates and caches them.
If you have a set of common aggregates you can calculate it before hand (like, well, once a week) in a separate table and/or columns and users get it fast.
But I'd seeking the tuning way too - revise your indexing strategy. As your database is read only, you don't need to worry about index updating overhead.
Revise your database configuration, maybe you can squeeze some performance of it - normally default configurations are targeted to easy the life of first-time users and become short-sighted fastly with large databases.
Maybe even some denormalization can speed up things after you revised your indexing and database configuration - and falls in the situation that you need even more performance, but try it as a last resort.
Oracle supports a concept called Query Rewrite. The idea is this:
When you want a lookup (WHERE ID = val) to go faster, you add an index. You don't have to tell the optimizer to use the index - it just does. You don't have to change the query to read FROM the index... you hit the same table as you always did but now instead of reading every block in the table, it reads a few index blocks and knows where to go in the table.
Imagine if you could add something like that for aggregation. Something that the optimizer would just 'use' without being told to change. Let's say you have a table called DAILY_SALES for the last ten years. Some sales managers want monthly sales, some want quarterly, some want yearly.
You could maintain a bunch of extra tables that hold those aggregations and then you'd tell the users to change their query to use a different table. In Oracle, you'd build those as materialized views. You do no work except defining the MV and an MV Log on the source table. Then if a user queries DAILY_SALES for a sum by month, ORACLE will change your query to use an appropriate level of aggregation. The key is WITHOUT changing the query at all.
Maybe other DB's support that... but this is clearly what you are looking for.