What is generic name of this technique CouchDB uses to index aggregated data - indexing

CouchDB employs a cool pattern that can be used in a multitude of other scenarios. I'm talking about the persisted B-tree index of map/reduce results. The idea is to precalculate the aggregated data and store it at different levels of the B-tree index. The index can then be used to efficiently query the aggregate without having to reaggregate all the data all the time. Then, if any leaf-level value changes, only the ascending path through the tree has to get recalculated.
For example, if the data is price over time, the index could store the SUM and the COUNT of items at day, month, and year levels. Then, if anybody wants to query average price year-to-date all you had to do is sum up all the SUMs and COUNTs for all the full months since year start, plus all the days available for the last month, then divide total SUM by total COUNT. If a past price has to change, the change has to propagate through the index, but only corresponding day's and month's and year's values have to be updated, and even then the values for other days and other months within the year can be reused for the calculation.
What is generic name of this approach? Is anything similar exists in any of the popular RDBMSes? Any experience with using this in practice?

Materialized view
"A materialized view is a database object that contains the results of a query. They are local copies of data located remotely, or are used to create summary tables based on aggregations of a table's data. Materialized views, which store data based on remote tables, are also known as snapshots."
This is from a wikipedia article that mainly discusses storing of results in the context of a RDBMS.
Personally I prefer the term "indexed view". I actually found that wikipedia article by searching for "indexed view" on Google.

Related

Data Warehouse - Storing unique data over time

Basically we are building a reporting dashboard for our software. We are giving the Clients the ability to view basic reporting information.
Example: (I've removed 99% of the complexity of our actual system out of this example, as this should still get across what I'm trying to do)
One example metric would be...the number of unique products viewed over a certain time period. AKA, if 5 products were each viewed by customers 100 times over the course of a month. If you run the report for that month, it should just say 5 for number of products viewed.
Are there any recommendations on how to go about storing data in such a way where it can be queried for any time range, and return a unique count of products viewed. For the sake of this example...lets say there is a rule that the application cannot query the source tables directly, and we have to store summary data in a different database and query it from there.
As a side note, we have tons of other metrics we are storing, which we store aggregated by day. But this particular metric is different because of the uniqueness issue.
I personally don't think it's possible. And our current solution is that we offer 4 pre-computed time ranges where metrics affected by uniqueness are available. If you use a custom time range, then that metric is no longer available because we don't have the data pre-computed.
Your problem is that you're trying to change the grain of the fact table. This can't be done.
Your best option is what I think you are doing now - define aggregate fact tables at the grain of day, week and month to support your performance constraint.
You can address the custom time range simply by advising your users that this will be slower than the standard aggregations. For example, a user wanting to know the counts of unique products sold on Tuesdays can write a query like this, at the expense of some performance loss:
select distinct dim_prod.pcode
,count(*)
from fact_sale
join dim_prod on dim_prod.pkey = fact_sale.pkey
join dim_date on dim_date.dkey = fact_sale.dkey
where dim_date.day_name = 'Tuesday'
group by dim_prod.pcode
The query could also be written against a daily aggregate rather than a transactional fact, and as it would be scanning less data it would run faster, maybe even meeting your need
From the information that you have provided, I think you are trying to measure ' number of unique products viewed over a month (for example)'.
Not sure if you are using Kimball methodologies to design your fact tables. I believe in Kimball methodology, an Accumulating Snapshot Fact table will be recommended to meet such a requirement.
I might be preaching to the converted(apologies in that case), but if not then I would let you go through the following link where the experts have explained the concept in detail:
http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
I have also included another link from Kimball, which explains different types of fact tables in detail:
http://www.kimballgroup.com/2014/06/design-tip-167-complementary-fact-table-types/
Hope that explains the concepts in detail. More than happy to answer any questions(to the best of my ability)
Cheers
Nithin

Are there downsides to nesting data in BigQuery?

We have data of different dimensions, for example:
Name by Company
Stock prices by Date, Company
Commodity prices by Date & Commodity
Production volumes by Date, Commodity & Company
We're thinking of the best way of storing these in BigQuery. One potential method is to put them all in the same table, and nest the extra dimensions.
That would mean:
Almost all the data would be nested - e.g. there would be a single 'row' for each Company, and then its prices would be nested by Date.
Data would have to share at least one dimension - I don't think there would be a way of representing Commodity prices in a table whose first column was the company's Name
Are there disadvantages? Are there performance implications? Is it sensible to nest 5000 dates + associated values within each company's row?
It's common to have nested/repeated columns in BigQuery schemas since it makes reasoning about the data easier. Firebase produces schemas with repetition at many levels, for instance. If you flatten everything, the downside is you need some kind of unique ID for each row in order to associate events with each other, and then you'll need aggregations (using the ID as a key) rather than simple filters if you want to do any kind of counting.
As for downsides of nested/repeated schemas, one is that you may find yourself performing complicated transformations of the structure with ARRAY subqueries or STRUCT operators, for instance. These are generally fast, but they do have some overhead relative to queries without any structure imposed on the result at all.
My best suggestion would be to load some data and run some experiments. Storage and querying both are relatively cheap, so you can try a few different schema shapes and see which works better for your purposes.
Updating in Bigquery is pretty new, but based on the public available info BigQuery DML it is currently limited to only 48 updates per table per day.
Quotas
DML statements are significantly more expensive to process than SELECT
statements.
Maximum UPDATE/DELETE statements per day per table: 48 Maximum
UPDATE/DELETE statements per day per project: 500 Maximum INSERT
statements per day per table: 1,000 Maximum INSERT statements per day
per project: 10,000
Processing nested data is also very expensive since all of the data from that column is loaded on every query. It is also slow if you are doing a lot of operations on nested data.

Slow calculated measure with tuple

I have one fact table that has all information regarding how much a company buys and sells. In order to create som calculation regarding f.ex margin I need to use the rows for a purchase together with rows for sales to get the correct calculations.
Now, I have created a calculated measure that gives me the correct result, but the more dimensions I add to my query the slower the query runs when using this calculated measure. It seems like it is using a lot of time to return the tuples I am using to find the rows regarding purchase.
I am using tuples to "store" the purcase row, but the tuple becomes quite large because I need to include all default members of dimensions used by the sales rows in order for them to be used. Basicly my tuples looks like this just with more dimension hierarchies:
(
[Dimension 1].[Hierarchy 1].&[member]
,[Dimension 1].[Hierarchy 2].&[member]
,[Dimension 2].[Hierarchy 1].&[member]
,[Dimension 3].[Hierarchy 1].&[member]
,[Dimension 4].[Hierarchy 1].&[member]
,[Measures].[Purchase Standard Cost]
)
I then multiply this tuple with a measure from the sales rows and I get my result.
Anyone have any tips on how to improve the query performance? The calculation works and if I slice by just a couple of dimensions it works just fine and performance is not to bad, but the more I add the slower it gets and the users will hit performance issues.
Since amount of used dimensions increased, Storage Engine has to scan additional files, it could be a reason of such performance degradation.
I have several suggestions based on their effectiveness from my point of view:
Implement partitioning (if it's not implemented yet) to scan lower amount of data.
"Materialize" some tuples into physical dimension (if there are no dynamics, late-binding functions etc. in MDX):
2.1. Add corresponding keys, which represents tuples, to your source tables.
2.2. Build appropriate dimensions on these keys.
2.3. Use calculated measures with these "ex-tuples".
Example:
You have a 100M rows table with columns: SomeDate, Customer, Product, Amount and a single-partitioned measure group.
You need to create tuples like (2015-01-01, Customer A, Product Z, Amount).
Server have to scan the entire data to take exact values.
Once you add partitions by SomeDate years (+slices), server will take only 2015 partition.
2.1. Add column Tuple_ID int to the table and map it during ETL.
E.g. Tuple_ID = 1 where Customer = 'Customer A' and Product = 'Product Z'
2.2. Create a dimension on this new field (or on additional table with list of combinations to be able to modify logic easily).
2.3. Use ([Tuple ID].[Tuple ID].&[1],[Measures].[Amount]) in calculation.
Advantage of such technique is that server takes only pre-calculated values, and queries speed as a result.

How can I improve performance of average method in SQL?

I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?
The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.
To be more specific, the data in question is essentially a log with this sort of definition:
table log {
int duration
date time
string event
}
I'm doing queries like
SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today
The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.
The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.
An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.
Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?
There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".
If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.
Speeding up aggregates is usually done by keeping additional tables.
Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table
dimAavg(dimA, avgValue)
The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...
For example let us assume that system predominantly does inserts and only occasionally updates and deletes.
Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without a big impact on the system.
This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.
This way you can still get accurate aggregates from this table and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID
GROUP BY a.dimA
The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).
The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).
Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).
Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.
EDIT
There are also materialized views, 2, 3
Just a guess, but indexes won't help much since average must read all the record (in any order), indexes are usefull the find subsets of rows, ubt if you have to iterate on all rows with no special ordering indexes are not helping...
This might not be what you're looking for, but if your table has some way to order the data (e.g. by date), then you can just do incremental computations and store the results.
For example, if your data has a date column, you could compute the average for records 1 - Date1 then store the average for that batch along with Date1 and the #records you averaged. The next time you compute, you restrict your query to results Date1..Date2, and add the # of records, and update the last date queried. You have all the information you need to compute the new average.
When doing this, it would obviously be helpful to have an index on the date, or whatever column(s) you are using for the ordering.

Azure Table Storage - Calculate or Persist Totals

I'm looking into using Table Storage for storing some transactional data, however, I need to support some very high level reporting over it, basically totals per day / month.
Couple of options I have though of:
Use a partition / row key structure and dynamically perform sum
e.g. 20101101_ITEMID_XXXXXXXX (x = guid or time, to make unique)
then I would query for a months data using a portion of the row key (ITEMID_201011), and to a total on the "Cost" property in the type.
How would the query limit of 1000 records be managed by this though? (i.e. if there are more than 1000 transactions for the day, totaling would be hard)
Use another record to store the total for the day, and update this as new records are added
e.g. row key "20101101_ITEMID_TOTAL"
then query off this for the days totals, or months, or years totals.
What is the best way to do this? Is there a 'best practice' for this type of requirement using table storage?
I'm not sure what is the best practice but I can comment that we have a similar situation with AzureWatch and are definitely using pre-aggregated values in tables.
Mostly for performance reasons -- table storage is not instantaneous even if you query by single partition-key and a range in row-key. The time it takes to download the records is somewhat significant and depending on the records might spike the CPU up, because it needs to de-serialize the data into objects. If you get to travel to the table storage multiple times because of the 1000 record limit, you'll be paying more as well.
Some other thoughts to consider:
Will your aggregated totals ever change? If no, the this is another nudge toward pre-aggregation
Will you need to keep aggregated values after raw data is gone or will you ever need to purge raw data? If yes, then it is another nudge toward pre-aggregation