Calculating an average of a DISTINCTCOUNT efficiently in Dax? - powerpivot

I'm trying to calculate a business-logic in DAX which has turned out to be quite resource-heavy and complex. I have a very large PowerPivot model (call it "sales") with numerous dimensions and measures. A simplified view of the sales model:
+-------+--------+---------+------+---------+-------+
| State | City | Store | Week | Product | Sales |
+-------+--------+---------+------+---------+-------+
| NY | NYC | Charlie | 1 | A | $5 |
| MA | Boston | Bravo | 2 | B | $10 |
| - | D.C. | Delta | 1 | A | $20 |
+-------+--------+---------+------+---------+-------+
Essentially what I'm trying to do is calculate a DISTINCTCOUNT of product by store and week:
SUMMARIZE(Sales,[Store],[Week],"Distinct Products",DISTINCTCOUNT([Product]))
+---------+------+-------------------+
| Store | Week | Distinct Products |
+---------+------+-------------------+
| Charlie | 1 | 15 |
| Charlie | 2 | 7 |
| Charlie | 3 | 12 |
| Bravo | 1 | 20 |
| Bravo | 2 | 14 |
| Bravo | 3 | 22 |
+---------+------+-------------------+
I then want to calculate the AVERAGE of these Distinct Products at the store level. The way I approached this was by taking the previous calculation, and running a SUMX on top of it and dividing it by distinct weeks:
SUMX(
SUMMARIZE(Sales,[Store],[Week],"Distinct Products",DISTINCTCOUNT([Product]))
,[Distinct Products]
) / DISTINCTCOUNT([Week])
+---------+------------------+
| Store | Average Products |
+---------+------------------+
| Charlie | 11.3 |
| Bravo | 18.7 |
+---------+------------------+
I stored this calculation in a measure and it worked well when the dataset was smaller. But now the dataset is so huge that when I try to use the measure, it hangs until I have to cancel the process.
Is there a more efficient way to do this?

SUMX is appropriate in this case since you want the distinct product count calculated independently for each store & for each week, then summed together by store, and then divided by the number of weeks by store. There's no way around that. (If there was, I'd recommend it.)
However, SUMX is an iterator, and so is the likely cause of the slowdown. Since we can't eliminate the SUMX entirely, the biggest factor here is the number of combinations of stores/weeks that you have.
To confirm if the number of combinations of stores/weeks is the source of the slowdown, try filtering or removing 50% from a copy of your data model and see if that speeds things up. If that doesn't time out, add more back in to get a sense of how many combinations are the failing point.
To make things faster with the full dataset:
You may be able to filter to a subset of stores/weeks in your pivot table, before dragging on the measure. This will typically get faster results than dragging on the measure first, then adding filters. (This isn't really a change to your measure, but more of a behaviour change for users of your model).
You might want to consider grouping at a higher level than week (e.g. month), to reduce the number of combinations it has to iterate over
If you're running Excel 32-bit, or only have 4GB of RAM, consider 64-bit Excel and/or a more powerful machine (I doubt this is the case, but am including for comprehensiveness - Power Pivot can be a resource hog)
If you can move your model to Power BI Desktop (I don't believe Calculated Tables are supported in Power Pivot), you could extract out the SUMMARIZE into a calculated table, and then re-write your measure to reference that calculated table instead. This reduces the number of calculations the measure has to perform at run-time, as all the combinations of store/week plus the distinct count of products will be pre-calculated (leaving only the summing & division for your measure to do - a lot less work).
.
Calculated Table =
SUMMARIZE (
Sales,
[Store],
[Week],
"Distinct Products", DISTINCTCOUNT ( Sales[Product] )
)
Note: The calculated table code above is rudimentary and is mostly designed as a proof of concept. If this is the path you take, you'll want to make sure you have a separate store dimension to join the calculated table to, as this won't join to the source table directly
Measure Using Calc Table =
SUMX (
'Calculated Table',
[Distinct Products] / DISTINCTCOUNT ( 'Calculated Table'[Week] )
)
Jason Thomas has a great post on calculated tables and when they can come in useful here: http://sqljason.com/2015/09/my-thoughts-on-calculated-tables-in.html.
If you can't use calculated tables, but your data is coming from a database of some form, then you could do the same logic in SQL and then import a pre-prepared separate table of unique store/months and their distinct counts.
I hope some of this proves useful (or you've solved the problem another way).

Related

Time Series in Postgres

I have a huge database of eCommerce transactions on Redshift, running into about 900 million rows, with the headers being somewhat similar to this.
id | date_stamp | location | item | amount
001 | 2009-12-28 | A1 | Apples | 2
002 | 2009-12-28 | A2 | Juice | 2
003 | 2009-12-28 | A1 | Apples | 1
004 | 2009-12-28 | A4 | Apples | 2
005 | 2009-12-29 | A1 | Juice | 6
006 | 2009-12-29 | A4 | Apples | 2
007 | 2009-12-29 | A1 | Water | 7
008 | 2009-12-28 | B7 | Juice | 14
Is it possible to find trends within items? For example, if I wanted to see how "Apples" performed in terms of sales, between 2009-12-28 and 2011-12-28, at location A4, how would I go about it? Ideally I would like to generate a table with positive/negative trending, somewhat similar to the post here -
Aggregate function to detect trend in PostgreSQL
I have performed similar analysis on small data sets in R, and even visualizing it using ggplot isn't a big challenge, but the sheer size of the database is causing me some troubles, and extremely long querying times as well.
For example,
select *
from fruitstore.sales
where item = 'Apple' and location = 'A1'
order by date_stamp
limit 1000000;
takes about 2500 seconds to execute, and times out often.
I appreciate any help on this.
900M rows is quite a bit for stock Postgres to handle. One of the MPP variants (like Citus) would be able to handle it better.
Another option is to change how you're storing the data. A far more efficient structure would be to have 1 row for each month/item/location, and store an int array of amounts. That would cut things down to ~300M rows, which is much more manageable. I suspect most of your analysis tools will want to see the data as an array anyway.
Take a look at window functions. They're great for this type of use case. They were a bit tough for me to get my head around but can save you some serious contortions with SQL.
This will show you how many apples were sold per day for the period you're interested in:
select date_trunc('day', date_stamp) as day, count(*) as sold
from fruitstore.sales
where item = 'Apple' and location = 'A4'
and date_stamp::date >= '2009-12-28'::date and date_stamp::date <= '2011-12-28'::date
group by 1 order by 1 asc
Regarding performance, avoid using select * in Redshift. It's a columnar store where data for different columns is spread across nodes. Being explicit about the columns and only referencing the ones you use will save Redshift from moving a lot of unneeded data over the network.
Make sure you're picking good distkey and sortkeys for your tables. In a time series table the timestamp should definitely be one of the sortkeys. Enabling compression on your tables can help too.
Schedule regular VACUUM and ANALYZE runs on your tables.
Also if there's any way to restrict the range of data you're looking at by filtering possible records out in the where clause, it can help a lot. For example, if you know you only care about the trend for the last few days it can make a huge difference to limit on time like:
where date_stamp >= sysdate::date - '5 day'::interval
Here's a good article with performance tips.
To filter results in your SQL query, you can use a WHERE clause:
SELECT *
FROM myTable
WHERE
item='Apple' AND
date_stamp BETWEEN '2009-12-28' AND '2011-12-28' AND
location = 'A4'
Using Aggregate functions, you can summarize fruit sales between two dates at a location, for instance:
SELECT item as "fruit", sum(amount) as "total"
FROM myTable
WHERE
date_stamp BETWEEN '2009-12-28' AND '2011-12-28' AND
location = 'A4'
GROUP BY item
Your question asking how apples "Fared" isn't terrible descriptive, but using a WHERE clause and aggregate functions (don't forget your group by) are probably where you need to aim.

SQL payments matrix

I want to combine two tables into one:
The first table: Payments
id | 2010_01 | 2010_02 | 2010_03
1 | 3.000 | 500 | 0
2 | 1.000 | 800 | 0
3 | 200 | 2.000 | 300
4 | 700 | 1.000 | 100
The second table is ID and some date (different for every ID)
id | date |
1 | 2010-02-28 |
2 | 2010-03-01 |
3 | 2010-01-31 |
4 | 2011-02-11 |
What I'm trying to achieve is to create table which contains all payments before the date in ID table to create something like this:
id | date | T_00 | T_01 | T_02
1 | 2010-02-28 | 500 | 3.000 |
2 | 2010-03-01 | 0 | 800 | 1.000
3 | 2010-01-31 | 200 | |
4 | 2010-02-11 | 1.000 | 700 |
Where T_00 means payment in the same month as 'date' value, T_01 payment in previous month and so on.
Is there a way to do this?
EDIT:
I'm trying to achieve this in MS Access.
The problem is that I cannot connect name of the first table's column with the date in the second (the easiest way would be to treat it as variable)
I added T_00 to T_24 columns in the second (ID) table and was trying to UPDATE those fields
set T_00 =
iif(year(date)&"_"&month(date)=2010_10,
but I realized that that would be to much code for access to handle if I wanted to do this for every payment period and every T_xx column.
Even if I would write the code for T_00 I would have to repeat it for next 23 periods.
Your Payments table is de-normalized. Those date columns are repeating groups, meaning you've violated First Normal Form (1NF). It's especially difficult because your field names are actually data. As you've found, repeating groups are a complete pain in the ass when you want to relate the table to something else. This is why 1NF is so important, but knowing that doesn't solve your problem.
You can normalize your data by creating a view that UNIONs your Payments table.
Like so:
CREATE VIEW NormalizedPayments (id, Year, Month, Amount) AS
SELECT id,
2010 AS Year,
1 AS Month,
2010_01 AS Amount
FROM Payments
UNION ALL
SELECT id,
2010 AS Year,
2 AS Month,
2010_02 AS Amount
FROM Payments
UNION ALL
SELECT id,
2010 AS Year,
3 AS Month,
2010_03 AS Amount
FROM Payments
And so on if you have more. This is how the Payments table should have been designed in the first place.
It may be easier to use a date field with the value '2010-01-01' instead of a Year and Month field. It depends on your data. You may also want to add WHERE Amount IS NOT NULL to each query in the UNION, or you might want to use Nz(2010_01,0.000) AS Amount. Again, it depends on your data and other queries.
It's hard for me to understand how you're joining from here, particularly how the id fields relate because I don't see how they do with the small amount of data provided, so I'll provide some general ideas for what to do next.
Next you can join your second table with this normalized Payments table using a method similar to this or a method similar to this. To actually produce the result you want, include a calculated field in this view with the difference in months. Then, create an actual Pivot Table to format your results (like this or like this) which is the proper way to display data like your tables do.

Joining multiple tables in SQL Server

I have a table with currency exchange rates:
CREATE TABLE ExchangeRates
(
ID int IDENTITY,
SellingCurrency nvarchar(20),
BuyingCurrency nvarchar(20),
Rate float,
CONSTRAINT PK__ExchangeRates__ID PRIMARY KEY (ID)
)
For example, table contains this data:
INSERT INTO ExchangeRates
VALUES ('USD', 'RUB', 1.2),
('RUB', 'EUR', 0.5),
('SEK', 'RUB', 1.3)
I need to write a query which should return an exchange rate of two currencies even if there is no row in table with this two currencies (using a chain of exchanges).
How can I do it?
This problem is an excellent fit for solutions from the graph theory field (and graph databases like Neo4j for example), but modelling a graph in a relational database isn't that hard and implementing an path finding algorithm like BFS/DFS or Dijkstra (for shortest path) is doable too, and could be a viable solution given a small enough data set (which an exchange rate table would be), although given the iterative nature of those algorithms I'm not sure they would scale that well (but it should be easy enough to implement the algorithm in a CLR proc for better performance).
Anyway, I like graph theory and found this problem interesting and went looking for, and found, a t-sql stored procedure implementation of Dijkstras algorithm (among several others) here, and played around with adapting it to your data (with a slight, and unnecessary change to the table structure - I put the currencies in a separate table to not have to modify the procedure too much) and got it working (the code isn't that hard to understand if you are familiar with how Dijkstras algorithm works).
You can look at the implementation with examples in this SQL Fiddle.
The result from the example runs (five separate executions) :
| STARTNODE | ID | NAME | DISTANCE | PATH | NAMEPATH |
|-----------|----|------|----------|-------|-------------|
| USD | 2 | RUB | 1.2 | 1,2 | USD,RUB |
|-----------|----|------|----------|-------|-------------|
| USD | 3 | EUR | 1.7 | 1,2,3 | USD,RUB,EUR |
|-----------|----|------|----------|-------|-------------|
| RUB | 3 | EUR | 0.5 | 2,3 | RUB,EUR |
|-----------|----|------|----------|-------|-------------|
| SEK | 2 | RUB | 1.3 | 4,2 | SEK,RUB |
|-----------|----|------|----------|-------|-------------|
| SEK | 3 | EUR | 1.8 | 4,2,3 | SEK,RUB,EUR |
The test data uses these id numbers for currencies:
1 = USD
2 = RUB
3 = EUR
4 = SEK
Credit to the author of the original algorithm
On a side note it worth considering that even though it's clearly possible to use a relational database in this way, it's probably not a good idea; there are much better solutions for this.
When you are sure you can fix it in 2 levels, you could use someting like this:
select Rate
from ExchangeRates
where SellingCurrency = 'SEK'
and BuyingCurrency = 'EUR'
UNION
select er1.Rate*er2.Rate as Rate
from ExchangeRates er1
left join ExchangeRates er2 on er2.SellingCurrency = er1.BuyingCurrency
where er1.SellingCurrency = 'SEK'
and er2.BuyingCurrency = 'EUR'
if you know that 'USD'/ 'RUB'=1.2 and 'SEK'/'RUB'= 1.3
then you can calculate that ('USD','SEK',1.2/1.3)--->('USD','SEK',0.92)
so, if you need rate (between two currencies 'a','b')
that does not exists in that table
then find two rows with common currency ('a'/'c',x) and ('b','c',y)
and do the calculation like i do.
('a','b',x/y)
i hope i was clear enough.

Row sieving statistics during WHERE clauses combined by AND

The website stores information about all specifications of large amount of items and provides user ability to search through the data by adding some filtering conditions at the front-end. At back-end all the conditions are being translated into clauses and joined by AND operand.
My aim is to give the user an idea how many goods are being thrown away or left after each filter. Exact numbers aren't very important for the initial sieving (some fuzzy or approximations are fine, because the whole amount is quite large), but at latter stages, when there's ten or so items left, the user should get the proper amount.
There's obvious straightforward way of making as much SELECT COUNT queries as he has filters, but I feel that it might be some technique to archive it in more elegant way and without abusing DB much.
There are many ways to achieve this with varying levels of difficulty and performance.
The first and most obvious way to me is to simply do a count on the filters which performs fairly well and is not that difficult to implement. An alternative but similar approach would be to group by the values and do a count.
Here's a fiddle as an example of both methods: http://sqlfiddle.com/#!15/0cdcb/26
select
count(product.id) total,
sum((v0.value = 'spam')::int) v0_is_spam,
sum((v0.value != 'spam')::int) v0_not_spam,
sum((v1.value = 'spam')::int) v1_is_spam,
sum((v1.value != 'spam')::int) v1_not_spam
from product
left join specification_value v0 on v0.product_id = product.id and v0.specification_id = 1
left join specification_value v1 on v1.product_id = product.id and v1.specification_id = 2;
select specification.id, value, count(*)
from specification
left join specification_value on specification.id = specification_value.specification_id
group by specification.id, value;
A slightly more difficult way to do something like that is using window functions, a lot more flexible but not as easy to grasp. Docs are here: http://www.postgresql.org/docs/9.3/static/tutorial-window.html
Example query and results:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
depname | empno | salary | avg
-----------+-------+--------+-----------------------
develop | 11 | 5200 | 5020.0000000000000000
develop | 7 | 4200 | 5020.0000000000000000
develop | 9 | 4500 | 5020.0000000000000000
develop | 8 | 6000 | 5020.0000000000000000
develop | 10 | 5200 | 5020.0000000000000000
personnel | 5 | 3500 | 3700.0000000000000000
personnel | 2 | 3900 | 3700.0000000000000000
sales | 3 | 4800 | 4866.6666666666666667
sales | 1 | 5000 | 4866.6666666666666667
sales | 4 | 4800 | 4866.6666666666666667
(10 rows)
And lastly, by far the fastest but also most inaccurate and difficult to implement. Using the database statistics to guess the amount of rows. I would not opt for this unless you have millions of rows within a filter set and have no way to reducing it further. Also, don't do this unless the performance is really so bad that it's needed.

Best way to join the two tables *including* duplicates from one table

Accounts (table)
+----+----------+----------+-------+
| id | account# | supplier | RepID |
+----+----------+----------+-------+
| 1 | 123xyz | Boston | 2 |
| 2 | 245xyz | Chicago | 2 |
| 3 | 425xyz | Chicago | 3 |
+----+----------+----------+-------+
PayOut (table)
+----+----------+----------+-------------+--------+
| id | account# | supplier | datecreated | Amount |
+----+----------+----------+-------------+--------+
| 5 | 245xyz | Chicago | 01-15-2009 | 25 |
| 6 | 123xyz | Boston | 10-15-2011 | 50 |
| 7 | 123xyz | Boston | 10-15-2011 | -50 |
| 8 | 123xyz | Boston | 10-15-2011 | 50 |
| 9 | 425xyz | Chicago | 10-15-2011 | 100 |
+----+----------+----------+-------------+--------+
I have accounts table and I have payout table. Payout table comes from abroad so we do not have any control over it. This leaves us with a problem that we can't join the two tables based on record ID field, that is one problem which we can't solved. We therefore join based on Account#, SupplierID (2nd and 3rd column). This creates a problem that it creates (possibly) many to many relationship. But we filter our records if they are active and we use a second filter on payout table when the payout was created. Payout are created months to month. There are two problems with this in my view
The query takes quite a bit of time to complete (could be inefficient)
There are certain duplicates that are removed which should not be removed. Example is record 6 and 8 in payout table. What happened here is, we got a customer, then the customer cancelled then he got him back. In this case +50, -50 and +50. Again all values are valid and must show in the report for audit purposes. Currently only one +50 is shown, the other is lost. There are a couple of other problems within the report that comes once in a while.
Here is the query. It uses groups by to remove duplicates. I would like to have an advance query which outperforms and which does takes into account that no record in PayOut table is duplicated as long as they come up in the month of the report.
Here is our current query
/* Supplied to Store Procedure */
-----------------------------------
#RepID // the person for whome payout is calculated
#Month // of payment date
#year // year of payment date
-----------------------------------
select distinct
A.col1,
A.col2,
...
A.col10,
B.col2,
B.Col2,
B.Amount /* this is the important column, portion of which goes to Rep */
from records A
JOIN payout B
on A.Supplier = B.Supplier AND A.Account# = B.Account#
where datepart(mm, B.datecreated) = #Month /* parameter to stored procedure */
and datepart(yyyy, B.datecreated) = #Year
and A.[rep ID] = #RepID /* parameter to SP */
group by
col1,col2,col3,....col10
order by customerName
Is this query optimum? Can I improve it using CROSS APPLY or WHERE EXISTs that will make it faster as well as remove the duplicate problem?
Note that this query is used to get payout of a rep. Hence every record has repid field who it is assigned to. Ideally I would like to use Select WHERE Exist query.
It's difficult to understand exactly what you want because in one place you say you 'want' the duplicates but then you say that you are using the group by to remove duplicates. So the first thought would be "Why not just get rid of the group by?". But I have to believe you are smart enough to have thought of that yourself, so I assume it's got to be there for a reason.
I think someone here could help you pretty easily if you could post the actual query, but since you say you can't I will just try to give you some direction in solving the problem...
Instead of trying to do everything in one statement, use temporary tables or views to split it up. It may be easier for you to think about how to get rid of the duplicates you don't want and keep the ones you do first and put those into a temporary table, and then join the tables together and work with that.