Don't return batch in the results - kql

Can I use a batch for temporal calculation, but eventually not include it in the results?
For example, I want to use 'temporal' tabular for calculation of other queries, but I do not want to get it later in the results (because it's too large). I.e. the results should contain only Tables X and Y.
requests
| take 1000
| as temporal;
temporal | summarize count() | as X;
temporal | summarize avg(duration) | as Y;
P.s. using 'let' is impossible in my scenario

With this solution you get an extra result set, however it is empty.
P.S.
You might want to use hint.materialized=true to prevent from the temporal data set to be computed twice.
// Sample data generation. Not part of the solution.
let requests = materialize(range i from 1 to 1000000 step 1 | extend duration = 1d * rand());
// Solution starts here.
requests
| take 1000
| as hint.materialized=true temporal
| take 0
;
temporal | summarize count() | as X;
temporal | summarize avg(duration) | as Y;
Fiddle

With this solution you get just the results set you are interested in.
You might consider the code less "clean".
// Sample data generation. Not part of the solution.
let requests = materialize(range i from 1 to 1000000 step 1 | extend duration = 1d * rand());
// Solution starts here.
requests
| take 1000
| as hint.materialized=true temporal
| summarize count() | as X;
temporal | summarize avg(duration) | as Y;
Fiddle

Related

Calculating an average of a DISTINCTCOUNT efficiently in Dax?

I'm trying to calculate a business-logic in DAX which has turned out to be quite resource-heavy and complex. I have a very large PowerPivot model (call it "sales") with numerous dimensions and measures. A simplified view of the sales model:
+-------+--------+---------+------+---------+-------+
| State | City | Store | Week | Product | Sales |
+-------+--------+---------+------+---------+-------+
| NY | NYC | Charlie | 1 | A | $5 |
| MA | Boston | Bravo | 2 | B | $10 |
| - | D.C. | Delta | 1 | A | $20 |
+-------+--------+---------+------+---------+-------+
Essentially what I'm trying to do is calculate a DISTINCTCOUNT of product by store and week:
SUMMARIZE(Sales,[Store],[Week],"Distinct Products",DISTINCTCOUNT([Product]))
+---------+------+-------------------+
| Store | Week | Distinct Products |
+---------+------+-------------------+
| Charlie | 1 | 15 |
| Charlie | 2 | 7 |
| Charlie | 3 | 12 |
| Bravo | 1 | 20 |
| Bravo | 2 | 14 |
| Bravo | 3 | 22 |
+---------+------+-------------------+
I then want to calculate the AVERAGE of these Distinct Products at the store level. The way I approached this was by taking the previous calculation, and running a SUMX on top of it and dividing it by distinct weeks:
SUMX(
SUMMARIZE(Sales,[Store],[Week],"Distinct Products",DISTINCTCOUNT([Product]))
,[Distinct Products]
) / DISTINCTCOUNT([Week])
+---------+------------------+
| Store | Average Products |
+---------+------------------+
| Charlie | 11.3 |
| Bravo | 18.7 |
+---------+------------------+
I stored this calculation in a measure and it worked well when the dataset was smaller. But now the dataset is so huge that when I try to use the measure, it hangs until I have to cancel the process.
Is there a more efficient way to do this?
SUMX is appropriate in this case since you want the distinct product count calculated independently for each store & for each week, then summed together by store, and then divided by the number of weeks by store. There's no way around that. (If there was, I'd recommend it.)
However, SUMX is an iterator, and so is the likely cause of the slowdown. Since we can't eliminate the SUMX entirely, the biggest factor here is the number of combinations of stores/weeks that you have.
To confirm if the number of combinations of stores/weeks is the source of the slowdown, try filtering or removing 50% from a copy of your data model and see if that speeds things up. If that doesn't time out, add more back in to get a sense of how many combinations are the failing point.
To make things faster with the full dataset:
You may be able to filter to a subset of stores/weeks in your pivot table, before dragging on the measure. This will typically get faster results than dragging on the measure first, then adding filters. (This isn't really a change to your measure, but more of a behaviour change for users of your model).
You might want to consider grouping at a higher level than week (e.g. month), to reduce the number of combinations it has to iterate over
If you're running Excel 32-bit, or only have 4GB of RAM, consider 64-bit Excel and/or a more powerful machine (I doubt this is the case, but am including for comprehensiveness - Power Pivot can be a resource hog)
If you can move your model to Power BI Desktop (I don't believe Calculated Tables are supported in Power Pivot), you could extract out the SUMMARIZE into a calculated table, and then re-write your measure to reference that calculated table instead. This reduces the number of calculations the measure has to perform at run-time, as all the combinations of store/week plus the distinct count of products will be pre-calculated (leaving only the summing & division for your measure to do - a lot less work).
.
Calculated Table =
SUMMARIZE (
Sales,
[Store],
[Week],
"Distinct Products", DISTINCTCOUNT ( Sales[Product] )
)
Note: The calculated table code above is rudimentary and is mostly designed as a proof of concept. If this is the path you take, you'll want to make sure you have a separate store dimension to join the calculated table to, as this won't join to the source table directly
Measure Using Calc Table =
SUMX (
'Calculated Table',
[Distinct Products] / DISTINCTCOUNT ( 'Calculated Table'[Week] )
)
Jason Thomas has a great post on calculated tables and when they can come in useful here: http://sqljason.com/2015/09/my-thoughts-on-calculated-tables-in.html.
If you can't use calculated tables, but your data is coming from a database of some form, then you could do the same logic in SQL and then import a pre-prepared separate table of unique store/months and their distinct counts.
I hope some of this proves useful (or you've solved the problem another way).

Time Series in Postgres

I have a huge database of eCommerce transactions on Redshift, running into about 900 million rows, with the headers being somewhat similar to this.
id | date_stamp | location | item | amount
001 | 2009-12-28 | A1 | Apples | 2
002 | 2009-12-28 | A2 | Juice | 2
003 | 2009-12-28 | A1 | Apples | 1
004 | 2009-12-28 | A4 | Apples | 2
005 | 2009-12-29 | A1 | Juice | 6
006 | 2009-12-29 | A4 | Apples | 2
007 | 2009-12-29 | A1 | Water | 7
008 | 2009-12-28 | B7 | Juice | 14
Is it possible to find trends within items? For example, if I wanted to see how "Apples" performed in terms of sales, between 2009-12-28 and 2011-12-28, at location A4, how would I go about it? Ideally I would like to generate a table with positive/negative trending, somewhat similar to the post here -
Aggregate function to detect trend in PostgreSQL
I have performed similar analysis on small data sets in R, and even visualizing it using ggplot isn't a big challenge, but the sheer size of the database is causing me some troubles, and extremely long querying times as well.
For example,
select *
from fruitstore.sales
where item = 'Apple' and location = 'A1'
order by date_stamp
limit 1000000;
takes about 2500 seconds to execute, and times out often.
I appreciate any help on this.
900M rows is quite a bit for stock Postgres to handle. One of the MPP variants (like Citus) would be able to handle it better.
Another option is to change how you're storing the data. A far more efficient structure would be to have 1 row for each month/item/location, and store an int array of amounts. That would cut things down to ~300M rows, which is much more manageable. I suspect most of your analysis tools will want to see the data as an array anyway.
Take a look at window functions. They're great for this type of use case. They were a bit tough for me to get my head around but can save you some serious contortions with SQL.
This will show you how many apples were sold per day for the period you're interested in:
select date_trunc('day', date_stamp) as day, count(*) as sold
from fruitstore.sales
where item = 'Apple' and location = 'A4'
and date_stamp::date >= '2009-12-28'::date and date_stamp::date <= '2011-12-28'::date
group by 1 order by 1 asc
Regarding performance, avoid using select * in Redshift. It's a columnar store where data for different columns is spread across nodes. Being explicit about the columns and only referencing the ones you use will save Redshift from moving a lot of unneeded data over the network.
Make sure you're picking good distkey and sortkeys for your tables. In a time series table the timestamp should definitely be one of the sortkeys. Enabling compression on your tables can help too.
Schedule regular VACUUM and ANALYZE runs on your tables.
Also if there's any way to restrict the range of data you're looking at by filtering possible records out in the where clause, it can help a lot. For example, if you know you only care about the trend for the last few days it can make a huge difference to limit on time like:
where date_stamp >= sysdate::date - '5 day'::interval
Here's a good article with performance tips.
To filter results in your SQL query, you can use a WHERE clause:
SELECT *
FROM myTable
WHERE
item='Apple' AND
date_stamp BETWEEN '2009-12-28' AND '2011-12-28' AND
location = 'A4'
Using Aggregate functions, you can summarize fruit sales between two dates at a location, for instance:
SELECT item as "fruit", sum(amount) as "total"
FROM myTable
WHERE
date_stamp BETWEEN '2009-12-28' AND '2011-12-28' AND
location = 'A4'
GROUP BY item
Your question asking how apples "Fared" isn't terrible descriptive, but using a WHERE clause and aggregate functions (don't forget your group by) are probably where you need to aim.

Selecting adjacent rows in an SQL query

The following is a problem which is not well-suited to an RDBMS, I think, but that is what I've got deal with.
I am trying to write a tool to search through logs stored in a database.
Some rows might be:
Time | ID | Object | Description
2012-01-01 13:37 | 1 | 1 | Something happened
2012-01-01 13:39 | 2 | 2 | Something else happened
2012-01-01 13:50 | 3 | 2 | Bad
2012-01-01 14:08 | 4 | 1 | Good
2012-01-01 14:27 | 5 | 1 | Bad
2012-01-01 14:30 | 6 | 2 | Good
Object is a foreign key. In practice, Time will increase with ID but that is not an actual constraint. In reality there are more fields. It's a Postgres database - I'd like to be able to support SQLite as well but am aware this may well be impossible.
Now, I want to be able to run a query for, say, all Bad events that happened to Object 2:
SELECT * FROM table WHERE Object = 2 AND Description = 'Bad';
But it would often be useful to see some lines of context around the results - just as with the -C option to grep is very useful when searching through text logs.
For the above query, if we wanted one line of context either side, we would want rows 2 and 6 in addition to row 3.
If the original query returned multiple rows, more context would need to be retrieved.
Notice that the context is not retrieved from the events associated with Object 1; we eliminate only the restriction on the Description.
Also, the order involved, and hence what determines what is adjacent to what, is that induced by the Time field.
This specifies what I want to achieve, but the database concerned is fairly big, at least in comparison to the power of the machine it's running on.
The most often cited solution for getting adjacent rows requires you to run one extra query per result in what I'll call the base query; this is no good because that might be thousands of queries.
My current least bad solution is to run a query to retrieved the IDs of all possible rows that could be context - in the above example, that would be a search for all rows relating to Object 2. Then I get the IDs matching the base query, expand (using the list of all possible IDs) to a list of IDs of rows matching the base query or in context, then finally retrieve the data for those IDs.
This works, but is inelegant and slow.
It is especially slow when using the tool from a remote computer, as that initial list of IDs can be very large, and retrieving it and then just transmitting it over the internet can be inordinate.
Another solution I have tried is using a subquery or view that computes the "buffer sequence" of the rows.
Here's what the table looks like with this field added:
Time | ID | Sequence | Object | Description
2012-01-01 13:37 | 1 | 1 | 1 | Something happened
2012-01-01 13:39 | 2 | 1 | 2 | Something else happened
2012-01-01 13:50 | 3 | 2 | 2 | Bad
2012-01-01 14:08 | 4 | 2 | 1 | Good
2012-01-01 14:27 | 5 | 3 | 1 | Bad
2012-01-01 14:30 | 6 | 3 | 2 | Good
Running the base query on this table then allows you to generate the list of IDs you want by adding or subtracting from the Sequence value.
This eliminates the problem of transferring loads of rows over the wire, but now the database has to run this complicated subquery, and it's unacceptably slow, especially on the first run - given the use-case, queries are sporadic and caching is not very effective.
If I were in charge of the schema I'd probably just store this field there in the database, but I'm not, so any suggestions for improvements are welcome. Thanks!
You should use the ROW_NUMBER windowing function
http://www.postgresql.org/docs/current/static/functions-window.html
Adjacency is an abstract construct and relies on an explicit sort (or PARTITION OVER) ... do you mean the one with the preceeding time stamp?
Decide how you decide on what sort of "adjacent" you want, then get ROW_NUMBER over that criteria.
Once you have that you would just JOIN each row on the item having ROW_NUMBER +/- 1
You can try this with sqlite
SELECT DISTINCT t2.*
FROM (SELECT * FROM t WHERE object=2 AND description='Bad') t1
JOIN
(SELECT * FROM t WHERE object=2) t2
ON t1.id = t2.id OR
t2.id IN (SELECT id FROM t WHERE object=2 AND t.time<t1.time ORDER BY t.time DESC LIMIT 1) OR
t2.id IN (SELECT id FROM t WHERE object=2 AND t.time>t1.time ORDER BY t.time ASC LIMIT 1)
ORDER BY t2.time
;
Change the limit values ​​by more context

SELECT TOP 1 ...Some stuff... ORDER BY DES gives different result

SELECT TOP 1 Col1,col2
FROM table ... JOIN table2
...Some stuff...
ORDER BY DESC
gives different result. compared to
SELECT Col1,col2
FROM table ... JOIN table2
...Some stuff...
ORDER BY DESC
2nd query gives me some rows , When I want the Top 1 of this result I write the 1st query with TOP 1 clause. These both give different results.
why is this behavior different
This isn't very clear, but I guess you mean the row returned by the first query isn't the same as the first row returned by the second query. This could be because your order by has duplicate values in it.
Say, for example, you had a table called Test
+-----+------+
| Seq | Name |
+-----+------+
| 1 | A |
| 1 | B |
| 2 | C |
+-----+------+
If you did Select * From Test Order By Seq, either of these is valid
+-----+------+
| Seq | Name |
+-----+------+
| 1 | A |
| 1 | B |
| 2 | C |
+-----+------+
+-----+------+
| Seq | Name |
+-----+------+
| 1 | B |
| 1 | A |
| 2 | C |
+-----+------+
With the top, you could get either row.
Having the top 1 clause could mean the query optimizer uses a completely different approach to generate the results.
I'm going to assume that you're working in SQL Server, so Laurence's answer is probably accurate. But for completeness, this also depends on what database technology you are using.
Typically, index-based databases, like SQL Server, will return results that are sorted by the index, depending on how the execution plan is created. But not all databases utilize indices.
Netezza, for example, keeps track of where data lives in the system without the concept of an index (Netezza's system architecture is quite a bit different). As a result, selecting the 1st record of a query will result in a random record from the result set floating to the top. Executing the same query multiple times will likely result in a different order each time.
If you have a requirement to order data, then it is in your best interest to enforce the ordering yourself instead of relying on the arbitrary ordering that the database will use when creating its execution plan. This will make your results more predictable.
Your 1st query will get one table's top row and compare with another table with condition. So it will return different values compare to normal join.

How to use MDX to calculate the difference between dates on seperate rows?

I am quite new to MDX and I need to write a query that gives me the day difference between two dates. The problem is that the dates exist on two different rows in my data. For example:
My Fact table:
SEAL | STARTDATE | PROCESS | FK_DATE_KEY
1 | 2012-10-22| A | 20121022
1 | 2012-10-24| B | 20121024
2 | 2012-10-22| A | 20121022
2 | 2012-10-26| B | 20121026
What I need returned is :
SEAL | AGE_IN_DAYS
1 | 2
2 | 4
Please help.... I have a date dimension that relates to my FK_DATE_KEY
If you are new to MDX, you shouldn't try to do this problem using only MDX. This particular problem is much easier if you write it under SQL and use that data in your Analysis Services.
So, the easiest and nicest way to do this problem is to write a view which returns the same data you gave in your question. (SEAL | AGE_IN_DAYS)
Then you are able to insert these data in your Data Source View (if you choose 'new named query', you can fetch table-valued functions too, not only views and tables)
Hope it helps!