Efficient way to get NonEmpty members - ssas

Consider Dimension_A and Dimension_B and Measure_Amt
I require first 100 non-empty members for cross-join between Dimension_A and Dimension_B for Measure_Amt. The following query works but takes a lot of time since these dimensions are large(from million to 20 million)
with set a as subset(Dimension_A.levels(1),0,100)
set x as SubSet(NONEMPTYCROSSJOIN(a,Dimension_B.levels(1)), 0, 100)
select [Measures].[Measure_Amt] on 0 ,
x on 1
from MY_CUBE
and with where clause,
with set a as subset(Dimension_A.levels(1),0,100)
set x as SubSet(NONEMPTYCROSSJOIN(a,Dimension_B.levels(1)), 0, 100)
select [Measures].[Measure_Amt] on 0 ,
x on 1
from MY_CUBE where Dimension_C.member_C1
Fetching first 100 members of a single Dimension is quick, the nonempty function accounts for most of the time.
Since I require only first 100 non-empty members and not the next subset, is there a way to write a better query.

When you say, "I require first 100 non-empty members for cross-join between Dimension_A and Dimension_B for Measure_Amt", I assume that you mean want to the top 100 results, where the Measure_Amt is not null for the two cross-joined sets.
If so, I believe this is another way to write your query which may improve performance.
WITH
SET [X] AS [Dimension_A].LEVELS(1) * [Dimension_B].LEVELS(1)
SELECT
[Measures].[Measure_Amt] ON 0,
SUBSET(
NONEMPTY(
[X]
, [Measures].[Measure_Amt]
)
, 0
, 100
) ON 1
FROM
[MY_CUBE]
WHERE
[Dimension_C].[member_C1]

Related

How to pull rows from a SQL table until quotas for multiple columns are met?

I've been able to find a few examples of questions similar to this one, but most only involve a single column being checked.
SQL Select until Quantity Met
Select rows until condition met
I have a large table representing facilities, with columns for each type of resource available and the number of those specific resources available per facility. I want this stored procedure to be able to take integer values in as multiple parameters (representing each of these columns) and a Lat/Lon. Then it should iterate over the table sorted by distance, and return all rows (facilities) until the required quantity of available resources (specified by the parameters) are met.
Data source example:
Id
Lat
Long
Resource1
Resource2
...
1
50.123
4.23
5
12
...
2
61.234
5.34
0
9
...
3
50.634
4.67
21
18
...
Result Wanted:
#latQuery = 50.634
#LongQuery = 4.67
#res1Query = 10
#res2Query = 20
Id
Lat
Long
Resource1
Resource2
...
3
50.634
4.67
21
18
...
1
50.123
4.23
5
12
...
Result includes all rows that meet the queries individually. Result is also sorted by distance to the requested lat/lon
I'm able to sort the results by distance, and sum the total running values as suggested in other threads, but I'm having some trouble with the logic comparing the running values with the quota provided in the params.
First I have some CTEs to get most recent edits, order by distance and then sum the running totals
WITH cte1 AS (SELECT
#origin.STDistance(geography::Point(Facility.Lat, Facility.Long, 4326)) AS distance,
Facility.Resource1 as res1,
Facility.Resource2 as res2
-- ...etc
FROM Facility
),
cte2 AS (SELECT
distance,
res1,
SUM(res1) OVER (ORDER BY distance) AS totRes1,
res2,
SUM(res1) OVER (ORDER BY distance) AS totRes2
-- ...etc, there's 15-20 columns here
FROM cte1
)
Next, with the results of that CTE, I need to pull rows until all quotas are met. Having the issues here, where it works for one row but my logic with all the ANDs isn't exactly right.
SELECT * FROM cte2 WHERE (
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1)) AND
(totRes2 <= #res2Query OR (totRes2 > #res2Query AND totRes2- res2 <= #totRes2)) AND
-- ... I also feel like this method of pulling the next row once it's over may be convoluted as well?
)
As-is right now, it's mostly returning nothing, and I'm guessing it's because it's too strict? Essentially, I want to be able to let the total values go past the required values until they are all past the required values, and then return that list.
Has anyone come across a better method of searching using separate quotas for multiple columns?
See my update in the answers/comments
I think you are massively over-complicating this. This does not need any joins, just some running sum calculations, and the right OR logic.
The key to solving this is that you need all rows, where the running sum up to the previous row is less than the requirement for all requirements. This means that you include all rows where the requirement has not been met, and the first row for which the requirement has been met or exceeded.
To do this you can subtract the current row's value from the running sum.
You could utilize a ROWS specification of ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING. But then you need to deal with NULL on the first row.
In any event, even a regular running sum should always use ROWS UNBOUNDED PRECEDING, because the default is RANGE UNBOUNDED PRECEDING, which is subtly different and can cause incorrect results, as well as being slower.
You can also factor out the distance calculation into a CROSS APPLY (VALUES, avoiding the need for lots of CTEs or derived tables. You now only need one level of derivation.
DECLARE #origin geography = geography::Point(#latQuery, #LongQuery, 4326);
SELECT
f.Id,
f.Lat,
f.Long,
f.Resource1,
f.Resource2
FROM (
SELECT f.*,
SumRes1 = SUM(f.Resource1) OVER (ORDER BY v1.Distance ROWS UNBOUNDED PRECEDING) - f.Resource1,
SumRes2 = SUM(f.Resource2) OVER (ORDER BY v1.Distance ROWS UNBOUNDED PRECEDING) - f.Resource2
FROM Facility f
CROSS APPLY (VALUES(
#origin.STDistance(geography::Point(f.Lat, f.Long, 4326))
)) v1(Distance)
) f
WHERE (
f.SumRes1 < #res1Query
OR f.SumRes2 < #res2Query
);
db<>fiddle
Was able to figure out the problem on my own here. The primary issue I was running into was that I was comparing 25 different columns' running totals versus the 25 stored proc parameters (quotas of resources required by the search).
Changing the lines such as these
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1)) AND --...
to
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1) OR #res1Query = 0) AND --...
(adding in the OR #res1Query = 0)solved my issue.
In other words, the search is often only for one or two columns (types of resources) - leaving others as zero. The way my logic was set up caused it to skip over lots of rows because it was instantly marking them as having met the quota (value less than or equal to the quota). like #A Neon Tetra suggested, was pretty close to it already.
Update:
First attempt didn't exactly fix my own issue. Posting the stripped down version of my code that is now working for me.
DECLARE #Lat AS DECIMAL(12,6)
DECLARE #Lon AS DECIMAL(12,6)
DECLARE #res1Query AS INT
DECLARE #res2Query AS INT
-- repeat for Resource 3 through 25, etc...
DECLARE #origin geography = geography::Point(#Lat, #Lon, 4326);
-- CTE to be able to expose distance
cte AS (SELECT TOP(99999) -- --> this is hacky, it won't let me order by distance unless I'm selecting TOP(x) or some other fn?
dbo.Facility.FacilityGUID,
dbo.Facility.Lat,
dbo.Facility.Lon,
#origin.STDistance(geography::Point(dbo.Facility.Lat, dbo.Facility.Lon, 4326))
AS distance,
dbo.Facility.Resource1 AS res1,
dbo.Facility.Resource2 AS res2,
-- repeat for Resource 3 through 25, etc...
FROM dbo.Facility
ORDER BY distance),
-- third CTE - has access to distance so we can keep track of a running total ordered by distance
---> have to separate into two since you can't reference the same alias (distance) again within the same SELECT
fullCTE AS (SELECT
FacilityID,
Lat,
Long,
distance,
res1,
SUM(res1) OVER (ORDER BY distance)AS totRes1,
res2,
SUM(res2) OVER (ORDER BY distance)AS totRes2,
-- repeat for Resource 3 through 25, etc...
FROM cte)
SELECT * -- Customize what you're pulling here for your output as needed
FROM dbo.Facility INNER JOIN fullCTE ON (fullCTE.FacilityID = dbo.Facility.FacilityID)
WHERE EXISTS
(SELECT
FacilityID
FROM fullCTE WHERE (
FacilityID = dbo.Facility.FacilityID AND
-- Keep pulling rows until all conditions are met, as opposed to pulling rows while they're under the quota
NOT (
((totRes1 - res1 >= #res1Query AND #res1Query <> 0) OR (#res1Query = 0)) AND
((totRes2 - res2 >= #res2Query AND #res2Query <> 0) OR (#res2Query = 0)) AND
-- repeat for Resource 3 through 25, etc...
)
)
)

subquery using DISTINCT and LIMIT

In SQLite, when I do
SELECT DISTINCT idvar
FROM myTable
LIMIT 100
OFFSET 0;
the data returned are 100 rows with (the first) 100 distinct values of idvar in myTable. That's exaclty what I expected.
Now, when I do
SELECT *
FROM myTable
WHERE idvar IN (SELECT DISTINCT idvar
FROM myTable
LIMIT 100
OFFSET 0);
I would expect to have all the data from myTable corresponding to those 100 distinct values of idvar (so potentially the data returned would have more than 100 rows if there is more than one row of each idvar). What I get however is all the data for whatever many distinct values of idvar that return more or less 100 rows. I don't understand why.
Thoughts? How should I build a query that returns what I expected?
Context
I have a 50GB table, and I need to do some calculations using R. Since I can't possibly load that much data into R for memory reasons, I want to work in chuncks. It is important however that each chunck contains all the rows for a given level of idvar. That's why I use OFFSET and LIMIT in the query, as well as trying to make sure that it returns all rows for levels of idvar.
I'm not sure about SQLite, but in other SQL variants the result of un-ordered LIMIT query is not guaranteed to return the same result every time. So you should also include ORDER BY in there.
But a better idea may be to do a separate query at the beginning to read all of the distinct IDs into R. And then split those into batches of 100 and then to a separate query for each batch. Should be clearer and faster and easier to debug.
Edit: example R code. Lets say you have 100k distinct IDs in variable ids.
for (i in 1:1000) {
tmp.ids <- ids[((i - 1) * 100 + 1) : (i * 100)]
query <- paste0("SELECT * FROM myTable WHERE idvar IN (",
paste0(tmp.ids, collapse = ", "),
")")
dbSendquery(con, query)
fetch results, etc..
}

select outliers based on sigma and standard deviation in sql

The sample data is like this.
I want select outliers out of 4 sigma for each class.
I tried
select value,class,AVG(value) as mean, STDEV(value)as st, size from Data
having value<mean-2*st OR value>mean+2*st group by calss
it seems does not work. Should I use having or where clause here?
The results I want is the whole 3rd row and 8th row.
When the condition you are looking at is a property of the row, use where i.e. where class = 1 (all rows with class 1) or where size > 2 (all rows with size > 2). When the condition is a property of a set of rows you use group by ... having e.g. group by class having avg(value) > 2 (all classes with average value > 2).
In this case you want where but there is a complication. You don't have enough information in each row alone to write the necessary where clause, so you will have to get it through a subquery.
Ultimately you want something like SELECT value, class, size FROM Data WHERE value < mean - 2 *st OR value > mean + 2*st; however you need a subquery to get mean and st.
One way to do this is:
SELECT value, Data.class, size, mean, st FROM Data,
INNER JOIN (
SELECT class, AVG(value) AS mean, STDEV(value) AS st
FROM Data GROUP BY class
) AS stats ON stats.class = Data.class
WHERE value < mean - 2 * st OR value > mean + 2 * st;
This creates a subquery which gets your means and standard deviations for each class, joins those numbers to the rows with matching classes, and then applies your outlier check.

Create a view on an integer?

I have a table named A. it has only one record with one field. It is an integer named number.
I want to create a view that have A.number records, each are one of the numbers less than A.number.
For example:
select A.number -----> 5
the view should show 5 records 0 1 2 3 4
P.S: This is a real problem that I simplified it a lot. The real problem is like dividing a budget in a fixed period to each day.
This sounds a bit like it might be homework, so I'm wary of providing the code outright.
I can give a pointer for how to solve the question, though. You use a recursive CTE where each iteration adds one to the previous iteration. Just be sure to set the MAXRECURSION option if you'll be checking numbers > 101. You can use a scalar sub query to key the view to the original table:
WITH numbers ( n ) AS (
SELECT 0 UNION ALL
SELECT 1 + n FROM numbers WHERE n < (select number from a) -1)
SELECT n FROM numbers
OPTION ( MAXRECURSION 500) --example
If the number of your table will be < 2048 and you are on SQL Server, this will work for you:
CREATE VIEW MyView AS
SELECT number
FROM master..spt_values
WHERE type = 'p'
AND number < (SELECT value FROM yourTable)
Alternatively you could consider creating your own Numbers table with an appropriate size to suit your application if you require a higher limit, or are not on SQL Server that has this provided to you. Here is a link to a blog post on the idea of having a "Numbers table" handy.

Distribution of table in time

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...
Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;
Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...
As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.
#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.
Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS
select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.
For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/