Powerpivot - Only Show Minimum Value - min

Newbie to DAX/PowerPivot and struggling with a specific problem.
I have a table
Location Category Distance
1 A 1.244
2 A 2.111
3 B 5.113
4 C 0.124
etc
I need to identify the Minimum distance out of the selection and only output for that record. So I'd have
Location Category Distance MinDist
1 A 1.244
2 A 0.111 0.111
3 B 5.113
4 C 3.124
etc
I've tried various measures but always end up with simply a repeat of the Distance column....whatever filters I try to apply.
Please help.

If your table was called 'table1' then this would give you the overall minimum:
=CALCULATE(MIN(Table1[Distance]), ALL(Table1))
Depending on your requirements, you may have to specify columns in the ALL() to reduce how much of the filter is opened up (suggest you research ALL() as it is a very important DAX function).
To return zero (blanks is tricky) for the non matchers you could package it in:
=
IF (SUM ( Table1[Distance] ) = CALCULATE ( MIN ( Table1[Distance] ), ALL ( Table1 ) ),
CALCULATE ( MIN ( Table1[Distance] ), ALL ( Table1 ) ),
0
)

Related

Assign an age to a person based on known population average but no Date of birth

I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don't know the date of birth of any of the family members.
Dataset looks like:
household_id
household_size
x1
5
x2
1
x3
8
...
...
I then have a set of percentages for each age group with that dataset looking like:
age_group
percentage
0-18
30
19-30
40
31-100
30
I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:
household_id
household_size
0-18
19-30
31-100
x1
5
2
2
1
x2
1
0
1
0
x3
8
3
3
2
...
...
...
....
...
I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.
I didn't want to post an answer with just a link so I figured I'll give it a shot and see if I can simplify depesz weighted_random to plain sql. The result is this slower, less readable, worse version of it, but in shorter, plain sql:
CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] )
RETURNS ANYELEMENT language sql as $$
select choice
from
( select case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
then choice end as choice
from ( select unnest(p_choices) as choice,
unnest(p_weights) as weight ) inputs,
( select sum(weight)*random() as hit
from unnest(p_weights) a(weight) ) as random_hit
) chances
where choice is not null
limit 1
$$;
It's not inlineable because of aggregate and window function calls. It's faster if you assume weights will only be probabilities that sum up to 1.
The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don't have to, nor do they have to sum up to any specific number):
update test_area t
set ("0-18",
"19-30",
"31-100")
= (with cte AS (
select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}')
as age_group
from generate_series(1,household_size,1))
select count(*) filter (where age_group='0-18') as "0-18",
count(*) filter (where age_group='19-30') as "19-30",
count(*) filter (where age_group='31-100') as "31-100"
from cte)
returning *;
Online demo showing that both his version and mine are statistically reliable.
A minimum start could be:
SELECT
household_id,
MIN(household_size) as size,
ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT
h.household_id,
h.household_size,
p.agegroup_from,
p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;
output:
household_id
size
g1
g2
g3
x1
5
1.5
2.0
1.5
x2
1
0.3
0.4
0.3
x3
8
2.4
3.2
2.4
see: DBFIDDLE
Notes:
Of course you should round the columns g to whole numbers, taking into account the complete split (g1+g2+g3 = total)
Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK.... (see, for more info: Return all possible combinations of values on columns in SQL )

fetch aggregate value along with data

I have a table with the following fields
ID,Content,QuestionMarks,TypeofQuestion
350, What is the symbol used to represent Bromine?,2,MCQ
758,What is the symbol used to represent Bromine? ,2,MCQ
2425,What is the symbol used to represent Bromine?,3,Essay
2080,A quadrilateral has four sides, four angles ,1,MCQ
2614,A circular cone has a curved surface area of ,2,MCQ
2520,Two triangles have sides 5 cm, 11 cm, 2 cm . ,2,MCQ
2196,Life supporting process mediated by water? ,2,Essay
I would like to get random questions where total marks is an input number.
For example if I say 25, the result should be all the random questions whose Sum(QuestionMarks) is 25(+/-1)
Is this really possible using a SQL
select content,id,questionmarks,sum(questionmarks) from quiz_question
group by content,id,questionmarks;
Expected Input 25
Expected Result (Sum of Question Marks =25)
Update:
How do I ensure I get atleast 2 Essay Type Questions (this is just an example) I would extend this for other conditions. Thank you for all the help
S-Man's cumulative sum is the right approach. For your logic, though, I think you want to get up to the first row that is 24 or more. That logic is:
where total - questionmark < 24
If you have enough questions, then you could get exactly 25 using:
with q25 as (
select *
from (select t.*,
sum(questionmark) over (order by random()) as running_questionmark
from t
) t
where running_questionmark < 25
)
select q.ID, q.Content, q.QuestionMarks, q.TypeofQuestion
from q25 q
union all
(select t.ID, t.Content, t.QuestionMarks, t.TypeofQuestion
from t cross join
(select sum(questionmark) as questionmark_25 from q25) x
where not exists (select 1 from q25 where q25.id = t.id)
order by abs(questionmark - (25 - questionmark_25))
limit 1
)
This selects questions up to 25 but not at 25. It then tries to find one more to make the total 25.
Supposing, questionmark is of type integer. Then you want to get some records in random order whose questionmark sum is not more than 25:
You can use the consecutive SUM() window function. The order is random. The consecutive SUM() adds every current value to the previous sum. So, you could filter where SUM() <= <your value>:
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*,
SUM(questionmark) OVER (ORDER BY random()) as total
FROM
t
)s
WHERE total <= 25
Note:
This returns a records list with no more than 25, but as close as possible to it with an random order.
To find an exact match of your value is some sort of combinatorical problem which shouldn't be solved in a database. Especially when there's a random factor. What if your current SUM is 22 and the next randomly chosen value is 4. Would you retry maybe until infinity to randomly find a value = 3? Or are you trying to remove an already counted record with value = 1?

DAX - Create Dynamic Index Column

I am trying to create a calculated column in SSAS tabular model with DAX. I want a dynamic index column on a table. Meaning that the index starts at 0 when the table is filtered. Imagine you have a table like:
item index
apple 0
banana 1
celery 2
broccoli 3
If I filter the table to just vegetables normally the index would still be:
item index
celery 2
broccoli 3
But I want it to be
item index
celery 0
broccoli 1
So far I create the index with: (I am indexing a date dimension table)
=CALCULATE(COUNT([Date])-1, ALL('DimDate'), FILTER(DimDate, [Date]<=EARLIER([Date])))
I have tried using ALLEXCEPT() and I have tried making an Offset column by getting the first value of the Index with FIRSTNONBLANK, but have not yet been successful.
Any ideas or help?
You cannot have a dynamic calculated column because calculated columns are evaluated and stored at design time. So you would have to use a measure to do the dynamic indexing. Here is one approach below using ALLSELECTED with additional parsing to display blank as zero.
=
IF (
ISBLANK (
CALCULATE (
COUNTROWS ( DimDate ),
FILTER ( ALLSELECTED ( DimDate ), DimDate[Date] < MAX ( DimDate[Date] ) )
)
),
0,
CALCULATE (
COUNTROWS ( DimDate ),
FILTER ( ALLSELECTED ( DimDate ), DimDate[Date] < MAX ( DimDate[Date] ) )
)
)

How to summarize on distinct values that group other values

I have this table:
Family_ID Person_ID Weight Spent_amount category
A 1 10 500 flight
A 2 10 500 flight
A 1 10 200 Hotel
A 2 10 200 Hotel
B 3 20 250 flight
B 3 20 300 Hotel
as we see here every family has a member and the costs per category calculated for the family not the person. And as we see not every family is equal to other but it has weight so every spent should be multiplied by family weight.
Now my goal is to write a measure for spent for every family
I wrote an equation but I see its very complex and I guess it give a wrong value. The Dax Equation I wrote is
cash_wieght:=SUMX (
SUMMARIZE (
s2g516_full,
s2g516_full[DIM_HOUSEHOLD_REF_ID],
"cash", CALCULATE ( MAXX ( SUMMARIZE (
s2g516_full,
s2g516_full[Persons],
s2g516_full[DIM_HOUSEHOLD_REF_ID] ,"cash1", CALCULATE ( Sum(s2g516_full[SPENT_PER_REASON])*MAX ( s2g516_full[Weight] ) )
/distinctcount(s2g516_full[person]) ),[Cash1] ) )
),
[cash]
)
When I took a sample by filtering with family_ID, it gave me the correct numbers for samples tests but maybe not of them. So how can I tell if I got the right results or not?
The result I want is the weight * spent_amount to be summed and filtered with family_ID as if person_ID is not in the table like: "if I filtered on Family_ID"
A 700*10
B 550 *20
First, DAX Formatter is your friend.
cash_wieght :=
SUMX (
SUMMARIZE (
s2g516_full,
s2g516_full[DIM_HOUSEHOLD_REF_ID],
"cash", CALCULATE (
MAXX (
SUMMARIZE (
s2g516_full,
s2g516_full[Persons],
s2g516_full[DIM_HOUSEHOLD_REF_ID],
"cash1", CALCULATE (
SUM ( s2g516_full[SPENT_PER_REASON] ) * MAX ( s2g516_full[Weight] )
)
/ DISTINCTCOUNT ( s2g516_full[person] )
),
[Cash1]
)
)
),
[cash]
)
After that, I think you've got the right intuition with SUMMARIZE(), but it can be much simpler than what you're doing. SUMMARIZE() groups by the fields in a table that you pass to it. Since members with the same [Family_ID] have the same values for [Spent_Amount] in a given [Category], all we need to do is group by everything from the table except [Person_ID].
cash_wieght:=
SUMX (
SUMMARIZE(
FactSpend
,FactSpend[Family_ID]
,FactSpend[Weight]
,FactSpend[Category]
,FactSpend[Spent_Amount]
)
,FactSpend[Weight] * FactSpend[Spent_Amount]
)
Here we use SUMMARIZE() to group by [Family_ID], [Weight], [Category], and [Spent_Amount]. Since addition and multiplication are commutative, it doesn't matter if we add first or multiply first.
So we use SUMX() to iterate over each row returned by our SUMMARIZE() and accumulate into a sum the values of [Weight] * [Spent_Amount] for each row in that grouped table.
Here's an image of my sample data and pivot table based on this measure performing appropriately. It also works when adding any of the attributes from the table to filter context. If one [Person_ID] is in context, the measure will return the value for the family, and at the family level will return the value for the family. We can alter the measure to only give a single [Person_ID]'s portion of the [Family_ID]'s whole if necessary.

Select random row from a PostgreSQL table with weighted row probabilities

Example input:
SELECT * FROM test;
id | percent
----+----------
1 | 50
2 | 35
3 | 15
(3 rows)
How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?
I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1, but it gives wrong results. After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405}, but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500}.
Any ideas?
This should do the trick:
WITH CTE AS (
SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
SELECT id, SUM(percent) OVER (ORDER BY id) S, R
FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;
The sub-query Q gives the following result:
1 50
2 85
3 100
We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE clause). We use common table expression (WITH) to ensure the random number is calculated only once.
BTW, the SELECT SUM(percent) FROM YOUR_TABLE allows you to have any weights in percent - they don't strictly need to be percentages (i.e. add-up to 100).
[SQL Fiddle]
ORDER BY random() ^ (1.0 / p)
from the algorithm described by Efraimidis and Spirakis.
Branko's accepted solution is great (thanks!). However, I'd like to contribute an alternative that is just as performant (according to my tests), and perhaps easier to visualize.
Let's recap. The original question can perhaps be generalized as follows:
Given an map of ids and relative weights, create a query that returns a random id in the map, but with a probability proportional to its relative weight.
Note the emphasis on relative weights, not percent. As Branko points out in his answer, using relative weights will work for anything, including percents.
Now, consider some test data, which we'll put in a temporary table:
CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
) AS test(id, weight);
Note that I'm using a more complicated example than that in the original question, in that it does not conveniently add up to 100, and in that the same weight (20) is used more than once (for ids 2 and 3), which is important to consider, as you'll see later.
The first thing we have to do is turn the weights into probabilities from 0 to 1, which is nothing more than a simple normalization (weight / sum(weights)):
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
This will result in the following output:
id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
1 | 25 | 0.5 | 0.0 | 0.5
2 | 10 | 0.2 | 0.5 | 0.7
3 | 10 | 0.2 | 0.7 | 0.9
4 | 5 | 0.1 | 0.9 | 1.0
The query above is admittedly doing more work than strictly necessary for our needs, but I find it helpful to visualize the relative probabilities this way, and it does make the final step of choosing the id trivial:
SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;
Now, let's put it all together with a test that ensures the query is returning data with the expected distribution. We'll use generate_series() to generate a random number a million times:
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
),
fp AS ( -- final probability
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;
This will result in output similar to the following:
id | count
----+--------
1 | 499679
3 | 200652
2 | 199334
4 | 100335
Which, as you can see, tracks the expected distribution perfectly.
Performance
The query above is quite performant. Even in my average machine, with PostgreSQL running in a WSL1 instance (the horror!), execution is relatively fast:
count | time (ms)
-----------+----------
1,000 | 7
10,000 | 25
100,000 | 210
1,000,000 | 1950
Adaptation to generate test data
I often use a variation of the query above when generating test data for unit/integration tests. The idea is to generate random data that approximates a probability distribution that tracks reality.
In that situation I find it useful to compute the start and end distributions once and storing the results in a table:
CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
),
p AS ( -- probability
SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
I can then use these precomputed probabilities repeatedly, which results in extra performance and simpler use.
I can even wrap it all in a function that I can call any time I want to get a random id:
CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
SELECT id
FROM test
WHERE p_random BETWEEN startprobability AND endprobability
;
$$
LANGUAGE SQL STABLE STRICT
Window function frames
It's worth noting that the technique above is using a window function with a non-standard frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is necessary to deal with the fact that some weights might be repeated, which is why I chose test data with repeated weights in the first place!
Your proposed query appears to work; see this SQLFiddle demo. It creates the wrong distribution though; see below.
To prevent PostgreSQL from optimising the subquery I've wrapped it in a VOLATILE SQL function. PostgreSQL has no way to know that you intend the subquery to run once for every row of the outer query, so if you don't force it to volatile it'll just execute it once. Another possibility - though one that the query planner might optimize out in future - is to make it appear to be a correlated subquery, like this hack that uses an always-true where clause, like this: http://sqlfiddle.com/#!12/3039b/9
At a guess (before you updated to explain why it didn't work) your testing methodology was at fault, or you're using this as a subquery in an outer query where PostgreSQL is noticing it isn't a correlated subquery and executing it just once, like in this example. .
UPDATE: The distribution produced isn't what you're expecting. The issue here is that you're skewing the distribution by taking multiple samples of random(); you need a single sample.
This query produces the correct distribution (SQLFiddle):
WITH random_weight(rw) AS (SELECT random() * (SELECT sum(percent) FROM test))
SELECT id
FROM (
SELECT
id,
sum(percent) OVER (ORDER BY id),
coalesce(sum(prev_percent) OVER (ORDER BY id),0) FROM (
SELECT
id,
percent,
lag(percent) OVER () AS prev_percent
FROM test
) x
) weighted_ids(id, weight_upper, weight_lower)
CROSS JOIN random_weight
WHERE rw BETWEEN weight_lower AND weight_upper;
Performance is, needless to say, horrible. It's using two nested sets of windows. What I'm doing is:
Creating (id, percent, previous_percent) then using that to create two running sums of weights that are used as range brackets; then
Taking a random value, scaling it to the range of weights, and then picking a value that has weights within the target bracket
Here is something for you to play with:
select t1.id as id1
, case when t2.id is null then 0 else t2.id end as id2
, t1.percent as percent1
, case when t2.percent is null then 0 else t2.percent end as percent2
from "Test1" t1
left outer join "Test1" t2 on t1.id = t2.id + 1
where random() * 100 between t1.percent and
case when t2.percent is null then 0 else t2.percent end;
Essentially perform a left outer join so that you have two columns to apply a between clause.
Note that it will only work if you get your table ordered in the right way.
Based on Branko Dimitrijevic's answer, I wrote this query, which may or may not be faster by using the sum total of percent using tiered windowing functions (not unlike a ROLLUP).
WITH random AS (SELECT random() AS random)
SELECT id FROM (
SELECT id, percent,
SUM(percent) OVER (ORDER BY id) AS rank,
SUM(percent) OVER () * random AS roll
FROM test CROSS JOIN random
) t WHERE roll <= rank LIMIT 1
If the ordering isn't important, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank, may be preferable because it avoids having to sort the data first.
I also tried Mechanic Wei's answer (as described in this paper, apparently), which seems very promising in terms of performance, but after some testing, the distribution appear to be off :
SELECT id
FROM test
ORDER BY random() ^ (1.0/percent)
LIMIT 1