I need to calculate total average as well as average for each value of ColumnD, divided by number of unique values in ColumnA:
ColumnA ColumnB ColumnC ColumnD
A 10 xyz Ab
A 20 def Ab
A 5 mno Xy
B 10 pqr Ab
B 40 abc Xy
C 10 uvw Xy
Total Average (divided by unique ColumnA):
(10+20+5+10+40+10)/3= 31.66
Now I need Average for ColumnD = 'Ab':
(10+20+10)/2
Average for ColumnD = 'Xy':
(5+40+10)/3
I made Calculated Column in HANA:
Counter-> CA_Count on ColumnA(to get unique Count)
CA_Avg ->
ColumnB/CA_Count
for Avg of Ab:
CA_AVG_Ab :
if(ColumnD='Ab',CA_Avg,0)
but this Value is not coming correctly.
To model different aggregation levels in CVs you need to model separate data flows leading into separate aggregation nodes.
The output of these agg. nodes can then be joined together (outer join, obviously).
Related
I have a table like this that captures the experiment data:
treatment metric values
control metric values
1
2
3
6
5
7
...
...
I want to calculate the P value for the experiment in Presto using SQL. I can take average of metric values for both treatment and control groups to compare but I need P-value to see if the results are statistically significant.
Given your data format, assuming equal population sizes, all users in the experiment are in the data set, etc:
SELECT
NORMAL_CDF(
ABS(AVG("treatment metric values") - AVG("control metric values")),
SQRT(VAR_SAMP("treatment metric values") + VAR_SAMP("control metric values")),
0
) AS p_value
I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation
This seems to be a 2 step problem I'm trying to solve.
Let's say we have N records, and we are trying to distribute as evenly as possible into K groups.
The second problem - each group in K can only accept an M amount of records.
For example, if we have 5 records, and 3 groups, then we would distribute 2 into Group K1, 2 into Group K2 and 1 record into Group K3. However, if say in group 1, it only accepts at most 1 record. Then the arrangement would need to be 1 into Group K1, 2 into Group K2, and 2 into Group K3.
I'm not necessary after the solution but what algorithm I might need to use to solve this? Apparently for the distribution, I need to use the Greedy algorithm? But for the second step, this seems to be a bit more complicated
Edit:
The example I'm looking at is:
Number of records: 23
Groups: 10
Max records for each group
G1 = 4
G2 = 1
G3 = 0
G4 = 5
G5 = 0
G6 = 0
G7 = 2
G8 = 4
G9 = 2
G10 = 2
if N=12 and K=3 then in normal situation,you just split it V=12/3=4 for each group. but since you have M limitation, and for example K3 can only accept 1 then the distribution can be 6-5-1 which is not evenly distributed.
So i guess you need to sort K based on the M limitation, so for the example above the groups order become K3-K1-K2.
then if the distributed value V is bigger than the accepted amount M for that group, you need to take the remainder and distribute it again to the remaining group (K3=1, then 4-1=3 must be distributed to K1 and K2).
the implementation might be complicated, i hope you can find more simple solution for this
From what I understood, you need to separate all groups which allows a fixed number of values first and then equally distribute records among remaining groups. Let's take an example, let's say we have 15 records which needs to be distributed among 5 groups (G1, G2, G3, G4 and G5). Also let's assume that G2 and G4 allows max records of 2 and 4 respectively. Now algorithm should go like this:
Get average(ceiling integer) of records based on number of groups (In this example we'll get 3).
Add all max allowed records which are smaller than our average (In this example it's G2 only who's max limit(i.e. 2) is less than our average hence the number comes as 2).
Now subtract our number from step 2 from total records and also subtract the number of groups involved in step 2 from total groups. (remaining total records: 13, remaining total groups 4).
Get the new average(ceiling integer) using remaining records and groups. (New average 4).
Get average (Integer) (i.e. 3) and allot equal number of records to remaining groups - 1.
Get Mod (i.e. 1) and allot that number to the last group.
Now what we finally will have here:
G1(No limit): 4
G2(Limit 2): 2
G3(No limit): 4
G4(Limit 4): 4
G5(No limit): 1
Let me know if you think that this algo might fail for some scenarios.
Formula to get ceiling integer average
floor((#total_records + #total_groups-1) / #total_groups)
I tried to find solutions for this and it is somehow easy to solve when records are below a certain number. But...
I have an original list with 81,590 records.
Id Loc Sales LatLong
1 a 100 ...
2 b 110 ...
3 c 105 ...
4 d 125 ...
5 e 123 ...
6 f 35 ...
.
.
.
81,590 ... ... ...
I need to compare all items in the list against each other.
Id L1 L2 Dist
1 a a 0 --> Not needed. Self comparison.
2 a b 26
3 a c 150 --> Not needed. Distance >100.
4 a d 58
5 b a 26 --> Not needed. Repeated record.
6 b b 0 --> Not needed. Self comparison.
7 b c 15
8 b d 151 --> Not needed. Distance >100.
9 c a 150 --> Not needed. Repeated record.
10 c b 15 --> Not needed. Repeated record.
11 c c 0 --> Not needed. Self comparison.
12 c d 75
13 d a 58 --> Not needed. Repeated record.
14 d b 151 --> Not needed. Repeated record.
15 d c 75 --> Not needed. Repeated record.
16 d d 0 --> Not needed. Self comparison.
But as shown next to the records above, the end result needs to be a list that:
1) Compares records against each other ONLY when they are located at a certain distance, say <100 miles.
2) Does not contain duplicates in the sense that comparing Loc1 to Loc2 is the same as comparing Loc2 to Loc1.
3) And the obvious one, no need to compare Loc1 to itself.
The end result would be:
Id L1 L2 Dist
2 a b 26
4 a d 58
7 b c 15
12 c d 75
Approach:
In theory, the total number of records after comparing all items against themselves is 81,590 ^ 2 = 6,656,928,100 records.
Subtracting repeated iterations (LocA-LocB = LocB-LocA) would mean 6,656,928,100 / 2 = 3,328,464,050.
Further cleaning by getting rid of self-repeating iterations (LocA-LocA), should be 3,328,464,050 - 81,590 = 3,328,382,460.
Then I could get rid of all records with distance > 100 miles.
This is highly inefficient, I'd be building a table with 6Bn records, then deleting half, etc. etc. etc.
Is there an approach to arrive to the end product in a much more efficient (less steps, less select/delete/update) way?
What is the select statement needed to insert the final data-set into destination?
It sounds to me like there is a join of the table with itself and a filtering by iterations of the key but here is where I am stuck.
What algorithm are you using to calculate distance between two points? Simple “the world is flat” cartesian math, or that trigonometry-laden “the word is an oblate spherloid” one? This can turn into serious CPU requirements.
It’s probably best to generate a table of “locations that are within distance X of this location” once and store it permanently; barring major events like earthquakes, it’s just not going to change.
Query-wise, the base join is trivial:
SELECT
t1.Loc L1
,t2.Loc L2
from MyTable t1
inner join MyTable t2
on t2.Loc > t1.Loc
If have the distance formula in, say, a function named “distanceFunction”, it might look something like:
WITH cteCalc as (
select
t1.Loc L1
,t2.Loc L2
,dbo.distanceFunction(t1.LatLong, t2.LatLong) Dist
from MyTable t1
inner join MyTable t2
on t2.Loc > t1.Loc
where dbo.distanceFunction(t1.LatLong, t2.LatLong) < #MaxDistance)
INSERT TargetTable (L1, L2, Dist)
SELECT
L1
,L2
,Dist
where Dist <= #MaxDistance
This, of course, may break your system, if only because the transaction log will grow too big while you’re writing a few billion rows to the target table. I'd say build a loop, processing each location in turn, with the final query like:
WITH cteCalc as (
select
t1.Loc L1
,t2.Loc L2
,dbo.distanceFunction(t1.LatLong, t2.LatLong) Dist
from MyTable t1
inner join MyTable t2
on t2.Loc > t1.Loc
where dbo.distanceFunction(t1.LatLong, t2.LatLong) < #MaxDistance
and t1.Loc = #ThisIterationLoc)
INSERT TargetTable (L1, L2, Dist)
SELECT
L1
,L2
,Dist
where Dist <= #MaxDistance
First pass returns 81,589 less whichever are too far away, second pass as 81,588 to process, and so forth.
Here is an outline of how I would solve this problem:
Put indexes on latitude and longitude
Do the math for lat and long of your distance for the range (box) of your distance. Then you know that your distance (as box not a circle) is contained in this delta. You also know that it is not outside of this delta. This constrains the problem considerably.
For example, if the change in lat and long is 10 for your distance then a location at (100,100) your box would be defined by (95,95) and (105,105) values for lat and long.
Write a query that looks at each element (from lowest id) and searches for other elements (with greater id to avoid duplicates) within the the delta of lat and log and save this to a temporary table.
Iterate over that table and do a full calculation to see if it is within the circle (not the box) of your distance.
I have a cube built on a fact which, amongst others, includes the Balance and Percentage columns. I have a calculation which multiplies the Balance by the Percentage to obtain an Adjusted Value. I now need to have this Adjusted Value divided by the sum of all balances, to get weighted values.
The problem is that this sum of all balances doesn't apply to the whole dataset. Rather, it should be calculated on a filtered subset of the whole data. This filtering is being done in Excel using a pivot table, so i do not know what conditions will be used to filter.
So, for example, this would be the pivot i'd like to see:
ID Balance Percentage Adjusted Value Weighted Adjusted Value
1 100 1.5 115 0.38 (ie 115/300)
2 50 2 51 0.17 (ie 51/300)
3 150 1 150 0.50 (ie 150/300)
300 is obtained by summing the balance of the rows that show in the filtered pivot.
Can this calculation be somehow done in OLAP? Or is it impossible to compute this sum with what i know?
Yes should be possible; e.g., assuming 1/2/3 are the children of a common parent, then the following calculated measure should do the trick :
WAV = AV / ( id.parent, Balance )
If not we would need more information about the actual data model and query.