Transpose some columns to row - sql

I know that same kind of questions has been asked before. However I didn't succeed to do what I need to do. Therefore I'm asking you.
I have a table with client_ID and some probabilities of purchasing different product category corresponding to each client.
Client_ID | Prob_CategoryA | Prob_CategoryB | Prob_CategoryC
1 0.2 0.3 0.2
2 0.4 0.6 0.7
3 0.3 0.7 0.4
Now what I would like to do is transform the above table into this.
Client_ID | Category Name | Probability
1 A 0.2
1 B 0.3
1 C 0.2
2 A 0.4
2 B 0.6
2 C 0.7
3 A 0.3
3 B 0.7
3 C 0.4
Thank you very much

Simple UNPIVOT:
SELECT Client_Id, SUBSTRING(Cat, 14, 1) [Category Name], Probability
FROM Src
UNPIVOT (Probability FOR Cat IN (Prob_CategoryA, Prob_CategoryB, Prob_CategoryC)) UP
Result
Client_Id Category Name Probability
----------- ------------- -----------
1 A 0.2
1 B 0.3
1 C 0.2
2 A 0.4
2 B 0.6
2 C 0.7
3 A 0.3
3 B 0.7
3 C 0.4

Use UNPIVOT:
SELECT Client_ID, Cats, Probability
FROM
(SELECT Client_ID, Prob_CategoryA, Prob_CategoryB, Prob_CategoryC
FROM yourTable) t
UNPIVOT
(Probability FOR Cats IN (Prob_CategoryA, Prob_CategoryB, Prob_CategoryC)
) AS c

Related

groupby shows unobserved values of non-categorical columns

I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0

calculate more than one column using for loop based multiple specific conditions in pandas

I have a dataframe as shown below.
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
p = [0.2, 0.4] and Duration for each slot = 30 (minutes)
p = threshold probability
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1 Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2 multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3 multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off_0.2 Cut_off_0.4
1 0.2 S1 1 1 1 0.5
2 0.3 S1 2 1 1.5 0.75
3 0.8 S1 3 1 4 2
4 0.3 S1 3 2 1.2 0.6
5 0.6 S1 4 1 3 1.5
6 0.8 S1 5 1 4 2
7 0.9 S1 5 2 3.6 1.8
8 0.4 S1 5 3 1.44 0.72
9 0.6 S1 5 4 0.864 0.432
12 0.9 S2 1 1 4.5 2.25
13 0.5 S2 1 2 2.25 1.125
14 0.3 S2 2 1 1.5 0.75
15 0.7 S2 3 1 3.5 1.75
20 0.7 S2 4 1 3.5 1.75
16 0.6 S2 5 1 3 1.5
17 0.8 S2 5 2 2.4 1.2
19 0.3 S2 5 3 0.72 0.36
I tried below code
p = [0.2, 0.4]
for i in p:
df['Cut_off_'+'i'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Your solution is possible here with f-strings with {i} for new columns names:
p = [0.2, 0.4]
for i in p:
df[f'Cut_off_{i}'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Solution with numpy is also possible - output is converted to numpy array and divided by p, then converted to DataFrame and joined to original.
p = [0.2, 0.4]
arr = df.groupby(['Session','slot_num'])['No_Show'].cumprod().values[:, None] / np.array(p)
df = df.join(pd.DataFrame(arr, columns=p, index=df.index).add_prefix('Cut_off_'))

Excel sumproudct function in pandas dataframes

Ok, as a python beginner I found multiplication matrix in pandas dataframes is very difficult to conduct.
I have two tables look like:
df1
Id lifetime 0 1 2 3 4 5 .... 30
0 1 4 0.1 0.2 0.1 0.4 0.5 0.4... 0.2
1 2 7 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
2 3 8 0.5 0.2 0.1 0.4 0.5 0.4... 0.6
.......
9 6 10 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
df2
Group lifetime 0 1 2 3 4 5 .... 30
0 2 4 0.9 0.8 0.9 0.8 0.8 0.8... 0.9
1 2 7 0.8 0.9 0.9 0.9 0.8 0.8... 0.9
2 3 8 0.9 0.7 0.8 0.8 0.9 0.9... 0.9
.......
9 5 10 0.8 0.9 0.7 0.7 0.9 0.7... 0.9
I want to perform excel's sumproduct function in my codes and the length of the columns that need to be summed are based on the lifetime in column 1 of both dfs, e,g.,
for row 0 in df1&df2, lifetime=4:
sumproduct(df1 row 0 from column 0 to column 3,
df2 row 0 from column 0 to column 3)
for row 1 in df1&df2, lifetime=7
sumproduct(df1 row 2 from column 0 to column 6,
df2 row 2 from column 0 to column 6)
.......
How can I do this?
You can use .iloc to access row and columns with integers.
So where lifetime==4 is row 0, and if you count the column numbers where Id is zero, then column labeled as 0 would be 2, and column labeled as 3 would be 5, to get that interval you would enter 2:6.
Once you get the correct data in both data frames with .iloc[0,2:6], you run np.dot
See below:
import numpy as np
np.dot(df1.iloc[0,2:6], df2.iloc[1,2:6])
Just to make sure you have the right data, try just running
df1.iloc[0,2:6]
Then try the np.dot product. You can read up on "pandas iloc" and "slicing" for more info.

Trying to group by but only specific rows based on their value

I am finding this issue quite complex:
I have the following df:
values_1 values_2 values_3 id name
0.1 0.2 0.3 1 AAAA_living_thing
0.1 0.2 0.3 1 AAA_mammals
0.1 0.2 0.3 1 AA_dog
0.2 0.4 0.6 2 AAAA_living_thing
0.2 0.4 0.6 2 AAA_something
0.2 0.4 0.6 2 AA_dog
The ouput should be:
values_1 values_2 values_3 id name
0.3 0.6 0.9 3 AAAA_living_thing
0.1 0.2 0.3 1 AAA_mammals
0.1 0.2 0.3 1 AA_dog
0.2 0.4 0.6 2 AAA_something
0.2 0.4 0.6 2 AA_dog
It would be like a group_by().sum() but only the AAAA_living_thing as the rows below are childs of AAAA_living_thing
Seperate the dataframe first by using query and getting the rows only with AAAA_living_thing and without. Then use groupby and finally concat them back together:
temp = df.query('name.str.startswith("AAAA")').groupby('name', as_index=False).sum()
temp2 = df.query('~name.str.startswith("AAAA")')
final = pd.concat([temp, temp2])
Output
id name values_1 values_2 values_3
0 3 AAAA_living_thing 0.3 0.6 0.9
1 1 AAA_mammals 0.1 0.2 0.3
2 1 AA_dog 0.1 0.2 0.3
4 2 AAA_something 0.2 0.4 0.6
5 2 AA_dog 0.2 0.4 0.6
Another way would be to make a unique identifier for rows which are not AAAA_living_thing with np.where and then groupby on name + unique identifier:
s = np.where(df['name'].str.startswith('AAAA'), 0, df.index)
final = df.groupby(['name', s], as_index=False).sum()
Output
name values_1 values_2 values_3 id
0 AAAA_living_thing 0.3 0.6 0.9 3
1 AAA_mammals 0.1 0.2 0.3 1
2 AAA_something 0.2 0.4 0.6 2
3 AA_dog 0.1 0.2 0.3 1
4 AA_dog 0.2 0.4 0.6 2

CTE not going through recursion

I am using CTE for the first time and am running into some difficulty. I have looked online and am trying to piece together examples.
I want to insert rows between each two rows returned to account for all the days inbetween. Row 1 has Date(A) and row 2 has Date(B). I want to insert a row for every day between A and B, where those rows all have the same values as Row 1.
If I run only my anchor definition on my test data, i get 341 rows. After running the CTE, i get 682. So it only runs once.
Any suggestions you can provide would be great. Thanks.
I have the following table schema:
field1 (varchar(10)) field2 (smalldatetime) field3 (numeric(18,0)) field4 (numeric(18,6)) field5 (numeric(18,6)) field6 (numeric(18,3))
An example of the input table is:
ABC 1-1-1990 0 0.1 0.1 0.125
ABC 1-5-1990 1 0.2 0.2 1.0
ABC 1-9-1990 0 0.3 0.3 0.750
ABC 1-13-1990 1 0.4 0.4 1.500
I want to turn that into this:
ABC 1-1-1990 0 0.1 0.1 0.125
ABC 1-2-1990 0 0.1 0.1 0.125
ABC 1-3-1990 0 0.1 0.1 0.125
ABC 1-4-1990 0 0.1 0.1 0.125
ABC 1-5-1990 1 0.2 0.2 1.0
ABC 1-6-1990 1 0.2 0.2 1.0
ABC 1-7-1990 1 0.2 0.2 1.0
ABC 1-8-1990 1 0.2 0.2 1.0
ABC 1-9-1990 0 0.3 0.3 0.750
ABC 1-10-1990 0 0.3 0.3 0.750
ABC 1-11-1990 0 0.3 0.3 0.750
ABC 1-12-1990 0 0.3 0.3 0.750
ABC 1-13-1990 1 0.4 0.4 1.500
Here is my current CTE:
WITH NewData (field1,field2,field3,field4,field5,field6) AS
(
SELECT m.field1,m.field2,m.field3,m.field4,m.field5,m.field6
FROM MyTable as m
WHERE m.field1 is not null
GROUP BY m.field1,m.field2,m.field3,m.field4,m.field5,m.field6
UNION ALL
SELECT m.field1, DATEADD(d, 1, m.field2), m.field3, m.field4, m.field5, m.field6
FROM MyTable as m
)
SELECT field1,field2,field3, field4, field5,field6
FROM NewData
order by field1, field2
OPTION(MAXRECURSION 0)
Current Output (it misses dates 1-3-1990, 1-4-1990, 1-7-1990, 1-8-1990, 1-11-1990, 1-12-1990):
ABC 1-1-1990 0 0.1 0.1 0.125
ABC 1-2-1990 0 0.1 0.1 0.125
ABC 1-5-1990 1 0.2 0.2 1.0
ABC 1-6-1990 1 0.2 0.2 1.0
ABC 1-9-1990 0 0.3 0.3 0.750
ABC 1-10-1990 0 0.3 0.3 0.750
ABC 1-13-1990 1 0.4 0.4 1.500
Your CTE is not currently defined to be recursive, in that the portion you think is recursive is not since it does not reference itself - so what it does is just a normal union query (so you get more rows, making you think its is recursive, when its just a union)
http://msdn.microsoft.com/en-us/library/ms186243.aspx
WITH NewData (field1,field2,field3,field4,field5,field6) AS
(
SELECT m.field1,m.field2,m.field3,m.field4,m.field5,m.field6
FROM MyTable as m
WHERE m.field1 is not null
GROUP BY m.field1,m.field2,m.field3,m.field4,m.field5,m.field6
UNION ALL
SELECT m.field1, DATEADD(d, 1, m.field2), m.field3, m.field4, m.field5, m.field6
FROM MyTable as m
INNER JOIN NewData n on n.field1 = m.field1
)
I am not entirely sure what join condition you want to recurse on so have just used field1 in the code example, but basically use that join to define how the rows relate.