Sum weights based on other rows values - sql

i'm looking for a query to sum the weights (from the first row -> q17), for every row (every question), but based on the answers of other questions. the idea is to leave out the value from the sum, if the answer does not have a value within other questions.
example
q17 5 5 4 3 5 5 4 5 = 36 for q17
q18 4 2 3 2 5 4 1 4 = 36 for q18
q19 5 2 4 2 5 4 1 4 = 36 for q19
q20 4 2 5 3 5 4 1 4 = 36 for q20
q21 4 5 3 5 5 1 = 26 for q21
q22 4 2 4 2 4 4 1 4 = 36 for q22
q23 4 1 4 3 5 4 1 4 = 36 for q23
q24 4 1 4 1 5 3 1 4 = 36 for q24
q25 5 2 4 3 5 4 1 4 = 36 for q25
q26 5 4 5 3 5 5 5 5 = 36 for q26
q27 5 4 4 1 5 4 1 4 = 36 for q27
q28 4 1 5 2 5 5 1 4 = 36 for q28
q29 5 5 5 4 5 4 5 5 = 36 for q29
q30 4 2 3 2 5 4 1 4 = 36 for q30
q31 4 3 4 4 5 4 1 5 = 36 for q31
q32 4 1 4 1 5 4 1 4 = 36 for q32
the weights are at q17. and i need to calculate the weights for every question. but where a question isn't answered, i don't need to sum into the weights.
don't take q18-32 as values, just take them only as true/false if it has a value or not for the question then sum the weights from q17 based on this for every q.
the data is the following
Q A
17 5
18 4
19 5
20 4
21 4
22 4
23 4
24 4
25 5
26 5
27 5
28 4
29 5
30 4
31 4
32 4
17 5
18 2
19 2
20 2
22 2
23 1
24 1
25 2
26 4
27 4
28 1
29 5
30 2
31 3
32 1
17 4
18 3
19 4
20 5
21 5
22 4
23 4
24 4
25 4
26 5
27 4
28 5
29 5
30 3
31 4
32 4
17 3
18 2
19 2
20 3
21 3
22 2
23 3
24 1
25 3
26 3
27 1
28 2
29 4
30 2
31 4
32 1
17 5
18 5
19 5
20 5
21 5
22 4
23 5
24 5
25 5
26 5
27 5
28 5
29 5
30 5
31 5
32 5
17 5
18 4
19 4
20 4
21 5
22 4
23 4
24 3
25 4
26 5
27 4
28 5
29 4
30 4
31 4
32 4
17 4
18 1
19 1
20 1
21 1
22 1
23 1
24 1
25 1
26 5
27 1
28 1
29 5
30 1
31 1
32 1
17 5
18 4
19 4
20 4
22 4
23 4
24 4
25 4
26 5
27 4
28 4
29 5
30 4
31 5
32 4

I think this is what you require:
SELECT 'q'+CAST(Q AS nvarchar(4)) AS Q,
CAST(SUM(A) AS NVARCHAR(4)) + ' total for q' + CAST(Q AS NVARCHAR(4)) AS A
FROM tbl
GROUP BY Q
With a sample SQL fiddle
Output:
Q A
q17 36 total for q17
q18 25 total for q18
q19 27 total for q19
q20 28 total for q20
q21 23 total for q21
q22 25 total for q22
q23 26 total for q23
q24 23 total for q24
q25 28 total for q25
q26 37 total for q26
q27 28 total for q27
q28 27 total for q28
q29 38 total for q29
q30 25 total for q30
q31 30 total for q31
q32 24 total for q32

CREATE TABLE Table1
(`col0` varchar(3),
`col1` int,
`col2` int,
`col3` int,
`col4` int,
`col5` int,
`col6` int,
`col7` int,
`col8` int)
;
INSERT INTO Table1
(`col0`, `col1`, `col2`, `col3`, `col4`, `col5`, `col6`, `col7`, `col8`)
VALUES
('q17', 5, 5, 4, 3, 5, 5, 4, 5),
('q18', 4, 2, 3, 2, 5, 4, 1, 4),
('q19', 5, 2, 4, 2, 5, 4, 1, 4),
('q20', 4, 2, 5, 3, 5, 4, 1, 4),
('q21', 4, NULL, 5, 3, 5, 5, 1, Null),
('q22', 4, 2, 4, 2, 4, 4, 1, 4),
('q23', 4, 1, 4, 3, 5, 4, 1, 4),
('q24', 4, 1, 4, 1, 5, 3, 1, 4),
('q25', 5, 2, 4, 3, 5, 4, 1, 4),
('q26', 5, 4, 5, 3, 5, 5, 5, 5),
('q27', 5, 4, 4, 1, 5, 4, 1, 4),
('q28', 4, 1, 5, 2, 5, 5, 1, 4),
('q29', 5, 5, 5, 4, 5, 4, 5, 5),
('q30', 4, 2, 3, 2, 5, 4, 1, 4),
('q31', 4, 3, 4, 4, 5, 4, 1, 5),
('q32', 4, 1, 4, 1, 5, 4, 1, 4)
;
SELECT *,
Concat('= ',(IFNULL(`Col1`,0)+
IFNULL(`Col2`,0)+
IFNULL(`Col3`,0)+
IFNULL(`Col4`,0)+
IFNULL(`Col5`,0)+
IFNULL(`Col6`,0)+
IFNULL(`Col7`,0)+
IFNULL(`Col8`,0)),' For ',Col0) As Result from Table1
Live Demo
http://sqlfiddle.com/#!9/8dcff2/2

Related

How can I plot two lines in one graph where values of the lines do not exist for the same x axis?

I would like to plot SupDem (variable) where e_boix_regime==1 and SupDem where e_boix_regime==0.
My data:
year
SupDem
e_boix_regime
1997
0.98
1
1998
0.75
0
My code:
dem = dem_aut[dem_aut["e_boix_regime"]==1].SupDem
aut = dem_aut[dem_aut["e_boix_regime"]==0].SupDem
year = dem_aut["year"]
plt.plot(year, dem, label="Suuport for Democracy in Demcoracies")
plt.plot(year, aut, label="Support for Democracy in Autocracies")
plt.show()```
The error is follwoing: x and y must have same first dimension, but have shapes (53,) and (28,)
I just wanted to plot two lines together.
This can help you solve the problem. I hope you can reproduce the codee with it:
two (or more) graphs in one plot with different x-axis AND y-axis scales in python
Issue
Your issue is regarding shape of x and y. For plotting graph you need same data point/shape of x-values and y-values.
Solution
Take each year with dem_aut["e_boix_regime"]==1 and dem_aut["e_boix_regime"]==2 condition as you are doing with SupDem
Source Code
df = pd.DataFrame(
{
"SupDem": np.random.randint(1, 11, 30),
"year": np.random.randint(10, 21, 30),
"e_boix_regime": np.random.randint(1, 3, 30),
}
) # see DataFrame below
df["e_boix_regime"].value_counts() # 1 = 18, 2 = 12
df[df["e_boix_regime"] == 2][["SupDem", "year"]] # see below
# you need same no. of data points for both x/y axis i.e. `year` and `SupDem`
plt.plot(
df[df["e_boix_regime"] == 1]["year"], df[df["e_boix_regime"] == 1]["SupDem"], marker="o", label="e_boix_regime==1"
)
# hence applying same condition for grabbing year which is applied for SupDem
plt.plot(
df[df["e_boix_regime"] == 2]["year"], df[df["e_boix_regime"] == 2]["SupDem"], marker="o", label="e_boix_regime==2"
)
plt.xlabel("Year")
plt.ylabel("SupDem")
plt.legend()
plt.show()
Output
PS: Ignore the data point plots, it's generated from random values
DataFrame Outputs
SupDem year e_boix_regime
0 1 12 2
1 10 10 1
2 5 19 2
3 4 14 2
4 8 14 2
5 4 17 2
6 2 15 2
7 10 11 1
8 8 11 2
9 6 19 2
10 5 15 1
11 8 17 1
12 9 10 2
13 1 14 2
14 8 18 1
15 3 13 2
16 6 16 2
17 1 16 1
18 7 13 1
19 8 15 2
20 2 17 2
21 5 10 2
22 1 19 2
23 5 20 2
24 7 16 1
25 10 14 1
26 2 11 2
27 1 18 1
28 5 16 1
29 10 18 2
df[df["e_boix_regime"] == 2][["SupDem", "year"]]
SupDem year
0 1 12
2 5 19
3 4 14
4 8 14
5 4 17
6 2 15
8 8 11
9 6 19
12 9 10
13 1 14
15 3 13
16 6 16
19 8 15
20 2 17
21 5 10
22 1 19
23 5 20
26 2 11
29 10 18

R : which.max + condition does not return the expected value

I made a reproducible example of a dataframe with 2 patients ID (ID 1 and ID 2), the value of a measurement (m_value) on different days (m_day).
df <- data.frame (ID = c (1, 1, 1, 1, 2, 2, 2),
m_value = c (10, 15, 12, 13, 18, 16, 19),
m_day = c (14, 143, 190, 402, 16, 55, 75)
ID m_value m_day
1 1 10 14
2 1 15 143
3 1 12 190
4 1 13 402
5 2 18 16
6 2 16 55
7 2 19 75
Now I want to obtain, for each patient, the best value of m before day 100 (period 1), and >= day 100 (period 2), and the dates of these best values, such as I can obtain the following table:
ID m_value m_day best_m_period1 best_m_period2 date_best_m_period1 date_best_m_period2
1 1 10 14 10 15 14 143
2 1 15 143 10 15 14 143
3 1 12 190 10 15 14 143
4 1 13 402 10 15 14 143
5 2 18 16 19 NA 75 NA
6 2 16 55 19 NA 75 NA
7 2 19 75 19 NA 75 NA
I tried the following code:
df2 <- df %>%
group_by (ID)%>%
mutate (best_m_period1 = max(m_value[m_day < 100]))%>%
mutate (best_m_period2 = max (m_value[m_day >=100])) %>%
mutate (date_best_m_period1 =
ifelse (is.null(which.max(m_value[m_day<100])), NA,
m_day[which.max(m_value[m_day < 100])])) %>%
mutate (date_best_m_period2 =
ifelse (is.null(which.max(m_value[m_day >= 100])), NA,
m_day[which.max(m_value[m_day >= 100])]))
But I obtain the following table:
ID m_value m_day best_m_period1 best_m_period2 date_best_m_period1 date_best_m_period2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 14 10 15 14 14
2 1 15 143 10 15 14 14
3 1 12 190 10 15 14 14
4 1 13 402 10 15 14 14
5 2 18 16 19 -Inf 75 NA
6 2 16 55 19 -Inf 75 NA
7 2 19 75 19 -Inf 75 NA
The date_best_m_period2 for ID1 is not 143 as expected (corresponding to the max value of 15 for ID1 in period 2 (>= 100 day)), but returns 14, the max value in period 1.
How can I resolve this problem ? Thank you very much for your help

Pandas: how to group on column change?

I am working with a log system, and I need to group data not in a standard way.
Alas with my limited knowledge of Pandas I couldn't find any example, probably because I don't know proper search terms.
This is a sample dataframe:
df = pd.DataFrame({
"speed": [2, 4, 6, 8, 8, 9, 2, 3, 8, 9, 13, 18, 25, 27, 18, 8, 6, 8, 12, 20, 27, 34, 36, 41, 44, 54, 61, 60, 61, 40, 17, 12, 15, 24],
"class": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 3, 1, 1, 1, 2]
})
df.groupby(by="class").groups returns indexed of each row, all grouped together by class value:
class indexes
1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 30, 32],
2: [12, 13, 19, 20, 21, 22, 33],
3: [23, 24, 29],
4: [25],
5: [26, 27, 28]
I need instead to split every time column class changes:
speed class
0 2 1
1 4 1
2 6 1
3 8 1
4 8 1
5 9 1
6 2 1
7 3 1
8 8 1
9 9 1
10 13 1
11 18 1
12 25 2 <= split here
13 27 2
14 18 1 <= split here
15 8 1
16 6 1
17 8 1
18 12 1 <= split here
19 20 2
20 27 2
21 34 2
22 36 2 <= split here
23 41 3
24 44 3 <= split here
25 54 4 <= split here
26 61 5
27 60 5
28 61 5 <= split here
29 40 3 <= split here
30 17 1 <= split here
31 12 1
32 15 1
33 24 2 <= split here
The desired grouping should return something like:
class count mean
0 1 12 70.50
1 2 2 26.00
2 1 5 10.40
3 2 4 29.25
4 3 2 42.50
5 4 1 54.00
6 5 3 60.66
7 3 1 40.00
8 1 3 14.66
9 2 1 24.00
Is there any command to do it not iteratively?
Use Series.cumsum with compare if not equal shifted values and aggregate by GroupBy.agg:
g = df["class"].ne(df["class"].shift()).cumsum()
df = (df.groupby(['class', g], sort=False)['speed'].agg(['count','mean'])
.reset_index(level=1, drop=True)
.reset_index())
print (df)
class count mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000
You can groupby the cumsum of when the class column differs from the value below it:
df.groupby(df["class"].diff().ne(0).cumsum()).speed.agg(['size', 'mean'])
size mean
class
1 12 7.500000
2 2 26.000000
3 5 10.400000
4 4 29.250000
5 2 42.500000
6 1 54.000000
7 3 60.666667
8 1 40.000000
9 3 14.666667
10 1 24.000000
Update: I hadn't seen how you wanted the class column: what you can do is group by the original class column as well as the cumsum above, and do a bit of index-sorting and resetting (but at this point this answer just converges with #jezrael's answer :P)
result = (
df.groupby(["class", df["class"].diff().ne(0).cumsum()])
.speed.agg(["size", "mean"])
.sort_index(level=1)
.reset_index(level=0)
.reset_index(drop=True)
)
class size mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000

Get and Modify column in groups on rows that meet a condition

I have this DataFrame:
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12], 'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
day hour sales
0 1 10 0
1 1 10 40
2 1 10 30
3 2 11 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 12 20
And I would like to filter to get the first entry of each day that has volume greater than 0. And as an additional thing I would like to change the 'sales' column for these to 9.
So to get something like this:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
I only came up with this iterative solution. But is there a solution, how I can apply it in a more functional way?
# Group by day:
groups = df.groupby(by=['day'])
# Get all indices of first non-zero sales entry per day:
indices = []
for name, group in groups:
group = group[group['sales'] > 0]
indices.append(group.index.to_list()[0])
# Change their values:
df.iloc[indices, df.columns.get_loc('hour')] = 9
You can create a group of df['day'] after checking of sales is greater than 0 , then get idxmax and filter out groups which doesnot have any value greater than 0 by using any , then assign with loc[]
g = df['sales'].gt(0).groupby(df['day'])
idx = g.idxmax()
df.loc[idx[g.any()],'hour']=9
print(df)
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
Create a mask m that groups by day as well as rows where the sales are not 0.
Then, use this mask as well as df['sales'] > 0 to change those specific rows to 9 with np.where()
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12],
'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
m = df.groupby(['day', df['sales'].ne(0)])['sales'].transform('first')
df['hour'] = np.where((df['sales'] == m) & (df['sales'] > 0), 9, df['hour'])
df
Out[37]:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

Pandas search in ascending index and match certain column value

I have a DF with thousands of rows. Column 'col1' is repeatedly from 1 to 6. Column 'value' is with unique numbers:
diction = {'col1': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6], 'target': [34, 65, 23, 65, 12, 87, 36, 51, 26, 74, 34, 87]}
df1 = pd.DataFrame(diction, index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
col1 target
0 1 34
1 2 65
2 3 23
3 4 65
4 5 12
5 6 87
6 1 36
7 2 51
8 3 26
9 4 74
10 5 34
11 6 87
I'm trying to create a new column (let's call it previous_col) that match col1 value (let's say COL1 value 2 with TARGET column value -> 65) so next time COL1 with value 2 to refer to previous TARGET value from the same row as col1 value 1:
col1 previous_col target
0 1 0 34
1 2 0 65
2 3 0 23
3 4 0 65
4 5 0 12
5 6 0 87
6 1 34 36
7 2 65 51
8 3 23 26
9 4 65 74
10 5 12 34
11 6 87 79
Note that first 6 rows are 0 values for previous column cuz no previous target values exist :D
The tricky part here is that I need to extract previous target's by DF index ascending order or the first met COL1 value ascending. So if we have a DF with 10k rows not just to match from the top or from the middle same COL1 value and to take the TARGET value. Each value in PREVIOUS_COL should be taken ascending to index and COL1 matching values. I know I can do it with shift but sometimes COL1 is with a missing order not from 1 to 6 strictly so I need to match exactly the COL1 value.
df1['Per_col']=df1.groupby('col1').target.shift(1).fillna(0)
df1
Out[1117]:
col1 target Per_col
0 1 34 0.0
1 2 65 0.0
2 3 23 0.0
3 4 65 0.0
4 5 12 0.0
5 6 87 0.0
6 1 36 34.0
7 2 51 65.0
8 3 26 23.0
9 4 74 65.0
10 5 34 12.0
11 6 87 87.0