Create new column based on information in two other columns - conditional-statements

I have a data frame with 2 columns of information that I want to compare to create a new condition in a new column.
PPT
1
2
1
A
1
2
A
2
3
A
3
4
B
1
5
B
2
6
B
3
7
C
1
8
C
2
9
C
3
I want the new column to provide a categorisation based on columns 1 and 2 using the following criteria:
if A equals 1, column 3 should be YES
if B equals 2, column 3 should be YES
if C equals 3, column 3 should be YES
All other instances, column 3 should be NO
PPT
1
2
3
1
A
1
YES
2
A
2
NO
3
A
3
NO
4
B
1
NO
5
B
2
YES
6
B
3
NO
7
C
1
NO
8
C
2
NO
9
C
3
YES

Make a mask with all of your conditions, and then you can make a new column set to 'NO' and change all valid rows to 'YES'
# Find rows where conditions are True
mask = ((df['1'] == 'A') & (df['2'] == 1)) | ((df['1'] == 'B') & (df['2'] == 2)) | ((df['1'] == 'C') & (df['2'] == 3))
# Create new column using rows
df['3'] = 'NO'
df.loc[mask, '3'] = 'YES'

Related

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?
You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0
You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

Is there any way to display duplicate column value once in multiple rows in SQL?

I have been trying to research and Google forever for this but I cannot find an answer. I have duplicate values in 1 column but I would like to display them only once. Is it even possible in SQL?
What I have:
A
B
C
A
2
3
A
2
4
B
4
4
B
3
4
C
3
9
What I would like:
A
B
C
A
2
3
A
4
B
4
4
B
3
4
C
9
Use this:
SELECT A,
CASE WHEN (LAG(B) OVER (ORDER A)) = B THEN '' ELSE CONVERT(VARCHAR,B) END AS B,
C FROM TABLENAME

How to make one column match duplicates in another column

This problem is out of my ability range and I can’t get anywhere with it beyond knowing I can probably use LEAD, LAG or maybe a cursor?
Here is a breakdown of the table and question:
row_id is always an IDENTITY(1, 1) column.
The set_id column always starts out in groups of 3s (two 0s for the first set_id, don't worry about why).
The letter column is alphabetic. There are varying counts of duplicates.
Here's the original table:
row_id
set_id
letter
1
0
A
2
0
A
3
1
A
4
1
B
5
1
B
6
2
B
7
2
B
8
2
C
9
3
C
10
3
C
11
3
D
12
4
D
13
4
D
14
4
D
What I need is a code that: if there is a duplicate letter in the next row, then the set_id in the next row should be the same as the previous row (alt_set_id).
If that doesn't make sense, here is the result I want:
row_id
set_id
letter
alt_set_id
1
0
A
0
2
0
A
0
3
1
A
0
4
1
B
1
5
1
B
1
6
2
B
1
7
2
B
1
8
2
C
2
9
3
C
2
10
3
C
2
11
3
D
3
12
4
D
3
13
4
D
3
14
4
D
3
Here's where I am with code so far, I'm not really close but I think I am on the right path:
SELECT
*,
CASE
WHEN letter = [letter in next row]
THEN 'yes'
ELSE 'no'
END AS 'next row a duplicate?',
'tbd' AS alt_row_id
FROM
(SELECT
*,
LEAD(letter) OVER (ORDER BY row_id) AS 'letter in next row'
FROM
sort_test) AS dt
WHERE
row_id = row_id
That query has the below result set, which is something I think I can work with, but it doesn't feel very efficient and I'm not yet getting the result needed in the alt_set_id column:
row_id
set_id
letter
letter in next row
next row a duplicate?
alt_set_id
1
0
A
A
yes
tbd
2
0
A
A
yes
tbd
3
1
A
B
no
tbd
4
1
B
B
yes
tbd
5
1
B
B
yes
tbd
6
2
B
B
yes
tbd
7
2
B
C
no
tbd
8
2
C
C
yes
tbd
9
3
C
C
yes
tbd
10
3
C
D
no
tbd
11
3
D
D
yes
tbd
12
4
D
D
yes
tbd
13
4
D
D
yes
tbd
14
4
D
NULL
no
tbd
Thanks for any help!
Based on your example data, you want the minimum set_id for each letter. If so, use window functions;
select t.*, min(set_id) over (partition by letter) as alt_set_id
from sort_test t;
It would appear if I understand correctly a simple correlated subquery will give you the desired result:
select *, (select Min(set_Id) from t t2 where t2.letter=t.letter) as alt_set_id
from t
See working DB Fiddle

most efficient way to set dataframe column indexing to other columns

I have a large Dataframe. One of my columns contains the name of others. I want to eval this colum and set in each row the value of the referenced column:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| B |
|2|5|3| A |
|3|5|9| C |
Desired output:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| 3 |
|2|5|3| 2 |
|3|5|9| 9 |
I am achieving this result using:
df.apply(lambda d: eval("d." + d['Column']), axis=1)
But it is very slow, even using swifter. Is there a more efficient way of performing this?
For better performance, use df.to_numpy():
In [365]: df['Column'] = df.to_numpy()[df.index, df.columns.get_indexer(df.Column)]
In [366]: df
Out[366]:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
For Pandas < 1.2.0, use lookup:
df['Column'] = df.lookup(df.index, df['Column'])
From 1.2.0+, lookup is decprecated, you can just use a for loop:
df['Column'] = [df.at[idx, r['Column']] for idx, r in df.iterrows()]
Output:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
Since lookup is going to decprecated try numpy method with get_indexer
df['new'] = df.values[df.index,df.columns.get_indexer(df.Column)]
df
Out[75]:
A B C Column new
0 1 3 4 B 3
1 2 5 3 A 2
2 3 5 9 C 9

Updating multiple columns based on multiple conditions

I've below table with some results for both Morning and Afternoon session (for different periods).
I would like to updated the results based on the simple condition:
Check if in 2 following morning sessions there was a change - if not add 5 to the score:
Example: ID=1, Mor2=C, Mor3=C so Score_M3 = 5+5= 10 (new value). All updated values are marked in the 'Wanted' table.
How can I write this in SQL? I will have a lot of columns and IDs.
My dataset:
ID Mor1 Aft1 Mor2 Aft2 Mor3 Aft3 Score_M1 Score_A1 Score_M2 Score_A2 Score_M3 Score_A3
1 A A C B C B 1 1 1 1 5 6
2 C C C B C B 1 1 1 1 4 5
3 A A A A A A 1 1 1 1 4 1
Wanted :
ID Mor1 Aft1 Mor2 Aft2 Mor3 Aft3 Score_M1 Score_A1 Score_M2 Score_A2 Score_M3 Score_A3
1 A A C B C B 1 1 1 1 *10 6
2 C C C B C B 1 1 *6 1 *9 5
3 A A A A A A 1 1 *6 1 *9 1
Here is the SQL to get you started. You can add many more columns as you see fit.
Can we restate as SAME, rather than Change?
If Mor1 = Mor2 then add +5 to Score2
If Mor2 = Mor3 then add +5 to Score3
UPDATE [StackOver].[dbo].[UpdateMultiCols]
SET
[Score_M1] = Score_M1
,[Score_M2] = Score_M2 +
Case When Mor1 = Mor2 Then 5 else 0 End
,[Score_M3] = Score_M3 +
Case When Mor2 = Mor3 Then 5 else 0 End
GO