I would like to create a new column based on various conditions
Let's say I have a df where column A can equal any of the following: ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other'], column B has numeric values from 0-30.
I'm trying to get column C to be 'Moderate' if A = 'Single' or 'Multiple', and if it equals anything else, to consider the values in column B. If column A != 'Single' or 'Multiple', column C will equal Moderate if 3 < B > 19 and 'High' if B>=19.
I have tried various loop combinations but I can't seem to get it. Any help?
trial = []
for x in df['A']:
if x == 'Single' or x == 'Multiple':
trial.append('Moderate')
elif x != 'Single' or x != 'Multiple':
if df['B']>19:
trial.append('Test')
df['trials'] = trial
Thank you kindly,
Denisse
It will good if you provide some sample data. But with some that I created, you can see how to apply a function to each row of your DataFrame.
Data
valuesA = ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other',
'Single', 'Multiple', 'Commercial', 'Domestic', 'Other']
valuesB = [0, 10, 20, 25, 30, 25, 15, 10, 5, 3 ]
df = pd.DataFrame({'A': valuesA, 'B': valuesB})
| | A | B |
|---:|:-----------|----:|
| 0 | Single | 0 |
| 1 | Multiple | 10 |
| 2 | Commercial | 20 |
| 3 | Domestic | 25 |
| 4 | Other | 30 |
| 5 | Single | 25 |
| 6 | Multiple | 15 |
| 7 | Commercial | 10 |
| 8 | Domestic | 5 |
| 9 | Other | 3 |
Function to apply
You don't specify what happen if column B is less than or equal to 3, so I suppose that C will be 'Low'. Adapt the function as you need. Also, maybe there is a typo in your question where you say '3 < B > 19', I changed to '3 < B < 19'.
def my_function(x):
if x['A'] in ['Single', 'Multiple']:
return 'Moderate'
else:
if x['B'] <= 3:
return 'Low'
elif 3 < x['B'] < 19:
return 'Moderate'
else:
return 'High'
New column
With the DataFrame and the new function you can apply it to each row with the method apply using the argument 'axis=1':
df['C'] = df.apply(my_function, axis=1)
| | A | B | C |
|---:|:-----------|----:|:---------|
| 0 | Single | 0 | Moderate |
| 1 | Multiple | 10 | Moderate |
| 2 | Commercial | 20 | High |
| 3 | Domestic | 25 | High |
| 4 | Other | 30 | High |
| 5 | Single | 25 | Moderate |
| 6 | Multiple | 15 | Moderate |
| 7 | Commercial | 10 | Moderate |
| 8 | Domestic | 5 | Moderate |
| 9 | Other | 3 | Low |
Related
Suppose I have a dataframe like below:
+-------------+----------+-----+------+
| key | Time |value|value2|
+-------------+----------+-----+------+
| 1 | 1 | 1 | 1 |
| 1 | 2 | 2 | 2 |
| 1 | 4 | 3 | 3 |
| 2 | 2 | 4 | 4 |
| 2 | 3 | 5 | 5 |
+-------------+----------+-----+------+
I want to select the key with same value with the least time. For this case, the is the dataframe I want.
+----------+----------+-----+------+
| Time | key |value|value2|
+----------+----------+-----+------+
| 1 | 1 | 1 | 1 |
| 2 | 2 | 4 | 4 |
+----------+----------+-----+------+
I tried use groupBy and agg, but these operations will drop the value columns. Is there a way to keep the value1 and value2 columns?
You can use struct to create a tuple containing the time column and everything you want to keep (the s column in the code below). Then, you can use s.* to unfold the struct and retrieve your columns.
val result = df
.withColumn("s", struct('Time, 'value, 'value2))
.groupBy("key")
.agg(min('s) as "s")
.select("key", "s.*")
// to order the columns the way you want:
val new_result = result.select("Time", "key", "value", "value2")
// or
val new_result = result.select("Time", df.columns.filter(_ != "time") : _*)
I have 2 types of score [M,B] in column 3, if a type is M, then the score is either an S[scored] or SB[bonus scored] in column 6. Every interval [from_hrs - to_hrs] for a type B must have a corresponding SB for type M, thus, an interval for a type B cannot have a score of S for a type M. I have several records that were unfortunately captured as seen in the table below.
CREATE TABLE SCORE_TBL
(
ID int IDENTITY(1,1) PRIMARY KEY,
PERSONID_FK int NOT NULL,
S_TYPE varchar(50) NULL,
FROM_HRS int NULL,
TO_HRS int NULL,
SCORE varchar(50) NULL,
);
INSERT INTO SCORE_TBL(PERSONID_FK,S_TYPE,FROM_HRS,TO_HRS,SCORE)
VALUES
(1, 'M' , 0,20, 'S'),
(1, 'B',6, 8, 'B'),
(2, 'B',0, 2, 'B'),
(2, 'M',0,20, 'S'),
(2, 'B', 10,13, 'B'),
(2, 'B', 18,20, 'B'),
(2, 'M', 13,18, 'S');
| ID | PERSONID_FK |S_TYPE| FROM_HRS | TO_HRS | SCORE |
|----|-------------|------|----------|--------|-------|
| 1 | 1 | M | 0 | 20 | S |
| 2 | 1 | B | 6 | 8 | B |
| 3 | 2 | B | 0 | 2 | B |
| 4 | 2 | M | 0 | 20 | S |
| 5 | 2 | B | 10 | 13 | B |
| 6 | 2 | B | 18 | 20 | B |
| 7 | 2 | M | 13 | 18 | S |
I want the data to look like this
| ID | PERSONID_FK |S_TYPE| FROM_HRS | TO_HRS | SCORE |
|----|-------------|------|----------|--------|-------|
| 1 | 1 | M | 0 | 6 | S |
| 2 | 1 | M | 6 | 8 | SB |
| 3 | 1 | B | 6 | 8 | B |
| 4 | 1 | M | 8 | 20 | S |
| 5 | 2 | B | 0 | 2 | B |
| 6 | 2 | M | 0 | 2 | SB |
| 7 | 2 | M | 2 | 10 | S |
| 8 | 2 | B | 10 | 13 | B |
| 9 | 2 | M | 10 | 13 | SB |
| 10 | 2 | M | 13 | 18 | S |
| 11 | 2 | B | 18 | 20 | B |
| 12 | 2 | S | 18 | 20 | SB |
Any ideas on how to generate this data in SQL Server select statement? Visually, this what am trying to get.
Tricky part here is that interval might need to be split in several pieces like 0..20 for person 2.
Window functions to the rescue. This query illustrates what you need to do:
WITH
deltas AS (
SELECT personid_fk, hrs, sum(delta_s) as delta_s, sum(delta_b) as delta_b
FROM (SELECT personid_fk, from_hrs as hrs,
case when score = 'S' then 1 else 0 end as delta_s,
case when score = 'B' then 1 else 0 end as delta_b
FROM score_tbl
UNION ALL
SELECT personid_fk, to_hrs as hrs,
case when score = 'S' then -1 else 0 end as delta_s,
case when score = 'B' then -1 else 0 end as delta_b
FROM score_tbl) _
GROUP BY personid_fk, hrs
),
running AS (
SELECT personid_fk, hrs as from_hrs,
lead(hrs) over (partition by personid_fk order by hrs) as to_hrs,
sum(delta_s) over (partition by personid_fk order by hrs) running_s,
sum(delta_b) over (partition by personid_fk order by hrs) running_b
FROM deltas
)
SELECT personid_fk, 'M' as s_type, from_hrs, to_hrs,
case when running_b > 0 then 'SB' else 'S' end as score
FROM running
WHERE running_s > 0
UNION ALL
SELECT personid_fk, s_type, from_hrs, to_hrs, score
FROM score_tbl
WHERE s_type = 'B'
ORDER BY personid_fk, from_hrs;
Step by step:
deltas is union of two passes on score_tbl - one for start and one for end of score/bonus interval, creating a timeline of +1/-1 events
running calculates running total of deltas over time, yielding split intervals where score/bonus are active
final query just converts score codes and unions bonus intervals (which are passed unchanged)
SQL Fiddle here.
I am trying to calculate an additional column in a results dataframe based on a filter operation from another dataframe that does not match in size.
So, I have my source dataframe source_df:
| id | date |
| -----------------------------
| 1 | 2100-01-01 |
| 2 | 2021-12-12 |
| 3 | 2018-09-01 |
| 4 | 2100-01-01 |
and the target dataframe target_df. Both dataframe lengths and amount of ids do not necessarily match:
| id |
| --------
| 1 |
| 2 |
| 3 |
| 4 |
| 5. |
...
| 100 |
I actually want to find out which dates lie more than 30 days in the past.
To do so, I created a query
query = (pd.datetime.today() - pd.to_datetime(source_df["date"], errors="coerce")).dt.days > 30
ids = source_df[query]["id"]
--> ids = [2,3]
My intention is to calculate a column "date_in_past"
that contains the values 0 and 1. If the date difference is greater than 30 days, a 1 is inserted, 0 elsewise.
The target_df should look like:
| id | date_in_past |
| ----------------- |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
| 5 | 0 |
Both indices and df lengths do not match.
I tried to create a lambda function
map_query = lambda x: 1 if x in ids.values else 0
When I try to pass the target frame map_query(target_df["id"]) a ValueError is thrown, that "lengths must match to compare".
How can I assign the new column "date_in_past" having the calculated values based on the source dataframe?
I have a pandas data frame as follows
+---------+------------+------------+-------+--------+
| Product | Date | Adj_Date | Price | Factor |
+---------+------------+------------+-------+--------+
| A | 01-06-2020 | 01-07-2020 | 100 | 10 |
| A | 01-06-2020 | 01-08-2020 | 200 | 20 |
| B | 15-07-2020 | 01-07-2020 | 400 | 10 |
| B | 15-07-2020 | 01-08-2020 | 800 | 10 |
| C | 01-09-2020 | 01-07-2020 | 1000 | 10 |
| C | 01-09-2020 | 01-08-2020 | 1200 | 10 |
| D | 01-10-2020 | 01-11-2020 | 1400 | 10 |
| E | 01-10-2020 | 01-09-2020 | 1600 | 10 |
+---------+------------+------------+-------+--------+
Code to generate the data frame above:
data = {'Product':['A', 'A', 'B', 'B', 'C', 'C', 'D', 'E'],
'Date':['01-06-2020', '01-06-2020', '15-07-2020', '15-07-2020', '01-09-2020', '01-09-2020', '01-10-2020', '01-10-2020'],
'Adj_Date':['01-07-2020', '01-08-2020', '01-07-2020', '01-08-2020', '01-07-2020', '01-08-2020', '01-11-2020', '01-09-2020'],
'Price':[100, 200, 400, 800, 1000, 1200, 1400, 1600],
'Factor':[10,20, 10, 10, 10, 10, 10, 10]}
df =pd.DataFrame(data)
Desired Output:
+---------+------------+------------+-------+--------+--------------+
| Product | Date | Adj_Date | Price | Factor | Actual Price |
+---------+------------+------------+-------+--------+--------------+
| A | 01-06-2020 | 01-07-2020 | 100 | 10 | 10 |
| B | 15-07-2020 | 01-08-2020 | 800 | 10 | 80 |
| C | 01-09-2020 | 01-07-2020 | 1000 | 10 | 1000 |
| D | 01-10-2020 | 01-11-2020 | 1400 | 10 | 140 |
| E | 01-10-2020 | 01-09-2020 | 1600 | 10 | 1600 |
+---------+------------+------------+-------+--------+--------------+
The above result is based on comparing 2 columns Date and Adj_Date. If for any product there are 2 rows we choose row in which Date is less than Adj_Date and differnce between Adj_Date is minimum. As we can see for product A we have date = 01-06-2020 this date was less than Adj_Date in both the rows but we choose Adj_Date = 01-07-2020 as the difference with this date is minimum. Using the same logic we opt for row 2 in case of product B.
If Date is greater than Adj_Date in all cases than we keep the first row.
Next part is to create column Actual Price. Once we have single row for each product we divide Price with Factor to create Actual Price only if Date is less than Adj_Date. Else Actual Price is equal to Price.
How can this result be achieved?
First, convert Date and Adj_Date to Timestamp. This will make your life a lot easier:
for col in ['Date', 'Adj_Date']:
df[col] = pd.to_datetime(df[col], dayfirst=True)
Then:
# Pick one row for each product
def pick_one(group):
if len(group) == 1:
return group
diff = (group['Date'] - group['Adj_Date']).dt.days
if (diff < 0).any():
cond = group.index == diff[diff < 0].idxmax()
else:
cond = group.index == group.index[0]
return group.loc[cond]
result = df.groupby('Product', as_index=False).apply(pick_one).droplevel(0)
# Calculate the Actual Price
result['Actual Price'] = np.where(result['Date'] < result['Adj_Date'], result['Price'] / result['Factor'], result['Price'])
you can use duplicated and groupby.idxmin to filter your dataframe, then apply a boolean to get your ActualPrice column.
import numpy as np
# ensure you have valid datetime objects.
#df[['Date','Adj_Date']] = df[['Date','Adj_Date']].apply(pd.to_datetime)
df1 = df.loc[
df.assign(
delta=np.where(
df.duplicated(subset=["Product"], keep=False),
(df["Date"] - df["Adj_Date"]).abs(),
0,
)
)
.groupby("Product")["delta"]
.idxmin()
]
df1['ActualPrice'] = np.where(
df1['Date'] <= df1['Adj_Date'],
df1['Price'].div(df1['Factor']),
df1['Price']
)
print(df1)
Product Date Adj_Date Price Factor ActualPrice
0 A 2020-01-06 2020-01-07 100 10 10.0
3 B 2020-07-15 2020-01-08 800 10 800.0
5 C 2020-01-09 2020-01-08 1200 10 1200.0
6 D 2020-01-10 2020-01-11 1400 10 140.0
7 E 2020-01-10 2020-01-09 1600 10 1600.0
I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
| | user_id | week | module_id |
+---+---------+------+-----------+
| 0 | 1 | 1 | A |
| 1 | 1 | 1 | B |
| 2 | 1 | 2 | A |
| 3 | 2 | 1 | A |
| 4 | 2 | 2 | B |
| 5 | 2 | 2 | C |
+---+---------+------+-----------+
What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:
+---+---------+------+-------------------------+
| | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 1 |
| 3 | 2 | 2 | 3 |
+---+---------+------+-------------------------+
It is straightforward to do this as a loop, for example this works:
running_tally = {}
result = {}
for index, row in df.iterrows():
if row['user_id'] not in running_tally:
running_tally[row['user_id']] = set()
result[row['user_id']] = {}
running_tally[row['user_id']].add(row['module_id'])
result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}
But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.
There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.
How would I do this vectorised in pandas?
Idea is create lists per groups by both columns and then use np.cumsum for cumulative lists, last convert values to sets and get length:
df1 = (df.groupby(['user_id','week'])['module_id']
.apply(list)
.groupby(level=0)
.apply(np.cumsum)
.apply(lambda x: len(set(x)))
.reset_index(name='cumulative_module_count'))
print (df1)
user_id week cumulative_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3
Jezrael's answer can be slightly improved by using pipe instead of apply(list), which should be faster, and then using np.unique instead of the trick with np.cumsum:
df1 = (df.groupby(['user_id', 'week']).pipe(lambda x: x.apply(np.unique))
.groupby('user_id')
.apply(np.cumsum)
.apply(np.sum)
.apply(lambda x: len(set(x)))
.rename('cumulated_module_count')
.reset_index(drop=False))
print(df1)
user_id week cumulated_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3