Shift row left by leading NaN's without removing all NaN's
How can I remove leading NaN's in pandas when reading in a csv file?
Example code:
df = pd.DataFrame({
'c1': [ 20, 30, np.nan, np.nan, np.nan, 17, np.nan],
'c2': [np.nan, 74, 65, np.nan, np.nan, 74, 82],
'c3': [ 250, 290, 340, 325, 345, 315, 248],
'c4': [ 250, np.nan, 340, 325, 345, 315, 248],
'c5': [np.nan, np.nan, 340, np.nan, 345, np.nan, 248],
'c6': [np.nan, np.nan, np.nan, 325, 345, np.nan, np.nan]})
The code displays this
| | c1 | c2 | c3 | c4 | c5 | c6 |
|:-|-----:|-----:|----:|------:|------:|------:|
|0 | 20.0 | NaN | 250 | 250.0 | NaN | NaN |
|1 | 30.0 | 74.0 | 290 | NaN | NaN | NaN |
|2 | NaN | 65.0 | 340 | 340.0 | 340.0 | NaN |
|3 | NaN | NaN | 325 | 325.0 | NaN | 325.0 |
|4 | NaN | NaN | 345 | 345.0 | 345.0 | 345.0 |
|5 | 17.0 | 74.0 | 315 | 315.0 | NaN | NaN |
|6 | NaN | 82.0 | 248 | 248.0 | 248.0 | NaN |
I'd like to only reomve the leading NaN's so the result would look like this
| | c1 | c2 | c3 | c4 | c5 | c6 |
|:-|-----:|-----:|----:|------:|------:|------:|
|0 | 20 | NaN | 250.0 | 250.0 | NaN | NaN |
|1 | 30 | 74.0 | 290.0 | NaN | NaN | NaN |
|2 | 65 | 340.0 | 340.0 | 340.0 | NaN | NaN |
|3 | 325 | 325.0 | NaN | 325.0 | NaN | NaN |
|4 | 345 | 345.0 | 345.0 | 345.0 | NaN | NaN |
|5 | 17 | 74.0 | 315.0 | 315.0 | NaN | NaN |
|6 | 82 | 248.0 | 248.0 | 248.0 | NaN | NaN |
I have tried the following but that didn't work
response = pd.read_csv (r'MonthlyPermitReport.csv')
df = pd.DataFrame(response)
df.loc[df.first_valid_index():]
Help please.
You can try this:
s = df.isna().cumprod(axis=1).sum(axis=1)
df.apply(lambda x: x.shift(-s[x.name]), axis=1)
Output:
c1 c2 c3 c4 c5 c6
0 20.0 NaN 250.0 250.0 NaN NaN
1 30.0 74.0 290.0 NaN NaN NaN
2 65.0 340.0 340.0 340.0 NaN NaN
3 325.0 325.0 NaN 325.0 NaN NaN
4 345.0 345.0 345.0 345.0 NaN NaN
5 17.0 74.0 315.0 315.0 NaN NaN
6 82.0 248.0 248.0 248.0 NaN NaN
Details:
s, is a series that counts number of leading NaN in a row. isna finds all the NaN the dataframe, then using cumprod along the row axis we are eliminating NaN after a non-NaN value by multiplying by zero. Lastly, we use sum along row to calculate the number of place to shift each row.
Using dataframe apply with axis=1 (rowwise) the name of the pd.Series called in df.apply(axis=1) is the row index of the dataframe. Therefore we can fetch the number of periods to shift using, s defined above.
Let us try apply create the list then recreate the dataframe
out = pd.DataFrame(df.apply(lambda x : [x[x.notna().cumsum()>0].tolist()],1).str[0].tolist(),
index=df.index,
columns=df.columns)
Out[102]:
c1 c2 c3 c4 c5 c6
0 20.0 NaN 250.0 250.0 NaN NaN
1 30.0 74.0 290.0 NaN NaN NaN
2 65.0 340.0 340.0 340.0 NaN NaN
3 325.0 325.0 NaN 325.0 NaN NaN
4 345.0 345.0 345.0 345.0 NaN NaN
5 17.0 74.0 315.0 315.0 NaN NaN
6 82.0 248.0 248.0 248.0 NaN NaN
Related
I have two tables below that I'm trying to join based on ID and the closest available weekly_dt date based on the ingest_date column.
in standard ANSI SQL I usually use a correlated sub query and limit the query to one result per row so there is no aggregate error, however doing this in standard SparkSQL gives me the following error
AnalysisException: Correlated scalar subqueries must be aggregated: GlobalLimit 1
Setup
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
np.random.seed(25)
A1 = [('A1', i.date(), np.random.randint(0,50)) for i in pd.date_range('01 Jan 2021', '21 Jan 2021',freq='D')]
A2 = [('A2', i.date(), np.random.randint(0,50)) for i in pd.date_range('01 Jan 2021', '21 Jan 2021',freq='D')]
df_a = spark.createDataFrame(A1 + A2, ['id','ingest_date','amt'])
weekly_scores = [
('A1', pd.Timestamp('01 Jan 2021').date(), '0.5'),
('A1', pd.Timestamp('08 Jan 2021').date(), '0.3'),
('A1', pd.Timestamp('15 Jan 2021').date(), '0.8'),
('A1', pd.Timestamp('22 Jan 2021').date(), '0.6'),
('A2', pd.Timestamp('01 Jan 2021').date(), '0.6'),
('A2', pd.Timestamp('08 Jan 2021').date(), '0.1'),
('A2', pd.Timestamp('15 Jan 2021').date(), '0.9'),
('A2', pd.Timestamp('22 Jan 2021').date(), '0.3'),
]
df_b = spark.createDataFrame(weekly_scores, ['id','weekly_dt','score'])
Tables
df_a.show()
+---+-----------+---+
| id|ingest_date|amt|
+---+-----------+---+
| A1| 2021-01-01| 26|
| A1| 2021-01-02| 1|
| A1| 2021-01-03| 0|
| A1| 2021-01-04| 31|
| A1| 2021-01-05| 41|
| A1| 2021-01-06| 46|
| A1| 2021-01-07| 11|
| A1| 2021-01-08| 0|
| A1| 2021-01-09| 14|
| A1| 2021-01-10| 5|
| A1| 2021-01-11| 0|
| A1| 2021-01-12| 35|
| A1| 2021-01-13| 5|
| A1| 2021-01-14| 43|
| A1| 2021-01-15| 18|
| A1| 2021-01-16| 31|
| A1| 2021-01-17| 44|
| A1| 2021-01-18| 25|
| A1| 2021-01-19| 47|
| A1| 2021-01-20| 36|
+---+-----------+---+
df_b.show()
+---+----------+-----+
| id| weekly_dt|score|
+---+----------+-----+
| A1|2021-01-01| 0.5|
| A1|2021-01-08| 0.3|
| A1|2021-01-15| 0.8|
| A1|2021-01-22| 0.6|
| A2|2021-01-01| 0.6|
| A2|2021-01-08| 0.1|
| A2|2021-01-15| 0.9|
| A2|2021-01-22| 0.3|
+---+----------+-----+
Expected Output.
id ingest_date amt weekly_dt score
0 A1 2021-01-01 26 2021-01-01 0.5
4 A1 2021-01-02 1 2021-01-01 0.5
8 A1 2021-01-03 0 2021-01-01 0.5
12 A1 2021-01-04 31 2021-01-01 0.5
17 A1 2021-01-05 41 2021-01-08 0.3
21 A1 2021-01-06 46 2021-01-08 0.3
25 A1 2021-01-07 11 2021-01-08 0.3
29 A1 2021-01-08 0 2021-01-08 0.3
33 A1 2021-01-09 14 2021-01-08 0.3
37 A1 2021-01-10 5 2021-01-08 0.3
41 A1 2021-01-11 0 2021-01-08 0.3
46 A1 2021-01-12 35 2021-01-15 0.8
50 A1 2021-01-13 5 2021-01-15 0.8
54 A1 2021-01-14 43 2021-01-15 0.8
58 A1 2021-01-15 18 2021-01-15 0.8
62 A1 2021-01-16 31 2021-01-15 0.8
66 A1 2021-01-17 44 2021-01-15 0.8
70 A1 2021-01-18 25 2021-01-15 0.8
75 A1 2021-01-19 47 2021-01-22 0.6
79 A1 2021-01-20 36 2021-01-22 0.6
83 A1 2021-01-21 43 2021-01-22 0.6
84 A2 2021-01-01 32 2021-01-01 0.6
88 A2 2021-01-02 37 2021-01-01 0.6
92 A2 2021-01-03 11 2021-01-01 0.6
96 A2 2021-01-04 21 2021-01-01 0.6
101 A2 2021-01-05 29 2021-01-08 0.1
105 A2 2021-01-06 48 2021-01-08 0.1
109 A2 2021-01-07 12 2021-01-08 0.1
113 A2 2021-01-08 40 2021-01-08 0.1
117 A2 2021-01-09 30 2021-01-08 0.1
121 A2 2021-01-10 28 2021-01-08 0.1
125 A2 2021-01-11 41 2021-01-08 0.1
130 A2 2021-01-12 12 2021-01-15 0.9
134 A2 2021-01-13 10 2021-01-15 0.9
138 A2 2021-01-14 10 2021-01-15 0.9
142 A2 2021-01-15 31 2021-01-15 0.9
146 A2 2021-01-16 13 2021-01-15 0.9
150 A2 2021-01-17 31 2021-01-15 0.9
154 A2 2021-01-18 11 2021-01-15 0.9
159 A2 2021-01-19 15 2021-01-22 0.3
163 A2 2021-01-20 18 2021-01-22 0.3
167 A2 2021-01-21 49 2021-01-22 0.3
Spark Query
SELECT
a.id,
a.ingestion_date,
a.amt,
b.weekly_dt,
b.score
FROM a
LEFT JOIN b
ON a.id = b.id
AND a.ingestion_date =
(
SELECT weekly_dt
FROM b
WHERE id = b.id
ORDER BY DATEDIFF(a.ingest_date, weekly_dt) ASC
LIMIT 1
)
Edit,
I know I can create a window and use a dense_rank() to order the results but I wonder if this is the best method?
from pyspark.sql import Window
s = spark.sql("""
SELECT
a.id,
a.ingest_date,
a.amt,
b.weekly_dt,
b.score
FROM a
LEFT JOIN b
ON b.id = a.id
""").withColumn('delta',
F.abs(F.datediff(F.col('ingest_date'), F.col('weekly_dt')
)
)
)
s.withColumn('t',
F.dense_rank().over(
Window.partitionBy('id','ingest_date').orderBy(F.asc('delta')))
).filter('t == 1').drop('t','delta').show()
id ingest_date amt weekly_dt score
0 A2 2021-01-01 32 2021-01-01 0.6
1 A2 2021-01-02 37 2021-01-01 0.6
2 A2 2021-01-03 11 2021-01-01 0.6
3 A2 2021-01-04 21 2021-01-01 0.6
4 A2 2021-01-05 29 2021-01-08 0.1
5 A2 2021-01-06 48 2021-01-08 0.1
6 A2 2021-01-07 12 2021-01-08 0.1
7 A2 2021-01-08 40 2021-01-08 0.1
8 A2 2021-01-09 30 2021-01-08 0.1
9 A2 2021-01-10 28 2021-01-08 0.1
10 A2 2021-01-11 41 2021-01-08 0.1
11 A2 2021-01-12 12 2021-01-15 0.9
12 A2 2021-01-13 10 2021-01-15 0.9
13 A2 2021-01-14 10 2021-01-15 0.9
14 A2 2021-01-15 31 2021-01-15 0.9
I would replace the subquery with a limit with a window function :
df = df_a.join(df_b, on="id")
df = (
df.withColumn(
"rnk",
F.row_number().over(
W.partitionBy("id", "ingest_date").orderBy(
F.abs(F.datediff("ingest_date", "weekly_dt"))
)
),
)
.where("rnk=1")
.drop("rnk")
)
df.show()
+---+-----------+---+----------+-----+
| id|ingest_date|amt| weekly_dt|score|
+---+-----------+---+----------+-----+
| A2| 2021-01-01| 31|2021-01-01| 0.6|
| A2| 2021-01-02| 48|2021-01-01| 0.6|
| A2| 2021-01-03| 47|2021-01-01| 0.6|
| A2| 2021-01-04| 9|2021-01-01| 0.6|
| A2| 2021-01-05| 16|2021-01-08| 0.1|
| A2| 2021-01-06| 44|2021-01-08| 0.1|
| A2| 2021-01-07| 45|2021-01-08| 0.1|
| A2| 2021-01-08| 21|2021-01-08| 0.1|
| A2| 2021-01-09| 36|2021-01-08| 0.1|
| A2| 2021-01-10| 9|2021-01-08| 0.1|
| A2| 2021-01-11| 32|2021-01-08| 0.1|
| A2| 2021-01-12| 10|2021-01-15| 0.9|
| A2| 2021-01-13| 47|2021-01-15| 0.9|
| A2| 2021-01-14| 42|2021-01-15| 0.9|
| A2| 2021-01-15| 1|2021-01-15| 0.9|
| A2| 2021-01-16| 22|2021-01-15| 0.9|
| A2| 2021-01-17| 27|2021-01-15| 0.9|
| A2| 2021-01-18| 49|2021-01-15| 0.9|
| A2| 2021-01-19| 18|2021-01-22| 0.3|
| A2| 2021-01-20| 28|2021-01-22| 0.3|
+---+-----------+---+----------+-----+
only showing top 20 rows
I have a data frame of empty daily prices. I have then written a function which give week commencing monday dates.
Day WC Monday Price 1 Price 2
1/1/12 1/1/12 44 34
2/1/13 1/1/12 55 34
3/1/12 1/1/12 44 34
4/1/13 1/1/12 NA NA
5/1/12 1/1/12 NA NA
6/1/13 1/1/12 34 NA
7/1/12 1/1/12 33 NA
8/1/13 8/1/12 12 NA
9/1/12 8/1/12 34 NA
10/1/13 8/1/12 23 NA
I want to say if the price only has NAs left until the end of the column, then fill with the last value only to the end of the incomplete week
So the expected output is:
Day WC Monday Price 1 Price 2
1/1/12 1/1/12 44 34
2/1/13 1/1/12 55 34
3/1/12 1/1/12 44 34
4/1/13 1/1/12 NA 34
5/1/12 1/1/12 NA 34
6/1/13 1/1/12 34 34
7/1/12 1/1/12 33 34
8/1/13 8/1/12 12 NA
9/1/12 8/1/12 34 NA
10/1/13 8/1/12 23 NA
Idea is test, if last row per group is missing values by GroupBy.transform with GroupBy.last and then replace missing values with DataFrame.mask and GroupBy.ffill:
c = ['Price 1','Price 2']
m = df.isna().groupby('WC Monday')[c].transform('last')
df[c] = df[c].mask(m, df.groupby('WC Monday')[c].ffill())
print (df)
Day WC Monday Price 1 Price 2
0 1/1/12 1/1/12 44.0 34.0
1 2/1/13 1/1/12 55.0 34.0
2 3/1/12 1/1/12 44.0 34.0
3 4/1/13 1/1/12 NaN 34.0
4 5/1/12 1/1/12 NaN 34.0
5 6/1/13 1/1/12 34.0 34.0
6 7/1/12 1/1/12 33.0 34.0
7 8/1/13 8/1/12 12.0 NaN
8 9/1/12 8/1/12 34.0 NaN
9 10/1/13 8/1/12 23.0 NaN
I would like to group by CurrentDate, Car fields and apply the following functions:
np.mean function to list of ['Attr1',...'Attr5'] columns;
np.random for Factory column;
Example of df is introduced there:
Index Car Attr1 Attr2 Attr3 Attr4 Attr5 AttrFactory CurrentDate
0 Nissan 0.0 1.7 3.7 0.0 6.8 F1 01/07/18
1 Nissan 0.0 1.7 3.7 0.0 6.8 F2 01/07/18
2 Nissan 0.0 1.7 3.7 0.0 6.8 F3 03/08/18
3 Porsche 10.0 0.0 2.8 3.5 6.5 F2 05/08/18
4 Porsche 10.0 2.0 0.8 3.5 6.5 F1 05/08/18
5 Golf 0.0 1.7 3.0 2.0 6.3 F4 07/09/18
6 Tiguan 1.0 0.0 3.0 5.2 5.8 F5 10/09/18
7 Porsche 0.0 0.0 3.0 4.2 7.8 F4 12/09/18
8 Tiguan 0.0 0.0 0.0 7.2 9.0 F3 13/09/18
9 Golf 0.0 3.0 0.0 0.0 4.8 F5 25/09/18
10 Golf 0.0 3.0 0.0 0.0 4.8 F1 25/09/18
11 Golf 0.0 3.0 0.0 0.0 4.8 F3 25/09/18
I tried to do it by the following code:
metric_cols = df.filter(regex='^Attr',axis=1).columns #it's list of all Attr columns;
addt_col = list(df.filter(regex='^Attr',axis=1).columns).remove('AttrFactory')
df_gr = df.groupby(['CurrentDate', 'Car'], as_index=False)[metric_cols].agg({addt_col: np.mean, 'AttrFactory': lambda x: x.iloc[np.random.choice(range(0,len(x)))]})
In result I received df with inctorrect index:
CurrentDate Car NaN
CurrentDate Car Attr1 Attr2 Attr3 Attr4 Attr5 AttrFactory
01/07/18 Nissan 01/07/18 Nissan 0.0 1.7 3.7 0.0 6.8 F1
03/08/18 Nissan 03/08/18 Nissan 0.0 1.7 3.7 0.0 6.8 F3
05/08/18 Porsche 05/08/18 Porsche 10.0 1.0 1.8 3.5 6.5 F1
... ... ... ... ... ... ... ... ... ...
13/09/18 Tiguan 13/09/18 Tiguan 0.0 0.0 0.0 7.2 9.0 F3
25/09/18 Golf 25/09/18 Golf 0.0 1.0 0.0 0.0 4.8 F3
Expected output is df_gr:
Attr1 Attr2 Attr3 Attr4 Attr5 AttrFactory
01/07/18 Nissan 0.0 1.7 3.7 0.0 6.8 F1
03/08/18 Nissan 0.0 1.7 3.7 0.0 6.8 F3
05/08/18 Porsche 10.0 1.0 1.8 3.5 6.5 F1
... ... ... ... ... ... ... ...
13/09/18 Tiguan 0.0 0.0 0.0 7.2 9.0 F3
25/09/18 Golf 0.0 1.0 0.0 0.0 4.8 F3
How can I fix the CurrentDate Car Nan incorrect index at the top of result?
I'd appreciate for any idea, thanks)
You can make a dictionary of your aggregations and pass them into agg
IN:
metric_cols = df.filter(regex='^Attr\d',axis=1).columns
d = dict.fromkeys(metric_cols, ['mean'])
d['AttrFactory'] = lambda x: x.iloc[np.random.choice(range(0,len(x)))]
df = df.groupby(['CurrentDate', 'Car'], as_index=False).agg(d).droplevel(1, axis=1)
OUT:
| | CurrentDate | Car | Attr1 | Attr2 | Attr3 | Attr4 | Attr5 | AttrFactory |
|---|-------------|---------|-------|-------|--------------------|-------|-------|-------------|
| 0 | 01/07/18 | Nissan | 0.0 | 1.7 | 3.7 | 0.0 | 6.8 | F2 |
| 1 | 03/08/18 | Nissan | 0.0 | 1.7 | 3.7 | 0.0 | 6.8 | F3 |
| 2 | 05/08/18 | Porsche | 10.0 | 1.0 | 1.7999999999999998 | 3.5 | 6.5 | F1 |
| 3 | 07/09/18 | Golf | 0.0 | 1.7 | 3.0 | 2.0 | 6.3 | F4 |
| 4 | 10/09/18 | Tiguan | 1.0 | 0.0 | 3.0 | 5.2 | 5.8 | F5 |
| 5 | 12/09/18 | Porsche | 0.0 | 0.0 | 3.0 | 4.2 | 7.8 | F4 |
| 6 | 13/09/18 | Tiguan | 0.0 | 0.0 | 0.0 | 7.2 | 9.0 | F3 |
| 7 | 25/09/18 | Golf | 0.0 | 3.0 | 0.0 | 0.0 | 4.8 | F1 |
Your aggregators are being applied column-wise and therefore are stored in level 2 while the column names are stored in level 1 (to prevent overwriting). This is especially useful when applying multiple aggregators per column.
A solution for this would be the following:
# Merge the aggregator with the column name
df_gr.columns = ['_'.join(x) for x in df_gr.columns.values.reshape(-1)]
I have a dataframe which contains time series data of 30 consecutive days, each day is supposed to contain data of 24 hours from 0 to 23, so there suppose to have 24*30 = 720 rows in the dataframe. However, there are some rows containing missing records of the column "Fooo" already being removed from the dataframe.
Index | DATE(YYYY/MM/DD) | Hour | Fooo
0 | 2015/01/01 | 0 | x
1 | 2015/01/01 | 1 | xy
2 | ... | ... | z
23 | 2015/01/01 | 23 | z
24 | 2015/01/02 | 0 | z
25 | 2015/01/02 | 2 | bz
... | ... | ... | z
46 | 2015/01/02 | 23 | zz
...
...
680 | 2015/01/30 | 1 | z
681 | 2015/01/30 | 3 | bz
... | ... | ... | z
701 | 2015/01/30 | 23 | zz
I would like to rewrite the dataframe so that it contains full 720 rows, with missing values in the column "Fooo" being filled with "NA".
Index | DATE(YYYY/MM/DD) | Hour | Fooo
0 | 2015/01/01 | 0 | x
1 | 2015/01/01 | 1 | xy
2 | ... | ... | z
23 | 2015/01/01 | 23 | z
24 | 2015/01/02 | 0 | z
25 | 2015/01/02 | 1 | NA
26 | 2015/01/02 | 2 | bz
... | ... | ... | z
47 | 2015/01/02 | 23 | zz
...
...
690 | 2015/01/30 | 0 | NA
691 | 2015/01/30 | 1 | z
692 | 2015/01/30 | 2 | NA
693 | 2015/01/30 | 3 | bz
... | ... | ... | z
719 | 2015/01/30 | 23 | zz
How can I do that in pandas? I tried to create another dataframe with one column "Hour" like this:
Index | Hour |
0 | 0 |
1 | 1 |
2 | ... |
23 | 23 |
24 | 0 |
25 | 1 |
26 | 2 |
... | ...
47 | 23 |
...
...
690 | 0 |
691 | 1 |
692 | 2
693 | 3 |
... | |
719 | 23 |
then outer join it with the original one, but it did not work.
Create helper DataFrame by product and DataFrame.merge with left join:
from itertools import product
df['DATE(YYYY/MM/DD)'] = pd.to_datetime(df['DATE(YYYY/MM/DD)'])
df1 = pd.DataFrame(list(product(df['DATE(YYYY/MM/DD)'].unique(), range(27))),
columns=['DATE(YYYY/MM/DD)','Hour'])
df = df1.merge(df, how='left')
print (df.head(10))
DATE(YYYY/MM/DD) Hour Fooo
0 2015-01-01 0 x
1 2015-01-01 1 xy
2 2015-01-01 2 NaN
3 2015-01-01 3 NaN
4 2015-01-01 4 NaN
5 2015-01-01 5 NaN
6 2015-01-01 6 NaN
7 2015-01-01 7 NaN
8 2015-01-01 8 NaN
9 2015-01-01 9 NaN
Or create MultiIndex by MultiIndex.from_product and use DataFrame.reindex for append missing rows:
df['DATE(YYYY/MM/DD)'] = pd.to_datetime(df['DATE(YYYY/MM/DD)'])
mux = pd.MultiIndex.from_product([df['DATE(YYYY/MM/DD)'].unique(), range(27)],
names=['DATE(YYYY/MM/DD)','Hour'])
df = df.set_index(['DATE(YYYY/MM/DD)','Hour']).reindex(mux).reset_index()
print (df.head(10))
DATE(YYYY/MM/DD) Hour Fooo
0 2015-01-01 0 x
1 2015-01-01 1 xy
2 2015-01-01 2 NaN
3 2015-01-01 3 NaN
4 2015-01-01 4 NaN
5 2015-01-01 5 NaN
6 2015-01-01 6 NaN
7 2015-01-01 7 NaN
8 2015-01-01 8 NaN
9 2015-01-01 9 NaN
I have the following code. I need to add a column deaths_last_tuesday which shows the deaths from last Tuesday, for each day.
import pandas as pd
data = {'date': ['2014-05-01', '2014-05-02', '2014-05-03', '2014-05-04', '2014-05-05', '2014-05-06', '2014-05-07',
'2014-05-08', '2014-05-09', '2014-05-10', '2014-05-11', '2014-05-12', '2014-05-13', '2014-05-14',
'2014-05-15', '2014-05-16', '2014-05-17', '2014-05-18', '2014-05-19', '2014-05-20'],
'battle_deaths': [34, 25, 26, 15, 15, 14, 26, 25, 62, 41, 23, 56, 23, 34, 23, 67, 54, 34, 45, 12]}
df = pd.DataFrame(data, columns=['date', 'battle_deaths'])
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.dayofweek
df = df.set_index('date')
df.sort_index()
battle_deaths day_of_week deaths_last_tuesday
date
2014-05-01 34 3
2014-05-02 25 4 24
2014-05-03 26 5 24
2014-05-04 15 6 24
2014-05-05 15 0 24
2014-05-06 14 1 24
2014-05-07 26 2 24
2014-05-08 25 3 24
2014-05-09 62 4 25
2014-05-10 41 5 25
2014-05-11 23 6 25
2014-05-12 56 0 25
I want to do this so that I want to compare the deaths of each day with the deaths of the previous Tuesday.
Use:
df['deaths_last_tuesday'] = df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill().shift()
print (df)
battle_deaths day_of_week deaths_last_tuesday
date
2014-05-01 34 3 NaN
2014-05-02 25 4 34.0
2014-05-03 26 5 34.0
2014-05-04 15 6 34.0
2014-05-05 15 0 34.0
2014-05-06 14 1 34.0
2014-05-07 26 2 34.0
2014-05-08 25 3 34.0
2014-05-09 62 4 25.0
2014-05-10 41 5 25.0
2014-05-11 23 6 25.0
2014-05-12 56 0 25.0
2014-05-13 23 1 25.0
2014-05-14 34 2 25.0
2014-05-15 23 3 25.0
2014-05-16 67 4 23.0
2014-05-17 54 5 23.0
2014-05-18 34 6 23.0
2014-05-19 45 0 23.0
2014-05-20 12 1 23.0
Explanation:
First compare by eq (==):
print (df['day_of_week'].eq(3))
date
2014-05-01 True
2014-05-02 False
2014-05-03 False
2014-05-04 False
2014-05-05 False
2014-05-06 False
2014-05-07 False
2014-05-08 True
2014-05-09 False
2014-05-10 False
2014-05-11 False
2014-05-12 False
2014-05-13 False
2014-05-14 False
2014-05-15 True
2014-05-16 False
2014-05-17 False
2014-05-18 False
2014-05-19 False
2014-05-20 False
Name: day_of_week, dtype: bool
Then create missing values for not matched values by where:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)))
date
2014-05-01 34.0
2014-05-02 NaN
2014-05-03 NaN
2014-05-04 NaN
2014-05-05 NaN
2014-05-06 NaN
2014-05-07 NaN
2014-05-08 25.0
2014-05-09 NaN
2014-05-10 NaN
2014-05-11 NaN
2014-05-12 NaN
2014-05-13 NaN
2014-05-14 NaN
2014-05-15 23.0
2014-05-16 NaN
2014-05-17 NaN
2014-05-18 NaN
2014-05-19 NaN
2014-05-20 NaN
Name: battle_deaths, dtype: float64
Forwrd fill missing values:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill())
date
2014-05-01 34.0
2014-05-02 34.0
2014-05-03 34.0
2014-05-04 34.0
2014-05-05 34.0
2014-05-06 34.0
2014-05-07 34.0
2014-05-08 25.0
2014-05-09 25.0
2014-05-10 25.0
2014-05-11 25.0
2014-05-12 25.0
2014-05-13 25.0
2014-05-14 25.0
2014-05-15 23.0
2014-05-16 23.0
2014-05-17 23.0
2014-05-18 23.0
2014-05-19 23.0
2014-05-20 23.0
Name: battle_deaths, dtype: float64
And last shift:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill().shift())
date
2014-05-01 NaN
2014-05-02 34.0
2014-05-03 34.0
2014-05-04 34.0
2014-05-05 34.0
2014-05-06 34.0
2014-05-07 34.0
2014-05-08 34.0
2014-05-09 25.0
2014-05-10 25.0
2014-05-11 25.0
2014-05-12 25.0
2014-05-13 25.0
2014-05-14 25.0
2014-05-15 25.0
2014-05-16 23.0
2014-05-17 23.0
2014-05-18 23.0
2014-05-19 23.0
2014-05-20 23.0
Name: battle_deaths, dtype: float64