Pandas function issues - equation output incorrect - pandas

row['conus_days']>0 or row['conus_days1']>0:
return row ['conus_days']* 8 + row['conus_days1']12
elif (row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen') and row['oconus_days']>0 or row['oconus_days1']>0:
return row ['oconus_days'] 12 + row['oconus_days1']*8
elif (row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen'):
return row ['days_in_month']*12
elif (row['Country'] == 'Germany') and row['conus_days']>0:
return row['conus_days']*8 + row['conus_days1']10
elif (row['Country'] == 'Germany'):
return row['days_in_month'] 10
elif row['Country'] == 'Conus':
return row['working_days']* 8
else:
return row['working_days']*8
forecast ['hours']= forecast.apply(lambda row: get_hours(row), axis=1)
print(forecast.head())
this is returning the following output:
Name EID Start Date End Date Country year Month \
0 xx 123456 2019-08-01 2020-01-03 Afghanistan 2020 1
1 XX 3456789 2019-09-22 2020-02-16 Conus 2020 1
2 xx. 456789 2019-12-05 2020-03-12 Conus 2020 1
3 DR. 789456 2019-09-11 2020-03-04 Iraq 2020 1
4 JR. 985756 2020-01-03 2020-05-06 Germany 2020 1
days_in_month start_month end_month working_days conus_mth oconus_mth \
0 31 2020-01-01 2020-01-31 21 8 1
1 31 2020-01-01 2020-01-31 21 9 2
2 31 2020-01-01 2020-01-31 21 12 3
3 31 2020-01-01 2020-01-31 21 9 3
4 31 2020-01-01 2020-01-31 21 1 5
conus_days conus_days1 oconus_days oconus_days1 hours
0 0 0 2 25 224
1 0 0 0 0 168
2 0 0 0 0 168
3 0 0 0 0 372
4 1 28 0 0 344
---output on row 4 is incorrect, this should return 288

Closing each if statement in double parenthesis allows for each if statement to run individual and accurately.
def get_hours(row):
if ((row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen') and (row['conus_days']>0 or row['conus_days1']>0)):
return row ['conus_days']* 8 + row['conus_days1']12
if ((row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen') and row['oconus_days']>0 or row['oconus_days1']>0):
return row ['oconus_days'] 12 + row['oconus_days1']*8
if (row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen'):
return row ['days_in_month']*12
if ((row['Country'] == 'Germany') and row['conus_days']>0):
return row['conus_days']*8 + row['conus_days1']*10
if ((row['Country'] == 'Germany') and row['oconus_days']>0):
return row['oconus_days']*10 + row['oconus_days1']8
if (row['Country'] == 'Germany'):
return row['days_in_month'] 10
if (row['Country'] == 'Conus'):
return row['working_days']* 8
else:
return row['working_days']*8
forecast ['hours']= forecast.apply(lambda row: get_hours(row), axis=1)
print(forecast.head())
Name EID Start Date End Date Country year Month \
0 XX. 123456 2019-08-01 2020-01-03 Afghanistan 2020 1
1 xx 3456789 2019-09-22 2020-02-16 Conus 2020 1
2 Mh 456789 2019-12-05 2020-03-12 Conus 2020 1
3 DR 789456 2019-09-11 2020-03-04 Iraq 2020 1
4 JR 985756 2020-01-03 2020-05-06 Germany 2020 1
days_in_month start_month end_month working_days conus_mth oconus_mth \
0 31 2020-01-01 2020-01-31 21 8 1
1 31 2020-01-01 2020-01-31 21 9 2
2 31 2020-01-01 2020-01-31 21 12 3
3 31 2020-01-01 2020-01-31 21 9 3
4 31 2020-01-01 2020-01-31 21 1 5
conus_days conus_days1 oconus_days oconus_days1 hours
0 0 0 2 25 224
1 0 0 0 0 168
2 0 0 0 0 168
3 0 0 0 0 372
4 1 28 0 0 288 ​

Related

Create a column counting number of consecutive negative days

I have huge (more than 3 million rows) pandas dataframe containing the following data:
companyId dateBalance amount
1 2020-04-17 100
1 2020-04-18 40
1 2020-04-19 20
1 2020-04-20 -40
1 2020-04-21 30
2 2020-04-18 5
2 2020-04-19 1
2 2020-04-20 -6
2 2020-04-21 -60
2 2020-04-22 200
I would like to create a new column that counts the number of days in a row the company is with negative balance, so for this case, we have the following
companyId dateBalance amount negCount
1 2020-04-17 100 0
1 2020-04-18 40 0
1 2020-04-19 20 0
1 2020-04-20 -40 1
1 2020-04-21 30 0
2 2020-04-18 5 0
2 2020-04-19 1 0
2 2020-04-20 -6 1
2 2020-04-21 -60 2
2 2020-04-22 200 0
Is there a quick way of doing this (i.e., some way that does not require iteration over every line)? Note that the index must "reset" every sign change and for every different company as well.
Use groupby().cumsum() on the negation of the criteria to identify the blocks, then groupby the blocks again:
blocks = df['amount'].ge(0).groupby(df['companyId']).cumsum()
df['negCount'] = df.groupby([df['companyId'],blocks]).cumcount()
Output:
companyId dateBalance amount negCount
0 1 2020-04-17 100 0
1 1 2020-04-18 40 0
2 1 2020-04-19 20 0
3 1 2020-04-20 -40 1
4 1 2020-04-21 30 0
5 2 2020-04-18 5 0
6 2 2020-04-19 1 0
7 2 2020-04-20 -6 1
8 2 2020-04-21 -60 2
9 2 2020-04-22 200 0

Calculate day's difference between successive pandas dataframe rows with condition

I have a dataframe as following:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1
XYZ 3/3/2020 1
XYZ 3/4/2020 1
XYZ 3/5/2020 1
XYZ 3/5/2020 0
XYZ 3/6/2020 1
XYZ 3/8/2020 1
ABC 3/9/2020 0
ABC 3/10/2020 1
ABC 3/11/2020 0
ABC 3/12/2020 1
The relTweet displays whether the tweet is relevant (1) or not (0).
\nI need to find the days difference (GaplastRel) between each successive rows for each company, with a condition that the previous day's tweet should be relevant tweet (i.e. relTweet =1 ). e.g. For the first record relTweet should be 0. For the 2nd record, relTweet should be 1 as the last relevant tweet was made one day ago.
Below is the example of needed output:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 0
XYZ 3/3/2020 1 1
XYZ 3/4/2020 1 1
XYZ 3/5/2020 1 1
XYZ 3/5/2020 0 1
XYZ 3/6/2020 1 1
XYZ 3/8/2020 1 2
ABC 3/9/2020 0 0
ABC 3/10/2020 1 0
ABC 3/11/2020 0 1
ABC 3/12/2020 1 2
Following is my code:
dataDf['Date'] = pd.to_datetime(dataDf['Date'], format='%m/%d/%Y')
dataDf['relTweet'] = (dataDf.groupby('Company', group_keys=False)
.apply(lambda g: g['Date'].diff().replace(0, np.nan).ffill()))
This code gives the days difference between successive rows for each company without conisidering the relTweet =1 condition. I am not sure how to apply the condition.
Following is the output of the above code:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 NaT
XYZ 3/3/2020 1 1 days
XYZ 3/4/2020 1 1 days
XYZ 3/5/2020 1 1 days
XYZ 3/5/2020 0 0 days
XYZ 3/6/2020 1 1 days
XYZ 3/8/2020 1 2 days
ABC 3/9/2020 0 NaT
ABC 3/10/2020 1 1 days
ABC 3/11/2020 0 1 days
ABC 3/12/2020 1 1 days
Change your mind sometime we need merge_asof rather than groupby
df1=df.loc[df['relTweet']==1,['Company','Date']]
df=pd.merge_asof(df,df1.assign(Date1=df1.Date),by='Company',on='Date', allow_exact_matches=False)
df['GaplastRel']=(df.Date-df.Date1).dt.days.fillna(0)
df
Out[31]:
Company Date relTweet Date1 GaplastRel
0 XYZ 2020-03-02 1 NaT 0.0
1 XYZ 2020-03-03 1 2020-03-02 1.0
2 XYZ 2020-03-04 1 2020-03-03 1.0
3 XYZ 2020-03-05 1 2020-03-04 1.0
4 XYZ 2020-03-05 0 2020-03-04 1.0
5 XYZ 2020-03-06 1 2020-03-05 1.0
6 XYZ 2020-03-08 1 2020-03-06 2.0
7 ABC 2020-03-09 0 NaT 0.0
8 ABC 2020-03-10 1 NaT 0.0
9 ABC 2020-03-11 0 2020-03-10 1.0
10 ABC 2020-03-12 1 2020-03-10 2.0

expand year values to month in pandas

I have sales by year:
pd.DataFrame({'year':[2015,2016,2017],'value':['12','24','30']})
year value
0 2015 12
1 2016 24
2 2017 36
I want to extrapolate to months:
yyyymm value
201501 1 (ie 12/12, etc)
201502 1
...
201512 1
201601 2
...
201712 3
any suggestions?
One idea is use cross join with helper DataFrame, convert columns to strings and add 0 by Series.str.zfill:
df1 = pd.DataFrame({'m': range(1, 13), 'a' : 1})
df = df.assign(a = 1).merge(df1).drop('a', 1)
df['year'] = df['year'].astype(str) + df.pop('m').astype(str).str.zfill(2)
df = df.rename(columns={'year':'yyyymm'})
Another solution is create MultiIndex and use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['year'], range(1, 13)], names=['yyyymm','m'])
df = df.set_index('year').reindex(mux, level=0).reset_index()
df['yyyymm'] = df['yyyymm'].astype(str) + df.pop('m').astype(str).str.zfill(2)
print (df.head(15))
yyyymm value
0 201501 12
1 201502 12
2 201503 12
3 201504 12
4 201505 12
5 201506 12
6 201507 12
7 201508 12
8 201509 12
9 201510 12
10 201511 12
11 201512 12
12 201601 24
13 201602 24
14 201603 24

Grouping into series based on days since

I need to create a new grouping every time I have a period of more than 60 days since my previous record.
Basically, I need too take the data I have here:
RowNo StartDate StopDate DaysBetween
1 3/21/2017 3/21/2017 14
2 4/4/2017 4/4/2017 14
3 4/18/2017 4/18/2017 14
4 6/23/2017 6/23/2017 66
5 7/5/2017 7/5/2017 12
6 7/19/2017 7/19/2017 14
7 9/27/2017 9/27/2017 70
8 10/24/2017 10/24/2017 27
9 10/31/2017 10/31/2017 7
10 11/14/2017 11/14/2017 14
And turn it into this:
RowNo StartDate StopDate DaysBetween Series
1 3/21/2017 3/21/2017 14 1
2 4/4/2017 4/4/2017 14 1
3 4/18/2017 4/18/2017 14 1
4 6/23/2017 6/23/2017 66 2
5 7/5/2017 7/5/2017 12 2
6 7/19/2017 7/19/2017 14 2
7 9/27/2017 9/27/2017 70 3
8 10/24/2017 10/24/2017 27 3
9 10/31/2017 10/31/2017 7 3
10 11/14/2017 11/14/2017 14 3
Once I have that I'll group by Series and get the min(StartDate) and max(StopDate) for individual durations.
I could do this using a cursor but I'm sure someone much smarter than me has figured out a more elegant solution. Thanks in advance!
You can use the window function sum() over with a conditional FLAG
Example
Select *
,Series= 1+sum(case when [DaysBetween]>60 then 1 else 0 end) over (Order by RowNo)
From YourTable
Returns
RowNo StartDate StopDate DaysBetween Series
1 2017-03-21 2017-03-21 14 1
2 2017-04-04 2017-04-04 14 1
3 2017-04-18 2017-04-18 14 1
4 2017-06-23 2017-06-23 66 2
5 2017-07-05 2017-07-05 12 2
6 2017-07-19 2017-07-19 14 2
7 2017-09-27 2017-09-27 70 3
8 2017-10-24 2017-10-24 27 3
9 2017-10-31 2017-10-31 7 3
10 2017-11-14 2017-11-14 14 3
EDIT - 2008 Version
Select A.*
,B.*
From YourTable A
Cross Apply (
Select Series=1+sum( case when [DaysBetween]>60 then 1 else 0 end)
From YourTable
Where RowNo <= A.RowNo
) B

Subtract day column from date column in pandas data frame

I have two columns in my data frame.One column is date(df["Start_date]) and other is number of days.I want to subtract no of days column(df["days"]) from Date column.
I was trying something like this
df["new_date"]=df["Start_date"]-datetime.timedelta(days=df["days"])
I think you need to_timedelta:
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
Sample:
np.random.seed(120)
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10)
df = pd.DataFrame({'Start_date': rng, 'days': np.random.choice(np.arange(10), size=10)})
print (df)
Start_date days
0 2015-02-24 7
1 2015-02-25 0
2 2015-02-26 8
3 2015-02-27 4
4 2015-02-28 1
5 2015-03-01 7
6 2015-03-02 1
7 2015-03-03 3
8 2015-03-04 8
9 2015-03-05 9
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
print (df)
Start_date days new_date
0 2015-02-24 7 2015-02-17
1 2015-02-25 0 2015-02-25
2 2015-02-26 8 2015-02-18
3 2015-02-27 4 2015-02-23
4 2015-02-28 1 2015-02-27
5 2015-03-01 7 2015-02-22
6 2015-03-02 1 2015-03-01
7 2015-03-03 3 2015-02-28
8 2015-03-04 8 2015-02-24
9 2015-03-05 9 2015-02-24