Pandas df, detecting dates - pandas

I have a pandas df as follows:
Name Cust1 Cust2 Cust3 Cust4
ABC Y N Y 2022-01-01
DEF N N N N
I am looking to detect if a date is in a row for Cust1, Cust2, Cust3 and Cust4 and if so create a column to populate that date.
So that output would look like
Name Date
ABC 2022-01-01
DEF na
Any ideas on how I can do this?
I am trying to do df.iloc[:,1:].apply(np.where<xxx> but not sure of how to approach this from here on.
Thanks!

You can flatten your dataframe then keep the most recent date per Name:
to_date = lambda x: pd.to_datetime(x['Date'], errors='coerce')
out = df.melt('Name', value_name='Date').assign(Date=to_date) \
.groupby('Name', as_index=False)['Date'].max()
print(out)
# Output
Name Date
0 ABC 2022-01-01
1 DEF NaT

Try convert values of all columns to datetimes by to_datetime with errors='coerce' for missing values if not datetimelike values and then get maximal dates per rows:
f = lambda x: pd.to_datetime(x, errors='coerce')
df = df.set_index('Name').apply(f).max(axis=1).reset_index(name='Date')
print (df)
Name Date
0 ABC 2022-01-01
1 DEF NaT
Alternative solution:
f = lambda x: pd.to_datetime(x, errors='coerce')
df = df[['Name']].join(df.iloc[:,1:].apply(f).max(axis=1).rename('Date'))
print (df)
Name Date
0 ABC 2022-01-01
1 DEF NaT

Related

Create a row for each year between two dates

I have a dataframe with two date columns (format: YYYY-MM-DD). I want to create one row for each year between those two dates. The rows would be identical with a new column which specifies the year. For example, if the dates are 2018-01-01 and 2020-01-01 then there would be three rows with same data and a new column with values 2018, 2019, and 2020.
You can use a custom function to compute the range then explode the column:
# Ensure to have datetime
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
# Create the new column
date_range = lambda x: range(x['date1'].year, x['date2'].year+1)
df = df.assign(year=df.apply(date_range, axis=1)).explode('year', ignore_index=True)
Output:
>>> df
date1 date2 year
0 2018-01-01 2020-01-01 2018
1 2018-01-01 2020-01-01 2019
2 2018-01-01 2020-01-01 2020
This should work for you:
import pandas
# some sample data
df = pandas.DataFrame(data={
'foo': ['bar', 'baz'],
'date1':['2018-01-01', '2022-01-01'],
'date2':['2020-01-01', '2017-01-01']
})
# cast date columns to datetime
for col in ['date1', 'date2']:
df[col] = pandas.to_datetime(df[col])
# reset index to ensure that selection by length of index works
df = df.reset_index(drop=True)
# the range of years between the two dates, and iterate through the resulting
# series to unpack the range of years and add a new row with the original data and the year
for i, years in df.apply(
lambda x: range(
min(x.date1, x.date2).year,
max(x.date1, x.date2).year + 1
),
axis='columns'
).iteritems():
for year in years:
new_index = len(df.index)
df.loc[new_index] = df.loc[i].values
df.loc[new_index, 'year'] = int(year)
output:
>>> df
foo date1 date2 year
0 bar 2018-01-01 2020-01-01 NaN
1 baz 2022-01-01 2017-01-01 NaN
2 bar 2018-01-01 2020-01-01 2018.0
3 bar 2018-01-01 2020-01-01 2019.0
4 bar 2018-01-01 2020-01-01 2020.0
5 baz 2022-01-01 2017-01-01 2017.0
6 baz 2022-01-01 2017-01-01 2018.0
7 baz 2022-01-01 2017-01-01 2019.0
8 baz 2022-01-01 2017-01-01 2020.0
9 baz 2022-01-01 2017-01-01 2021.0
10 baz 2022-01-01 2017-01-01 2022.0

Pandas: drop out of sequence row

My Pandas df:
import pandas as pd
import io
data = """date value
"2015-09-01" 71.925000
"2015-09-06" 71.625000
"2015-09-11" 71.333333
"2015-09-12" 64.571429
"2015-09-21" 72.285714
"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.date = pd.to_datetime(df.date)
I Given a user input date ( 01-09-2015).
I would like to keep only those date where difference between date and input date is multiple of 5.
Expected output:
input = 01-09-2015
df:
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
3 2015-09-21 72.285714
My Approach so far:
I am taking the delta between input_date and date in pandas and saving this delta in separate column.
If delta%5 == 0, keep the row else drop. Is this the best that can be done?
Use boolean indexing for filter by mask, here convert input values to datetimes and then timedeltas to days by Series.dt.days:
input1 = '01-09-2015'
df = df[df.date.sub(pd.to_datetime(input1)).dt.days % 5 == 0]
print (df)
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
4 2015-09-21 72.285714

Substring column in Pandas based another column

I'm trying to substring a column based on the length of another column but the resultset is NaN. What am I doing wrong?
import pandas as pd
df = pd.DataFrame([['abcdefghi','xyz'], ['abcdefghi', 'z']], columns=['col1', 'col2'])
df.col1.str[:df.col2.str.len()]
0 NaN
1 NaN
Name: col1, dtype: float64
Here is what I am expecting:
0 'abc'
1 'a'
I don't think string indexing would take a series. I would do a list comprehension:
df['extract'] = [r.col1[:len(r.col2)] for _,r in df.iterrows()]
Or
df['extract'] = [s1[:len(s2)] for s1,s2 in zip(df.col1, df.col2)]
Output:
col1 col2 extract
0 abcdefghi xyz abc
1 abcdefghi z a
using numpy and converting the array to pd.Series
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
df["new_str"] = pd.Series(
[slicer(0, i)(c) for i, c in zip(df["col2"].apply(len), df["col1"].values)]
)
print(df)
col1 col2 new_str
0 abcdefghi xyz abc
1 abcdefghi z a
Here is a solution using lambda:
df['new'] = df.apply(lambda row: row['col1'][0:len(row['col2'])], axis=1)
Result:
col1 col2 new
0 abcdefghi xyz abc
1 abcdefghi z a

Pandas - groupby and count series string over column

I have a df like this:
import pandas as pd
df = pd.DataFrame(columns=['Concat','SearchTerm'])
df = df.append({'Concat':'abc','SearchTerm':'aa'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'aab'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'aac'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'ddd'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'cef'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'plo'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'cefa'}, ignore_index=True)
print(df)
Concat SearchTerm
0 abc aa
1 abc aab
2 abc aac
3 abc ddd
4 def cef
5 def plo
6 def cefa
I want to group up the df by Concat, and count how many times each SearchTerm appears within the strings of that subset. So the final result should look like this:
Concat SearchTerm Count
0 abc aa 3
1 abc aab 1
2 abc aac 1
3 abc ddd 1
4 def cef 2
5 def plo 1
6 def cefa 1
For Concat abc, aa is found 3 times among the 4 SearchTerms. I can get the solution using a loop, but for my larger dataset, it is too slow.
I have tried two solutions from this thread and this thread.
df['Count'] = df['SearchTerm'].str.contains(df['SearchTerm']).groupby(df['Concat']).sum()
df['Count'] = df.groupby(['Concat'])['SearchTerm'].transform(lambda x: x[x.str.contains(x)].count())
In either case, there is a TypeError:
'Series' objects are mutable, thus they cannot be hashed
Any help would be appreciated.
Use transform and listcomp
s = df.groupby('Concat').SearchTerm.transform('|'.join)
df['Count'] = [s[i].count(term) for i, term in enumerate(df.SearchTerm)]
Out[77]:
Concat SearchTerm Count
0 abc aa 3
1 abc aab 1
2 abc aac 1
3 abc ddd 1
4 def cef 2
5 def plo 1
6 def cefa 1

Pandas dataframe apply function

I have a dataframe which looks like this.
df.head()
Ship Date Cost Amount
0 2010-08-01 4257.23300
1 2010-08-01 9846.94540
2 2010-08-01 35.77764
3 2010-08-01 420.82920
4 2010-08-01 129.49638
I had to club the data week wise for which I did :
df['week_num'] = pd.DatetimeIndex(df['Ship Date']).week
x = df.groupby('week_num').sum()
it produces a dataframe which looks like this:
Cost Amount
week_num
30 3.273473e+06
31 9.715421e+07
32 9.914568e+07
33 9.843721e+07
34 1.065546e+08
35 1.087598e+08
36 8.050456e+07
now I wanted to add a column with week and year information to do this I did:
def my_conc(row):
return str(row['week_num'])+str('2011')
and
x['year_week'] = x.apply(my_conc,axis= 1)
This gives me an error message:
KeyError: ('week_num', u'occurred at index 30')
Now my questions are
1) Why groupby function produced a dataframe which looks a little odd as it doesn't have week_num as column name ?
2) Is there a better way of producing the dataframe with grouped data ?
3) How to use apply function on the above dataframe temp ?
Here's one way to do it.
Use as_index=False in groupby to not create index.
In [50]: df_grp = df.groupby('week_num', as_index=False).sum()
Then apply lambda function.
In [51]: df_grp['year_week'] = df_grp.apply(lambda x: str(x['week_num']) + '2011',
axis=1)
In [52]: df_grp
Out[52]:
week_num Cost year_week
0 30 3273473 302011
1 31 97154210 312011
2 32 99145680 322011
3 33 98437210 332011
4 34 106554600 342011
5 35 108759800 352011
6 36 80504560 362011
Or use df_grp.apply(lambda x: '%d2011' % x['week_num'], axis=1)
On your first question, I have no idea. When I try and replicate it, I just get an error.
On the other questions, Use the .dt accessor for groupby() functions ...
# get your data into a DataFrame
data = """Ship Date Cost Amount
0 2010-08-01 4257.23300
1 2010-08-01 9846.94540
2 2010-08-01 35.77764
3 2010-08-01 420.82920
4 2010-08-01 129.49638
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep=' ', skipinitialspace=True)
# make the dtype for the column datetime64[ns]
df['Ship Date'] = pd.to_datetime(df['Ship Date'])
# then you can use the .dt accessor to group on
x = df.groupby(df['Ship Date'].dt.dayofyear).sum()
y = df.groupby(df['Ship Date'].dt.weekofyear).sum()
There are a host more of these .dt accessors ... link