Reshape dataframe year variables - pandas

How I do reshape/transform:
df = pd.DataFrame({'Year':[2014,2015,2014,2015],'KS4':[True, True, False, False], 'KS5':[False, False, True, False]})
KS4 KS5 Year
0 True False 2014
1 True False 2015
2 False True 2014
3 False False 2015
To get:
KS4 KS5
0 2014 2014
1 2015

There are a couple simple answers involving reconstructing the DataFrame with Series.
df.iloc[:, :-1].apply(lambda x: pd.Series(df.Year.values[x]))
This does the same thing more explicitly with a loop.
pd.DataFrame({col: pd.Series(df['Year'].values[df[col]]) for col in df.columns[:-1]})
KS4 KS5
0 2014 2014.0
1 2015 NaN

It looks like you are only looking where the values are True. If so...
dd = dd.groupby(['Year'], as_index=False).sum()
dd.KS4 = dd.KS4 * dd.Year
dd.KS5 = dd.KS5 * dd.Year
dd.replace(0, '', inplace=True)

Try this
df.KS4=df.KS4.mul(df.Year)
df.KS5=df.KS5.mul(df.Year)
df.set_index('Year').stack().to_frame().replace({0:np.nan}).dropna()\
.unstack().fillna('').reset_index(drop=True)
Out[159]:
0
KS4 KS5
0 2014 2014
1 2015
EDIT drop column level by using df.columns = df.columns.droplevel()
Or
df=df.set_index('Year').stack().to_frame().replace({0:np.nan}).dropna()\
.unstack().fillna('')
df.mul(df.index.values).reset_index(drop=True)
Out[183]:
0
KS4 KS5
0 2014 2015
1 2014

f = lambda d: d.mul(d.index.to_series().astype(str), 0)
df.groupby('Year').any().pipe(f).reset_index(drop=True)
KS4 KS5
0 2014 2014
1 2015

Related

generate date feature column using pandas

I have a timeseries data frame that has columns like these:
Date temp_data holiday day
01.01.2000 10000 0 1
02.01.2000 0 1 2
03.01.2000 2000 0 3
..
..
..
26.01.2000 200 0 26
27.01.2000 0 1 27
28.01.2000 500 0 28
29.01.2000 0 1 29
30.01.2000 200 0 30
31.01.2000 0 1 31
01.02.2000 0 1 1
02.02.2000 2500 0 2
Here, holiday = 0 when there is data present - indicates a working day
holiday = 1 when there is no data present - indicated a non-working day
I am trying to extract three new columns from this data -second_last_working_day_of_month and third_last_working_day_of_month and the fourth_last_wday
the output data frame should look like this
Date temp_data holiday day secondlast_wd thirdlast_wd fouthlast_wd
01.01.2000 10000 0 1 1 0 0
02.01.2000 0 1 2 0 0 0
03.01.2000 2000 0 3 0 0 0
..
..
25.01.2000 345 0 25 0 0 1
26.01.2000 200 0 26 0 1 0
27.01.2000 0 1 27 0 0 0
28.01.2000 500 0 28 1 0 0
29.01.2000 0 1 29 0 0 0
30.01.2000 200 0 30 0 0 0
31.01.2000 0 1 31 0 0 0
01.02.2000 0 1 1 0 0 0
02.02.2000 2500 0 2 0 0 0
Can anyone help me with this?
Example
data = [['26.01.2000', 200, 0, 26], ['27.01.2000', 0, 1, 27], ['28.01.2000', 500, 0, 28],
['29.01.2000', 0, 1, 29], ['30.01.2000', 200, 0, 30], ['31.01.2000', 0, 1, 31],
['26.02.2000', 200, 0, 26], ['27.02.2000', 0, 0, 27], ['28.02.2000', 500, 0, 28],['29.02.2000', 0, 1, 29]]
df = pd.DataFrame(data, columns=['Date', 'temp_data', 'holiday', 'day'])
df
Date temp_data holiday day
0 26.01.2000 200 0 26
1 27.01.2000 0 1 27
2 28.01.2000 500 0 28
3 29.01.2000 0 1 29
4 30.01.2000 200 0 30
5 31.01.2000 0 1 31
6 26.02.2000 200 0 26
7 27.02.2000 0 0 27
8 28.02.2000 500 0 28
9 29.02.2000 0 1 29
Code
for example make secondlast_wd column (n=2)
n = 2
s = pd.to_datetime(df['Date'])
result = df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(n)
result
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
make result to secondlast_wd column
df.assign(secondlast_wd=result.astype('int'))
output:
Date temp_data holiday day secondlast_wd
0 26.01.2000 200 0 26 0
1 27.01.2000 0 1 27 0
2 28.01.2000 500 0 28 1
3 29.01.2000 0 1 29 0
4 30.01.2000 200 0 30 0
5 31.01.2000 0 1 31 0
6 26.02.2000 200 0 26 0
7 27.02.2000 0 0 27 1
8 28.02.2000 500 0 28 0
9 29.02.2000 0 1 29 0
you can change n and can get third, forth and so on..
Update for comment
chk workday(reverse index)
df.iloc[::-1, 2].eq(0) # 2 means location of 'holyday'. can use df.loc[::-1,"holiday"]
9 False
8 True
7 True
6 True
5 False
4 True
3 False
2 True
1 False
0 True
Name: holiday, dtype: bool
reverse cumsum by group(month). then when workday is +1 above value and when holyday is still same value with above.(of course in reverse index)
df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum()
9 0
8 1
7 2
6 3
5 0
4 1
3 1
2 2
1 2
0 3
Name: holiday, dtype: int64
find holiday == 0 and result == 2, that is secondlast_wd
df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(2)
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
This operation returns index as it was.(not reverse)
Other Way
A more understandable code would be:
s = pd.to_datetime(df['Date'])
idx1 = df[df['holiday'].eq(0)].groupby(s.dt.month, as_index=False).nth(-2).index
df.loc[idx1, 'lastsecondary_wd'] = 1
df['lastsecondary_wd'] = df['lastsecondary_wd'].fillna(0).astype('int')
same result

Pandas Equivalent of Excel COUNTIFS

I've read through some previous questions and am having trouble implementing. Here is my table.
Value Bool
abc TRUE
abc TRUE
bca TRUE
bca FALSE
asd FALSE
asd FALSE
I want this:
Value Bool Count
abc TRUE 2
abc TRUE 2
bca TRUE 1
bca FALSE 1
asd FALSE 0
asd FALSE 0
For each group of terms in Value, count the number of occurrences of TRUE, which is a boolean in my df.
In Excel you can do COUNTIFS to do this. Can someone please show me the way in Pandas?
Try with groupby transform:
df['Count']=df.groupby('Value')['Bool'].transform('sum')
print(df)
Value Bool Count
0 abc True 2.0
1 abc True 2.0
2 bca True 1.0
3 bca False 1.0
4 asd False 0.0
5 asd False 0.0
Or:
df['Count']=df.groupby('Value')['Bool'].transform(lambda x: x.sum())
print(df)
Value Bool Count
0 abc True 2
1 abc True 2
2 bca True 1
3 bca False 1
4 asd False 0
5 asd False 0

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)

Calulations within same category

Small data frame example:
ID V1 V2 is
1 01 23569.5 0.138996 FALSE
2 01 23611.5 1.318343 TRUE
3 01 23636.0 0.071871 FALSE
4 01 23665.5 0.081087 FALSE
5 01 33417.5 0.102158 FALSE
6 01 33563.5 0.119645 FALSE
7 01 42929.5 0.175000 FALSE
8 01 44552.5 0.066056 FALSE
9 01 45539.5 0.227691 FALSE
10 01 46984.5 0.649687 FALSE
11 01 47018.0 0.932445 FALSE
12 02 23611.5 1.418377 TRUE
13 02 23667.5 0.474754 FALSE
14 02 46984.0 0.443233 FALSE
15 02 47018.0 0.847738 FALSE
16 02 47051.5 0.446792 FALSE
17 02 47096.5 3.602696 FALSE
18 03 23464.0 1.010199 FALSE
19 03 23523.5 0.150067 FALSE
20 03 23611.5 1.273281 TRUE
21 03 29608.0 0.071324 FALSE...
There is only one row within each ID-category with is=T. I would like to know a convenient way of calculating the ratio V2 (is=F)/V2 (is=T) within each ID and add the result in a new column/vector with a result like this:
ID V1 V2 is Ratio
1 1 23569.5 0.138996 FALSE 0.10543235
2 1 23611.5 1.318343 TRUE 1
3 1 23636 0.071871 FALSE 0.054516162
4 1 23665.5 0.081087 FALSE 0.061506755
5 1 33417.5 0.102158 FALSE 0.077489697
6 1 33563.5 0.119645 FALSE 0.090754075
7 1 42929.5 0.175000 FALSE 0.132742389
8 1 44552.5 0.066056 FALSE 0.050105322
9 1 45539.5 0.227691 FALSE 0.172709985
10 1 46984.5 0.649687 FALSE 0.492805742
11 1 47018 0.932445 FALSE 0.707285585
12 2 23611.5 1.418377 TRUE 1
13 2 23667.5 0.474754 FALSE 0.334716369
14 2 46984 0.443233 FALSE 0.312493082
15 2 47018 0.847738 FALSE 0.597681716
16 2 47051.5 0.446792 FALSE 0.315002288
17 2 47096.5 3.602696 FALSE 2.540012987
18 3 23464 1.010199 FALSE 0.793382608
19 3 23523.5 0.150067 FALSE 0.117858509
20 3 23611.5 1.273281 TRUE 1
21 3 29608 0.071324 FALSE 0.056015915...
I am sorry for the trivial question. However my search result has not helped finding the solution I am looking for.
I assume that your dataframe is called data and already sorted by ID.
Select records with is==TRUE:
data.true = data[data$is==TRUE,]
Obtain run length encoding of ID:
rle.id = rle(data$ID)
For each V2 with is==TRUE, copy it as many times as many members of the group exist:
v2.true = rep(data.true$v2, rle.id$len)
make the division
data$Ratio = data$V2/v2.true

Find string in multiple columns ?

I have a dataframe with 3 columns tel1,tel2,tel3
I want to keep row that contains a specific value in one or more columns:
For exemple i want to keep row where columns tel1 or tel2 or tel3 start with '06'
How can i do that ?
Thanks
Let's use this df as an example DataFrame:
In [54]: df = pd.DataFrame({'tel{}'.format(j):
['{:02d}'.format(i+j)
for i in range(10)] for j in range(3)})
In [71]: df
Out[71]:
tel0 tel1 tel2
0 00 01 02
1 01 02 03
2 02 03 04
3 03 04 05
4 04 05 06
5 05 06 07
6 06 07 08
7 07 08 09
8 08 09 10
9 09 10 11
You can find which values in df['tel0'] starts with '06' using
StringMethods.startswith:
In [72]: df['tel0'].str.startswith('06')
Out[72]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
Name: tel0, dtype: bool
To combine two boolean Series with logical-or, use |:
In [73]: df['tel0'].str.startswith('06') | df['tel1'].str.startswith('06')
Out[73]:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 False
8 False
9 False
dtype: bool
Or, if you want to combine a list of boolean Series using logical-or, you could use reduce:
In [79]: import functools
In [80]: import numpy as np
In [80]: mask = functools.reduce(np.logical_or, [df['tel{}'.format(i)].str.startswith('06') for i in range(3)])
In [81]: mask
Out[81]:
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 False
9 False
Name: tel0, dtype: bool
Once you have the boolean mask, you can select the associated rows using df.loc:
In [75]: df.loc[mask]
Out[75]:
tel0 tel1 tel2
4 04 05 06
5 05 06 07
6 06 07 08
Note there are many other vectorized str methods besides startswith.
You might find str.contains useful for finding which rows contain a string. Note that str.contains interprets its argument as a regex pattern by default:
In [85]: df['tel0'].str.contains(r'6|7')
Out[85]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 False
9 False
Name: tel0, dtype: bool
I like to use dataframe.apply in such situations:
#search dataframe multip columns
#generate some random numbers
import random as r
rand_numbers = [[r.randint(100000, 9999999) for __ in range(3)] for _ in range(20)]
df = pd.DataFrame.from_records(rand_numbers, columns=['tel1','tel2','tel3'])
df.head()
#a really simple search function
#if you need speed use cpython here ;-)
def searchfilter(row, search='5'):
#df.apply returns the rows or columns as list
for string in row:
#string is a number here, so we must cast it.
if str(string).startswith(search):
return True
else:
return False
#apply the searchfunction to each row
result_bool_array =df.apply(searchfilter, axis=1) #the axis argument is to run it rowise
df[result_bool_array]
#other search with lambda in apply
result_bool_array =df.apply(lambda row: searchfilter(row, search='6'), axis=1)