Slicing a dataframe from unstack() - pandas

This is a dataframe, df that emerged using a unstack()
Index 0 7 21 22 23
June 89 0 3 5 2
July 30 0 2 5 4
August 20 8 5 5 5
I tried to slice a portion of the dataframe using
df2=df.loc[: , :'21']
I tried to slice a portion of the dataframe using
df2=df.loc[: , :'21']
But I have this error:
keyError: '21'

Error mean there is no column 21 as string.
Check it by:
#integers columns
print (df.columns.tolist())
[0, 7, 21, 22, 23]
#strings columns
#print (df.columns.tolist())
#['0', '7', '21', '22', '23']
You can use for Series integer 21:
df2 = df[21]
print (df2)
Index
June 3
July 2
August 5
Name: 21, dtype: int64
And for one columns DataFrame use double []:
df2 = df[[21]]
print (df2)
21
Index
June 3
July 2
August 5
EDIT:
Another problem should be trailing whitespaces:
print (df.columns.tolist())
['0', '7', ' 21 ', '22', '23']
and for remove need str.strip:
df.columns = df.columns.str.strip()
print (df.columns.tolist())
['0', '7', '21', '22', '23']
EDIT1:
For filter columns with values less as 24 with bad data also use to_numeric with errors='coerce' for NaNs for unparseable values:
df2 = df.loc[:, pd.to_numeric(df.columns, errors='coerce') < 24]

Related

Sum specific columns of pandas dataframe based on if column name ends with string and begins with value in another column

I have a pandas dataframe like this
In [1]: import pandas as pd
In [2}: df = pd.DataFrame([['X', 2, 3, 4, 5 ,6, 7], ['Y',8, 9, 10, 11, 12, 13], ['X', 14, 15, 16, 17, 18, 19]], \
columns=['name', 'X 1_V1', 'X 1_V2', 'Y 1_V1', 'Y 1_V2','X 2_V1', 'X 2_V2'])
In[3]: print(df)
Out[3]: name X 1_V1 X 1_V2 Y 1_V1 Y 1_V2 X 2_V1 X 2_V2
0 X 2 3 4 5 6 7
1 Y 8 9 10 11 12 13
2 X 14 15 16 17 18 19
I want to sum the columns that begin with the value in the 'name' column and end with 'V1'. So the 1st and 3rd row would sum the 2nd and 5th column, while the 2nd row would sum the 4th column.
In[3]: df['sum']
Out[3]:
0 8
1 10
2 32
Name: sum, dtype: int64
I have tried
df["sum_Area"] = df[[x for x in df.columns if (x.split(' ')[0] == df['name']) and (x.endswith('peak_area'))]].sum(axis = "columns")
But receive the fault : ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). The column names are strings
Results I would like in picture format
df['sum']=df.apply(lambda x:sum([x[c] for c in df.columns if c.split()[0]==x['name'] and c.endswith('V1')]),axis=1)

Pandas transform rows with specific character

i am working on features transformation, and ran into this issue. Let me know what you think. Thanks!
I have a table like this
And I want to create an output column like this
Some info:
All the outputs will be based on numbers that end with a ':'
I have 100M+ rows in this table. Need to consider performance issue.
Let me know if you have some good ideas. Thanks!
Here is some copy and paste-able sample data:
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
Solution #1:
You can use .str.contains(':') with np.where() to identify the values, otherwise return np.nan. Then, use ffill() to fill down on nan values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
df['Output'] = np.where(df['Number'].str.contains(':'),df['Number'].str.split(':').str[0],np.nan)
df['Output'] = df['Output'].ffill()
df
Solution #2 - Even easier and potentially better performance you can do some regex with str.extract() and then again ffill():
df['Output'] = df['Number'].str.extract('^(\d+):').ffill()
df
Out[1]:
Number Output
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7
I think this is what you are looking for:
import pandas as pd
c = ['Number']
d = ['1:00',100,1001,1321,3254,'15:00',20,60,80,90,'4:00',26,45,90,89]
df = pd.DataFrame(data=d,columns=c)
temp= df['Number'].str.split(":", n = 1, expand = True)
df['New_Val'] = temp[0].ffill()
print(df)
The output of this will be as follows:
Number New_Val
0 1:00 1
1 100 1
2 1001 1
3 1321 1
4 3254 1
5 15:00 15
6 20 15
7 60 15
8 80 15
9 90 15
10 4:00 4
11 26 4
12 45 4
13 90 4
14 89 4
Looks like your DataFrame has string values. I considered them as a mix of numbers and strings.
Here's the solution if df['Number'] is all strings.
df1 = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
temp= df1['Number'].str.split(":", n = 1, expand = True)
temp.loc[temp[1].astype(bool) != False, 'New_val'] = temp[0]
df1['New_val'] = temp['New_val'].ffill()
print (df1)
The output of df1 will be:
Number New_val
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7

Data standardization of feat having lt/gt values among absolute values

One of the datasets I am dealing with has few features which have lt/gt values along with absolute values. Please refer to an example below -
>>> df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
>>> df
foo
0 <10
1 23
2 34
3 22
4 >90
5 42
note - foo is % value. ie 0 <= foo <= 100
How are such data transformed to run regression models on?
One thing you could do is, for values <10, impute the median value (5). Similarly, for those >90, impute 95.
Then add two extra boolean columns:
df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
dummies = pd.get_dummies(df, columns=['foo'])[['foo_<10', 'foo_>90']]
df = df.replace('<10', 5).replace('>90', 95)
df = pd.concat([df, dummies], axis=1)
df
This will give you
foo foo_<10 foo_>90
0 5 1 0
1 23 0 0
2 34 0 0
3 22 0 0
4 95 0 1
5 42 0 0

Pandas Flatten a Complex Multi-level column dataframe

I initially had a dataframe with column ID and Date, i wanted to find the first and last Date entry for every ID.
Therefore i applied an aggregation function:
df.groupby('ID').agg({'Date':['first','last']})
I have a dataframe in the following form:
print(df.columns)
>> MultiIndex(levels=[['Date', 'ID', 'difference'], ['first', 'last', '']],
labels=[[1, 0, 0, 2], [2, 0, 1, 2]])
I want to flatten this dataframe such that i get the dataframe in the following manner:
I tried using df.reset_index(level=[0])
and also used df.unstack() but couldn't get the desired result.
Any leads regarding on how to solve this problem?
I think you need change aggregate function for avoid MultiIndex in columns with specify column for aggregate and list of aggregating functions:
rng = pd.date_range('2017-04-03', periods=10)
df = pd.DataFrame({'Date': rng, 'id': [23] * 5 + [35] * 5})
print (df)
Date id
0 2017-04-03 23
1 2017-04-04 23
2 2017-04-05 23
3 2017-04-06 23
4 2017-04-07 23
5 2017-04-08 35
6 2017-04-09 35
7 2017-04-10 35
8 2017-04-11 35
9 2017-04-12 35
df1 = df.groupby('id')['Date'].agg(['first','last']).reset_index()
print (df1)
id first last
0 23 2017-04-03 2017-04-07
1 35 2017-04-08 2017-04-12

Apply rolling function to groupby over several columns

I'd like to apply rolling functions to a dataframe grouped by two columns with repeated date entries. Specifically, with both "freq" and "window" as datetime values, not simply ints.
In principle, I'm try to combine the methods from How to apply rolling functions in a group by object in pandas and pandas rolling sum of last five minutes.
Input
Here is a sample of the data, with one id=33 although we expect several id's.
X = [{'date': '2017-02-05', 'id': 33, 'item': 'A', 'points': 20},
{'date': '2017-02-05', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-06', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-13', 'id': 33, 'item': 'A', 'points': 4}]
# df = pd.DataFrame(X) and reindex df to pd.to_datetime(df['date'])
df
id item points
date
2017-02-05 33 A 20
2017-02-05 33 B 10
2017-02-06 33 B 10
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-13 33 A 4
Goal
Sample each 'id' every 2 days (freq='2d') and return the sum of total points for each item over the previous three days (window='3D'), end-date inclusive
Desired Output
id A B
date
2017-02-05 33 20 10
2017-02-07 33 20 30
2017-02-09 33 0 10
2017-02-11 33 3 0
2017-02-13 33 7 0
E.g. on the right-inclusive end-date 2017-02-13, we sample the 3-day period 2017-02-11 to 2017-02-13. In this period, id=33 had a sum of A points equal to 1+1+1+4 = 7
Attempts
An attempt of groupby with a pd.rolling_sum as follows didn't work, due to repeated dates
df.groupby(['id', 'item'])['points'].apply(pd.rolling_sum, freq='4D', window=3)
ValueError: cannot reindex from a duplicate axis
Also note that from the documentation http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_apply.html 'window' is an int representing the size sample period, not the number of days to sample.
We can also try resampling and using last, however the desired look-back of 3 days doesn't seem to be used
df.groupby(['id', 'item'])['points'].resample('2D', label='right', closed='right').\
apply(lambda x: x.last('3D').sum())
id item date
33 A 2017-02-05 20
2017-02-07 0
2017-02-09 0
2017-02-11 3
2017-02-13 4
B 2017-02-05 10
2017-02-07 10
Of course,setting up a loop over unique id's ID, selecting df_id = df[df['id']==ID], and summing over the periods does work but is computationally-intensive and doesn't exploit groupby's nice vectorization.
Thanks to #jezrael for good suggestions so far
Notes
Pandas version = 0.20.1
I'm a little confused as to why the documentation on rolling() here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
suggests that the "window" parameter can be in an int or offset but on attempting df.rolling(window='3D',...) I getraise ValueError("window must be an integer")
It appears that the above documentation is not consistent with the latest code for rolling's window from ./core/window.py :
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window.py
elif not is_integer(self.window):
raise ValueError("window must be an integer")
It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
I clean this up with a bit of renaming indices and one more pivot.
s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()
d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)
d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)
A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0
df = pd.DataFrame(X)
# group sum by day
df = df.groupby(['date', 'id', 'item'])['points'].sum().reset_index().sort_values(['date', 'id', 'item'])
# convert index to datetime index
df = df.set_index('date')
df.index = DatetimeIndex(df.index)
# rolloing sum by 3D
df['pointsum'] = df.groupby(['id', 'item']).transform(lambda x: x.rolling(window='3D').sum())
# reshape dataframe
df = df.reset_index().set_index(['date', 'id', 'item'])['pointsum'].unstack().reset_index().set_index('date').fillna(0)
df