Create new nested column within dataframe - pandas

I have the following
df1 = pd.DataFrame({'data': [1,2,3]})
df2 = pd.DataFrame({'data': [4,5,6]})
df = pd.concat([df1,df2], keys=['hello','world'], axis=1)
What is the "proper" way of creating a new nested column (say, df['world']['data']*2) within the hello column? I have tried df['hello']['new_col'] = df['world']['data']*2 but this does not seem to work.

Use tuples for select and set MultiIndex:
df[('hello','new_col')] = df[('world','data')]*2
print (df)
hello world hello
data data new_col
0 1 4 8
1 2 5 10
2 3 6 12
Selecting like df['world']['data'] is not recommended - link, because possible chained indexing.

Related

GroupBy-Apply even for empty DataFrame

I am using groupby-apply to create new DataFrame from given Data Frame. But if given DataFrame is empty result would look like given DataFrame with group keys not like target new DataFrame. So to get look of target new DataFrame I have to use if..else with length check and if given DataFrame is empty then manually create DataFrame with specified columns and indexes.
It is kinda broken flow of code. Also if in future structure of target DataFrame happen to change I would have to fix code in two places instead of one.
Is there a way to get look of a target DataFrame even if given DataFrame is empty with GroupBy only (or without if..else)?
Simplified example:
def some_func(df: pd.DataFrame):
return df.values.sum() + pd.DataFrame([[1,1,1], [2,2,2], [3,3,3]], columns=['new_col1', 'new_col2', 'new_col3'])
df1 = pd.DataFrame([[1,1], [1,2], [2,1], [2,2]], columns=['col1', 'col2'])
df2 = pd.DataFrame(columns=['col1', 'col2'])
df1_grouped = df1.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
df2_grouped = df2.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
Result for df1 is ok:
new_col1 new_col2 new_col3
0 6 6 6
1 7 7 7
2 8 8 8
0 8 8 8
1 9 9 9
2 10 10 10
And not ok for df2:
Empty DataFrame
Columns: [col1, col2]
Index: []
If..else to get expected result for df2:
df = df2
if df.empty:
df_grouped = pd.DataFrame(columns=['new_col1', 'new_col2', 'new_col3'])
else:
df_grouped = df.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
Gives what I need:
Empty DataFrame
Columns: [new_col1, new_col2, new_col3]
Index: []

Multiplying two data frames in pandas

I have two data frames as shown below df1 and df2. I want to create a third dataframe i.e. df as shown below. What would be the appropriate way?
df1={'id':['a','b','c'],
'val':[1,2,3]}
df1=pd.DataFrame(df)
df1
id val
0 a 1
1 b 2
2 c 3
df2={'yr':['2010','2011','2012'],
'val':[4,5,6]}
df2=pd.DataFrame(df2)
df2
yr val
0 2010 4
1 2011 5
2 2012 6
df={'id':['a','b','c'],
'val':[1,2,3],
'2010':[4,8,12],
'2011':[5,10,15],
'2012':[6,12,18]}
df=pd.DataFrame(df)
df
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
I can basically convert df1 and df2 as 1 by n matrices and get n by n result and assign it back to the df1. But is there any easy pandas way?
TL;DR
We can do it in one line like this:
df1.join(df1.val.apply(lambda x: x * df2.set_index('yr').val))
or like this:
df1.join(df1.set_index('id') # df2.set_index('yr').T, on='id')
Done.
The long story
Let's see what's going on here.
To find the output of multiplication of each df1.val by values in df2.val we use apply:
df1['val'].apply(lambda x: x * df2.val)
The function inside will obtain df1.vals one by one and multiply each by df2.val element-wise (see broadcasting for details if needed). As far as df2.val is a pandas sequence, the output is a data frame with indexes df1.val.index and columns df2.val.index. By df2.set_index('yr') we force years to be indexes before multiplication so they will become column names in the output.
DataFrame.join is joining frames index-on-index by default. So due to identical indexes of df1 and the multiplication output, we can apply df1.join( <the output of multiplication> ) as is.
At the end we get the desired matrix with indexes df1.index and columns id, val, *df2['yr'].
The second variant with # operator is actually the same. The main difference is that we multiply 2-dimentional frames instead of series. These are the vertical and horizontal vectors, respectively. So the matrix multiplication will produce a frame with indexes df1.id and columns df2.yr and element-wise multiplication as values. At the end we connect df1 with the output on identical id column and index respectively.
This works for me:
df2 = df2.T
new_df = pd.DataFrame(np.outer(df1['val'],df2.iloc[1:]))
df = pd.concat([df1, new_df], axis=1)
df.columns = ['id', 'val', '2010', '2011', '2012']
df
The output I get:
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
Your question is a bit vague. But I suppose you want to do something like that:
df = pd.concat([df1, df2], axis=1)

Adding file name to column name pandas dataframe

I have a pandas dataframe created from several csv files. The csv files are all structured the same way, so I have the same column names over and over again. I want the column names to be expanded by the file names (which I have as a list) they come from.
From this I know how to add a count to same name columns and I know how to rename columns. But I fail at bringing the right file name to the right column value.
That should be the relevant part of the code:
for i in range(0,len(file_list)):
data = pd.read_table(file_list[i], encoding='unicode_escape')
df = pd.DataFrame(data)
df = df.drop(droplist,axis=1)
main_dataframe = pd.concat([main_dataframe, df], axis = 1)
You can use a dictionary in concat to generate a MultiIndex:
list_of_files = ['f1.csv', 'f2.csv']
pd.concat({f: pd.read_table(f, encoding='unicode_escape', sep=',')
for f in list_of_files}, axis=1)
example:
# f1.csv
a,b
1,2
3,4
# f2.csv
a,b
5,6
7,8
output:
f1.csv f2.csv
a b a b
0 1 2 5 6
1 3 4 7 8
Alternative using add_prefix in the list comprehension:
pd.concat([pd.read_table(f, encoding='unicode_escape', sep=',')
.add_prefix(f[:-3]) # add prefix without ".csv" extension
for f in list_of_files], axis=1))
output:
f1.a f1.b f2.a f2.b
0 1 2 5 6
1 3 4 7 8

Parsing python list of dates into a pandas DataFrame

need some help/advise how to wrangling dates into a Pandas DataFrame. I have Python list looking like this:
['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
Is there an easy way to transform this into a Pandas DataFrame with two columns (start time and end time)?
Sample:
L = ['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
I think best here is use list comprehension with split by separator and filter out values with no splitter:
df = pd.DataFrame([x.split('-') for x in L if '-' in x], columns=['start','end'])
print (df)
start end
0 20180715:1700 20180716:1600
1 20180716:1700 20180717:1600
2 20180717:1700 20180718:1600
3 20180718:1700 20180719:1600
4 20180719:1700 20180720:1600
5 20180722:1700 20180723:1600
6 20180723:1700 20180724:1600
7 20180724:1700 20180725:1600
8 20180725:1700 20180726:1600
9 20180726:1700 20180727:1600
Pandas solution is also possible, especially if need process Series - here is used split and dropna:
s = pd.Series(L)
df = s.str.split('-', expand=True).dropna(subset=[1])
df.columns = ['start','end']
print (df)
start end
1 20180715:1700 20180716:1600
2 20180716:1700 20180717:1600
3 20180717:1700 20180718:1600
4 20180718:1700 20180719:1600
5 20180719:1700 20180720:1600
7 20180722:1700 20180723:1600
8 20180723:1700 20180724:1600
9 20180724:1700 20180725:1600
10 20180725:1700 20180726:1600
11 20180726:1700 20180727:1600

Pandas - Trying to create a list or Series in a data frame cell

I have the following data frame
df = pd.DataFrame({'A':[74.75, 91.71, 145.66], 'B':[4, 3, 3], 'C':[25.34, 33.52, 54.70]})
A B C
0 74.75 4 25.34
1 91.71 3 33.52
2 145.66 3 54.70
I would like to create another column df['D'] that would be a list or series from the first 3 columns suitable for use in another column with the np.irr function that would look like this
D
0 [ -74.75, 2.34, 25.34, 25.34, 25.34]
1 [ -91.71, 33.52, 33.52, 33.52]
2 [-145.66, 54.70, 54.70, 54.70]
so I could ultimately do something like this
df['E'] = np.irr(df['D'])
I did get as far as this
[-df.A[0]]+[df.C[0]]*df.B[0]
but it is not quite there.
Do you really need the column 'D'?
By the way you can easily add it as:
df['D'] = [[-df.A[i]]+[df.C[i]]*df.B[i] for i in xrange(len(df))]
df['E'] = df['D'].map(np.irr)
if you don't need it, you can directly set E
df['E'] = [np.irr([-df.A[i]]+[df.C[i]]*df.B[i]) for i in xrange(len(df))]
or:
df['E'] = df.apply(lambda x: np.irr([-x.A] + [x.C] * x.B), axis=1)