Append a tuple to a dataframe as a row - pandas

I am looking for a solution to add rows to a dataframe. Here is the data I have :
A grouped object ( obtained by grouping a dataframe on month and year i.e in this grouped object key is [month,year] and value is all the rows / dates in that month and year).
I want to extract all the month , year combinations and put that in a new dataframe. Issue : When I iterate over the grouped object, month, row is a tuple, so I converted the tuple into a list and added it to a dataframe using thye append command. Instead of getting added as rows :
1 2014
2 2014
3 2014
it got added in one column
0 1
1 2014
0 2
1 2014
0 3
1 2014
...
I want to store these values in a new dataframe. Here is how I want the new dataframe to be :
month year
1 2014
2 2014
3 2014
I tried converting the tuple to list and then I tried various other things like pivoting. Inputs would be really helpful.
Here is the sample code :
df=df.groupby(['month','year'])
df = pd.DataFrame()
for key, value in df:
print "type of key is:",type(key)
print "type of list(key) is:",type(list(key))
df = df.append(list(key))
print df

When you do the groupby the resulting MultiIndex is available as:
In [11]: df = pd.DataFrame([[1, 2014, 42], [1, 2014, 44], [2, 2014, 23]], columns=['month', 'year', 'val'])
In [12]: df
Out[12]:
month year val
0 1 2014 42
1 1 2014 44
2 2 2014 23
In [13]: g = df.groupby(['month', 'year'])
In [14]: g.grouper.result_index
Out[14]:
MultiIndex(levels=[[1, 2], [2014]],
labels=[[0, 1], [0, 0]],
names=['month', 'year'])
Often this will be sufficient, and you won't need a DataFrame. If you do, one way is the following:
In [21]: pd.DataFrame(index=g.grouper.result_index).reset_index()
Out[21]:
month year
0 1 2014
1 2 2014
I thought there was a method to get this, but can't recall it.
If you really want the tuples you can use .values or to_series:
In [31]: g.grouper.result_index.values
Out[31]: array([(1, 2014), (2, 2014)], dtype=object)
In [32]: g.grouper.result_index.to_series()
Out[32]:
month year
1 2014 (1, 2014)
2 2014 (2, 2014)
dtype: object

You had initially declared both the groupby and empty dataframe as df. Here's a modified version of your code that allows you to append a tuple as a dataframe row.
g=df.groupby(['month','year'])
df = pd.DataFrame()
for (key1,key2), value in g:
row_series = pd.Series((key1,key),index=['month','year'])
df = df.append(row_series, ignore_index = True)
print df

If all you want are the unique values, you could use drop_duplicates
In [29]: df[['month','year']].drop_duplicates()
Out[29]:
month year
0 1 2014
2 2 2014

Related

Multiplying two data frames in pandas

I have two data frames as shown below df1 and df2. I want to create a third dataframe i.e. df as shown below. What would be the appropriate way?
df1={'id':['a','b','c'],
'val':[1,2,3]}
df1=pd.DataFrame(df)
df1
id val
0 a 1
1 b 2
2 c 3
df2={'yr':['2010','2011','2012'],
'val':[4,5,6]}
df2=pd.DataFrame(df2)
df2
yr val
0 2010 4
1 2011 5
2 2012 6
df={'id':['a','b','c'],
'val':[1,2,3],
'2010':[4,8,12],
'2011':[5,10,15],
'2012':[6,12,18]}
df=pd.DataFrame(df)
df
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
I can basically convert df1 and df2 as 1 by n matrices and get n by n result and assign it back to the df1. But is there any easy pandas way?
TL;DR
We can do it in one line like this:
df1.join(df1.val.apply(lambda x: x * df2.set_index('yr').val))
or like this:
df1.join(df1.set_index('id') # df2.set_index('yr').T, on='id')
Done.
The long story
Let's see what's going on here.
To find the output of multiplication of each df1.val by values in df2.val we use apply:
df1['val'].apply(lambda x: x * df2.val)
The function inside will obtain df1.vals one by one and multiply each by df2.val element-wise (see broadcasting for details if needed). As far as df2.val is a pandas sequence, the output is a data frame with indexes df1.val.index and columns df2.val.index. By df2.set_index('yr') we force years to be indexes before multiplication so they will become column names in the output.
DataFrame.join is joining frames index-on-index by default. So due to identical indexes of df1 and the multiplication output, we can apply df1.join( <the output of multiplication> ) as is.
At the end we get the desired matrix with indexes df1.index and columns id, val, *df2['yr'].
The second variant with # operator is actually the same. The main difference is that we multiply 2-dimentional frames instead of series. These are the vertical and horizontal vectors, respectively. So the matrix multiplication will produce a frame with indexes df1.id and columns df2.yr and element-wise multiplication as values. At the end we connect df1 with the output on identical id column and index respectively.
This works for me:
df2 = df2.T
new_df = pd.DataFrame(np.outer(df1['val'],df2.iloc[1:]))
df = pd.concat([df1, new_df], axis=1)
df.columns = ['id', 'val', '2010', '2011', '2012']
df
The output I get:
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
Your question is a bit vague. But I suppose you want to do something like that:
df = pd.concat([df1, df2], axis=1)

DataFrame Index Created From Columns

I have a dataframe that I am using TIA to populate data from Bloomberg. When I look at df.index I see that the data that I intended to be columns is presented to me as what appears to be a multi-index. The output for df.columns is like this:
Index([u'column1','u'column2'])
I have tried various iterations of reset_index but have not been able to remedy this situation.
1) what about the TIA manager causes the dataframe columns to be read in as an index?
2) How can I properly identify these columns as columns instead of a multi-index?
The ultimate problem that I'm trying to fix is that when I try to add this column to df2, the values for that column in df2 come out as NaT. Like below:
df2['column3'] = df1['column1']
Produces:
df2
column1 column2 column3
1135 32 NaT
1351 43 NaT
35 13 NaT
135 13 NaT
From the comments it appears df1 and df2 have completely different indexes
In [396]: df1.index
Out[400]: Index(['Jan', 'Feb', 'Mar', 'Apr', 'May'], dtype='object')
In [401]: df2.index
Out[401]: Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')
but we wish to assign values from df1 to df2, preserving order.
Usually, Pandas operations try to automatically align values based on index (and/or column) labels.
In this case, we wish to ignore the labels. To do that, use
df2['columns3'] = df1['column1'].values
df1['column1'].values is a NumPy array. Since it doesn't have a Index, Pandas simply assigns the values in the array into df2['columns3'] in order.
The assignment would behave the same way if the right-hand side were a list or a tuple.
Note that this also relies on len(df1) equaling len(df2).
For example,
import pandas as pd
df1 = pd.DataFrame(
{"column1": [1135, 1351, 35, 135, 0], "column2": [32, 43, 13, 13, 0]},
index=[u"Jan", u"Feb", u"Mar", u"Apr", u"May"],
)
df2 = pd.DataFrame(
{"column1": range(len(df1))}, index=[u"One", u"Two", u"Three", u"Four", u"Five"]
)
df2["columns3"] = df1["column1"].values
print(df2)
yields
column1 columns3
One 0 1135
Two 1 1351
Three 2 35
Four 3 135
Five 4 0
Alternatively, you could make the two Indexs the same, and then df2["columns3"] = df1["column1"] would produce the same result (but now because the index labels are being aligned):
df1.index = df2.index
df2["columns3"] = df1["column1"]
Another way to make the Indexs match, is to reset the index on both DataFrames:
df1 = df1.reset_index()
df2 = df2.reset_index()
df2["columns3"] = df1["column1"]
reset_index moves the old index into a column named index by default (if index.name was None). Integers (starting with 0) are assigned as the new index labels:
In [402]: df1.reset_index()
Out[410]:
index column1 column2
0 Jan 1135 32
1 Feb 1351 43
2 Mar 35 13
3 Apr 135 13
4 May 0 0

Create datetime from columns in a DataFrame

I got a DataFrame with these columns :
year month day gender births
I'd like to create a new column type "Date" based on the column year, month and day as : "yyyy-mm-dd"
I'm just beginning in Python and I just can't figure out how to proceed...
Assuming you are using pandas to create your dataframe, you can try:
>>> import pandas as pd
>>> df = pd.DataFrame({'year':[2015,2016],'month':[2,3],'day':[4,5],'gender':['m','f'],'births':[0,2]})
>>> df['dates'] = pd.to_datetime(df.iloc[:,0:3])
>>> df
year month day gender births dates
0 2015 2 4 m 0 2015-02-04
1 2016 3 5 f 2 2016-03-05
Taken from the example here and the slicing (iloc use) "Selection" section of "10 minutes to pandas" here.
You can useĀ .assign
For example:
df2= df.assign(ColumnDate = df.Column1.astype(str) + '- ' + df.Column2.astype(str) + '-' df.Column3.astype(str) )
It is simple and it is much faster than lambda if you have tonnes of data.

monthly frequency time series data frame, fill NaNs with specific values

How do I pass values to months from April to September.
I would like the April value equals to 42000, May=41000, June=61200, July=71000,August=71000
df.index
RangeIndex(start=0, stop=60, step=1)
For a mapping like this, you would typically define a dictionary and map the values. Use .split to get the month part of the date and fillna to fill only the missing values.
Data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2018-Jan', '2018-Feb', '2018-Mar', '2018-Apr', '2018-May',
'2018-Jun', '2018-Jul', '2018-Aug', '2018-Sep'],
'Value': [75267.169, 42258.868, 43793]+[np.NaN]*6})
Code:
d = {'Apr': 42000, 'May': 41000, 'Jun': 61200, 'Jul': 71000, 'Aug': 71000}
df['Value'] = df.Value.fillna(df.Date.str.split('-').str[1].map(d))
Output:
Date Value
0 2018-Jan 75267.169
1 2018-Feb 42258.868
2 2018-Mar 43793.000
3 2018-Apr 42000.000
4 2018-May 41000.000
5 2018-Jun 61200.000
6 2018-Jul 71000.000
7 2018-Aug 71000.000
8 2018-Sep NaN
super simple and ugly way to do it using pd.DataFrame.iloc
to_fill = [42000,41000,61200,71000,71000]
df.iloc[54:59,1] = to_fill

Pandas groupby on one column and then filter based on quantile value of another column

I am trying to filter my data down to only those rows in the bottom decile of the data for any given date. Thus, I need to groupby the date first to get the sub-universe of data and then from there filter that same sub-universe down to only those values falling in the bottom decile. I then need to aggregate all of the different dates back together to make one large dataframe.
For example, I want to take the following df:
df = pd.DataFrame([['2017-01-01', 1], ['2017-01-01', 5], ['2017-01-01', 10], ['2018-01-01', 5], ['2018-01-01', 10]], columns=['date', 'value'])
and only those rows where the value is in the bottom decile for that date (below 1.8 and 5.5, respectively):
date value
0 '2017-01-01' 1
1 '2018-01-01' 5
I can get a series of the bottom decile using df.groupby(['date'], 'value'].quantile(.1), but this would then require me to iterate through the entire df and compare the value to the quantile value in the series, which I'm trying to avoid due to performance issues.
Something like this?
df.groupby('date').value.apply(lambda x: x[x < x.quantile(.1)]).reset_index(1,drop = True).reset_index()
date value
0 2017-01-01 1
1 2018-01-01 5
Edit:
df.loc[df['value'] < df.groupby('date').value.transform(lambda x: x.quantile(.1))]