Stuck merging two dataframes, getting unexpected results - pandas

Thank you, I only 3 weeks into learning Pandas, and I am getting unexpected results, any guidance would be appreciated.
I would like to merge two DataFrames together and retain my set_index.
I have a simple DataFrame
import pandas as pd
data = {
'part_number': [123,123,123],
'part_name': ['some name in 11', 'some name in 12', 'some name in 13'],
'part_size': [11,12,13]
}
df = pd.DataFrame(data=data)
df.set_index('part_name', inplace=True)
I groupby the part_sizes, and merge.
This is where my knowledge breaks down, I lose my index which is the part_name.
I see there are joins and concats, am I using the wrong syntax?
part_size_merge = df.groupby(['part_number'], dropna=False)['part_size'].agg(tuple).to_frame()
merged = df.merge(part_size_merge, on=['part_number'])
display(merged.head())
I tried concat, however, it looks like it stacks the two df's together, which isn't how I'd like it.
x = pd.concat([df, part_size_merge], axis=0, join='inner')
x.head()

Yes that is normal merge
out = df.reset_index().merge(part_size_merge, on=['part_number']).set_index('part_name')
Out[334]:
part_number part_size_x part_size_y
part_name
some name in 11 123 11 (11, 12, 13)
some name in 12 123 12 (11, 12, 13)
some name in 13 123 13 (11, 12, 13)

Related

DataFrame Index Created From Columns

I have a dataframe that I am using TIA to populate data from Bloomberg. When I look at df.index I see that the data that I intended to be columns is presented to me as what appears to be a multi-index. The output for df.columns is like this:
Index([u'column1','u'column2'])
I have tried various iterations of reset_index but have not been able to remedy this situation.
1) what about the TIA manager causes the dataframe columns to be read in as an index?
2) How can I properly identify these columns as columns instead of a multi-index?
The ultimate problem that I'm trying to fix is that when I try to add this column to df2, the values for that column in df2 come out as NaT. Like below:
df2['column3'] = df1['column1']
Produces:
df2
column1 column2 column3
1135 32 NaT
1351 43 NaT
35 13 NaT
135 13 NaT
From the comments it appears df1 and df2 have completely different indexes
In [396]: df1.index
Out[400]: Index(['Jan', 'Feb', 'Mar', 'Apr', 'May'], dtype='object')
In [401]: df2.index
Out[401]: Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')
but we wish to assign values from df1 to df2, preserving order.
Usually, Pandas operations try to automatically align values based on index (and/or column) labels.
In this case, we wish to ignore the labels. To do that, use
df2['columns3'] = df1['column1'].values
df1['column1'].values is a NumPy array. Since it doesn't have a Index, Pandas simply assigns the values in the array into df2['columns3'] in order.
The assignment would behave the same way if the right-hand side were a list or a tuple.
Note that this also relies on len(df1) equaling len(df2).
For example,
import pandas as pd
df1 = pd.DataFrame(
{"column1": [1135, 1351, 35, 135, 0], "column2": [32, 43, 13, 13, 0]},
index=[u"Jan", u"Feb", u"Mar", u"Apr", u"May"],
)
df2 = pd.DataFrame(
{"column1": range(len(df1))}, index=[u"One", u"Two", u"Three", u"Four", u"Five"]
)
df2["columns3"] = df1["column1"].values
print(df2)
yields
column1 columns3
One 0 1135
Two 1 1351
Three 2 35
Four 3 135
Five 4 0
Alternatively, you could make the two Indexs the same, and then df2["columns3"] = df1["column1"] would produce the same result (but now because the index labels are being aligned):
df1.index = df2.index
df2["columns3"] = df1["column1"]
Another way to make the Indexs match, is to reset the index on both DataFrames:
df1 = df1.reset_index()
df2 = df2.reset_index()
df2["columns3"] = df1["column1"]
reset_index moves the old index into a column named index by default (if index.name was None). Integers (starting with 0) are assigned as the new index labels:
In [402]: df1.reset_index()
Out[410]:
index column1 column2
0 Jan 1135 32
1 Feb 1351 43
2 Mar 35 13
3 Apr 135 13
4 May 0 0

How do I merge two pandas dataframes by time series index?

Currently I have two dataframes that look like this:
FSample
GMSample
What I want is something that ideally looks like this:
I attempted to do something similar to
result = pd.concat([FSample,GMSample],axis=1)
result
But my result has the data stacked on top of each other.
Then I attempted to use the merge command like this
result = pd.merge(FSample,GMSample,how='inner',on='Date')
result
From that I got a KeyError on 'Date'
So I feel like I am missing both an understanding of how I should be trying to combine these dataframes (i.e. multi-index?) and the syntax to do so properly.
You get a key error, because the Date is an index, whereas the "on" keyword in merge takes a column. Alternatively, you could remove Symbol from the indexes and then join the dataframes by the Date indexes.
FSample.reset_index("Symbol").join(GMSample.reset_index("Symbol"), lsuffix="_x", rsuffix="_y")
Working with MultiIndexes in pandas usually requires you to constantly set/reset the index. That is probably going to be the easiest thing to do in this case as well, as pd.merge does not immediately support merging on specific levels of a MultiIndex.
df_f = pd.DataFrame(
data = {
'Symbol': ['F'] * 5,
'Date': pd.to_datetime(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09']),
'Close': [11.13, 11.30, 11.59, 11.71, 11.80],
},
).set_index(['Symbol', 'Date']).sort_index()
df_gm = pd.DataFrame(
data = {
'Symbol': ['GM'] * 5,
'Date': pd.to_datetime(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09']),
'Close': [21.05, 21.15, 22.17, 22.92, 22.84],
},
).set_index(['Symbol', 'Date']).sort_index()
pd.merge(df_f.reset_index(level='Date'),
df_gm.reset_index(level='Date'),
how='inner',
on='Date',
suffixes=('_F', '_GM')
).set_index('Date')
The result:
Close_F Close_GM
Date
2012-01-03 11.13 21.05
2012-01-04 11.30 21.15
2012-01-05 11.59 22.17
2012-01-06 11.71 22.92
2012-01-09 11.80 22.84

How to implement where clause in python

I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock

Sum of data entry with the given index in pandas dataframe

I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.
IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64

Pandas dataframe values not changing outside of function

I have a pandas dataframe inside a for loop where I change a value in pandas dataframe like this:
df[item].ix[(e1,e2)] = 1
However when I access the df, the values are still unchanged. Do you know where exactly am I going wrong?
Any suggestions?
You are using chained indexing, which usually causes problems. In your code, df[item] returns a series, and then .ix[(e1,e2)] = 1 modifies that series, leaving the original dataframe untouched. You need to modify the original dataframe instead, like this:
import pandas as pd
df = pd.DataFrame({'colA': [5, 6, 1, 2, 3],
'colB': ['a', 'b', 'c', 'd', 'e']})
print df
df.ix[[1, 2], 'colA'] = 111
print df
That code sets rows 1 and 2 of colA to 111, which I believe is the kind of thing you were looking to do. 1 and 2 could be replaced with variables of course.
colA colB
0 5 a
1 6 b
2 1 c
3 2 d
4 3 e
colA colB
0 5 a
1 111 b
2 111 c
3 2 d
4 3 e
For more information on chained indexing, see the documentation:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Side note: you may also want to rethink your code in general since you mentioned modifying a dataframe in a loop. When using pandas, you usually can and should avoid looping and leverage set-based operations instead. It takes some getting used to, but it's the way to unlock the full power of the library.