Thank you, I only 3 weeks into learning Pandas, and I am getting unexpected results, any guidance would be appreciated.
I would like to merge two DataFrames together and retain my set_index.
I have a simple DataFrame
import pandas as pd
data = {
'part_number': [123,123,123],
'part_name': ['some name in 11', 'some name in 12', 'some name in 13'],
'part_size': [11,12,13]
}
df = pd.DataFrame(data=data)
df.set_index('part_name', inplace=True)
I groupby the part_sizes, and merge.
This is where my knowledge breaks down, I lose my index which is the part_name.
I see there are joins and concats, am I using the wrong syntax?
part_size_merge = df.groupby(['part_number'], dropna=False)['part_size'].agg(tuple).to_frame()
merged = df.merge(part_size_merge, on=['part_number'])
display(merged.head())
I tried concat, however, it looks like it stacks the two df's together, which isn't how I'd like it.
x = pd.concat([df, part_size_merge], axis=0, join='inner')
x.head()
Yes that is normal merge
out = df.reset_index().merge(part_size_merge, on=['part_number']).set_index('part_name')
Out[334]:
part_number part_size_x part_size_y
part_name
some name in 11 123 11 (11, 12, 13)
some name in 12 123 12 (11, 12, 13)
some name in 13 123 13 (11, 12, 13)
Related
I have a dataframe that I am using TIA to populate data from Bloomberg. When I look at df.index I see that the data that I intended to be columns is presented to me as what appears to be a multi-index. The output for df.columns is like this:
Index([u'column1','u'column2'])
I have tried various iterations of reset_index but have not been able to remedy this situation.
1) what about the TIA manager causes the dataframe columns to be read in as an index?
2) How can I properly identify these columns as columns instead of a multi-index?
The ultimate problem that I'm trying to fix is that when I try to add this column to df2, the values for that column in df2 come out as NaT. Like below:
df2['column3'] = df1['column1']
Produces:
df2
column1 column2 column3
1135 32 NaT
1351 43 NaT
35 13 NaT
135 13 NaT
From the comments it appears df1 and df2 have completely different indexes
In [396]: df1.index
Out[400]: Index(['Jan', 'Feb', 'Mar', 'Apr', 'May'], dtype='object')
In [401]: df2.index
Out[401]: Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')
but we wish to assign values from df1 to df2, preserving order.
Usually, Pandas operations try to automatically align values based on index (and/or column) labels.
In this case, we wish to ignore the labels. To do that, use
df2['columns3'] = df1['column1'].values
df1['column1'].values is a NumPy array. Since it doesn't have a Index, Pandas simply assigns the values in the array into df2['columns3'] in order.
The assignment would behave the same way if the right-hand side were a list or a tuple.
Note that this also relies on len(df1) equaling len(df2).
For example,
import pandas as pd
df1 = pd.DataFrame(
{"column1": [1135, 1351, 35, 135, 0], "column2": [32, 43, 13, 13, 0]},
index=[u"Jan", u"Feb", u"Mar", u"Apr", u"May"],
)
df2 = pd.DataFrame(
{"column1": range(len(df1))}, index=[u"One", u"Two", u"Three", u"Four", u"Five"]
)
df2["columns3"] = df1["column1"].values
print(df2)
yields
column1 columns3
One 0 1135
Two 1 1351
Three 2 35
Four 3 135
Five 4 0
Alternatively, you could make the two Indexs the same, and then df2["columns3"] = df1["column1"] would produce the same result (but now because the index labels are being aligned):
df1.index = df2.index
df2["columns3"] = df1["column1"]
Another way to make the Indexs match, is to reset the index on both DataFrames:
df1 = df1.reset_index()
df2 = df2.reset_index()
df2["columns3"] = df1["column1"]
reset_index moves the old index into a column named index by default (if index.name was None). Integers (starting with 0) are assigned as the new index labels:
In [402]: df1.reset_index()
Out[410]:
index column1 column2
0 Jan 1135 32
1 Feb 1351 43
2 Mar 35 13
3 Apr 135 13
4 May 0 0
Currently I have two dataframes that look like this:
FSample
GMSample
What I want is something that ideally looks like this:
I attempted to do something similar to
result = pd.concat([FSample,GMSample],axis=1)
result
But my result has the data stacked on top of each other.
Then I attempted to use the merge command like this
result = pd.merge(FSample,GMSample,how='inner',on='Date')
result
From that I got a KeyError on 'Date'
So I feel like I am missing both an understanding of how I should be trying to combine these dataframes (i.e. multi-index?) and the syntax to do so properly.
You get a key error, because the Date is an index, whereas the "on" keyword in merge takes a column. Alternatively, you could remove Symbol from the indexes and then join the dataframes by the Date indexes.
FSample.reset_index("Symbol").join(GMSample.reset_index("Symbol"), lsuffix="_x", rsuffix="_y")
Working with MultiIndexes in pandas usually requires you to constantly set/reset the index. That is probably going to be the easiest thing to do in this case as well, as pd.merge does not immediately support merging on specific levels of a MultiIndex.
df_f = pd.DataFrame(
data = {
'Symbol': ['F'] * 5,
'Date': pd.to_datetime(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09']),
'Close': [11.13, 11.30, 11.59, 11.71, 11.80],
},
).set_index(['Symbol', 'Date']).sort_index()
df_gm = pd.DataFrame(
data = {
'Symbol': ['GM'] * 5,
'Date': pd.to_datetime(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09']),
'Close': [21.05, 21.15, 22.17, 22.92, 22.84],
},
).set_index(['Symbol', 'Date']).sort_index()
pd.merge(df_f.reset_index(level='Date'),
df_gm.reset_index(level='Date'),
how='inner',
on='Date',
suffixes=('_F', '_GM')
).set_index('Date')
The result:
Close_F Close_GM
Date
2012-01-03 11.13 21.05
2012-01-04 11.30 21.15
2012-01-05 11.59 22.17
2012-01-06 11.71 22.92
2012-01-09 11.80 22.84
I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock
I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.
IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64
I have a pandas dataframe inside a for loop where I change a value in pandas dataframe like this:
df[item].ix[(e1,e2)] = 1
However when I access the df, the values are still unchanged. Do you know where exactly am I going wrong?
Any suggestions?
You are using chained indexing, which usually causes problems. In your code, df[item] returns a series, and then .ix[(e1,e2)] = 1 modifies that series, leaving the original dataframe untouched. You need to modify the original dataframe instead, like this:
import pandas as pd
df = pd.DataFrame({'colA': [5, 6, 1, 2, 3],
'colB': ['a', 'b', 'c', 'd', 'e']})
print df
df.ix[[1, 2], 'colA'] = 111
print df
That code sets rows 1 and 2 of colA to 111, which I believe is the kind of thing you were looking to do. 1 and 2 could be replaced with variables of course.
colA colB
0 5 a
1 6 b
2 1 c
3 2 d
4 3 e
colA colB
0 5 a
1 111 b
2 111 c
3 2 d
4 3 e
For more information on chained indexing, see the documentation:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Side note: you may also want to rethink your code in general since you mentioned modifying a dataframe in a loop. When using pandas, you usually can and should avoid looping and leverage set-based operations instead. It takes some getting used to, but it's the way to unlock the full power of the library.