How do I subset a dataframe based on index matches to the column name of another dataframe? - pandas

I want to keep the columns of df if its column name matches the index of df2.
My code below only returns the df.index but I want to return the entire subset of pandas dataframe.
import pandas as pd
df = df[df.columns.intersection(df2.index)]

From my understanding, you want to have datas from both dataframes matching with index of df2. Correct?
You can use Merge to join the dataframes.
df = pd.merge(df1, df2, how='inner', on=[df2.index])

Related

pandas - lookup a value in another DF without merging data from the other dataframe

I have 2 DataFrames. DF1 and DF2.
(Please note that DF2 has more entries than DF1)
What I want to do is add the nationality column to DF1 (The script should look up the name in DF1 and find the corresponding nationality in DF2).
I am currently using the below code
final_DF =df1.merge(df2[['PPS','Nationality']], on=['PPS'], how ='left')
Although the nationality column is being added to DF1 the code duplicating entries and also adding additional data from DF2 that I do not want.
Is there a method to get the nationality from DF2 while only keeping the DF1 data?
Thanks
DF1
DF2
OUPUT
2 points, you need to do.
If there is any duplicated in the DF2
You can define 'how' in the merge statement. so it will look like
final_DF = DF1.merge(DF2, on=['Name'], how = 'left')
since you want to keep only to DF1 rows, 'left' should be the ideal option for you.
For more info refer this

How to iterate through rows of a DataFrame and add those rows to a blank DataFrame?

I have two populated DataFrames, df1 and df2. I also have an empty Dataframe (test):
df1 = pd.read_excel(xlpath1, sheetname='Sheet1')
df2 = pd.read_excel(xlpath2, sheetname='Sheet1')
test = pd.DataFrame()
I'd like to iterate through the rows of df1 and add those rows to the empty test Dataframe. When I try the following, I don't get any sort of error, but nothing is added to the test DataFrame:
for i, j in df1.iterrows():
test.append(j)
Any ideas? Do I need to add the proper columns to the test DataFrame first? My total end-goal is to iterate through multiple DataFrames and add only unique items to the empty DataFrame (ex, adding items that appear in one of the many DataFrames).
If are trying to append dataframe df1 on empty dataframe df2 you can use concat function of pandas.
test = pd.concat([df1, test], axis = 0)
axis = 0 ; for appending two dataframes row-wise

Combine a list of pandas dataframes that do not have the same columns to one pandas dataframe

I have three Dataframes : df1, df2, df3 with the same number of "rows" but different number of "columns" and different "column" labels. I want to "merge" them in one single dataframe with the order df1,df2,df3 and keeping the original column labels.
I've read in Combine a list of pandas dataframes to one pandas dataframe that this can be done by:
df = pd.DataFrame.from_dict(map(dict,df_list))
But I cannot fully understand the code. I assume df_list is:
df_list = [df1,df2,df3]
But what is dict? A dictionary of df_list? How to get it?
I solve this by:
df = pd.concat([df1, df2], axis=1, sort=False)
df = pd.concat([df, df3], axis=1, sort=False)

Does a DataFrame with a single row have all the attributes of a DataFrame?

I am slicing a DataFrame from a large DataFrame and daughter df have only one row. Does a daughter df with a single row has same attributes like parent df?
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,2),index=dates,columns=['col1','col2'])
df1=df.iloc[1]
type(df1)
>> pandas.core.series.Series
df1.columns
>>'Series' object has no attribute 'columns'
Is there a way I can use all attributes of pd.DataFrame on a pd.series ?
Possibly what you are looking for is a dataframe with one row:
>>> pd.DataFrame(df1).T # T -> transpose
col1 col2
2013-01-02 -0.428913 1.265936
What happens when you do df.iloc[1] is that pandas converts that to a series, which is one-dimensional, and the columns become the index. You can still do df1['col1'], but you can't do df.columns because a series is basically a column, and hence the old columns are now the new index
As a result, you can returns the former columns like this:
>>> df1.index.tolist()
['col1', 'col2']
This used to confuse me quite a bit. I also expected df.iloc[1] to be a dataframe with one row, but it has always been the default behavior of pandas to automatically convert any one dimensional dataframe slice (whether row or column) to a series. It's pretty natural for a row, but less so for a column (since the columns become the index), but really is not a problem once you understand what is happening.

when reading an html (pandas.read_html), how to select dataframe and set_ index in one line

I'm reading an html which brings back a list of dataframes. I want to be able to choose the dataframe from the list and set my index (index_col) in the least amount of lines.
Here is what I have right now:
import pandas as pd
df =pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)
df2 =df[4] #here I'm assigning df2 to dataframe#4 from the list of dataframes I read
df2.set_index('Date', inplace =True)
Is it possible to do all this in one line? Do I need to create another dataframe (df2) to assign one dataframe from a list, or is it possible I can assign the dataframe as soon as I read the list of dataframes (df).
Thanks.
Anyway:
import pandas as pd
df = pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)[4].set_index('Date')