Pandas - How best to combing dataframes based on specific column values - pandas

I have my main data frame (df) with the six columns defined in 'column_main'.
The needed data comes from two much larger df's. Let's call them df1 and df2.
Plus df1 & df2 do not have the same columns labels. But they both include the required df columns.
The df just has the few pieces that are needed from each for the two bigger ones. And by bigger, I mean many times the columns.
Since it is all going into a DB I want to get rid of all the unwanted columns.
How do I combine/merge/join/mask the needed data from the large data frames into the main (smaller) data frame? or maybe drop the columns not covered by 'columns_main'.
df = pd.DataFrame(columns = columns_main)
The other two df's are coming from excel workbooks with a lot of unwanted trash.
wb = load_workbook(filename = filename )
ws = wb[_sheets[0]]
df1 = pd.DataFrame(ws.values)
ws = wb[_sheets[1]]
df2 = pd.DataFrame(ws.values)
How can I do without some sort of crazy looping?
Thank you.

You can select another DataFrames by subset:
df1[df['column_main']]
df2[df['column_main']]
If possible some columns not match use Index.intersection:
cols = df['column_main']
df1[df1.columns.intersection(cols)]
df2[df2.columns.intersection(cols)]

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

How can I iterate/ transpose /append a data frame to another one?

for row in range(1, len(df)):
try:
df_out, orthogroup, len_group = HOG_get_group_stats(df.loc[row, "HOG"])
temp_df = pd.DataFrame()
for id in range(len(df_out)):
print(" ")
temp_df = pd.concat([df, pd.DataFrame(df_out.iloc[id, :]).T], axis=1)
temp_df["HOG"] = orthogroup
temp_df["len_group"] = len_group
print(temp_df)
except:
print(row, "no")
Here I have a script that does the following:
Iterate over df and apply the HOG_get_group_stats function to the HOG column in df and then, get 3 variables as outputs. (Basically, the function creates some stats as a data frame called df_out, and extracts some information as two more columns called orthogroup, len_group)
Create an empty template called temp_df
Transpose the df_out data frame and make it one single row and then, concatenate with the df we used in the beginning as columns.
Add orthogroup, len_group columns to the end of the temp_df
Problem:
It prints out the data however, when I try to see the temp_df as a data frame it shows only a single row ( probably the last one) means that my concatenation of several data frames doesn't work.
Questions:
How can I iterate and then append a data frame as columns?
Is there an easier way to iterate over a data frame? (e.g. iterrow)
Is there a better way to transpose rows to columns in a data frame? ( e.g. pivot, melt)
Any help would be appreciated!!
You can find the sample files to df, df_out,temp_df and expected output_sample table here :
Sample_files

How to concat 3 dataframes with each into sequential columns

I'm trying to understand how to concat three individual dataframes (i.e df1, df2, df3) into a new dataframe say df4 whereby each individual dataframe has its own column left to right order.
I've tried using concat with axis = 1 to do this, but it appears not possible to automate this with a single action.
Table1_updated = pd.DataFrame(columns=['3P','2PG-3Io','3Io'])
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io])
Note that with the exception of get_table1_2P_max_3Io, which has two columns, all other dataframes have one column
For example,
get_table1_3P =
get_table1_2P_max_3Io =
get_table1_3Io =
Ultimately, i would like to see the following:
I believe you need first concat and tthen change order by list of columns names:
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io], axis=1)
Table1_updated = Table1_updated[['3P','2PG-3Io','3Io']]

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]

Merge two csv files that have a similar row structure but no common index between them

I have two csv files that I want to merge, by adding the column information from one csv to another. However they have no common index between them, but they do have the same amount of rows(they are in order). I have seen many examples of joining csv files based on index and on same numbers, however my csv files have no similar index, but they are in order. I've tried a few different examples with no luck.
mycsvfile1
"a","1","mike"
"b","2","sally"
"c","3","derek"
mycsvfile2
"boy","63","retired"
"girl","55","employed"
"boy","22","student"
Desired outcome for outcsvfile3
"a","1","mike","boy","63","retired"
"b","2","sally","girl","55","employed"
"c","3","derek","boy","22","student"
Code:
import csv
import panada
df2 = pd.read_csv("mycsvfile1.csv",header=None)
df1 = pd.read_csv("mycsvfile2.csv", header=None)
df3 = pd.merge(df1,df2)
Using
df3 = pd.merge([df1,df2])
Adds the data into a new row which doesn't help me. Any assistance is greatly appreciated.
If both dataframes have numbered indexes (i.e. starting at 0 and increasing by 1 - which is the default behaviour of pd.read_csv), and assuming that both DataFrames are already sorted in the correct order so that the rows match up, then this should do it:
df3 = pd.merge(df1,df2, left_index=True, right_index=True)
You do not have any common columns between df1 and df2 , besides the index . So we can using concat
pd.concat([df1,df2],axis=1)