Merge two csv files that have a similar row structure but no common index between them - pandas

I have two csv files that I want to merge, by adding the column information from one csv to another. However they have no common index between them, but they do have the same amount of rows(they are in order). I have seen many examples of joining csv files based on index and on same numbers, however my csv files have no similar index, but they are in order. I've tried a few different examples with no luck.
mycsvfile1
"a","1","mike"
"b","2","sally"
"c","3","derek"
mycsvfile2
"boy","63","retired"
"girl","55","employed"
"boy","22","student"
Desired outcome for outcsvfile3
"a","1","mike","boy","63","retired"
"b","2","sally","girl","55","employed"
"c","3","derek","boy","22","student"
Code:
import csv
import panada
df2 = pd.read_csv("mycsvfile1.csv",header=None)
df1 = pd.read_csv("mycsvfile2.csv", header=None)
df3 = pd.merge(df1,df2)
Using
df3 = pd.merge([df1,df2])
Adds the data into a new row which doesn't help me. Any assistance is greatly appreciated.

If both dataframes have numbered indexes (i.e. starting at 0 and increasing by 1 - which is the default behaviour of pd.read_csv), and assuming that both DataFrames are already sorted in the correct order so that the rows match up, then this should do it:
df3 = pd.merge(df1,df2, left_index=True, right_index=True)

You do not have any common columns between df1 and df2 , besides the index . So we can using concat
pd.concat([df1,df2],axis=1)

Related

Generate diff of two CSV files based on a single column using Pandas

I am working with CSV files that are each hundreds of megabytes (800k+ rows), use pipe delimiters, and have 90 columns. What I need to do is compare two files at a time, generating a CSV file of any differences (i.e. rows in File2 that do not exist in File1 as well as rows in File1 that do not exist in File2) but performing the comparison Only using a single column.
For instance, a highly simplified version would be:
File1
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|John|Doe|5555551212|ABC123
20200505|active|Jane|Doe|5555551212|ABC321
File2
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|Someone|New|5555551212|ABC123
20200510|active|Another|Person|5555551212|ABC000
In this example, the output file should look like this:
claim_date|status|first_name|last_name|phone|claim_number
20200505|active|Jane|Doe|5555551212|ABC321
20200510|active|Another|Person|5555551212|ABC000
As this example shows, both input files contained the row with claim_number ABC123 and although the fields first_name and last_name changed between the files I do not care as the claim_number was the same in both files. The other rows contained unique claim_number values and so both were included in the output file.
I have been told that Pandas is the way to do this, so I have set up a Jupyter environment and am able to load the files correctly but am banging my head against the wall at this point. Any suggestions are Highly appreciated!
My code so far:
import os
import pandas as pd
df1 = pd.read_table("/Users/X/Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("/Users/X/Claims_20210618.txt", sep='|', low_memory=False)
Everything else I've written is basically irrelevant at this point as it's just copypasta from the web that doesn't execute for one reason or another.
EDIT: Solution!
import os
import pandas as pd
df1 = pd.read_table("Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("Claims_20210618.txt", sep='|', low_memory=False)
df1.astype({'claim_number':'str'})
df2.astype({'claim_number':'str'})
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False,
ignore_index=True)
.to_csv('diff.csv')
)
I still need to figure out how to kill off the first / added column before writing the file but this is fantastic! Thanks!
IIUC, you can try:
If you wanna drop duplicates based on all columns except ['first_name', 'last_name']:
df = pd.concat([df1, df2])
(
df.drop_duplicates(
subset=df.columns.difference(['first_name', 'last_name']),
keep=False)
.to_csv('file3.csv')
)
If you wanna drop duplicates based on duplicate claim_number column only:
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False)
.to_csv('file3.csv')
)

combine two Dataframes through pandas

the question is a little bit confused....Maybe my english is not good.
1.The data file for the second school is ms_data.csv. Load the data like you loaded the first data file. Then, create a new DataFrame that is a combination of the first two.
df = pd.read_csv('gp_data.csv',sep= ";",index_col= "student_id")
df1 = pd.read_csv('ms_data.csv')
2.Using concat, combine the first and last five rows of the DataFrame. Print the result.
df_row_concat = pd.concat([df, df1])
print(df_row_conca).head(0)
print(df_row_conca).tail(5)
My question is. what is the mean of "first two"?????

Pandas - How best to combing dataframes based on specific column values

I have my main data frame (df) with the six columns defined in 'column_main'.
The needed data comes from two much larger df's. Let's call them df1 and df2.
Plus df1 & df2 do not have the same columns labels. But they both include the required df columns.
The df just has the few pieces that are needed from each for the two bigger ones. And by bigger, I mean many times the columns.
Since it is all going into a DB I want to get rid of all the unwanted columns.
How do I combine/merge/join/mask the needed data from the large data frames into the main (smaller) data frame? or maybe drop the columns not covered by 'columns_main'.
df = pd.DataFrame(columns = columns_main)
The other two df's are coming from excel workbooks with a lot of unwanted trash.
wb = load_workbook(filename = filename )
ws = wb[_sheets[0]]
df1 = pd.DataFrame(ws.values)
ws = wb[_sheets[1]]
df2 = pd.DataFrame(ws.values)
How can I do without some sort of crazy looping?
Thank you.
You can select another DataFrames by subset:
df1[df['column_main']]
df2[df['column_main']]
If possible some columns not match use Index.intersection:
cols = df['column_main']
df1[df1.columns.intersection(cols)]
df2[df2.columns.intersection(cols)]

Merging Pandas Dataframes with unequal rows

I have two dataframes, dframe and dframp. Dframe has 301497 rows in it and dframep has 6080 rows in it. Both dataframes are show below. I want to merge the two such that when dframep is added to dframe the new dataframe puts Nans where dframep does not have any values for that date. I have tried this:
dfall = dframe.merge(dframep, on=['ID','date','instrument','close'], how='outer')
The two merge together but the result is 307577 rows e.g. for the dates that are not in dframep there are no Nan's.
Pulling my hair out so any help would be appreciated. Im guessing it has something to do with indexing and selecting the columns correctly.
Thanks
I can't replicate your problem (nor understand it given your description), but try something like this ? :
dfall = pd.merge(dframe, dframep, how = 'left' ,left_on = ['ID','date','instrument','close'], right_on =['ID','date','instrument','close']
This will keep the rows of dframe, and bring the info that matches from dframp

Making Many Empty Columns in PySpark

I have a list of many dataframes each with a subset schema of a master schema. In order to union these dataframes, I need to construct a common schema among all the dataframes. My thought is that I need to create empty columns for all the missing columns for each of the dataframes. I have about on average 80 missing features and 100s of dataframes.
This is somewhat of a duplicate or inspired by Concatenate two PySpark dataframes
I am currently implementing things this way:
from pyspark.sql.functions import lit
for df in dfs: # list of dataframes
for feature in missing_features: # list of strings
df = df.withColumn(feature, lit(None).cast("string"))
This seems to be taking a significant amount of time. Is there a faster way to concat these dataframes with null in place of missing features?
You might be able to cut time a little by replacing your code with:
cols = ["*"] + [lit(None).cast("string").alias(f) for f in missing_features]
dfs_new = [df.select(cols) for df in dfs]