Generate diff of two CSV files based on a single column using Pandas - pandas

I am working with CSV files that are each hundreds of megabytes (800k+ rows), use pipe delimiters, and have 90 columns. What I need to do is compare two files at a time, generating a CSV file of any differences (i.e. rows in File2 that do not exist in File1 as well as rows in File1 that do not exist in File2) but performing the comparison Only using a single column.
For instance, a highly simplified version would be:
File1
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|John|Doe|5555551212|ABC123
20200505|active|Jane|Doe|5555551212|ABC321
File2
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|Someone|New|5555551212|ABC123
20200510|active|Another|Person|5555551212|ABC000
In this example, the output file should look like this:
claim_date|status|first_name|last_name|phone|claim_number
20200505|active|Jane|Doe|5555551212|ABC321
20200510|active|Another|Person|5555551212|ABC000
As this example shows, both input files contained the row with claim_number ABC123 and although the fields first_name and last_name changed between the files I do not care as the claim_number was the same in both files. The other rows contained unique claim_number values and so both were included in the output file.
I have been told that Pandas is the way to do this, so I have set up a Jupyter environment and am able to load the files correctly but am banging my head against the wall at this point. Any suggestions are Highly appreciated!
My code so far:
import os
import pandas as pd
df1 = pd.read_table("/Users/X/Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("/Users/X/Claims_20210618.txt", sep='|', low_memory=False)
Everything else I've written is basically irrelevant at this point as it's just copypasta from the web that doesn't execute for one reason or another.
EDIT: Solution!
import os
import pandas as pd
df1 = pd.read_table("Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("Claims_20210618.txt", sep='|', low_memory=False)
df1.astype({'claim_number':'str'})
df2.astype({'claim_number':'str'})
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False,
ignore_index=True)
.to_csv('diff.csv')
)
I still need to figure out how to kill off the first / added column before writing the file but this is fantastic! Thanks!

IIUC, you can try:
If you wanna drop duplicates based on all columns except ['first_name', 'last_name']:
df = pd.concat([df1, df2])
(
df.drop_duplicates(
subset=df.columns.difference(['first_name', 'last_name']),
keep=False)
.to_csv('file3.csv')
)
If you wanna drop duplicates based on duplicate claim_number column only:
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False)
.to_csv('file3.csv')
)

Related

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

allowing python to impoert csv with duplicate column names in python

i have a data frame that looks like this:
there are in total 109 columns.
when i import the data using the read_csv it adds ".1",".2" to duplicate names .
is there any way to go around it ?
i have tried this :
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding = "ISO-8859-1",
sep='|', header=None)
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
but it changed the data frame and wasnt helpful.
this is what it did to my data
python:
excel:
Remove header=None, because it is used for avoid convert first row of file to df.columns and then remove . with digits from columns names:
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding="ISO-8859-1", sep=',')
df.columns = df.columns.str.replace('\.\d+$','')

How do I swap two (or more) columns in two different data tables? on pandas

new here and I am new to programming.
So.. as the title says I am trying to swap two full columns from two different files (columns has the same name but different data). I started this:
import numpy as np
import pandas as pd
from pandas import DataFrame
df = pd.read_csv('table1.csv', col_name= 'COL1')
df1 = pd.read_csv('table2.csv', col_name = 'COL1')
df1.COL1 = df.COL1
But now I am stack.. how do I select whole column and how can I print the new combined table to a new file (i.e table 3)?
You could perform the swapping by copying one column in a temporary one and deleting afterwards like follows
df1['temp'] = df1['COL1']
df1['COL1'] = df['COL1']
df['COL1'] = df1['temp']
del df1['temp']
and then writing the result via to_csv to a third CSV
df1.to_csv('table3.csv')

Merge two csv files that have a similar row structure but no common index between them

I have two csv files that I want to merge, by adding the column information from one csv to another. However they have no common index between them, but they do have the same amount of rows(they are in order). I have seen many examples of joining csv files based on index and on same numbers, however my csv files have no similar index, but they are in order. I've tried a few different examples with no luck.
mycsvfile1
"a","1","mike"
"b","2","sally"
"c","3","derek"
mycsvfile2
"boy","63","retired"
"girl","55","employed"
"boy","22","student"
Desired outcome for outcsvfile3
"a","1","mike","boy","63","retired"
"b","2","sally","girl","55","employed"
"c","3","derek","boy","22","student"
Code:
import csv
import panada
df2 = pd.read_csv("mycsvfile1.csv",header=None)
df1 = pd.read_csv("mycsvfile2.csv", header=None)
df3 = pd.merge(df1,df2)
Using
df3 = pd.merge([df1,df2])
Adds the data into a new row which doesn't help me. Any assistance is greatly appreciated.
If both dataframes have numbered indexes (i.e. starting at 0 and increasing by 1 - which is the default behaviour of pd.read_csv), and assuming that both DataFrames are already sorted in the correct order so that the rows match up, then this should do it:
df3 = pd.merge(df1,df2, left_index=True, right_index=True)
You do not have any common columns between df1 and df2 , besides the index . So we can using concat
pd.concat([df1,df2],axis=1)

Concatenate a pandas dataframe to CSV file without reading the entire file

I have a quite large CSV file. I have a pandas dataframe that has exactly the columns with the CSV file.
I checked on stackoverflow and I see several answers suggested to read_csv then concatenate the read dataframe with the current one then write back to a CSV file.
But for a large file I think it is not the best way.
Can I concatenate a pandas dataframe to an existed CSV file without reading the whole file?
Update: Example
import pandas as pd
df1 = pd.DataFramce ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
# what to do here? I would like to concatenate df2 to my.csv
The expected my.csv
a b
0 1 2
1 3 4
Look at using mode='a' in to_csv:
MCVE:
df1 = pd.DataFrame ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
df2.to_csv('my.csv', mode='a', header=False)
!type my.csv #Windows machine use 'type' command or on unix use 'cat'
Output:
,a,b
0,1,2
1,3,4