allowing python to impoert csv with duplicate column names in python - pandas

i have a data frame that looks like this:
there are in total 109 columns.
when i import the data using the read_csv it adds ".1",".2" to duplicate names .
is there any way to go around it ?
i have tried this :
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding = "ISO-8859-1",
sep='|', header=None)
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
but it changed the data frame and wasnt helpful.
this is what it did to my data
python:
excel:

Remove header=None, because it is used for avoid convert first row of file to df.columns and then remove . with digits from columns names:
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding="ISO-8859-1", sep=',')
df.columns = df.columns.str.replace('\.\d+$','')

Related

Concatenate row values in Pandas DataFrame

I have a problem with Pandas' DataFrame Object.
I have read first excel file and I have DataFrame like this:
First DataFrame
And read second excel file like this:
Second DataFrame
I need to concatenate rows and it should like this:
Third DataFrame
I have code like this:
import pandas as pd
import numpy as np
x1 = pd.ExcelFile("x1.xlsx")
df1 = pd.read_excel(x1, "Sheet1")
x2 = pd.ExcelFile("x2.xlsx")
df2 = pd.read_excel(x2, "Sheet1")
result = pd.merge(df1, df2, how="outer")
The second df just follow the first df,how can I get the style with dataframe like the third one?
merge does not concatenate the dfs as you want, use append instead.
ndf = df1.append(df2).sort_values('name')
You can also use concat:
ndf = pd.concat([df1, df2]).sort_values('name')

Read json files in pandas dataframe

I have large pandas dataframe (17 000 rows) with a filepath in each row associated with a specific json file. For each row I want to read the json file content and extract the content into a new dataframe.
The dataframe looks something like this:
0 /home/user/processed/config1.json
1 /home/user/processed/config2.json
2 /home/user/processed/config3.json
3 /home/user/processed/config4.json
4 /home/user/processed/config5.json
... ...
16995 /home/user/processed/config16995.json
16996 /home/user/processed/config16996.json
16997 /home/user/processed/config16997.json
16998 /home/user/processed/config16998.json
16999 /home/user/processed/config16999.json
What is the most efficient way to do this?
I believe a simple for-loop might be best suited here?
import json
json_content = []
for row in df:
with open(row) as file:
json_content.append(json.load(file))
result = pd.DataFrame(json_content)
Generally, I'd try with iterrows() function (as a first hit to improve efficiency).
Implementation could possibly look like that:
import json
import pandas as pd
json_content = []
for row in df.iterrows():
with open(row) as file:
json_content.append(json.load(file))
result = pd.Series(json_content)
Possible solution is the following:
# pip install pandas
import pandas as pd
#convert column with paths to list, where: : - all rows, 0 - first column
paths = df.iloc[:, 0].tolist()
all_dfs = []
for path in paths:
df = pd.read_json(path, encoding='utf-8')
all_dfs.append(df)
Each df in all_dfs can be accessed individually or in loop by index like all_dfs[0], all_dfs[1] and etc.
If you wish you can merge all_dfs into the single dataframe.
dfs = df.concat(all_dfs, axis=1)

Generate diff of two CSV files based on a single column using Pandas

I am working with CSV files that are each hundreds of megabytes (800k+ rows), use pipe delimiters, and have 90 columns. What I need to do is compare two files at a time, generating a CSV file of any differences (i.e. rows in File2 that do not exist in File1 as well as rows in File1 that do not exist in File2) but performing the comparison Only using a single column.
For instance, a highly simplified version would be:
File1
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|John|Doe|5555551212|ABC123
20200505|active|Jane|Doe|5555551212|ABC321
File2
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|Someone|New|5555551212|ABC123
20200510|active|Another|Person|5555551212|ABC000
In this example, the output file should look like this:
claim_date|status|first_name|last_name|phone|claim_number
20200505|active|Jane|Doe|5555551212|ABC321
20200510|active|Another|Person|5555551212|ABC000
As this example shows, both input files contained the row with claim_number ABC123 and although the fields first_name and last_name changed between the files I do not care as the claim_number was the same in both files. The other rows contained unique claim_number values and so both were included in the output file.
I have been told that Pandas is the way to do this, so I have set up a Jupyter environment and am able to load the files correctly but am banging my head against the wall at this point. Any suggestions are Highly appreciated!
My code so far:
import os
import pandas as pd
df1 = pd.read_table("/Users/X/Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("/Users/X/Claims_20210618.txt", sep='|', low_memory=False)
Everything else I've written is basically irrelevant at this point as it's just copypasta from the web that doesn't execute for one reason or another.
EDIT: Solution!
import os
import pandas as pd
df1 = pd.read_table("Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("Claims_20210618.txt", sep='|', low_memory=False)
df1.astype({'claim_number':'str'})
df2.astype({'claim_number':'str'})
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False,
ignore_index=True)
.to_csv('diff.csv')
)
I still need to figure out how to kill off the first / added column before writing the file but this is fantastic! Thanks!
IIUC, you can try:
If you wanna drop duplicates based on all columns except ['first_name', 'last_name']:
df = pd.concat([df1, df2])
(
df.drop_duplicates(
subset=df.columns.difference(['first_name', 'last_name']),
keep=False)
.to_csv('file3.csv')
)
If you wanna drop duplicates based on duplicate claim_number column only:
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False)
.to_csv('file3.csv')
)

pandas remove spaces from Series

The question is, how to gain access to the strings inside of the first column so that string manipulations can be performed with each value. For example remove spaces in front of each string.
import pandas as pd
data = pd.read_csv("adult.csv", sep='\t', index_col=0)
series = data['workclass'].value_counts()
print(series)
Here is the file:
Zipped csv file
It is index, so use str.strip with series.index:
series.index = series.index.str.strip()
But if need convert series here to 2 columns DataFrame use:
df = series.rename_axis('a').reset_index(name='b')

Concatenate a pandas dataframe to CSV file without reading the entire file

I have a quite large CSV file. I have a pandas dataframe that has exactly the columns with the CSV file.
I checked on stackoverflow and I see several answers suggested to read_csv then concatenate the read dataframe with the current one then write back to a CSV file.
But for a large file I think it is not the best way.
Can I concatenate a pandas dataframe to an existed CSV file without reading the whole file?
Update: Example
import pandas as pd
df1 = pd.DataFramce ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
# what to do here? I would like to concatenate df2 to my.csv
The expected my.csv
a b
0 1 2
1 3 4
Look at using mode='a' in to_csv:
MCVE:
df1 = pd.DataFrame ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
df2.to_csv('my.csv', mode='a', header=False)
!type my.csv #Windows machine use 'type' command or on unix use 'cat'
Output:
,a,b
0,1,2
1,3,4