I've a process than reads the information from a Microsoft SQL database using:
df = psql.read_sql(sql, con=connection)
print(df)
This function is used in many processes so variable sql doesn't have the same columns (variable structure).
Then I get, for example, the following data:
STORE EMAIL_CONTACT VALUE
10 a#mail.com 2.2100
23 b#mail.com 0.7990
Everything is fine to this point.
When extracting to csv using:
file = r"Test.csv"
df.to_csv(file, sep=";", index=False, quoting=csv.QUOTE_ALL)
The output is the following:
"STORE";"EMAIL_CONTACT";"VALUE"
"10.0";"a#mail.com";"2.2100"
"23.0";"b#mail.com";"0.7990"
The column STORE now has ".0"...
Is there a way to configure the function "to_csv" to output exactly (values) as shown in print? Thanks in advance.
"STORE";"EMAIL_CONTACT";"VALUE"
"10";"a#mail.com";"2.2100"
"23";"b#mail.com";"0.7990"
Solved: The problem was with the decimal option:
df.to_csv(file, sep=";", index=False, quoting=csv.QUOTE_ALL, decimal=",")
"STORE";"EMAIL_CONTACT";"VALUE"
"10";"a#mail.com";"2.2100"
"23";"b#mail.com";"0.7990"
Thanks everyone for the support!
STORE is probably a float, check it out with
print df.STORE.dtype
if so, do:
df.STORE = df.STORE.astype(int)
then:
df.to_csv("Test.csv", sep=";", index=False)
output:
STORE;EMAIL_CONTACT;VALUE
1;a#mail.com;2.2100
2;b#mail.com;0.7990
EDIT:
For tabulation use:
df.to_csv("Test.csv", sep="\t", index=False)
this will output a csv with this format:
STORE EMAIL_CONTACT VALUE
1 a#mail.com 2.2100
2 b#mail.com 0.7990
Related
I tried different stack overflow solutions using pd.read_csv for this file.
When I use Excel to text to and use ";" as delimiter in Excel, it gives exactly the output I need.
data:
'Balance Sheet;"'Package / Number";"Package Type";"Instrument";"Counterparty";"Opening Date";"Value Date";"Maturity Date";"'Nominal Amount";"'Interest Rate";"CCy";"'Funding Type";"Nominal Amount Local";"Interest Rate Local";"'Maturity Year";"'Maturity Quarter";"Tenor";"Tenor Range";"Date Basis"
Asset Finance;"2.915.239";;"IRS-FIX-TO-FLOAT";"X_SEL";"03/27/2019";"03/29/2019";"08/29/2023";"-20.000.000.000";"1
Asset Finance;"2.915.239";;"IRS-FIX-TO-FLOAT";"X_SEL";"03/27/2019";"03/29/2019";"08/29/2023";"20.000.000.000";"2
Asset Finance;;;"IRS-FIX-TO-FLOAT";;"03/27/2019";"03/29/2019";"08/29/2023";;;;"Payer Swap";"20.000.000.000";"-1
Code:
df = pd.read_csv(path2, sep='";"',engine='python')
df = df.apply(lambda x: x.replace('"','')) --\> doesnt seems to be working
The output columns are not split correct. It should be per above column 0: Balance Sheet, 1: Package / Number, 2: 'Package Type etc.. total 19 columns
pandas output:
If there is any other work around solutions, pls tell. Thanks!
Use only sep=";" to correctly split columns. Add quotechar='"' to tell pandas that " is a quote character and should not be part of value.
df = pd.read_csv(path2,sep=';', quotechar='"',engine='python')
My code to save the df is:
fdi_out_vdem.to_csv("fdi_out_vdem.csv")
To read the df into python is :
fdi_out_vdem = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv")
The df:
Unnamed: 0
country_name
value
1
Spain
190
2
Spain
311
Your df has two columns, but also an index with "0" and "1". When writing it to csv it looks like this:
,country_name,value
0,Spain,190
1,Spain,311
When importing it with pandas you it is considered as df with 3 columns (and the first has no name)
You have two possibilities here:
Save it without index column:
df.to_csv("fdi_out_vdem.csv", index=False)
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv")
or save it with index column and define an index col when reading it with pd.read_csv
df.to_csv("fdi_out_vdem.csv")
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col=[0])
UPDATE
As recommended by #ouroboros1 in the comments you could also name your index before saving it to csv, so you can define the index column by using that name
df.index.name = "index"
df.to_csv("fdi_out_vdem.csv")
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col="index")
You can either pass the parameter index_col=[0] to pandas.read_csv :
fdi_out_vdem = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col=[0])
Or even better, get rid of the index at the beginning when calling pandas.DataFrame.to_csv:
fdi_out_vdem.to_csv("fdi_out_vdem.csv", index=False)
i have a data frame that looks like this:
there are in total 109 columns.
when i import the data using the read_csv it adds ".1",".2" to duplicate names .
is there any way to go around it ?
i have tried this :
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding = "ISO-8859-1",
sep='|', header=None)
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
but it changed the data frame and wasnt helpful.
this is what it did to my data
python:
excel:
Remove header=None, because it is used for avoid convert first row of file to df.columns and then remove . with digits from columns names:
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding="ISO-8859-1", sep=',')
df.columns = df.columns.str.replace('\.\d+$','')
I am working with CSV files that are each hundreds of megabytes (800k+ rows), use pipe delimiters, and have 90 columns. What I need to do is compare two files at a time, generating a CSV file of any differences (i.e. rows in File2 that do not exist in File1 as well as rows in File1 that do not exist in File2) but performing the comparison Only using a single column.
For instance, a highly simplified version would be:
File1
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|John|Doe|5555551212|ABC123
20200505|active|Jane|Doe|5555551212|ABC321
File2
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|Someone|New|5555551212|ABC123
20200510|active|Another|Person|5555551212|ABC000
In this example, the output file should look like this:
claim_date|status|first_name|last_name|phone|claim_number
20200505|active|Jane|Doe|5555551212|ABC321
20200510|active|Another|Person|5555551212|ABC000
As this example shows, both input files contained the row with claim_number ABC123 and although the fields first_name and last_name changed between the files I do not care as the claim_number was the same in both files. The other rows contained unique claim_number values and so both were included in the output file.
I have been told that Pandas is the way to do this, so I have set up a Jupyter environment and am able to load the files correctly but am banging my head against the wall at this point. Any suggestions are Highly appreciated!
My code so far:
import os
import pandas as pd
df1 = pd.read_table("/Users/X/Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("/Users/X/Claims_20210618.txt", sep='|', low_memory=False)
Everything else I've written is basically irrelevant at this point as it's just copypasta from the web that doesn't execute for one reason or another.
EDIT: Solution!
import os
import pandas as pd
df1 = pd.read_table("Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("Claims_20210618.txt", sep='|', low_memory=False)
df1.astype({'claim_number':'str'})
df2.astype({'claim_number':'str'})
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False,
ignore_index=True)
.to_csv('diff.csv')
)
I still need to figure out how to kill off the first / added column before writing the file but this is fantastic! Thanks!
IIUC, you can try:
If you wanna drop duplicates based on all columns except ['first_name', 'last_name']:
df = pd.concat([df1, df2])
(
df.drop_duplicates(
subset=df.columns.difference(['first_name', 'last_name']),
keep=False)
.to_csv('file3.csv')
)
If you wanna drop duplicates based on duplicate claim_number column only:
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False)
.to_csv('file3.csv')
)
I have a column of data in pandas dataframe in Bxxxx-xx-xx-xx.y format. Only the first part (Bxxxx) is all I require. How do I split the data? In addition, I also have data in BSxxxx-xx-xx-xx format in the same column which I would like to remove using regex='^BS' command (For some reason, it's not working). Any help in this regard will be appreciated.BTW, I am using df.filter command.
This should work.
df[df.col1.apply(lambda x: x.split("-")[0][0:2]!="BS")].col1.apply(lambda x: x.split("-")[0])
Consider below example:
df = pd.DataFrame({
'col':['B123-34-gd-op','BS01010-9090-00s00','B000003-3frdef4-gdi-ortp','B1263423-304-gdcd-op','Bfoo3-poo-plld-opo', 'BSfewf-sfdsd-cvc']
})
print(df)
Output:
col
0 B123-34-gd-op
1 BS01010-9090-00s00
2 B000003-3frdef4-gdi-ortp
3 B1263423-304-gdcd-op
4 Bfoo3-poo-plld-opo
5 BSfewf-sfdsd-cvc
Now Let's do two tasks:
Extract Bxxxx part from Bxxx-xx-xx-xxx .
Remove BSxxx formated strings.
Consider below code which uses startswith():
df[~df.col.str.startswith('BS')].col.str.split('-').str[0]
Output:
0 B123
2 B000003
3 B1263423
4 Bfoo3
Name: col, dtype: object
Breakdown:
df[~df.col.str.startswith('BS')] gives us all the string which do not start with BS. Next, We are spliting those string with - and taking the first part with .col.str.split('-').str[0] .
You can define a function where in you treat Bxxxx-xx-xx-xx.y as a string and just extract the first 5 indexes.
>>> def edit_entry(x):
... return (str(x)[:5])
>>> df['Column_name'].apply(edit_entry)
A one-liner solution would be:
df["column_name"] = df["column_name"].apply(lambda x: x[:5])