pd.read_csv sep ";" not working. tricky dataset - pandas

I tried different stack overflow solutions using pd.read_csv for this file.
When I use Excel to text to and use ";" as delimiter in Excel, it gives exactly the output I need.
data:
'Balance Sheet;"'Package / Number";"Package Type";"Instrument";"Counterparty";"Opening Date";"Value Date";"Maturity Date";"'Nominal Amount";"'Interest Rate";"CCy";"'Funding Type";"Nominal Amount Local";"Interest Rate Local";"'Maturity Year";"'Maturity Quarter";"Tenor";"Tenor Range";"Date Basis"
Asset Finance;"2.915.239";;"IRS-FIX-TO-FLOAT";"X_SEL";"03/27/2019";"03/29/2019";"08/29/2023";"-20.000.000.000";"1
Asset Finance;"2.915.239";;"IRS-FIX-TO-FLOAT";"X_SEL";"03/27/2019";"03/29/2019";"08/29/2023";"20.000.000.000";"2
Asset Finance;;;"IRS-FIX-TO-FLOAT";;"03/27/2019";"03/29/2019";"08/29/2023";;;;"Payer Swap";"20.000.000.000";"-1
Code:
df = pd.read_csv(path2, sep='";"',engine='python')
df = df.apply(lambda x: x.replace('"','')) --\> doesnt seems to be working
The output columns are not split correct. It should be per above column 0: Balance Sheet, 1: Package / Number, 2: 'Package Type etc.. total 19 columns
pandas output:
If there is any other work around solutions, pls tell. Thanks!

Use only sep=";" to correctly split columns. Add quotechar='"' to tell pandas that " is a quote character and should not be part of value.
df = pd.read_csv(path2,sep=';', quotechar='"',engine='python')

Related

Losing rows when renaming columns in pyspark (Azure databricks)

I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How to index a column with two values pandas

I have two dataframes:
Dataframe #1
Reads the values--Will only be interested in NodeID AND GSE
sta = pd.read_csv(filename)
Dataframe #2
Reads the file, use pivot and get the following result
sim = pd.read_csv(headout,index_col=0)
sim['Layer'] = sim.groupby('date').cumcount() + 1
sim['Layer'] = 'L' + sim['Layer'].astype(str)
sim = sim.pivot(index = None , columns = 'Layer').T
This gives me the index column to be with two values. (The header is blank for the first one, and Layers for the second) i.e 1,L1.
What I need help on is:
I can not find a way to rename that first blank in the index to 'NodeID'.
I want to name it that so that I can do the lookup function and use NodeID in both dataframes so that I can bring in the 'GSE' values from the first dataframe to the second.
I have been googling way to rename that first column in the second dataframe and I can not seem to find an solution. Any ideas help at this point. I think my pivot function might be wrong...
This is a picture of dataframe #2 before pivot. The number 1-4 are the Node ID.
when I export it to csv to see what the dataframe looks like I get this..
Try
df.rename(columns={"Index": "your preferred name"})
if it is your index then do -
df = df.reset_index()
df.rename(columns={"index": "your preferred name"})

Reading csv file with pandas

enter image description hereI have a csv file with 2 columns which is text and boolean(y/n) where I am trying to put all the positive value in 1 file in 1 file and the negative one in the others. Here is what I tried:
df = pd.read_csv('text_trait_with_binary_EXT.csv','rb',delimiter=',',quotechar='"')
#print(df)
df.columns = ["STATUS", "xEXT"]
positive = []
negative = []
for line in df:
text = line[0].strip()
if line[1].strip() == "y":
positive.append(text)
elif line[1].strip() == "n":
negative.append(text)
print(positive)
print(negative)
And when I run this it just give an empty list!
I am new in using pandas so if any of you can help that would be great.
As others have commented, there is almost always a better approach than using iteration in Pandas. It has a lot of built-in functions to help avoid loops.
If I understand your intentions correctly that you want to take the values from column 1 (named 'STATUS'), split them according to whether the corresponding value in column 2 (named 'xEXT') is 'y' or 'n', and generate two lists containing the column 1 values, the following should work (to be used after your first two lines of code you posted):
positive = df.loc[df['xEXT'].str.strip() == 'y', 'STATUS'].tolist()
negative = df.loc[df['xEXT'].str.strip() == 'n', 'STATUS'].tolist()
Here is a link to the documentation on loc, which is useful for problems like this.
The above solution assumes that your data has been read in correctly. If it does not work for you, please do as others have commented and add a small sample of your data so that we are able to try out our proposed solutions.

Number format column with pandas.DataFrame.to_csv()?

I've a process than reads the information from a Microsoft SQL database using:
df = psql.read_sql(sql, con=connection)
print(df)
This function is used in many processes so variable sql doesn't have the same columns (variable structure).
Then I get, for example, the following data:
STORE EMAIL_CONTACT VALUE
10 a#mail.com 2.2100
23 b#mail.com 0.7990
Everything is fine to this point.
When extracting to csv using:
file = r"Test.csv"
df.to_csv(file, sep=";", index=False, quoting=csv.QUOTE_ALL)
The output is the following:
"STORE";"EMAIL_CONTACT";"VALUE"
"10.0";"a#mail.com";"2.2100"
"23.0";"b#mail.com";"0.7990"
The column STORE now has ".0"...
Is there a way to configure the function "to_csv" to output exactly (values) as shown in print? Thanks in advance.
"STORE";"EMAIL_CONTACT";"VALUE"
"10";"a#mail.com";"2.2100"
"23";"b#mail.com";"0.7990"
Solved: The problem was with the decimal option:
df.to_csv(file, sep=";", index=False, quoting=csv.QUOTE_ALL, decimal=",")
"STORE";"EMAIL_CONTACT";"VALUE"
"10";"a#mail.com";"2.2100"
"23";"b#mail.com";"0.7990"
Thanks everyone for the support!
STORE is probably a float, check it out with
print df.STORE.dtype
if so, do:
df.STORE = df.STORE.astype(int)
then:
df.to_csv("Test.csv", sep=";", index=False)
output:
STORE;EMAIL_CONTACT;VALUE
1;a#mail.com;2.2100
2;b#mail.com;0.7990
EDIT:
For tabulation use:
df.to_csv("Test.csv", sep="\t", index=False)
this will output a csv with this format:
STORE EMAIL_CONTACT VALUE
1 a#mail.com 2.2100
2 b#mail.com 0.7990