Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ββββin the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes
Related
I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.
I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')
I am trying to create a fixed width file output in Pandas. When using pandas dataFrames to_string all the data has a "white space" separating the values. How do I remove the white space between the data columns?
sql = """SELECT
FIELD_1,
FIELD_2,
.........
FROM
VIEW"""
db_connection_string = "your connection string"
df = pd.read_sql_query(sql=sql, con=db_connection_string)
df['field_1'] = df['field_1'].str.pad(width=10, side='right', fillchar='-')
df['field_2'] = df['field_2'].str.pad(width=10, side='right', fillchar='-')
print(df.to_string(header=False, index=False)
I expected the following:
field1----field2----
What I got was:
field1---- field2----
Please note the spaces between the columns. This is what I am trying to remove. The fields should be flush against one another and not have a whitespace separator.
I think problem is to_string add default separator. Possible solutions is join together all columns:
print(df.astype(str).apply(''.join, 1).to_string(header=False, index=False))
field1----field_2---
Or only some columns:
print ((df['field_1'] + df['field_2']).to_string(header=False, index=False)))
For some reason I need to output to a csv in this format with quotations around each columns names, my desired output looks like:
"date" "ret"
2018-09-24 0.00013123989025119056
I am trying with
import csv
import pandas as pd
Y_pred.index.name = "\"date\""
Y_pred.name = "\'ret\'"
Y_pred = Y_pred.to_frame()
path = "prediction/Q1/"
try:
os.makedirs(path)
except:
pass
Y_pred.to_csv(path+instrument_tmp+"_ret.txt",sep=' ')
and got outputs like:
"""date""" 'ret'
2018-09-24 0.00013123989025119056
I can't seem to find a way to use quotation to wrap at the columns. Does anyone know how to? Thanks.
My solution:
using quoting=csv.QUOTE_NONE together with Y_pred.index.name = "\"date\"", Y_pred.name = "\"ret\""
Y_pred.index.name = "\"date\""
Y_pred.name = "\"ret\""
Y_pred = Y_pred.to_frame()
path = "prediction/Q1/"
try:
os.makedirs(path)
except:
pass
Y_pred.to_csv(path+instrument_tmp+"_ret.txt",sep=' ',quoting=csv.QUOTE_NONE)
and then I get
"date" "ret"
2018-09-24 0.00013123989025119056
This is called quoted output.
Instead of manually hacking in quotes into your column names (which will mess with other dataframe functionality), use the quoting option:
df = pd.DataFrame({"date": ["2018-09-24"], "ret": [0.00013123989025119056]})
df.to_csv("out_q_esc.txt", sep=' ', escapechar='\\', quoting=csv.QUOTE_ALL, index=None)
"date" "ret"
"2018-09-24" "0.00013123989025119056"
The 'correct' way is to use quoting=csv.QUOTE_ALL (and optionally escapechar='\\'), but note however that QUOTE_ALL will force all columns to be quoted, even obviously numeric ones like the index; if we hadn't specified index=None, we would get:
"" "date" "ret"
"0" "2018-09-24" "0.00013123989025119056"
csv.QUOTE_MINIMAL refuses to quote these fields because they don't strictly need quotes (they're neither multiline nor do they contain internal quote or separator chars)
IIUC, you can use the quoting argument with csv.QUOTE_NONE
import csv
df.to_csv('test.csv',sep=' ',quoting=csv.QUOTE_NONE)
And your resulting csv will look like:
"date" "ret"
0 2018-09-24 0.00013123989025119056
Side Note: To facilitate the adding of quotations to your columns, you can use add_prefix and add_suffix. If your starting dataframe looks like:
>>> df
date ret
0 2018-09-24 0.000131
Then do:
df = df.add_suffix('"').add_prefix('"')
df.to_csv('test.csv',sep=' ',quoting=csv.QUOTE_NONE)
I am having trouble converting a df column into a tuple that I can iterate through. I started with a simple code that works like this:
set= 'pare-10040137', 'pare-10034330', 'pare-00022936', 'pare-10025987', 'pare-10036617'
for i in set:
ref_data=req_data[req_data['REQ_NUM']==i]
This works fine, but now I want my set to come from a df. The df looks like this:
open_reqs
Out[233]:
REQ_NUM
4825 pare-00023728
4826 pare-00023773
.... ..............
I want all of those REQ_NUM values thrown into a tuple, so I tried to do open_reqs.apply(tuple, axis=1) and tuple(zip(open_reqs.columns,open_reqs.T.values.tolist())) but it's not able to iterate through either of these.
My old set looks like this, so this is the format I need to match to iterate through like I was before. I'm not sure if the Unicode is also an issue (when I print above I get (u'pare-10052173',)
In[236]: set
Out[236]:
('pare-10040137',
'pare-10034330',
'pare-00022936',
'pare-10025987',
'pare-10036617')
So basically I need the magic code to get a nice simple set like that from the REQ_NUM column of my open_reqs table. Thank you!
The following statement makes a list out of the specified column and then converts it to an array of tuple
open_req_list = tuple(list(open_reqs['REQ_NUM']))
You can use the tolist() function to convert to a list and the tuple() the whole list
req_num = tuple(open_reqs['REQ_NUM'].tolist())
#type(req_num)
req_num
df = pd.DataFrame(data)
columns_tuple = tuple(df.columns)
df.columns has the datatype of object. To convert it into tuples, use this code and you will GET TUPLE OF ALL COLUMN NAMES