Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ββββin the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes
I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.
I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]
I have a some data that was pre-populated from another system whose DataFrame looks as below:
id;value
101;Product_1,,,,,,,,,,,,,,,,,,,,,,,Product_2,,,,,,,,,,,,,,,,,,,,,,, Product_3,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan, Product_4,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None
102;,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None
I am trying to clean this up such that I remove all values that have 2 or more commas (,)continuously that are blanks.
Expected Output:
id; value
101; Product_1, Product_2, Product_3, Product_4
102;
Using semi-colon (;) to identify separators
First, import the data while specifying the separator as a semicolon. Then you can run str.replace() to collapse the commas. There are actually three kinds of replacements you want to perform.
Replace the null values (and blank spaces) with ', '
Replace sequences of commas with single ', '
To deal with empty cells, add a final replace. I've specified it as leaving a blank '', but for many purposes it would more useful to replace it with numpy.nan instead.
import pandas as pd
df = pd.read_csv(path, sep=';')
df['value'].str.replace(r'nan|None| ', '').str.replace(r'\,+', ', ').replace(', ', '')
You might find it useful to have lists instead of strings, in which case you can use:
df['value'].str.split(', ')
For some reason I need to output to a csv in this format with quotations around each columns names, my desired output looks like:
"date" "ret"
2018-09-24 0.00013123989025119056
I am trying with
import csv
import pandas as pd
Y_pred.index.name = "\"date\""
Y_pred.name = "\'ret\'"
Y_pred = Y_pred.to_frame()
path = "prediction/Q1/"
try:
os.makedirs(path)
except:
pass
Y_pred.to_csv(path+instrument_tmp+"_ret.txt",sep=' ')
and got outputs like:
"""date""" 'ret'
2018-09-24 0.00013123989025119056
I can't seem to find a way to use quotation to wrap at the columns. Does anyone know how to? Thanks.
My solution:
using quoting=csv.QUOTE_NONE together with Y_pred.index.name = "\"date\"", Y_pred.name = "\"ret\""
Y_pred.index.name = "\"date\""
Y_pred.name = "\"ret\""
Y_pred = Y_pred.to_frame()
path = "prediction/Q1/"
try:
os.makedirs(path)
except:
pass
Y_pred.to_csv(path+instrument_tmp+"_ret.txt",sep=' ',quoting=csv.QUOTE_NONE)
and then I get
"date" "ret"
2018-09-24 0.00013123989025119056
This is called quoted output.
Instead of manually hacking in quotes into your column names (which will mess with other dataframe functionality), use the quoting option:
df = pd.DataFrame({"date": ["2018-09-24"], "ret": [0.00013123989025119056]})
df.to_csv("out_q_esc.txt", sep=' ', escapechar='\\', quoting=csv.QUOTE_ALL, index=None)
"date" "ret"
"2018-09-24" "0.00013123989025119056"
The 'correct' way is to use quoting=csv.QUOTE_ALL (and optionally escapechar='\\'), but note however that QUOTE_ALL will force all columns to be quoted, even obviously numeric ones like the index; if we hadn't specified index=None, we would get:
"" "date" "ret"
"0" "2018-09-24" "0.00013123989025119056"
csv.QUOTE_MINIMAL refuses to quote these fields because they don't strictly need quotes (they're neither multiline nor do they contain internal quote or separator chars)
IIUC, you can use the quoting argument with csv.QUOTE_NONE
import csv
df.to_csv('test.csv',sep=' ',quoting=csv.QUOTE_NONE)
And your resulting csv will look like:
"date" "ret"
0 2018-09-24 0.00013123989025119056
Side Note: To facilitate the adding of quotations to your columns, you can use add_prefix and add_suffix. If your starting dataframe looks like:
>>> df
date ret
0 2018-09-24 0.000131
Then do:
df = df.add_suffix('"').add_prefix('"')
df.to_csv('test.csv',sep=' ',quoting=csv.QUOTE_NONE)