Issue in pyspark dataframe after toPandas csv - pandas

I've a data frame in pyspark.sql.dataframe.DataFrame, I converted it into pandas dataframe then saved it as csv file. Here in csv when opened, I discovered the columns which has empty values in a field become \"\".
I go back to the spark dataframe.toPandas() the when I check one of these columns values I see this empty string with blank space.
dfpandas.colX[2] give this res: ' '.
I used this kind of csv saving.
df_sparksql.repartition(1).write.format('com.databricks.spark.csv').save("/data/rep//CLT_20200729csv",
header = 'true',)
I used also this kind of saving method but lead to memory outage.
df = df_per_mix.toPandas()
df.to_csv("/data/rep//CLT_20200729.csv",sep=";", index=False)
What is the issue and how to remove the blank space converted to \"\"?

Related

How to save a tricky dataset

I have a dataset (DataFrame) which contains numbers and lists, when I save it in CSV format and then read it, the list cells are converted to strings.
Before saving : df.to_csv("data.csv")
After reading : pd.read_csv("data.csv")
After reading : pd.read_csv("data.csv", converters={"C2_ACP": lambda x: x.strip("[]").split(",")})
df.to_csv("data.csv", index=False, sep=",")
I need to have to retrive the original dataset when I read the file.
Have you tried to change the sep argument in pd.to_csv??. Maybe the standard sep=',' enters in conflict with your list separator, that is also a comma

How to remove empty space in list and also adjust the updated list back to csv

I'm trying to use python to clean up data in csv file.
data = ['Code', 'Name',' Size ',' Sector',' Industry ']
Tried the following;
for x in data:
print(data.strip())
it works where I can read the data in the format I want but the problem is it doesn't change the data in csv.
If you want to strip away space from string stored in a list you can do it with a list comprehension like this,
data = [item.strip() for item in data]
If you like to do this over a pandas dataframe column,
df['col'] = df['col'].str.strip()
reassign the cleansed entries back to the data variable before saving it back to csv.
data = [x.strip() for x in data]
then save data to csv.

Pandas to_csv adds new rows when data has special characters

My data has multiple columns including a text column
id text date
8950026 Make your EMI payments only through ABC 01-04-2021 07:43:54
8950969 Pay from your Bank Account linked on XXL \r 01-04-2021 02:16:48
8953627 Do not share it with anyone. -\r 01-04-2021 08:04:57
I used pandas to_csv to export my data. That works well for my first row but for the next 2 rows, it creates a new line and moves the date to the next line and adds to the total rows. Basically my output csv will have 5 rows instead of 3
df_in.to_csv("data.csv", index = False)
What is the best way to handle the special character "\r" here? I tried converting the text variable to string in pandas (Its dtype is object now) but that doesn't help . I can try and remove all \r in the end of text in my dataframe before exporting, but is there a way to modify to_csv to export this in the right format ?
**** EDIT****
This question below is similar and I can solve the problem by replacing all instances of \r in my dataframe but how can this be solved by not replacing? Does to_csv have options to handle these
Pandas to_csv with escape characters and other junk causing return to next line

Julia CSV skipping rows

I have a csv file as shown below. I basically want to add the last two rows into a dataframe (24 & 25). Unfortunately with the program (Netlogo), generating this file it's not possible to export this as a xlsx file. So using the package xlsx gives me an error.
I am wondering how to skiprows and get a dataframe. I'vs tried this piece of code but it gives me 2x1 DataFrame with X and 0 as values(Basically column A and rows 24-25). What I am after is rows 24-25 and columns A to AC
using DataFrames
using CSV
df = CSV.File(
joinpath("D:/ABM/Simulation Runs/Output Files/run_1.csv"),
skipto = 24
)
You can use (you could do the same with CSV.File if you do not want DataFrame as a sink):
CSV.read("run_1.csv DataFrame, header=24, limit=1, threaded=false)
Explanation:
header: line in which header is stored
limit: number of rows of data to read (omit it if below header you only have data),
threaded: use this to ensure that limit is respected exactly (as in general CSV.jl might use multiple threads to read your data and try to read more than asked)

AWS SageMaker Batch Transform CSV Error: Bare " in non quoted field

AWS SageMaker Batch Transform errors with the following:
bare " in non quoted field found near: "['R627' 'Q2739' 'D509' 'S37009A' 'E860' 'D72829' 'R9431' 'J90' 'R7989'
In a SageMaker Studio notebook, I use Pandas to output data to csv:
data.to_csv(my_file, index=False, header=False)
My Pandas dataframe has columns with string values like the following:
['ABC123', 'DEF456']
Pandas is adding line breaks between these fields e.g. this is one row (that spans two lines) and has a line break. Note that the double quotes now span two lines. Sometimes they'll span 3 or more lines.
False,ABC123,7,1,3412,['I509'],,"['R627' 'Q2739' 'D509' 'S37009A' 'E860' 'D72829' 'R9431' 'J90' 'R7989'
'R5383' 'J9621']",['R51' 'R05' 'R0981'],['X58XXXA'],M,,A,48
The CSV is valid and I can successfully read it back into a Pandas dataframe.
Why would Batch Transform fail to read this CSV format?
I've converted arrays to strings (space separated) e.g.
From:
['ABC123', 'DEF456']
To:
ABC123 DEF456