writing pandas dataframe into csv file - pandas

I am trying to write pandas dataframe which has German text into csv file. Here is the the relevant snippet:
data =p.DataFrame(Inform)
data = data.fillna("NA")
data=data.transpose()
data.to_csv("./Info.csv",encoding='utf-8')
The text was obtained through soup = BeautifulSoup(r, from_encoding='utf-8'). When I print the text in console it produces properly decoded text - however in the csv the text is not decoded (e.g., "Gesamtfläche"). I tried some other encodings but they don't seem to work either.

Related

How to read $ character while reading a csv using pandas dataframe

I want to ignore the $ sign while reading the csv file . I have used multiple encoding options such as latin-1, utf-8, utf-16, utf-32, ascii, utf-8-sig, unicode_escape, rot_13
Also encoding_errors = 'replace' but nothing seems to work
below is a dummy data set which reads the '$' as below. It converts the text in between '$' to bold-italic font.
This is how the original data set looks like
code :
df = pd.read_csv("C:\\Users\\nitin2.bhatt\\Downloads\\CCL\\dummy.csv")
df.head()
please help as I have referred to multiple blogs but couldn't find a solution to this

AWS SageMaker Batch Transform CSV Error: Bare " in non quoted field

AWS SageMaker Batch Transform errors with the following:
bare " in non quoted field found near: "['R627' 'Q2739' 'D509' 'S37009A' 'E860' 'D72829' 'R9431' 'J90' 'R7989'
In a SageMaker Studio notebook, I use Pandas to output data to csv:
data.to_csv(my_file, index=False, header=False)
My Pandas dataframe has columns with string values like the following:
['ABC123', 'DEF456']
Pandas is adding line breaks between these fields e.g. this is one row (that spans two lines) and has a line break. Note that the double quotes now span two lines. Sometimes they'll span 3 or more lines.
False,ABC123,7,1,3412,['I509'],,"['R627' 'Q2739' 'D509' 'S37009A' 'E860' 'D72829' 'R9431' 'J90' 'R7989'
'R5383' 'J9621']",['R51' 'R05' 'R0981'],['X58XXXA'],M,,A,48
The CSV is valid and I can successfully read it back into a Pandas dataframe.
Why would Batch Transform fail to read this CSV format?
I've converted arrays to strings (space separated) e.g.
From:
['ABC123', 'DEF456']
To:
ABC123 DEF456

Issue in pyspark dataframe after toPandas csv

I've a data frame in pyspark.sql.dataframe.DataFrame, I converted it into pandas dataframe then saved it as csv file. Here in csv when opened, I discovered the columns which has empty values in a field become \"\".
I go back to the spark dataframe.toPandas() the when I check one of these columns values I see this empty string with blank space.
dfpandas.colX[2] give this res: ' '.
I used this kind of csv saving.
df_sparksql.repartition(1).write.format('com.databricks.spark.csv').save("/data/rep//CLT_20200729csv",
header = 'true',)
I used also this kind of saving method but lead to memory outage.
df = df_per_mix.toPandas()
df.to_csv("/data/rep//CLT_20200729.csv",sep=";", index=False)
What is the issue and how to remove the blank space converted to \"\"?

Encoding Error of Reading .dta Files with Chinese Characters

I am trying to read .dta files with pandas:
import pandas as pd
my_data = pd.read_stata('filename', encoding='utf-8')
the error message is:
ValueError: Unknown encoding. Only latin-1 and ascii supported.
other encoding formality also didn't work, such as gb18030 or gb2312 for dealing with Chineses characters. If I remove the encoding parameter, the DataFrame will be all of garbage values.
Simply read the original data by default encoding, then transfer to the expected encoding! Suppose the column having garbled text is column1
import pandas as pd
dta = pd.read_stata('filename.dta')
print(dta['column1'][0].encode('latin-1').decode('gb18030'))
The print result will show normal Chinese characters, and gb2312 can also make it.
Looking at the source code of pandas (version 0.22.0), the supported encodings for read_stata are ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1'). So you can only choose from this list.

Converting a massive JSON file to CSV

I have a JSON file which is 48MB (collection of tweets I data mined). I need to convert the JSON file to CSV so I can import it into a SQL database and cleanse it.
I've tried every JSON to CSV converter but they all come back with the same result of "file exceeds limits" / the file is too large. Is there a good method of converting such a massive JSON file to CSV within a short period of time?
Thank you!
A 48mb json file is pretty small. You should be able to load the data into memory using something like this
import json
with open('data.json') as data_file:
data = json.load(data_file)
Dependending on how you wrote to the json file, data may be a list, which contains many dictionaries. Try running:
type(data)
If the type is a list, then iterate over each element and inspect it. For instance:
for row in data:
print(type(row))
# print(row.keys())
If row is a dict instance, then inspect the keys and within the loop, start building up what each row of the CSV should contain, then you can either use pandas, the csv module or just open a file and write line by line with commas yourself.
So maybe something like:
import json
with open('data.json') as data_file:
data = json.load(data_file)
with open('some_file.txt', 'w') as f:
for row in data:
user = row['username']
text = row['tweet_text']
created = row['timestamp']
joined = ",".join([user, text, created])
f.write(joined)
You may still run into issues with unicode characters, commas within your data, etc...but this is a general guide.