I want to ignore the $ sign while reading the csv file . I have used multiple encoding options such as latin-1, utf-8, utf-16, utf-32, ascii, utf-8-sig, unicode_escape, rot_13
Also encoding_errors = 'replace' but nothing seems to work
below is a dummy data set which reads the '$' as below. It converts the text in between '$' to bold-italic font.
This is how the original data set looks like
code :
df = pd.read_csv("C:\\Users\\nitin2.bhatt\\Downloads\\CCL\\dummy.csv")
df.head()
please help as I have referred to multiple blogs but couldn't find a solution to this
Related
I have a dataset (DataFrame) which contains numbers and lists, when I save it in CSV format and then read it, the list cells are converted to strings.
Before saving : df.to_csv("data.csv")
After reading : pd.read_csv("data.csv")
After reading : pd.read_csv("data.csv", converters={"C2_ACP": lambda x: x.strip("[]").split(",")})
df.to_csv("data.csv", index=False, sep=",")
I need to have to retrive the original dataset when I read the file.
Have you tried to change the sep argument in pd.to_csv??. Maybe the standard sep=',' enters in conflict with your list separator, that is also a comma
I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.
The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.
I have a csv-file with a list of keywords that I want to use for some filtering of texts.
I saved the csv-file, and tried to open it in my notebook using pd.from_csv('file.csv', encoding = 'UTF-8')
This didn't work even though I specified the encoding to this encoding type.
After some searching, I found some different encodings, I decided to go for
keywords = pd.read_csv('file.csv', encoding = 'latin1')
gets me the actual keywords, but when inspecting the words, I get that the spaces are passed as follows:
['falsification\xa0',
'détournement\xa0de\xa0subsides\xa0',
'parachutes\xa0dorés\xa0',...]
about the csv-file: it has two columns of keywords, one column in dutch, the other one in French. The issue with the spaces persists even when I use other encodings like
I am trying to read .dta files with pandas:
import pandas as pd
my_data = pd.read_stata('filename', encoding='utf-8')
the error message is:
ValueError: Unknown encoding. Only latin-1 and ascii supported.
other encoding formality also didn't work, such as gb18030 or gb2312 for dealing with Chineses characters. If I remove the encoding parameter, the DataFrame will be all of garbage values.
Simply read the original data by default encoding, then transfer to the expected encoding! Suppose the column having garbled text is column1
import pandas as pd
dta = pd.read_stata('filename.dta')
print(dta['column1'][0].encode('latin-1').decode('gb18030'))
The print result will show normal Chinese characters, and gb2312 can also make it.
Looking at the source code of pandas (version 0.22.0), the supported encodings for read_stata are ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1'). So you can only choose from this list.
I am trying to write pandas dataframe which has German text into csv file. Here is the the relevant snippet:
data =p.DataFrame(Inform)
data = data.fillna("NA")
data=data.transpose()
data.to_csv("./Info.csv",encoding='utf-8')
The text was obtained through soup = BeautifulSoup(r, from_encoding='utf-8'). When I print the text in console it produces properly decoded text - however in the csv the text is not decoded (e.g., "Gesamtfläche"). I tried some other encodings but they don't seem to work either.