I'm using a csv file from Kaggle for fake new analysis and it has unicode characters of all types. So whenever I was running and executing the code with utf-8 showing some errors, "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 83735: invalid continuation byte"
And the resolution is present here too.
import pandas as pd
url = "https://raw.githubusercontent.com/akdubey2k/NLP/main/Fake_News_Classifier/train.csv"
try:
df = pd.read_csv(url, encoding="utf8")
except UnicodeDecodeError:
df = pd.read_csv(url, encoding="latin1")
df.head()
Related
I have a large .TXT file which is delimited by ";". Unfortunately some of my values contain ";" aswell, which in that case is not a delimiter but recognized as delimiter by pandas. Becasue of this I have difficulties reading the .txt. files into pandas because some lines have more columns than the others. Background: I am trying to combine several .txt files into 1 dataframe and get the following error: ParserError: Error tokenizing data. C error: Expected 21 fields in line 443, saw 22.
So when checking line 443 I saw indeed that that line had 1 more instance of ";" because it was part of one of the values.
Reproduction:
Text file 1:
1;2;3;4
23123213;23123213;23123213;23123213
123;123;123;123
123;123;123;123
1;1;1;1
123;123;123;123
12;12;12;12
3;3;3;3
Text file 2:
1;2;3;4
23123213;23123213;23123213;23123213
123;123;123;123
123;123;12;3;123
1;1;1;1
123;123;123;123
12;12;12;12
3;3;3;3
Code:
import pandas as pd
import glob
import os
path = r'C:\Users\file'
all_files = glob.glob(path + "/*.txt")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, delimiter=';')
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
I would like to upload csv as parquet file to S3 bucket. Below is the code snippet.
df = pd.read_csv('right_csv.csv')
csv_buffer = BytesIO()
df.to_parquet(csv_buffer, compression='gzip', engine='fastparquet')
csv_buffer.seek(0)
Above is giving me an error: TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO
How to make it work?
As per the documentation, when fastparquet is used as the engine, io.BytesIO cannot be used. auto or pyarrow engine have to be used. Quoting from the documentation.
The engine fastparquet does not accept file-like objects.
Below code works without any issues.
import io
f = io.BytesIO()
df.to_parquet(f, compression='gzip', engine='pyarrow')
f.seek(0)
As mentioned in the other answer, this is not supported. One work around would be to save as parquet to a NamedTemporaryFile. Then copy the content to a BytesIO buffer:
import tempfile
with tempfile.NamedTemporaryFile() as tmp:
df.to_parquet(tmp.name, compression='gzip', engine='fastparquet')
with open(tmp.name, 'rb') as fh:
buf = io.BytesIO(fh.read())
I made a pickled utf-8 dataframe on my local machine.
I can read this pickled data with read_pickle on my local machine.
However, I cannot read it on google-colaboratory.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Postscript01
My code is very simple
import pandas as pd
DF_OBJ = open('/content/drive/DF_OBJ')
DF = pd.read_pickle(DF_OBJ)
I can run until 2nd row.
I cannot run the last row with the above error comment.
Postscript02
I could solve it by myself.
import pandas as pd
import pickle5
DF_OBJ = open('OBJ','rb')
DF = pickle5.load(DF_OBJ)
i am trying to load a csv file
import pandas as pd
dfc = pd.read_csv('data/Vehicles0515.csv', sep =',')
but i have the following error
ParserError: Error tokenizing data. C error: Expected 22 fields in line 3004427, saw 23
i have read to include error_bad_lines = False
but it doesn't solve the problem
Thanks a lot
Sometimes parser is getting confused by the head of csv file.
try this:
dfc = pd.read_csv('data/Vehicles0515.csv', header=None)
or
dfc = pd.read_csv('data/Vehicles0515.csv', skiprows=2)
Also you don't need provide an comma separator owing to fact that comma is default value in pandas read_csv method.
I'm trying to load a csv (from an API response) into pandas, but keep getting an error
"ValueError: stat: path too long for Windows" and "FileNotFoundError: [Errno 2] File b'"fwefwe","fwef..."
indicating that pandas interprets it as an url, not a string.
The code below causes the errors above.
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(fake_csv, encoding='utf8')
df
How do I force pandas to interpret my argument as a csv string?
You can do that using StringIO:
import io
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(io.StringIO(fake_csv), encoding='utf8', sep=',', lineterminator=';')
df
Result:
Out[30]:
fwefwe fwefw fwefew
0 2 5 7