Google Colab Python Unicode Character decode Pandas Dataframe

Google Colab Python Unicode Character decode Pandas Dataframe - pandas

I'm using a csv file from Kaggle for fake new analysis and it has unicode characters of all types. So whenever I was running and executing the code with utf-8 showing some errors, "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 83735: invalid continuation byte"
And the resolution is present here too.
import pandas as pd
url = "https://raw.githubusercontent.com/akdubey2k/NLP/main/Fake_News_Classifier/train.csv"
try:
df = pd.read_csv(url, encoding="utf8")
except UnicodeDecodeError:
df = pd.read_csv(url, encoding="latin1")
df.head()

Related

How to solve for delimter conflicts

I have a large .TXT file which is delimited by ";". Unfortunately some of my values contain ";" aswell, which in that case is not a delimiter but recognized as delimiter by pandas. Becasue of this I have difficulties reading the .txt. files into pandas because some lines have more columns than the others. Background: I am trying to combine several .txt files into 1 dataframe and get the following error: ParserError: Error tokenizing data. C error: Expected 21 fields in line 443, saw 22.
So when checking line 443 I saw indeed that that line had 1 more instance of ";" because it was part of one of the values.
Reproduction:
Text file 1:
1;2;3;4
23123213;23123213;23123213;23123213
123;123;123;123
123;123;123;123
1;1;1;1
123;123;123;123
12;12;12;12
3;3;3;3
Text file 2:
1;2;3;4
23123213;23123213;23123213;23123213
123;123;123;123
123;123;12;3;123
1;1;1;1
123;123;123;123
12;12;12;12
3;3;3;3
Code:
import pandas as pd
import glob
import os
path = r'C:\Users\file'
all_files = glob.glob(path + "/*.txt")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, delimiter=';')
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)

Error while converting csv to parquet file using pandas

I would like to upload csv as parquet file to S3 bucket. Below is the code snippet.
df = pd.read_csv('right_csv.csv')
csv_buffer = BytesIO()
df.to_parquet(csv_buffer, compression='gzip', engine='fastparquet')
csv_buffer.seek(0)
Above is giving me an error: TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO
How to make it work?

As per the documentation, when fastparquet is used as the engine, io.BytesIO cannot be used. auto or pyarrow engine have to be used. Quoting from the documentation.
The engine fastparquet does not accept file-like objects.
Below code works without any issues.
import io
f = io.BytesIO()
df.to_parquet(f, compression='gzip', engine='pyarrow')
f.seek(0)

As mentioned in the other answer, this is not supported. One work around would be to save as parquet to a NamedTemporaryFile. Then copy the content to a BytesIO buffer:
import tempfile
with tempfile.NamedTemporaryFile() as tmp:
df.to_parquet(tmp.name, compression='gzip', engine='fastparquet')
with open(tmp.name, 'rb') as fh:
buf = io.BytesIO(fh.read())

How to read pickled utf-8 dataframe on google-colaboratory

I made a pickled utf-8 dataframe on my local machine.
I can read this pickled data with read_pickle on my local machine.
However, I cannot read it on google-colaboratory.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Postscript01
My code is very simple
import pandas as pd
DF_OBJ = open('/content/drive/DF_OBJ')
DF = pd.read_pickle(DF_OBJ)
I can run until 2nd row.
I cannot run the last row with the above error comment.
Postscript02
I could solve it by myself.
import pandas as pd
import pickle5
DF_OBJ = open('OBJ','rb')
DF = pickle5.load(DF_OBJ)

Loading CSV file with pandas - Error Tokenizing

i am trying to load a csv file
import pandas as pd
dfc = pd.read_csv('data/Vehicles0515.csv', sep =',')
but i have the following error
ParserError: Error tokenizing data. C error: Expected 22 fields in line 3004427, saw 23
i have read to include error_bad_lines = False
but it doesn't solve the problem
Thanks a lot

Sometimes parser is getting confused by the head of csv file.
try this:
dfc = pd.read_csv('data/Vehicles0515.csv', header=None)
or
dfc = pd.read_csv('data/Vehicles0515.csv', skiprows=2)
Also you don't need provide an comma separator owing to fact that comma is default value in pandas read_csv method.

How to make pandas read csv input as string, not as an url

I'm trying to load a csv (from an API response) into pandas, but keep getting an error
"ValueError: stat: path too long for Windows" and "FileNotFoundError: [Errno 2] File b'"fwefwe","fwef..."
indicating that pandas interprets it as an url, not a string.
The code below causes the errors above.
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(fake_csv, encoding='utf8')
df
How do I force pandas to interpret my argument as a csv string?

You can do that using StringIO:
import io
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(io.StringIO(fake_csv), encoding='utf8', sep=',', lineterminator=';')
df
Result:
Out[30]:
fwefwe fwefw fwefew
0 2 5 7

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Google Colab Python Unicode Character decode Pandas Dataframe - pandas

Related

How to solve for delimter conflicts

Error while converting csv to parquet file using pandas

How to read pickled utf-8 dataframe on google-colaboratory

Loading CSV file with pandas - Error Tokenizing

How to make pandas read csv input as string, not as an url

Categories

Resources