How to read pickled utf-8 dataframe on google-colaboratory - pandas

I made a pickled utf-8 dataframe on my local machine.
I can read this pickled data with read_pickle on my local machine.
However, I cannot read it on google-colaboratory.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Postscript01
My code is very simple
import pandas as pd
DF_OBJ = open('/content/drive/DF_OBJ')
DF = pd.read_pickle(DF_OBJ)
I can run until 2nd row.
I cannot run the last row with the above error comment.
Postscript02
I could solve it by myself.
import pandas as pd
import pickle5
DF_OBJ = open('OBJ','rb')
DF = pickle5.load(DF_OBJ)

Related

Google Colab Python Unicode Character decode Pandas Dataframe

I'm using a csv file from Kaggle for fake new analysis and it has unicode characters of all types. So whenever I was running and executing the code with utf-8 showing some errors, "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 83735: invalid continuation byte"
And the resolution is present here too.
import pandas as pd
url = "https://raw.githubusercontent.com/akdubey2k/NLP/main/Fake_News_Classifier/train.csv"
try:
df = pd.read_csv(url, encoding="utf8")
except UnicodeDecodeError:
df = pd.read_csv(url, encoding="latin1")
df.head()

Google cloud blob: XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

I want to import several xml files from a bucket on GCS and then parse them into a pandas Dataframe. I found the pandas.read_xml function do to this which is great. Unfortunately
I keep getting the error:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
I checked the xml files and they look fine.
This is the code:
from google.cloud import storage
import pandas as pd
#importing the data
client = storage.Client()
bucket = client.get_bucket('bucketname')
df = pd.DataFrame()
from google.cloud import storage
import pandas as pd
#parsing the data into pandas df
for blob in bucket.list_blobs():
print(blob)
split = str(blob.name).split("/")
country = split[0]
data = pd.read_xml(blob.open(mode='rt', encoding='iso-8859-1', errors='ignore'), compression='gzip')
df["country"] = country
print(country)
df.append(data)
When I print out the blob it gives me :
<Blob: textkernel, DE/daily/2020/2020-12-19/jobs.0.xml.gz, 1612169959288959>
maybe it has something to do with the pandas function trying to read the filename and not the content? Does someone have an idea about why this could be happening?
thank you!

Write CSV to HDFS from stream with pyarrow upload

I am trying to save a Pandas DataFrame to HDFS in CSV format using pyarrow upload method, but the CSV file saved is empty. The code example can be found below.
import io
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"x": [1, 2, 3]})
buf = io.StringIO()
df.to_csv(buf)
hdfs = pa.hdfs.connect()
hdfs.upload("path/to/hdfs/test.csv", buf)
When I check the contents of test.csv on HDFS it is empty. What did I do wrong? Thanks.
You need to call buf.seek(0) before uploading.
Basically you need to rewind to the begining of the buffer otherwise hdfs thinks there's nothing to upload:
>>> buf.read()
''
>>> buf.seek(0)
0
>>> buf.read()
',x\n0,1\n1,2\n2,3\n'
>>> buf.read()
''

How to make pandas read csv input as string, not as an url

I'm trying to load a csv (from an API response) into pandas, but keep getting an error
"ValueError: stat: path too long for Windows" and "FileNotFoundError: [Errno 2] File b'"fwefwe","fwef..."
indicating that pandas interprets it as an url, not a string.
The code below causes the errors above.
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(fake_csv, encoding='utf8')
df
How do I force pandas to interpret my argument as a csv string?
You can do that using StringIO:
import io
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(io.StringIO(fake_csv), encoding='utf8', sep=',', lineterminator=';')
df
Result:
Out[30]:
fwefwe fwefw fwefew
0 2 5 7

Generating a NetCDF from a text file

Using Python can I open a text file, read it into an array, then save the file as a NetCDF?
The following script I wrote was not successful.
import os
import pandas as pd
import numpy as np
import PIL.Image as im
path = 'C:\path\to\data'
grb = [[]]
for fn in os.listdir(path):
file = os.path.join(path,fn)
if os.path.isfile(file):
df = pd.read_table(file,skiprows=6)
grb.append(df)
df2 = pd.np.array(grb)
#imarray = im.fromarray(df2) ##cannot handle this data type
#imarray.save('Save_Array_as_TIFF.tif')
i once used xray or xarray (they renamed them selfs) to get a NetCDF file into an ascii dataframe... i just googled and appearantly they have a to_netcdf function
import xarray and it allows you to treat dataframes just like pandas.
so give this a try:
df.to_netcdf(file_path)
xarray slow to save netCDF