Google cloud blob: XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 - pandas

I want to import several xml files from a bucket on GCS and then parse them into a pandas Dataframe. I found the pandas.read_xml function do to this which is great. Unfortunately
I keep getting the error:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
I checked the xml files and they look fine.
This is the code:
from google.cloud import storage
import pandas as pd
#importing the data
client = storage.Client()
bucket = client.get_bucket('bucketname')
df = pd.DataFrame()
from google.cloud import storage
import pandas as pd
#parsing the data into pandas df
for blob in bucket.list_blobs():
print(blob)
split = str(blob.name).split("/")
country = split[0]
data = pd.read_xml(blob.open(mode='rt', encoding='iso-8859-1', errors='ignore'), compression='gzip')
df["country"] = country
print(country)
df.append(data)
When I print out the blob it gives me :
<Blob: textkernel, DE/daily/2020/2020-12-19/jobs.0.xml.gz, 1612169959288959>
maybe it has something to do with the pandas function trying to read the filename and not the content? Does someone have an idea about why this could be happening?
thank you!

Related

reading csv file into python pandas

I want to read a csv file into a pandas dataframe but I get an error when executing the code below:
filepath = "https://drive.google.com/file/d/1bUTjF-iM4WW7g_Iii62Zx56XNTkF2-I1/view"
df = pd.read_csv(filepath)
df.head(5)
To retrieve information or data from google drive, at first, you need to identify the file id.
import pandas as pd
url='https://drive.google.com/file/d/0B6GhBwm5vaB2ekdlZW5WZnppb28/view?usp=sharing'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)
print(df.head())
Try the following code snippet to read the CSV from Google Drive into the pandas DataFrame:
import pandas as pd
url = "https://drive.google.com/uc?id=1bUTjF-iM4WW7g_Iii62Zx56XNTkF2-I1"
df = pd.read_csv(url)
df.head(5)

Write CSV to HDFS from stream with pyarrow upload

I am trying to save a Pandas DataFrame to HDFS in CSV format using pyarrow upload method, but the CSV file saved is empty. The code example can be found below.
import io
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"x": [1, 2, 3]})
buf = io.StringIO()
df.to_csv(buf)
hdfs = pa.hdfs.connect()
hdfs.upload("path/to/hdfs/test.csv", buf)
When I check the contents of test.csv on HDFS it is empty. What did I do wrong? Thanks.
You need to call buf.seek(0) before uploading.
Basically you need to rewind to the begining of the buffer otherwise hdfs thinks there's nothing to upload:
>>> buf.read()
''
>>> buf.seek(0)
0
>>> buf.read()
',x\n0,1\n1,2\n2,3\n'
>>> buf.read()
''

Reading CSV files from Google Cloud Storage using pandas

I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.
File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):
google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1
Also, the filename already contains the .csv extension. So change the 9th line to this:
temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')
With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:
This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents
In [1]: import dask.bytes.core
In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}
In [3]: import gcsfs
In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
'gcs': gcsfs.dask_link.DaskGCSFileSystem,
'gs': gcsfs.dask_link.DaskGCSFileSystem}
As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.
Few things to point out in the text above:
bucket_name and prefixes needed to be defined.
and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.
from google.cloud import storage
import pandas as pd
storage_client = storage.Client()
buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
print(filename, temp.head())
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)

Convert IEX Finance API data to pandas dataframe

I want to pull data from the IEX finance api and put it into a pandas dataframe but I don't know the correct code. Can someone help?
URL call for the api =
https://api.iextrading.com/1.0/stock/aapl/chart/1d?chartInterval=5
I tried the below but it doesn't work
import pandas as pd
api_call = 'https://api.iextrading.com/1.0/stock/aapl/chart/1d?chartInterval=5'
price = pd.read_csv(api_call)
The data is in JSON format. To load into dataframe you have to call read_json function.
import pandas as pd
df = pd.read_json("https://api.iextrading.com/1.0/stock/aapl/chart/1d?chartInterval=5")

Generating a NetCDF from a text file

Using Python can I open a text file, read it into an array, then save the file as a NetCDF?
The following script I wrote was not successful.
import os
import pandas as pd
import numpy as np
import PIL.Image as im
path = 'C:\path\to\data'
grb = [[]]
for fn in os.listdir(path):
file = os.path.join(path,fn)
if os.path.isfile(file):
df = pd.read_table(file,skiprows=6)
grb.append(df)
df2 = pd.np.array(grb)
#imarray = im.fromarray(df2) ##cannot handle this data type
#imarray.save('Save_Array_as_TIFF.tif')
i once used xray or xarray (they renamed them selfs) to get a NetCDF file into an ascii dataframe... i just googled and appearantly they have a to_netcdf function
import xarray and it allows you to treat dataframes just like pandas.
so give this a try:
df.to_netcdf(file_path)
xarray slow to save netCDF