I cloned a repo from GitHub to the Google Cloud Workbench. I haven't been able to read in my data to the Jupyter notebook. It seems like it is unable to locate the file. I have checked the file spellings and location, it all seems to be in place. I also tried to read it in as
PATH = "data/countypres_2000-2020.csv"
df = pd.read_csv(PATH)
or as
PATH = "eco395m-homework-6/data/countypres_2000-2020.csv"
df = pd.read_csv(PATH)
Try this:
PATH = r"/data/countypres_2000-2020.csv"
df = pd.read_csv(PATH)
Related
Please don't mark this as a duplicate. I have tried reading the other solutions, but they seem to address a different issue.
I concatenated 2 CSV files using pandas. I can read them easily using Jupyter Notebook on my laptop.
p1 = path + "asm/eng_1.csv"
p2 = path + "asm/eng_2.csv"
csv1 = pd.read_csv(p1, header = None, encoding = 'utf16')
csv2 = pd.read_csv(p2, header = None, encoding = 'utf16')
concat_csv = pd.concat([csv1, csv2], ignore_index=True)
concat_csv.to_csv(destination_path + '/combined.csv', index=False)
When I upload the above 3 CSVs on Google Colab (mount using GDrive), I can read csv1 and csv2 easily, but concat_csv gives me the error:
UnicodeError: UTF-16 stream does not start with BOM after concatenating CSVs
Surprisingly, if run the following code on Google Colab:
csv1 = pd.read_csv(p1, header = None, encoding = 'utf16')
csv2 = pd.read_csv(p2, header = None, encoding = 'utf16')
concat_csv = pd.concat([csv1, csv2], ignore_index=True)
print(concat_csv)
concat_csv.to_csv('data.csv', index=False)
from google.colab import files
files.download("data.csv")
It easily prints concat_csv, but when I re-upload the downloaded file on G-Drive and try to read it again, I get the exact same error.
Why is this happening? Please help.
Thank you.
I have an Excel file uploaded to my ML workspace.
I can access the file as an azure FileDataset object. However, I don't know how to get it into a pandas DataFrame since 'FileDataset' object has no attribute 'to_dataframe'.
Azure ML notebooks seem to make a point of avoiding pandas for some reason.
Does anyone know how to get blob files into pandas dataframes from within Azure ML notebooks?
To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a pandas DataFrame.
Here are the steps to follow for this procedure:
Download the data from Azure blob with the following Python code sample using Blob service. Replace the variable in the following code with your specific values:
from azure.storage.blob import BlobServiceClient
import pandas as pd
STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
#download from blob
t1=time.time()
blob_service_client_instance =
BlobServiceClient(account_url=STORAGEACCOUNTURL,
credential=STORAGEACCOUNTKEY)
blob_client_instance =
blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME,
snapshot=None)
with open(LOCALFILENAME, "wb") as my_blob:
blob_data = blob_client_instance.download_blob()
blob_data.readinto(my_blob)
t2=time.time()
print(("It takes %s seconds to download "+BLOBNAME) % (t2 - t1))
Read the data into a pandas DataFrame from the downloaded file.
#LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)
For more details you can follow this link
I have put my CSV file in the same folder as running jupyter notebook, still can't able to import it.
You need read to a df first:
df = pd.read_csv('name.csv') # (the file name of your csv)
df
I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.
File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):
google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1
Also, the filename already contains the .csv extension. So change the 9th line to this:
temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')
With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:
This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents
In [1]: import dask.bytes.core
In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}
In [3]: import gcsfs
In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
'gcs': gcsfs.dask_link.DaskGCSFileSystem,
'gs': gcsfs.dask_link.DaskGCSFileSystem}
As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.
Few things to point out in the text above:
bucket_name and prefixes needed to be defined.
and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.
from google.cloud import storage
import pandas as pd
storage_client = storage.Client()
buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
print(filename, temp.head())
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
Tried all the possible options
like
import pandas as pd
df = pd.read_csv('AD_Data')
data = pd.ExcelFile("AD_Data")
xl_file = pd.ExcelFile(AD_Data)
dfs = {sheet_name: xl_file.parse(AD_Data) for sheet_name in xl_file.AD_Data}
dfs = pd.read_excel(AD_Data, sheetname=None)
None of them are helping
The error I am getting that
FileNotFoundError: File b'adData' does not exist
notebook and Data is in the same Folder.
I tried keeping different folder too, did not help.
I can use / import any other file like text and convert to DataFrame and work on it in same note book and from same data folder.
pd.read_excel (Python 3.6.4) works fine with xlsx on Windows.
Add the fileending .xlsx or make sure the file is in the same folder as the script.
dfs = pd.read_excel(r'C:\users\ilja\Desktop\Mappe1.xlsx', sheet_name=None)
print(dfs)
# OrderedDict([('Tabelle1', 1 5
# 0 2 6
# 1 3 7)])