If I am running a Jupyter notebook server where an analyst can browse into server URL to run IPython, (typical IPython but hosted remotely) can the analyst export the data in CSV form as well to their local machine? Like download CSV file they are working on from a Pandas data wrangling process?
For example if a Pandas data frame is created something like this method below of IPython hosted remotely from an SQL query:
# Example python program to read data from a PostgreSQL table
# and load into a pandas DataFrame
import psycopg2
import pandas as pds
from sqlalchemy import create_engine
# Create an engine instance
alchemyEngine = create_engine('postgresql+psycopg2://test:#127.0.0.1', pool_recycle=3600);
# Connect to PostgreSQL server
dbConnection = alchemyEngine.connect();
# Read data from PostgreSQL database table and load into a DataFrame instance
dataFrame = pds.read_sql("select * from \"StudentScores\"", dbConnection);
pds.set_option('display.expand_frame_repr', False);
# Print the DataFrame
print(dataFrame);
# Close the database connection
dbConnection.close();
Would Jupyter labs have something to similar of serving a static file like a web server can do? For example a Flask server you can serve static file, sorry NOT a web developer but I have experimented with this in with Flask. The Flask code looks like this below for send_file to serve a static file:
from flask import Flask, request, send_file
import io
import os
import csv
app = Flask(__name__)
#app.route('/get_csv')
def get_csv():
"""
Returns the monthly weather csv file (Montreal, year=2019)
corresponding to the month passed as parameter.
"""
# Checking that the month parameter has been supplied
if not "month" in request.args:
return "ERROR: value for 'month' is missing"
# Also make sure that the value provided is numeric
try:
month = int(request.args["month"])
except:
return "ERROR: value for 'month' should be between 1 and 12"
csv_dir = "./static"
csv_file = "2019_%02d_weather.csv" % month
csv_path = os.path.join(csv_dir, csv_file)
# Also make sure the requested csv file does exist
if not os.path.isfile(csv_path):
return "ERROR: file %s was not found on the server" % csv_file
# Send the file back to the client
return send_file(csv_path, as_attachment=True, attachment_filename=csv_file)
Can a IPython or pandas serve a static file if IPython is hosted on a remote machine?
Related
I have an Excel file uploaded to my ML workspace.
I can access the file as an azure FileDataset object. However, I don't know how to get it into a pandas DataFrame since 'FileDataset' object has no attribute 'to_dataframe'.
Azure ML notebooks seem to make a point of avoiding pandas for some reason.
Does anyone know how to get blob files into pandas dataframes from within Azure ML notebooks?
To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a pandas DataFrame.
Here are the steps to follow for this procedure:
Download the data from Azure blob with the following Python code sample using Blob service. Replace the variable in the following code with your specific values:
from azure.storage.blob import BlobServiceClient
import pandas as pd
STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
#download from blob
t1=time.time()
blob_service_client_instance =
BlobServiceClient(account_url=STORAGEACCOUNTURL,
credential=STORAGEACCOUNTKEY)
blob_client_instance =
blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME,
snapshot=None)
with open(LOCALFILENAME, "wb") as my_blob:
blob_data = blob_client_instance.download_blob()
blob_data.readinto(my_blob)
t2=time.time()
print(("It takes %s seconds to download "+BLOBNAME) % (t2 - t1))
Read the data into a pandas DataFrame from the downloaded file.
#LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)
For more details you can follow this link
Trying to understand if I can use pickle for storing the model in a file system.
from neuralprophet import NeuralProphet
import pandas as pd
import pickle
df = pd.read_csv('data.csv')
pipe = NeuralProphet()
pipe.fit(df, freq="D")
pickle.dump(pipe, open('model/pipe_model.pkl', 'wb'))
Question:- Loading multiple CSV files. I have multiple CSV file. How can i dump multiple CSV files in the same pickle file and load later for the prediction?
I think the right answer here is sqlite. SQLite acts like a database but it is stored as a single self-contained file on disk.
The benefit for your use case is that you can append new data as received into a table on the file, then read it as required. The code to do this is as simple as:
import pandas as pd
import sqlite3
# Create a SQL connection to our SQLite database
# This will create the file if not already existing
con = sqlite3.connect("my_table.sqlite")
# Replace this with read_csv
df = pd.DataFrame(index = [1, 2, 3], data = [1, 2, 3], columns=['some_data'])
# Simply continue appending onto 'My Table' each time you read a file
df.to_sql(
name = 'My Table',
con = con,
if_exists='append'
)
Please be aware that SQLite performance drops after very large numbers of rows, in which case caching the data as parquet files or another fast and compressed format, then reading them all in at training time may be more appropriate.
When you need the data, just read everything from the table:
pd.read_sql('SELECT * from [My Table]', con=con)
I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.
File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):
google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1
Also, the filename already contains the .csv extension. So change the 9th line to this:
temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')
With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:
This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents
In [1]: import dask.bytes.core
In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}
In [3]: import gcsfs
In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
'gcs': gcsfs.dask_link.DaskGCSFileSystem,
'gs': gcsfs.dask_link.DaskGCSFileSystem}
As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.
Few things to point out in the text above:
bucket_name and prefixes needed to be defined.
and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.
from google.cloud import storage
import pandas as pd
storage_client = storage.Client()
buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
print(filename, temp.head())
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
Let me prefix this by saying I'm very new to tensorflow and even newer to AWS Sagemaker.
I have some tensorflow/keras code that I wrote and tested on a local dockerized Jupyter notebook and it runs fine. In it, I import a csv file as my input.
I use Sagemaker to spin up a jupyter notebook instance with conda_tensorflow_p36. I modified the pandas.read_csv() code to point to my input file, now hosted on a S3 bucket.
So I changed this line of code from
import pandas as pd
data = pd.read_csv("/input.csv", encoding="latin1")
to this
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/my-sagemaker-bucket/input.csv", encoding="latin1")
and I get this error
AttributeError: module 'pandas' has no attribute 'core'
I'm not sure if it's a permissions issue. I read that as long as I name my bucket with the string "sagemaker" it should have access to it.
Pull our data from S3 for example:
import boto3
import io
import pandas as pd
# Set below parameters
bucket = '<bucket name>'
key = 'data/training/iris.csv'
endpointName = 'decision-trees'
# Pull our data from S3
s3 = boto3.client('s3')
f = s3.get_object(Bucket=bucket, Key=key)
# Make a dataframe
shape = pd.read_csv(io.BytesIO(f['Body'].read()), header=None)
I have been trying to write a function that loads multiple files from a Google Cloud Storage bucket into a single Pandas Dataframe, however I cannot seem to make it work.
import pandas as pd
from google.datalab import storage
from io import BytesIO
def gcs_loader(bucket_name, prefix):
bucket = storage.Bucket(bucket_name)
df = pd.DataFrame()
for shard in bucket.objects(prefix=prefix):
fp = shard.uri
%gcs read -o $fp -v tmp
df.append(read_csv(BytesIO(tmp))
return df
When I try to run it says:
undefined variable referenced in command line: $fp
Sure, here's an example:
https://colab.research.google.com/notebook#fileId=0B7I8C_4vGdF6Ynl1X25iTHE4MGc
This notebook shows the following:
Create two random CSVs
Upload both CSV files to a GCS bucket
Uses the GCS Python API to iterate over files in the bucket. And,
Merge each file into a single Pandas DataFrame.