I would like to post a pickled dataframe file into a FastAPI route. However, I keep getting an error. Could anyone suggest how to fix my scripts?
This is interesting for me since I would like to preserve the dataframe column types & details. Therefore, posting the dataframe directly to a route is not an option as those details & types would be lost after JSONification.
main.py
from fastapi import FastAPI, File
app = FastAPI(debug=True)
#app.post("/luminex_file/")
async def handle_luminex_data(file: bytes = File(...)):
df = pd.read_pickle(io.BytesIO(file))
return {"file_size": len(file), "df": df}
request.py
files = {'file': open('dataframe.pickle', 'rb')}
response = requests.post("http://127.0.0.1:8000/luminex_file/", files=files)
Related
I'm trying to write from a data frame to CSV directly to an s3 bucket
I've tried the stringIO method but the problem is that I run into the "KeyTooLong" error.
import boto3
client = boto3.client('s3')
client.create_bucket(Bucket = 'poolpo-rent-a-car-bucket')
# checking if the bucket was created
response = client.list_buckets()
response['Buckets']
bucket_name = 'poolpo-rent-a-car-bucket'
car_costs.to_csv(f"s3://{bucket_name}/{car_costs}.csv")
This is the StringIO one
from io import StringIO
bucket_name = 'poolpo-rent-a-car-bucket'
csv_buffer = StringIO()
branch_locations.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket_name, f'{branch_locations}.csv').put(Body=csv_buffer.getvalue())
And the error
ClientError: An error occurred (KeyTooLongError) when calling the PutObject operation: Your key is too long
These are medium size dataframes, like 5000 rows and like 3-5 columns
For an unrelated reason, I had to reinstall anaconda and the problems got away.
Ended up using a way simpler approach.
import boto3
client = boto3.client('s3')
client.create_bucket(Bucket = 'poolpo-rent-a-car-bucket')
response = client.list_buckets()
response['Buckets']
car_costs.to_csv(f"s3://{bucket_name}/car_costs.csv")
One other thing that I noticed in s3 was that when I was using the f string to input the dataframe I was basically using the dataframe as a name hence why I was having the KeyTooLongError
I'm writing an airflow job to read a gzipped file from s3.
First I get the key for the object, which works fine
obj = self.s3_hook.get_key(key, bucket_name=self.s3_bucket)
obj looks fine, something like this:
path/to/file/data_1.csv.gz
Now I want to read the contents into a pandas dataframe. I've tried a number of things but this is my current iteration:
import pandas as pd
df = pd.read_csv(obj['Body'], compression='gzip')
This returns the following error:
TypeError: 's3.Object' object is not subscriptable
What am I doing wrong? i feel like I need to do something with StringIO or BytesIO...I was able to read it in as bytes, but thought there was a more straight forward way to get to a dataframe
Just in case it matters, one row of the data looks like this when I unzip and open in CSV:
9671211|ddc9979d5ff90a4714fec7290657c90f|2138|2018-01-30 00:00:12|2018-01-30 00:00:16.069048|42b32863522dbe52e963034bb0aa68b6|1909705|8803795|collect|\\N|0||0||0|
figured it out:
obj = self.s3_hook.get_key(key, bucket_name=self.s3_bucket)
df = pd.read_csv(obj.get()['Body'], compression='gzip', header = None, sep = '|')
I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.
File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):
google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1
Also, the filename already contains the .csv extension. So change the 9th line to this:
temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')
With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:
This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents
In [1]: import dask.bytes.core
In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}
In [3]: import gcsfs
In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
'gcs': gcsfs.dask_link.DaskGCSFileSystem,
'gs': gcsfs.dask_link.DaskGCSFileSystem}
As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.
Few things to point out in the text above:
bucket_name and prefixes needed to be defined.
and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.
from google.cloud import storage
import pandas as pd
storage_client = storage.Client()
buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
print(filename, temp.head())
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
I have developed a spark streaming app where I have data stream of json strings.
sc = SparkContext("local[*]", "appname")
sc.setLogLevel("WARN")
sqlContext = sql.SQLContext(sc)
#batch width in time
stream = StreamingContext(sc, 5)
stream.checkpoint("checkpoint")
# mqtt setup
brokerUrl = "tcp://localhost:1883"
topic = "test"
# mqtt stream
DS = MQTTUtils.createStream(stream, brokerUrl, topic)
# transform DStream to be able to read json as a dict
jsonDS = kvs.map(lambda v: json.loads(v))
#create SQL-like rows from the json
sqlDS = jsonDS.map(lambda x: Row(a=x["a"], b=x["b"], c=x["c"], d=x["d"]))
#in each batch do something
sqlDS.foreachRDD(doSomething)
# run
stream.start()
stream.awaitTermination()
def doSomething(time,rdd):
data = rdd.toDF().toPandas()
This code above is working as expected: I receive some jsons in a stringified manner and I can convert each batch to a dataframe, also converting it to a Pandas DataFrame.
So far so good.
The problem comes if I want to add a different schema to the DataFrame.
The method toDF() assumes a schema=None in the following function: sqlContext.createDataFrame(rdd, schema).
If I try to access sqlContext from inside doSomething(), obviosuly it is not defined. If I try to make it available there with a global variable I get the typical error that it cannot be serialized.
I have also read the sqlContext can only be used in the Spark Driver and not in the workers.
So the question is: how is the toDF() working in the first place, as it needs the sqlContext? And how can I add a schema to it (hopefully without changing the source)?
Creating the DataFrame in the driver doesnt seem to be an option because I cannot serialize it to the workers.
Maybe I am not seeing this properly.
Thanks a lot in advance!
Answering my own question...
define the following:
def getSparkSessionInstance(sparkConf):
if ("sparkSessionSingletonInstance" not in globals()):
globals()["sparkSessionSingletonInstance"] = SparkSession \
.builder \
.config(conf=sparkConf) \
.getOrCreate()
return globals()["sparkSessionSingletonInstance"]
and then from the worker just call:
spark = getSparkSessionInstance(rdd.context.getConf())
taken from DataFrame and SQL Operations
Once I have a TF server serving multiple models, is there a way to query such server to know which models are served?
Would it be possible then to have information about each of such models, like name, interface and, even more important, what versions of a model are present on the server and could potentially be served?
It is really hard to find some info about this, but there is possibility to get some model metadata.
request = get_model_metadata_pb2.GetModelMetadataRequest()
request.model_spec.name = 'your_model_name'
request.metadata_field.append("signature_def")
response = stub.GetModelMetadata(request, 10)
print(response.model_spec.version.value)
print(response.metadata['signature_def'])
Hope it helps.
Update
Is is possible get these information from REST API. Just get
http://{serving_url}:8501/v1/models/{your_model_name}/metadata
Result is json, where you can easily find model specification and signature definition.
It is possible to get model status as well as model metadata. In the other answer only metadata is requested and the response, response.metadata['signature_def'] still needs to be decoded.
I found the solution is to use the built-in protobuf method MessageToJson() to convert to json string. This can then be converted to a python dictionary with json.loads()
import grpc
import json
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from tensorflow_serving.apis import model_service_pb2_grpc
from tensorflow_serving.apis import get_model_status_pb2
from tensorflow_serving.apis import get_model_metadata_pb2
from google.protobuf.json_format import MessageToJson
PORT = 8500
model = "your_model_name"
channel = grpc.insecure_channel('localhost:{}'.format(PORT))
request = get_model_status_pb2.GetModelStatusRequest()
request.model_spec.name = model
result = stub.GetModelStatus(request, 5) # 5 secs timeout
print("Model status:")
print(result)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = get_model_metadata_pb2.GetModelMetadataRequest()
request.model_spec.name = model
request.metadata_field.append("signature_def")
result = stub.GetModelMetadata(request, 5) # 5 secs timeout
result = json.loads(MessageToJson(result))
print("Model metadata:")
print(result)
To continue the decoding process, either follow Tyler's approach and convert the message to JSON, or more natively Unpack into a SignatureDefMap and take it from there
signature_def_map = get_model_metadata_pb2.SignatureDefMap()
response.metadata['signature_def'].Unpack(signature_def_map)
print(signature_def_map.signature_def.keys())
To request data using REST API, for additional data of the particular model that is served, you can issue (via curl, Postman, etc.):
GET http://host:port/v1/models/${MODEL_NAME}
GET http://host:port/v1/models/${MODEL_NAME}/metadata
For more information, please check https://www.tensorflow.org/tfx/serving/api_rest