Pandasql table not found error on AWS Lambda Function which works on local machine - pandas

I am reading a pandas DF from AWS S3, trying to run some pre-processing SQL on it and save as a csv again, using pandasql library for the same. Challenge here is, in my local machine it works perfectly fine, but on the AWS Lambda it fails with the following error:
"An error occured: (sqlite3.OperationalError) no such table: TblV\n[SQL: SELECT * from TblV;]\n(Background on this error at: http://sqlalche.me/e/e3q8)"
Note: I've built a deployment package of pandas and pandasql on Amazon AMI Linux EC2 instance, zipped it with the lambda_function code and pushed to AWS S3 and saved in the Lambda Function by passing the path.
My code in local, which work perfectly fine:
import pandas as pd
from pandasql import sqldf
from time import time
t1 = time()
TblV = pd.read_csv(r"C:\Users\ab\Documents\test.csv")
query = """SELECT * from TblV"""
df = sqldf(query, globals())
print(df.columns)
print(df.shape)
print(df.head(5))
t2 = time()
print('Time taken: ', t2 - t1)
My code in AWS Lambda Function which throws the above error no matter what I do:
import json
import boto3
import datetime
import pandas as pd
from pandasql import sqldf
import sys
from io import StringIO
def lambda_handler(event, context):
try:
client = boto3.client('s3')
bucket_name = 'bucket'
object_key = 'test/Vol/test.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
TblV = pd.read_csv(StringIO(csv_string))
print(TblV.head(5)) # This print works perfectly
query = """SELECT * from TblV;"""
df = sqldf(query, globals())
print(df.columns)
print(df.shape)
print(df.head(5))
except Exception as e:
err = "An error occured: " + str(e)
return err

you need to download the find in the tmp folder. AWS lambda has temporary storage of 500MB.
client.download_file(bucket_name, object_key,'/tmp/file_name.extension')
in order to read the data
TblV = pd.read_csv(r"/tmp/file_name.extension")
query = """SELECT * from TblV"""

Try using 'df = sqldf(query, locals())' instead of 'df = sqldf(query, globals())'.
The variable 'TblV' is defined inside a function hence it cant be reffered as a global variable.

Related

airflow ingest data from json api doesn't work

I use google cloud to create a pipeline for ingest data from api by using google cloud composer
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
import pymysql.cursors
import pandas as pd
import requests
def get_data_from_api():
url = "https://de-training-2020-7au6fmnprq-de.a.run.app/currency_gbp/all"
response = requests.get(url)
result_conversion_rate = response.json()
conversion_rate = pd.DataFrame.from_dict(result_conversion_rate)
conversion_rate = conversion_rate.reset_index().rename(columns={"index":"date"})
conversion_rate['date'] = pd.to_datetime(conversion_rate['date']).dt.date
conversion_rate.to_csv("/home/airflow/gcs/data/conversion_rate_from_api.csv", index=False)
def covid_api():
url = "https://covid19.th-stat.com/json/covid19v2/getTimeline.json"
response = requests.get(url)
df = response.json()
df = pd.DataFrame.from_dict(df['Data'])
df['Date'] = pd.to_datetime(df['Date']).dt.date
df.to_csv("/home/airflow/gcs/data/result.csv", index=False)
default_args = {
'owner': 'datath',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '#once'
}
dag = DAG(
'Retail_pipeline',
default_args=default_args,
description='Pipeline for ETL online_retail data',
schedule_interval=timedelta(days=1),
)
t1 = PythonOperator(
task_id='api_call',
python_callable=get_data_from_api,
dag=dag,
)
t2 = PythonOperator(
task_id='api_covid',
python_callable=covid_api,
dag=dag,
)
t1 >> t2
the first task works fine but the second task got failed and I try the second task on jupyter it works fine please help don't know what to do
It appears it is failing on the response.json() step.
There are a couple things you can do to troubleshoot:
Output the raw result of the response with like r.text. I think this will show you where the error is.
If you are still uncertain where the error is after step 1. We should load the result and try to deserialise using native json from python.

Importing a gziped csv into pandas

I have an url: https://api.kite.trade/instruments
And this is my code to fetched data from url and write into excel
import pandas as pd
url = "https://api.kite.trade/instruments"
df = pd.read_json(url)
df.to_excel("file.xlsx")
print("Program executed successfully")
but, when I run this program I'm getting error like this_
AttributeError: partially initialized module 'pandas' has no attribute 'read_json' (most likely due to a circular import)
It's not a json, it's csv. So you need to use read_csv. Can you please try this?
import pandas as pd
url = "https://api.kite.trade/instruments"
df = pd.read_csv(url)
df.to_excel("file.xlsx",index=False)
print("Program excuted successfully")
I added an example how I converted the text to a csv dump on your local drive.
import requests
url = "https://api.kite.trade/instruments"
filename = 'test.csv'
f = open(filename, 'w')
response = requests.get(url)
f.write(response.text)

Upload and show with flask a dataframe | AttributeError: 'builtin_function_or_method' object has no attribute 'replace'

I am trying to upload and show a dataframe by flask and when I want to show it, it says
AttributeError: 'builtin_function_or_method' object has no attribute 'replace'.
I found this code on YT and I don't know if it is correct. Can somebody help me?
from flask import Flask, render_template, request
from werkzeug.utils import secure_filename
import pandas as pd
import csv
def reencode(file):
for line in file:
yield line.decode('windows-1250').encode('utf-8')
#app.route("/data")
def data():
df = pd.read_csv("Sistema_de_Stock.csv", encoding='latin-1')
df = df.drop(df.loc['stock al cargar':].columns, axis=1)
df.to_html('data.html')
with open("data.html", 'r', encoding='latin-1') as file:
file = file.read
**file = file.replace("<table","<table class='rwd-table'")**
with open("data.html","w") as file_write:
file_write.write(html + file)
data = os.startfile("data.html")
return data
file.read is a method, so you should call the method. Furthermore you might want to rename the variable to make it clear that this is not a file handler:
with open('data.html', 'r', encoding='latin-1') as file:
# call the method &downarrow;
file_data = file.read().replace('<table', "<table class='rwd-table'")
with open('data.html', 'w') as file_write:
file_write.write(html + file_data)
data = os.startfile('data.html')

Invoke Sagemaker Endpoint using Spark (EMR Cluster)

I am developing a spark application in an EMR cluster. The flow of the project goes like this :
Dataframe is repartitioned based in a Id.
Sagemaker endpoint needs to be invoked on each partition and get the result.
But doing that i am getting this error :
cPickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects
The code is a follows :
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
import itertools
import json
import boto3
import time
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from io import BytesIO as StringIO
client=boto3.client('sagemaker-runtime')
def invoke_endpoint(json_data):
ansJson=json.dumps(json_data)
response=client.invoke_endpoint(EndpointName="<EndpointName>",Body=ansJson,ContentType='text/csv',Accept='Accept')
resultJson=json.loads(str(response['Body'].read().decode('ascii')))
return resultJson
def execute(list_of_url):
final_iterator=[]
urlist=[]
json_data={}
for url in list_of_url:
final_iterator.append((url.ID,url.Prediction))
urlist.append(url.ID)
json_data['URL']=urlist
ressultjson=invoke_endpoint(json_data)
return iter(final_iterator)
### Atributes to be added to Spark Conf
conf = (SparkConf().set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true").set("spark.driver.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true"))
scT=SparkContext(conf=conf)
scT.setSystemProperty("com.amazonaws.services.s3.enableV4","true")
hadoopConf=scT._jsc.hadoopConfiguration()
hadoopConf.set("f3.s3a.awsAccessKeyId","<AccessKeyId>")
hadoopConf.set("f3.s3a.awsSecretAccessKeyId","<SecretAccessKeyId>")
hadoopConf.set("f3.s3a.endpoint","s3-us-east-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3.enableV4","true")
hadoopConf.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sql=SparkSession(scT)
csv_df=sql.read.csv('s3 path to my csv file',header =True)
#print('Total count is',csv_df.count())
csv_dup_df=csv_df.dropDuplicates(['ID'])
print('Total count is',csv_dup_df.count())
windowSpec=Window.orderBy("ID")
result_df=csv_dup_df.withColumn("ImageID",F.row_number().over(windowSpec)%80)
final_df=result_df.withColumn("Prediction",lit(str("UNKOWN")))
df2 = final_df.repartition("ImageID")
df3=df2.rdd.mapPartitions(lambda url: execute(url)).toDF()
df3.coalesce(1).write.mode("overwrite").save("s3 path to save the results in csv format",format="csv")
print(df3.rdd.glom().collect())
##Ok
print("Work is Done")
Can you tell me how to rectify this issue ?

Cannot store an array using dask

I am using the following code to create an array and and store the the results sequentially in a hdf5 format. I was checking out the dask documentation, and the suggested to use dask.store to store the arrays generated in a function like mine. However I receive an error: dask has no attribute store
My code:
import os
import numpy as np
import time
import concurrent.futures
import multiprocessing
from itertools import product
import h5py
import dask as da
def mean_py(array):
start_time = time.time()
x = array.shape[1]
y = array.shape[2]
values = np.empty((x,y), type(array[0][0][0]))
for i in range(x):
for j in range(y):
values[i][j] = ((np.mean(array[:,i,j])))
end_time = time.time()
hours, rem = divmod(end_time-start_time, 3600)
minutes, seconds = divmod(rem,60)
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours), int(minutes), int(seconds)))
print(f"{'.'*80}")
return values
def generate_random_array():
a = np.random.randn(120560400).reshape(10980,10980)
return a
def generate_array(nums):
for num in range(nums):
a = generate_random_array()
f = h5py.File('test_db.hdf5')
d = f.require_dataset('/data', shape=a.shape, dtype=a.dtype)
da.store(a, d)
start = time.time()
generate_array(8)
end = time.time()
print(f'\nTime complete: {end-start:.2f}s\n')
Should I use dask for such a a task, or do you recommend to store the results using h5py directly?
Please Ignore the mean_py(array) function. It's for something I want to try out once the data has been produced.
As suggested in the comments, you're currently doing this
import dask as da
When you probably meant to do this
import dask.array as da