Count reoccurring variable within AWS-S3 bucket using S3-Select query - sql

I'm running a Python script to query an AWS-S3 bucket using the AWS-S3-Select tool. I'm importing a variable from a txt file and want to pass it into the S3-Select query. I also want to count all imported variable recurrences (within a specified column) by querying the entire S3 directory instead of just a single file.
This is what I have so far:
import boto3
from boto3.session import Session
with open('txtfile.txt', 'r') as myfile:
variable = myfile.read()
ACCESS_KEY='accessKey'
SECRET_KEY='secredtKey'
session = Session(aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
s3b = session.client('s3')
r = s3b.select_object_content(
Bucket='s3BucketName',
Key='directory/fileName',
ExpressionType='SQL',
Expression="'select count(*)from S3Object s where s.columnName = %s;', [variable]",
InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization={'CSV': {}},
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
elif 'Stats' in event:
statsDetails = event['Stats']['Details']
print("Stats details bytesScanned: ")
When I run this script I get back the following error:
Traceback (most recent call last):
File "s3_query.py", line 20, in <module>
OutputSerialization={'CSV': {}},
File "/root/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found COMMA:',' at line 1, column 67.

This line looks quite strange:
Expression="'select count(*)from S3Object s where s.columnName = %s;', [variable]"
That is not normal SQL or Python syntax.
You should probably use:
Expression='select count(*)from S3Object s where s.columnName = %s;' % [variable]

Related

using pandas.read_csv, how can one process all errors, receive all non-error data?

Data which, for me, generates an exception instead of invoking the 'on_bad_lines' handler is at:
https://opencalaccess.org/misc/NAMES_CD.TSV
I have this:
bad_lines = list()
def bad_line_finder(x):
bad_lines.append(str(x))
return None
for file in os.listdir(dir):
bad_lines = list()
try:
for df in pd.read_csv(f"{dir}/{file}",
sep='\t',
on_bad_lines=bad_line_finder,
engine='python',
chunksize=1000):
print(f"\n{target}")
df.info()
print(f"Bad Lines: {bad_lines}")
bad_lines = list()
except:
print("EXCEPTION:")
traceback.print_exc()
and this works great. There are errors in the files and the method handles them so that I can keep track of them. Except, why do i still see this:
EXCEPTION:
Traceback (most recent call last):
File "/home/ray/Projects/opencalaccess-data/import.py", line 41, in <module>
for df in pd.read_csv(f"{dir}/{file}",
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
return self.get_chunk()
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
return self.read(nrows=size)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 250, in read
content = self._get_lines(rows)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 1114, in _get_lines
new_rows.append(next(self.data))
_csv.Error: ' ' expected after '"'
What is the "on_bad_lines" option doing if it does not handle all of the bad lines? Which of them will it handle and which will it not?
This is a government data source. There are format errors in the data that cannot be corrected by the agency, because they constitute the 0fficial record. So, I must fix them myself. But which of them throw exceptions and which do not?

Error using Python BigQuery API with user authentication

I'm getting an error when querying BigQuery from Python using end-user authentication
It works successfully with service account authentication, but fails with end-user authentication.
I am essentially following these instructions https://cloud.google.com/docs/authentication/end-user
The error message is:
ProjectId and DatasetId must be non-empty
I am stumped. Using service account authentication returns the expected data, so it appears to be an authentication related issue, but the authentication step appears to be successful.
Details
from google_auth_oauthlib import flow
from google.cloud import bigquery
appflow = flow.InstalledAppFlow.from_client_secrets_file(
"client_secrets.json", scopes=["https://www.googleapis.com/auth/bigquery"])
appflow.run_local_server()
credentials = appflow.credentials
client = bigquery.Client(project='MyProject', credentials=credentials)
query_string = """SELECT name, SUM(number) as total
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name = 'William'
GROUP BY name;
"""
query_job = client.query(query_string)
for row in query_job.result():
print("{}: {}".format(row["name"], row["total"]))
gives the following errors:
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=...
Traceback (most recent call last):
File "C:\python\test\bqtest3.py", line 15, in <module>
query_job = client.query(query_string)
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\client.py", line 3331, in query
return _job_helpers.query_jobs_insert(
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\_job_helpers.py", line 114, in query_jobs_insert
future = do_query()
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\_job_helpers.py", line 91, in do_query
query_job._begin(retry=retry, timeout=timeout)
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\job\query.py", line 1298, in _begin
super(QueryJob, self)._begin(client=client, retry=retry, timeout=timeout)
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\job\base.py", line 510, in _begin
api_response = client._call_api(
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\client.py", line 756, in _call_api
return call()
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\api_core\retry.py", line 283, in retry_wrapped_func
return retry_target(
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\api_core\retry.py", line 190, in retry_target
return target()
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\_http\__init__.py", line 494, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/MyProject/jobs?prettyPrint=false: ProjectId and DatasetId must be non-empty
Location: None
Job ID: 9eba1ce9-971a-4495-825a-728aed28fc98
Please add python env to the script like below(marked in bold).
#!/usr/bin/env python
from google_auth_oauthlib import flow
from google.cloud import bigquery
appflow = flow.InstalledAppFlow.from_client_secrets_file("client_secrets.json", scopes=["https://www.googleapis.com/auth/bigquery"])
appflow.run_local_server()
credentials = appflow.credentials
client = bigquery.Client(project='MyProject', credentials=credentials)
query_job = client.query(
"""
SELECT name, SUM(number) as total
FROM bigquery-public-data.usa_names.usa_1910_current
WHERE name = 'William'
GROUP BY name;
"""
)
for row in query_job.result():
print("{}: {}".format(row["name"], row["total"]))

Export Scrapy JSON Feed - Fails for Dynamic FEED_URI for AWS S3 using ScrapingHub

I have written a scrapy scraper that writes data out using the JsonItemExporter and I have worked out how to export this data to my AWS S3 using the following Spider Settings in ScrapingHub
AWS_ACCESS_KEY_ID = AAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY = Abababababababababababababababababababab
FEED_FORMAT = json
FEED_URI = s3://scraper-dexi/my-folder/jobs-001.json
What I need to do is dynamically set the date / time on the output file and I would love it if it was using a date and time format like this jobs-20171215-1000.json but I don't know how to set a dynamic FEED_URI with scrapinghub.
There is not much information online and the only example I can find is here on the scraping hub site but unfortunately it does not work.
When I apply these settings based on the example in the documentation
AWS_ACCESS_KEY_ID = AAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY = Abababababababababababababababababababab
FEED_FORMAT = json
FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time).json
Note the %(time) in my URI
The scraping fails with the following errors
[scrapy.utils.signal] Error caught on signal handler: <bound method ?.open_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7fd11625d410>> Less
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 190, in open_spider
uri = self.urifmt % self._get_uri_params(spider)
ValueError: unsupported format character 'j' (0x6a) at index 53
[scrapy.utils.signal] Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x7fd11625d410>> Less
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 220, in item_scraped
slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
I misunderstood the importance of the s in the documentation and did not realize that it was part of the token signature.
I altered
FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time).json
to
FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time)s.json
as per the documentation and solved the problem
%(time)
changed to
%(time)s

pypyodbc execute returns list index out of range error

I have a function that runs 3 queries and returns the result of the last (using the previous ones to create the last) when I get to the 3rd query, it get a list index our of range error. I have ran this exact query as the first query (with manually entered variables) and it worked fine.
This is my code:
import pypyodbc
def sql_conn():
conn = pypyodbc.connect(r'Driver={SQL Server};'
r'Server=HPSQL31\ni1;'
r'Database=tq_hp_prod;'
r'Trusted_Connection=yes;')
cursor = conn.cursor()
return conn, cursor
def get_number_of_jobs(ticket):
# Get Connection
conn, cursor = sql_conn()
# Get asset number
sqlcommand = "select top 1 item from deltickitem where dticket = {} and cat_code = 'Trq sub'".format(ticket)
cursor.execute(sqlcommand)
asset = cursor.fetchone()[0]
print(asset)
# Get last MPI date
sqlcommand = "select last_test from prevent where item = {} and description like '%mpi'".format(asset)
cursor.execute(sqlcommand)
last_recal = cursor.fetchone()[0]
print(last_recal)
# Get number of jobs since last recalibration
sqlcommand = """select count(i.item)
from deltickhdr as d
join deltickitem as i
on d.dticket = i.dticket
where i.start_rent >= '2017-03-03 00:00:00'
and i.meterstart <> i.meterstop
and i.item = '002600395'""" #.format(last_recal, asset)
cursor.execute(sqlcommand)
num_jobs = cursor.fetchone()[0]
print(num_jobs)
cursor.close()
conn.close()
return num_jobs
ticketnumber = 14195 # int(input("Ticket: "))
get_number_of_jobs(ticketnumber)
Below is the error(s) i get when i get to the 3rd cursor.execute(sqlcommand)
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 1596, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/bdrillin/PycharmProjects/Torque_Turn_Data/tt_sub_ui.py", line 56, in <module>
get_number_of_jobs(ticketnumber)
File "C:/Users/bdrillin/PycharmProjects/Torque_Turn_Data/tt_sub_ui.py", line 45, in get_number_of_jobs
cursor.execute(sqlcommand)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1470, in execute
self._free_stmt(SQL_CLOSE)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1994, in _free_stmt
check_success(self, ret)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1007, in check_success
ctrl_err(SQL_HANDLE_STMT, ODBC_obj.stmt_h, ret, ODBC_obj.ansi)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 972, in ctrl_err
state = err_list[0][0]
IndexError: list index out of range
Any help would be great
I've had the same error.
Even though I haven't come to the definite conclusion about what this error means I thought my guessing might help anyone else ending up here.
In my case, the problem was a conflict with a datatype length (NVARCHAR(24) and CHAR(10)).
So I guess this IndexError in ctrl_err function just means there is an error in your SQL code that pypyodbc does not know how to handle.
I know this is not much of an answer, but I know it would have saved me a couple of hours had I known this was not some bug in pypyodbc but an inconsistency in the data I was inserting.
Kind regards,
Luka

Socket error while connecting to hive through: hiveserver2-client.py

Has anyone experienced this error before when trying to connect to hive.
Sample code used (https://github.com/telefonicaid/fiware-cygnus/blob/master/cygnus-ngsi/resources/hiveclients/python/hiveserver2-client.py):
import sys
import pyhs2
from pyhs2.error import Pyhs2Exception
# get the input parameters
if len(sys.argv) != 6:
print 'Usage: python hiveserver2-client.py <hive_host> <hive_port> <db_name> <hadoop_user> <hadoop_password>'
sys.exit()
hiveHost = sys.argv[1]
hivePort = sys.argv[2]
dbName = sys.argv[3]
hadoopUser = sys.argv[4]
hadoopPassword = sys.argv[5]
# do the connection
with pyhs2.connect(host=hiveHost,
port=hivePort,
authMechanism="PLAIN",
user=hadoopUser,
password=hadoopPassword,
database=dbName) as conn:
# get a client
with conn.cursor() as client:
# create a loop attending HiveQL queries
while (1):
query = raw_input('remotehive> ')
try:
if not query:
continue
if query == 'exit':
sys.exit()
# execute the query
client.execute(query)
# get the content
for row in client.fetch():
print row
except Pyhs2Exception, ex:
print ex.errorMessage
Error displayed:
[centos#test]$ sudo python hiveserver2-client.py computing.cosmos.lab.fiware.org 10000 default USERNAME TOKEN
Traceback (most recent call last):
File "hiveserver2-client.py", line 42, in <module>
database=dbName) as conn:
File "/usr/lib/python2.7/site-packages/pyhs2/__init__.py", line 7, in connect
return Connection(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/pyhs2/connections.py", line 46, in __init__
transport.open()
File "/usr/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 74, in open
status, payload = self._recv_sasl_message()
File "/usr/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 92, in _recv_sasl_message
header = self._trans.readAll(5)
File "/usr/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 60, in readAll
chunk = self.read(sz - have)
File "/usr/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 132, in read
message='TSocket read 0 bytes')
thrift.transport.TTransport.TTransportException: TSocket read 0 bytes
can you post your piece of code ? This looks like some auth mechanism or credentials sent are not Valid.
authMechanism= can be "PLAIN" or "KERBEROS" as per your setup .