I'm running a Python script to query an AWS-S3 bucket using the AWS-S3-Select tool. I'm importing a variable from a txt file and want to pass it into the S3-Select query. I also want to count all imported variable recurrences (within a specified column) by querying the entire S3 directory instead of just a single file.
This is what I have so far:
import boto3
from boto3.session import Session
with open('txtfile.txt', 'r') as myfile:
variable = myfile.read()
ACCESS_KEY='accessKey'
SECRET_KEY='secredtKey'
session = Session(aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
s3b = session.client('s3')
r = s3b.select_object_content(
Bucket='s3BucketName',
Key='directory/fileName',
ExpressionType='SQL',
Expression="'select count(*)from S3Object s where s.columnName = %s;', [variable]",
InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization={'CSV': {}},
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
elif 'Stats' in event:
statsDetails = event['Stats']['Details']
print("Stats details bytesScanned: ")
When I run this script I get back the following error:
Traceback (most recent call last):
File "s3_query.py", line 20, in <module>
OutputSerialization={'CSV': {}},
File "/root/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found COMMA:',' at line 1, column 67.
This line looks quite strange:
Expression="'select count(*)from S3Object s where s.columnName = %s;', [variable]"
That is not normal SQL or Python syntax.
You should probably use:
Expression='select count(*)from S3Object s where s.columnName = %s;' % [variable]
Related
Data which, for me, generates an exception instead of invoking the 'on_bad_lines' handler is at:
https://opencalaccess.org/misc/NAMES_CD.TSV
I have this:
bad_lines = list()
def bad_line_finder(x):
bad_lines.append(str(x))
return None
for file in os.listdir(dir):
bad_lines = list()
try:
for df in pd.read_csv(f"{dir}/{file}",
sep='\t',
on_bad_lines=bad_line_finder,
engine='python',
chunksize=1000):
print(f"\n{target}")
df.info()
print(f"Bad Lines: {bad_lines}")
bad_lines = list()
except:
print("EXCEPTION:")
traceback.print_exc()
and this works great. There are errors in the files and the method handles them so that I can keep track of them. Except, why do i still see this:
EXCEPTION:
Traceback (most recent call last):
File "/home/ray/Projects/opencalaccess-data/import.py", line 41, in <module>
for df in pd.read_csv(f"{dir}/{file}",
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
return self.get_chunk()
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
return self.read(nrows=size)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 250, in read
content = self._get_lines(rows)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 1114, in _get_lines
new_rows.append(next(self.data))
_csv.Error: ' ' expected after '"'
What is the "on_bad_lines" option doing if it does not handle all of the bad lines? Which of them will it handle and which will it not?
This is a government data source. There are format errors in the data that cannot be corrected by the agency, because they constitute the 0fficial record. So, I must fix them myself. But which of them throw exceptions and which do not?
I'm getting an error when querying BigQuery from Python using end-user authentication
It works successfully with service account authentication, but fails with end-user authentication.
I am essentially following these instructions https://cloud.google.com/docs/authentication/end-user
The error message is:
ProjectId and DatasetId must be non-empty
I am stumped. Using service account authentication returns the expected data, so it appears to be an authentication related issue, but the authentication step appears to be successful.
Details
from google_auth_oauthlib import flow
from google.cloud import bigquery
appflow = flow.InstalledAppFlow.from_client_secrets_file(
"client_secrets.json", scopes=["https://www.googleapis.com/auth/bigquery"])
appflow.run_local_server()
credentials = appflow.credentials
client = bigquery.Client(project='MyProject', credentials=credentials)
query_string = """SELECT name, SUM(number) as total
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name = 'William'
GROUP BY name;
"""
query_job = client.query(query_string)
for row in query_job.result():
print("{}: {}".format(row["name"], row["total"]))
gives the following errors:
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=...
Traceback (most recent call last):
File "C:\python\test\bqtest3.py", line 15, in <module>
query_job = client.query(query_string)
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\client.py", line 3331, in query
return _job_helpers.query_jobs_insert(
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\_job_helpers.py", line 114, in query_jobs_insert
future = do_query()
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\_job_helpers.py", line 91, in do_query
query_job._begin(retry=retry, timeout=timeout)
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\job\query.py", line 1298, in _begin
super(QueryJob, self)._begin(client=client, retry=retry, timeout=timeout)
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\job\base.py", line 510, in _begin
api_response = client._call_api(
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\bigquery\client.py", line 756, in _call_api
return call()
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\api_core\retry.py", line 283, in retry_wrapped_func
return retry_target(
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\api_core\retry.py", line 190, in retry_target
return target()
File "C:\Users\me\AppData\Local\Programs\Python\Python310\lib\site-packages\google\cloud\_http\__init__.py", line 494, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/MyProject/jobs?prettyPrint=false: ProjectId and DatasetId must be non-empty
Location: None
Job ID: 9eba1ce9-971a-4495-825a-728aed28fc98
Please add python env to the script like below(marked in bold).
#!/usr/bin/env python
from google_auth_oauthlib import flow
from google.cloud import bigquery
appflow = flow.InstalledAppFlow.from_client_secrets_file("client_secrets.json", scopes=["https://www.googleapis.com/auth/bigquery"])
appflow.run_local_server()
credentials = appflow.credentials
client = bigquery.Client(project='MyProject', credentials=credentials)
query_job = client.query(
"""
SELECT name, SUM(number) as total
FROM bigquery-public-data.usa_names.usa_1910_current
WHERE name = 'William'
GROUP BY name;
"""
)
for row in query_job.result():
print("{}: {}".format(row["name"], row["total"]))
I have written a scrapy scraper that writes data out using the JsonItemExporter and I have worked out how to export this data to my AWS S3 using the following Spider Settings in ScrapingHub
AWS_ACCESS_KEY_ID = AAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY = Abababababababababababababababababababab
FEED_FORMAT = json
FEED_URI = s3://scraper-dexi/my-folder/jobs-001.json
What I need to do is dynamically set the date / time on the output file and I would love it if it was using a date and time format like this jobs-20171215-1000.json but I don't know how to set a dynamic FEED_URI with scrapinghub.
There is not much information online and the only example I can find is here on the scraping hub site but unfortunately it does not work.
When I apply these settings based on the example in the documentation
AWS_ACCESS_KEY_ID = AAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY = Abababababababababababababababababababab
FEED_FORMAT = json
FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time).json
Note the %(time) in my URI
The scraping fails with the following errors
[scrapy.utils.signal] Error caught on signal handler: <bound method ?.open_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7fd11625d410>> Less
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 190, in open_spider
uri = self.urifmt % self._get_uri_params(spider)
ValueError: unsupported format character 'j' (0x6a) at index 53
[scrapy.utils.signal] Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x7fd11625d410>> Less
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 220, in item_scraped
slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
I misunderstood the importance of the s in the documentation and did not realize that it was part of the token signature.
I altered
FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time).json
to
FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time)s.json
as per the documentation and solved the problem
%(time)
changed to
%(time)s
I have a function that runs 3 queries and returns the result of the last (using the previous ones to create the last) when I get to the 3rd query, it get a list index our of range error. I have ran this exact query as the first query (with manually entered variables) and it worked fine.
This is my code:
import pypyodbc
def sql_conn():
conn = pypyodbc.connect(r'Driver={SQL Server};'
r'Server=HPSQL31\ni1;'
r'Database=tq_hp_prod;'
r'Trusted_Connection=yes;')
cursor = conn.cursor()
return conn, cursor
def get_number_of_jobs(ticket):
# Get Connection
conn, cursor = sql_conn()
# Get asset number
sqlcommand = "select top 1 item from deltickitem where dticket = {} and cat_code = 'Trq sub'".format(ticket)
cursor.execute(sqlcommand)
asset = cursor.fetchone()[0]
print(asset)
# Get last MPI date
sqlcommand = "select last_test from prevent where item = {} and description like '%mpi'".format(asset)
cursor.execute(sqlcommand)
last_recal = cursor.fetchone()[0]
print(last_recal)
# Get number of jobs since last recalibration
sqlcommand = """select count(i.item)
from deltickhdr as d
join deltickitem as i
on d.dticket = i.dticket
where i.start_rent >= '2017-03-03 00:00:00'
and i.meterstart <> i.meterstop
and i.item = '002600395'""" #.format(last_recal, asset)
cursor.execute(sqlcommand)
num_jobs = cursor.fetchone()[0]
print(num_jobs)
cursor.close()
conn.close()
return num_jobs
ticketnumber = 14195 # int(input("Ticket: "))
get_number_of_jobs(ticketnumber)
Below is the error(s) i get when i get to the 3rd cursor.execute(sqlcommand)
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 1596, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/bdrillin/PycharmProjects/Torque_Turn_Data/tt_sub_ui.py", line 56, in <module>
get_number_of_jobs(ticketnumber)
File "C:/Users/bdrillin/PycharmProjects/Torque_Turn_Data/tt_sub_ui.py", line 45, in get_number_of_jobs
cursor.execute(sqlcommand)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1470, in execute
self._free_stmt(SQL_CLOSE)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1994, in _free_stmt
check_success(self, ret)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1007, in check_success
ctrl_err(SQL_HANDLE_STMT, ODBC_obj.stmt_h, ret, ODBC_obj.ansi)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 972, in ctrl_err
state = err_list[0][0]
IndexError: list index out of range
Any help would be great
I've had the same error.
Even though I haven't come to the definite conclusion about what this error means I thought my guessing might help anyone else ending up here.
In my case, the problem was a conflict with a datatype length (NVARCHAR(24) and CHAR(10)).
So I guess this IndexError in ctrl_err function just means there is an error in your SQL code that pypyodbc does not know how to handle.
I know this is not much of an answer, but I know it would have saved me a couple of hours had I known this was not some bug in pypyodbc but an inconsistency in the data I was inserting.
Kind regards,
Luka
Has anyone experienced this error before when trying to connect to hive.
Sample code used (https://github.com/telefonicaid/fiware-cygnus/blob/master/cygnus-ngsi/resources/hiveclients/python/hiveserver2-client.py):
import sys
import pyhs2
from pyhs2.error import Pyhs2Exception
# get the input parameters
if len(sys.argv) != 6:
print 'Usage: python hiveserver2-client.py <hive_host> <hive_port> <db_name> <hadoop_user> <hadoop_password>'
sys.exit()
hiveHost = sys.argv[1]
hivePort = sys.argv[2]
dbName = sys.argv[3]
hadoopUser = sys.argv[4]
hadoopPassword = sys.argv[5]
# do the connection
with pyhs2.connect(host=hiveHost,
port=hivePort,
authMechanism="PLAIN",
user=hadoopUser,
password=hadoopPassword,
database=dbName) as conn:
# get a client
with conn.cursor() as client:
# create a loop attending HiveQL queries
while (1):
query = raw_input('remotehive> ')
try:
if not query:
continue
if query == 'exit':
sys.exit()
# execute the query
client.execute(query)
# get the content
for row in client.fetch():
print row
except Pyhs2Exception, ex:
print ex.errorMessage
Error displayed:
[centos#test]$ sudo python hiveserver2-client.py computing.cosmos.lab.fiware.org 10000 default USERNAME TOKEN
Traceback (most recent call last):
File "hiveserver2-client.py", line 42, in <module>
database=dbName) as conn:
File "/usr/lib/python2.7/site-packages/pyhs2/__init__.py", line 7, in connect
return Connection(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/pyhs2/connections.py", line 46, in __init__
transport.open()
File "/usr/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 74, in open
status, payload = self._recv_sasl_message()
File "/usr/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 92, in _recv_sasl_message
header = self._trans.readAll(5)
File "/usr/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 60, in readAll
chunk = self.read(sz - have)
File "/usr/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 132, in read
message='TSocket read 0 bytes')
thrift.transport.TTransport.TTransportException: TSocket read 0 bytes
can you post your piece of code ? This looks like some auth mechanism or credentials sent are not Valid.
authMechanism= can be "PLAIN" or "KERBEROS" as per your setup .