Efficient way of iterating rows in pandas

Efficient way of iterating rows in pandas - sql

I have a dataframe in pandas containing the following information
Using a for loop for each entry in the TRANSACTION_ID, I am calling the following function,
def checkForImages(TransNum):
"""pass function a transaction number and get the string with image found information then store that
string into the same row in a new column"""
try:
cursor.execute('select CAMERA_TYPE from VEHICLE_IMAGE where TRANSACTION_ID=' + str(TransNum))
result = ''
for img_type in cursor:
result = result + img_type[0]
if result == '':
result = 'No image available'
print 'Images found: ' + str(TransNum) + " "+ result
resultSort = result.split()
resultSort.sort()
result = ''
for i in range(len(resultSort)):
result = result + " " + resultSort[i]
cursor.close()
return result
except Exception as e:
# print 'Error occured while getting image references: ', e
pass
This function returns a string which is either 'No images available' or has the image information if found. I have to create a new column in the dataframe populated with this result so my final dataframe should look like this
My question is: How can I speed up this process? Using for loop on rows with 100k+ entries is extremely slow and painful. I have looked into functions like dataframe.map and dataframe.apply but haven't been able to get it working. Other options I see is using cython or multiple threads. In which option should I invest my time? Any help is appreciated

You query Oracle for each transaction and then additionally aggregate fetched data for each transaction in a loop - it's very inefficient.
First i would create a "mapping" DataFrame like as follows:
transaction_id images
111 No image available
112 FRONT REAR
113 OVERVIEW
this can be done using Oracle's LISTAGG function:
qry = """
select
transaction_id,
NVL(listagg(camera_type, ' ') within group (order by camera_type), 'No image available') as images
from vehicle_image group by transaction_id
"""
# `engine` - is a SQLAlchemy engine connection ...
cam = pd.read_sql(qry, con=engine, index_col=['transaction_id'])
after that we can use Series.map() method:
df['Image_Found'] = df.transaction_id.map(cam.images)

Related

ProgrammingError: ('Expected 0 parameters, supplied 391', 'HY000') with 391 columns using dynamic approach

I have a dataframe that contains 391 columns and a number of rows. I am trying to push this to a database via pyodbc and using the following command:
cursor = conn.cursor()
cursor.fast_executemany = True
cursor.executemany(
f"INSERT INTO db.tble({', '.join(df.columns.tolist())}) VALUES ({('?,' * len(df.columns))[:-1]})",
list(df.itertuples(index=False, name=None))
)
cursor.commit()
I would have thought this method would be dynamic for a dataframe of any size yet I get the following error:
ProgrammingError: ('Expected 0 parameters, supplied 391', 'HY000')
I am struggling to understand this as the syntax looks correct, ? has been used instead of %s like other answers. Can someone please help.
Thanks

I once wrote a piece of code, where I wanted to create the insert statement dynamically based on number of columns in the data frame:
here is how the insert query would be passed to the database:
INSERT INTO dbo.Table (column1,columns2,column3) VALUES (?,?,?)
and again, the number of columns and values '?' would be required to be created dynamically at runtime based upon the number of columns the data frame had
I wrote the below piece to just write a string (of ?,?,?) and concatenate it with the insert query,
here
df is the dataframe,
symbol_counter would hold the number of columns in the dataframe,
sym_string would be the final string i.e. (?,?,?,?...n) based on the number of columns
symbol = ['?']
sym_string = ''
symbol_counter = int(df.shape[1])-1
word = 0
for word in range(symbol_counter):
# sym_string += str(symbol)
symbol.insert(word, "?")
word+=1
sym_string = (','.join(symbol))
#and then use this variable and concatenate it with the rest of the query as shown below
query = Variable_holding_first_partofthequery + " VALUES (" +sym_string+")"
I know, it's the big way, but that's how I got it to work. Good Luck!

RDSdataService execute_statement returns (BadRequestException)

I am using boto3 library with executeStatement to get data from an RDS cluster using DATA API.
Query is working fine if i select 1 or 2 columns but as soon as I select another column to query, it returns an error with (BadRequestException) permission denied for relation table_name
I have checked using pgadmin the permissions are intact to query the whole db for the user I am using.
function included in call:
def execute_query(self, sql_query, sql_parameters=[]):
"""
Aurora DataAPI execute query. Generally used for select statements.
:param sql_query: Query
:param sql_parameters: parameters in sql query
:return: DataApi response
"""
client = self.api_access()
response = client.execute_statement(
resourceArn=RESOURCE_ARN,
secretArn=SECRET_ARN,
database='db_name',
sql=sql_query,
includeResultMetadata=True,
parameters=sql_parameters)
return response
function call: No errors
query = '''
SELECT id
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
function call: fails with above error
query = '''
SELECT id,name,event
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
Is there a horizontal limit on what we can get from DATA API using Boto3? I know there is a limit for 1MB, but it should return something as per the documentation if it exceeds the limit.
Backend is Postgres RDS
UPDATE:
I can select the same columns 10 times and its not a problem
query = '''
SELECT id,event,event,event,event,event
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
So this means there are some columns that I cannot select.

I didnt know there are column level security in some tables. If there are column level securities in postgres for the user you are using that's obvious I cannot select those columns.

Python cx_oracle bind variable with a list of items

I have a query like this:
SELECT prodId, prod_name , prod_type FROM mytable WHERE prod_type in (:list_prod_names)
I want to get the information of a product, depending on the possible types are: "day", "week", "weekend", "month". Depending on the date it might be at least one of those option, or a combination of all of them.
This info (List type) is returned by the function prod_names(date_search)
I am using cx_oracle bindings with code like:
def get_prod_by_type(search_date :datetime):
query_path = r'./queries/prod_by_name.sql'
raw_query = open(query_path).read().strip().replace('\n', ' ').replace('\t', ' ').replace(' ', ' ')
print(sql_read_op)
# Depending on the date the product types may be different
prod_names(search_date) #This returns a list with possible names
qry_params = {"list_prod_names": prod_names} # See attempts bellow
try:
db = DB(username='username', password='pss', hostname="localhost")
df = db.get(raw_query,qry_params)
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get_short_cov_op2() : %s\n%s' % exception_error
print(exception_error)
return df
For this: qry_params = {"list_prod_names": prod_names} I have tried multiple different things such as:
prod_names = ''.join(prod_names)
prod_names = str(prod_names)
prod_names =." \'"+''.join(prod_names)+"\'"
The only thing I have managed to get it work is by doing:
new_query = raw_query.format(list_prod_names=prodnames_for_date(search_date)).replace('[', '').replace(']','')
df = db.query(new_query)
I am trying not to use .format() because is bad practie to do a .format to an sql to prevent attacks.
db.py contains among other functions:
def get(self, sql, params={}):
cur = self.con.cursor()
cur.prepare(sql)
try:
cur.execute(sql, **params)
df = pd.DataFrame(cur.fetchall(), columns=[c[0] for c in cur.description])
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get() : %s\n%s' % exception_error
print(exception_error)
self.con.rollback()
cur.close()
df.columns = df.columns.map(lambda x: x.upper())
return df
I would like to be able to do a type binding.
I am using:
python = 3.6
cx_oracle = 6.3.1
I have read the followig articles but I a still unable to find a solution:
Python cx_Oracle bind variables
Python cx_Oracle SQL with bind string variable
Search for name in cx_Oracle

Unfortunately you cannot bind an array directly unless you convert it to a SQL type and use a subquery -- which is fairly complex. So instead you need to do something like this:
inClauseParts = []
for i, inValue in enumerate(ARRAY_VALUE):
argName = "arg_" + str(i + 1)
inClauseParts.append(":" + argName)
clause = "%s in (%s)" % (columnName, ",".join(inClauseParts))
This works fine but be aware that if the number of elements in the array changes regularly that using this technique will create a separate statement that must be parsed for each number of elements. If you know that (in general) you won't have more than (for example) 10 elements in the array it would be better to append None to the incoming array so that the number of elements is always 10.
Hopefully that is clear enough!

I have finally manage to do it. It might not be pretty but it works.
I have modified my sql query to include an extra select which returns the value of my list of descriptors:
inner join (
SELECT regexp_substr(:my_list_of_items, '[^,]+', 1, LEVEL) as mylist
FROM dual
CONNECT BY LEVEL <= length(:my_list_of_items) - length(REPLACE(:my_list_of_items, ',', '')) + 1
) d
on d.mylist= a.corresponding_columns

Sql query dynamic variable passing

I've looked at the documentation in various places to see how to do this, but I haven't had any success. I want to pass in the name of a column into a sql query. I'm using psycopg2 and My most recent attempt was based off of this doc page http://initd.org/psycopg/docs/sql.html#module-psycopg2.sql
Here is my latest attempt, but I get an error IndexError: tuple index out of range that points to the format() where I'm passing in the parameter.
def parse_files(cursor):
for name in column_names:
cursor.execute(
sql.SQL(
"select planet_osm_point.{}, count(*) from planet_osm_point group by planet_osm_point.{}"
).format(sql.Identifier(name)))
for row in cursor:
print(str(row[0]) + str(row[1]))

It's not clear by the given documentation, but it looks like I need to pass in a value inside of the {} specifying what argument I want to use. In this case it's {0}

column_names = ['col1', 'col2']
for column in column_names:
query = sql.SQL('''
select {0}, count(*)
from planet_osm_point pop
group by {0}
''').format(sql.Identifier('pop.' + column))
cursor.execute(query)
for row in cursor.fetchall():
print (str(row[0]) + str(row[1]))

django using .extra() got error `only a single result allowed for a SELECT that is part of an expression`

I'm trying to use .extra() where the query return more than 1 result, like :
'SELECT "books_books"."*" FROM "books_books" WHERE "books_books"."owner_id" = %s' % request.user.id
I got an error : only a single result allowed for a SELECT that is part of an expression
Try it on dev-server using sqlite3. Anybody knows how to fix this? Or my query is wrong?
EDIT:
I'm using django-simple-ratings, my model like this :
class Thread(models.Model):
#
#
ratings = Ratings()
I want to display each Thread's ratings and whether a user already rated it or not. For 2 items, it will hit 6 times, 1 for the actual Thread and 2 for accessing the ratings. The query:
threads = Thread.ratings.order_by_rating().filter(section = section)\
.select_related('creator')\
.prefetch_related('replies')
threads = threads.extra(select = dict(myratings = "SELECT SUM('section_threadrating'.'score') AS 'agg' FROM 'section_threadrating' WHERE 'section_threadrating'.'content_object_id' = 'section_thread'.'id' ",)
Then i can print each Thread's ratings without hitting the db more. For the 2nd query, i add :
#continue from extra
blahblah.extra(select = dict(myratings = '#####code above####',
voter_id = "SELECT 'section_threadrating'.'user_id' FROM 'section_threadrating' WHERE ('section_threadrating'.'content_object_id' = 'section_thread'.'id' AND 'section_threadrating'.'user_id' = '3') "))
Hard-coded the user_id. Then when i use it on template like this :
{% ifequal threads.voter_id user.id %}
#the rest of the code
I got an error : only a single result allowed for a SELECT that is part of an expression
Let me know if it's not clear enough.

The problem is in the query. Generally, when you are writing subqueries, they must return only 1 result. So a subquery like the one voter_id:
select ..., (select sectio_threadrating.user_id from ...) as voter_id from ....
is invalid, because it can return more than one result. If you are sure it will always return one result, you can use the max() or min() aggregation function:
blahblah.extra(select = dict(myratings = '#####code above####',
voter_id = "SELECT max('section_threadrating'.'user_id') FROM 'section_threadrating' WHERE ('section_threadrating'.'content_object_id' = 'section_thread'.'id' AND 'section_threadrating'.'user_id' = '3') "))
This will make the subquery always return 1 result.
Removing that hard-code, what user_id are you expecting to retrieve here? Maybe you just can't reduce to 1 user using only SQL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Efficient way of iterating rows in pandas - sql

Related

ProgrammingError: ('Expected 0 parameters, supplied 391', 'HY000') with 391 columns using dynamic approach

RDSdataService execute_statement returns (BadRequestException)

Python cx_oracle bind variable with a list of items

Sql query dynamic variable passing

django using .extra() got error `only a single result allowed for a SELECT that is part of an expression`

Categories

Resources