Scrapy: MySQL Pipeline -- Unexpected Errors Encountered - scrapy

I'm getting a number of errors, depending upon what is being inserted/updated.
Here is the code for processing the item:
def process_item(self, item, spider):
try:
if 'producer' in item:
self.cursor.execute("""INSERT INTO Producers (title, producer) VALUES (%s, %s)""", (item['title'], item['producer']))
elif 'actor' in item:
self.cursor.execute("""INSERT INTO Actors (title, actor) VALUES (%s, %s)""", (item['title'], item['actor']))
elif 'director' in item:
self.cursor.execute("""INSERT INTO Directors (title, director) VALUES (%s, %s)""", (item['title'], item['director']))
else:
self.cursor.execute("""UPDATE example_movie SET distributor=%S, rating=%s, genre=%s, budget=%s WHERE title=%s""", (item['distributor'], item['rating'], item['genre'], item['budget'], item['title']))
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
Here is an example of the items returned from the scraper:
[{'budget': [u'N/A'], 'distributor': [u'Lorimar'], 'genre': [u'Action'], 'rating': [u'R'],'title': [u'Action Jackson']}, {'actor': u'Craig T. Nelson', 'title': [u'Action Jackson']}, {'actor': u'Sharon Stone', 'title': [u'Action Jackson']}, {'actor': u'Carl Weathers', 'title': [u'Action Jackson']}, {'producer': u'Joel Silver', 'title': [u'Action Jackson']}, {'director': u'Craig R. Baxley', 'title': [u'Action Jackson']}]
Here are the errors returned:
2013-08-25 23:04:57-0500 [ActorSpider] ERROR: Error processing {'budget': [u'N/A'],
'distributor': [u'Lorimar'],
'genre': [u'Action'],
'rating': [u'R'],
'title': [u'Action Jackson']}
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 62, in _process_chain
return process_chain(self.methods[methodname], obj, *args)
File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 65, in process_chain
d.callback(input)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 361, in callback
self._startRunCallbacks(result)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 455, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/fortylashes/Documents/Management_Work/BoxOfficeMojo/BoxOfficeMojo/pipelines.py", line 53, in process_item
self.cursor.execute("""UPDATE example_movie SET distributor=%S, rating=%s, genre=%s, budget=%s WHERE title=%s""", (item['distributor'], item['rating'], item['genre'], item['budget'], item['title']))
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 159, in execute
query = query % db.literal(args)
exceptions.ValueError: unsupported format character 'S' (0x53) at index 38
Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'Craig T. Nelson')' at line 1
2013-08-25 23:04:57-0500 [ActorSpider] DEBUG: Scraped from <200 http://www.boxofficemojo.com/movies/?id=actionjackson.htm>
{'actor': u'Craig T. Nelson', 'title': [u'Action Jackson']}
Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'Sharon Stone')' at line 1
2013-08-25 23:04:57-0500 [ActorSpider] DEBUG: Scraped from <200 http://www.boxofficemojo.com/movies/?id=actionjackson.htm>
{'actor': u'Sharon Stone', 'title': [u'Action Jackson']}
Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'Carl Weathers')' at line 1
2013-08-25 23:04:57-0500 [ActorSpider] DEBUG: Scraped from <200 http://www.boxofficemojo.com/movies/?id=actionjackson.htm>
{'actor': u'Carl Weathers', 'title': [u'Action Jackson']}
Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'Joel Silver')' at line 1
2013-08-25 23:04:57-0500 [ActorSpider] DEBUG: Scraped from <200 http://www.boxofficemojo.com/movies/?id=actionjackson.htm>
{'producer': u'Joel Silver', 'title': [u'Action Jackson']}
Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'Craig R. Baxley')' at line 1
2013-08-25 23:04:57-0500 [ActorSpider] DEBUG: Scraped from <200 http://www.boxofficemojo.com/movies/?id=actionjackson.htm>
{'director': u'Craig R. Baxley', 'title': [u'Action Jackson']}
Apparently there are a lot issues. Thank you for reading! Any and all suggestions or ideas are greatly appreciated!
::::UPDATE/MORE INFO::::
There appear to be three movies, of the test set of 52 total, which are being inserted into the the Actors, Producers and Directors tables. Note: The UPDATE statement isn't working at all.
These movies are: Abraham Lincoln: Vampire Hunter, Ace Ventura: Pet Detective and Ace Ventura: When Nature Calls
Interestingly, these are all of the movies that have : in the title- I'm not sure what this means, but if anyone has an idea please share it!
:::::INSERT SOLVED:::::
Turns out the problem was caused by the scraper putting individual items in a list. So {'actor': [u'this one guy'] as opposed top {'actor': u'this one guy'}.

You have used wrong format specifier for string data type at line 53 of your code. It should be small 's' not the capital 'S'.
self.cursor.execute("""UPDATE example_movie SET distributor=%S, rating=%s, genre=%s, budget=%s WHERE title=%s""", (item['distributor'], item['rating'], item['genre'], item['budget'], item['title']))
it should be like this.
self.cursor.execute("""UPDATE example_movie SET distributor=%S, rating=%s, genre=%s, budget=%s WHERE title=%s""", (item['distributor'], item['rating'], item['genre'], item['budget'], item['title']))

Related

Execution failed on sql - Error - Pandas.to.sql()

I want to save a data frame in a Database table. What I did :
Connect to azure Sql server DB
import pyodbc
# Create
server = 'XXXXXXXXXXXXXXXXXXXX'
database = 'XXXXXXXXXXXXXXXXXXX'
username = 'XXXXXXXXXXXXXXXX'
password = 'XXXXXXXXXXXX'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
Create Table
#create_table = """
CREATE TABLE forecast_data (
CompanyEndDate text,
Retailer text,
Store_Name text,
Category text,
Description text,
QtySold int);
cursor.execute(create_table)
cnxn.commit()
Use pandas to_sql
data.to_sql('forecast_data', con=cnxn)
I get this error:
ProgrammingError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
1680 try:
-> 1681 cur.execute(*args, **kwargs)
1682 return cur
ProgrammingError: ('42S02', "[42S02] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name 'sqlite_master'. (208) (SQLExecDirectW)")
The above exception was the direct cause of the following exception:
DatabaseError Traceback (most recent call last)
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
1691
1692 ex = DatabaseError(f"Execution failed on sql '{args[0]}': {exc}")
-> 1693 raise ex from exc
1694
1695 #staticmethod
DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ('42S02', "[42S02] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name 'sqlite_master'. (208) (SQLExecDirectW)")
Any one have an idea what is going on ?
When import sqlalchemy, you can use to_sql.
Related Post:
pandas to sql server
import sqlalchemy
...
engine = sqlalchemy.create_engine(
"mssql+pyodbc://user:pwd#server/database",
echo=False)
data.to_sql('forecast_data', con=cnxn, if_exists='replace')
When import pyodbc, you can use to_sql.
Your code should like below. You also can read my answer in below post.
Logon failed (pyodbc.InterfaceError) ('28000', "[28000] [Microsoft][ODBC SQL Server Driver][SQL Server]Login failed for user 'xxxx'

Exception: [IBM][CLI Driver][DB2/LINUXX8664] SQL0104N unexpected token Error

I have the following data frame
CALL_DISPOSITION CITY END INCIDENT_NUMBER
0 ADV-Advised Waterloo Fri, 23 Mar 2018 01:13:27 GMT 6478983
1 AST-Assist Waterloo Sat, 18 Mar 2017 12:41:47 GMT 724030
2 AST-Assist Waterloo Sat, 18 Mar 2017 12:41:47 GMT 999000
I am trying to push this to an IBM DB2 Database.
I have the following code:
# IBM DB2 imports
import ibm_db
# instantiate db2 connection
connection_id = ibm_db.connect(
conn_string,
'',
'',
conn_option,
ibm_db.QUOTED_LITERAL_REPLACEMENT_OFF)
# create list of tuples from df
records = [tuple(row) for _, row in df.iterrows()]
# Define sql statement structure to replace data into WATERLOO_911_CALLS table
column_names = df.columns
df_sql = "VALUES({}{})".format("?," * (len(column_names) - 1), "?")
sql_command = "REPLACE INTO WATERLOO_911_CALLS {} ".format(df_sql)
# Prepare SQL statement
try:
sql_command = ibm_db.prepare(connection, sql_command)
except Exception as e:
print(e)
# Execute query
try:
ibm_db.execute_many(sql_command, tuple(temp_records))
except Exception as e:
print('Data pushing error {}'.format(e))
However, I keep getting the following error:
Exception: [IBM][CLI Driver][DB2/LINUXX8664] SQL0104N An unexpected token "REPLACE INTO WATERLOO_911_CALLS" was found following "BEGIN-OF-STATEMENT". Expected tokens may include: "<space>". SQLSTATE=42601 SQLCODE=-104
I don't understand why that is the case. I followed the steps outlined in this repo but I can't seem to get this to work. What am I doing wrong? Please let me know there are any clarifications I can make.
It hints about missing spaces, maybe it needs one between the fields in the VALUE() string.
Like df_sql = "VALUES({}{})".format("?, " * (len(column_names) - 1), "?")
instead of df_sql = "VALUES({}{})".format("?," * (len(column_names) - 1), "?")
Just a hunch.
I find that printing sql_command before executing it could also help troubleshooting.

SQL syntax error using PgAdmin3

I'm trying to insert the following data, but Im getting a syntax error for the line "2, docking station ..."
This is the error im getting but I inputed line 1 the exact same way.
ERROR: syntax error at or near "Docking"
LINE 4: (2,'Docking Station','AS-0201764-RT','275-4197991-QC-53','Eq...
^
********** Error **********
ERROR: syntax error at or near "Docking"
SQL state: 42601
Character: 348
INSERT INTO assets (id, name, asset_tag, asset_serial_number, asset_type, product_descrption, asset_status, warranty_date, warranty_length, employee_id)
VALUES
(1,'Monitor','AS-8153000-DP','358-7190353-JD-44', 'Equipment', 'A display screen used to provide visual output from a computer', 'Assigned', '2018-01-08', '1 year', 'ZO-9440’ ),
(2,'Docking Station','AS-0201764-RT','275-4197991-QC-53','Equipment', 'A device in which a laptop computer, smartphone, or other mobile device may be placed for charging','Assigned','2017-07-12','1 year','ZO-9440’),

Python & SnakeSQL - raise lock.LockError('Lock no longer valid.') ERROR

I am trying to run a python script (createdb.py) which has DB operations from my main python script (app.py) but having the below error.
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\web\application.py", line 236, in process
return self.handle()
File "C:\Python27\lib\site-packages\web\application.py", line 227, in handle
return self._delegate(fn, self.fvars, args)
File "C:\Python27\lib\site-packages\web\application.py", line 409, in _delegate
return handle_class(cls)
File "C:\Python27\lib\site-packages\web\application.py", line 384, in handle_class
return tocall(*args)
File "D:\Python\virtualenvs\new4\textweb\bin\app.py", line 16, in GET
createdb.createTables()
File "D:\Python\virtualenvs\new4\textweb\bin\createdb.py", line 9, in createTables
cursor.execute("CREATE TABLE table (dateColumn Date, numberColumn Integer)")
File "D:\Python\virtualenvs\new4\textweb\bin\SnakeSQL\driver\base.py", line 1548, in execute
self.info = self.connection._create(parsedSQL['table'], parsedSQL['columns'], parameters)
File "D:\Python\virtualenvs\new4\textweb\bin\SnakeSQL\driver\base.py", line 993, in _create
self._insertRowInColTypes(table)
File "D:\Python\virtualenvs\new4\textweb\bin\SnakeSQL\driver\base.py", line 632, in _insertRowInColTypes
], types= ['String','String','String','Bool','Bool','Bool','Text','Text','Integer']
File "D:\Python\virtualenvs\new4\textweb\bin\SnakeSQL\driver\dbm.py", line 61, in _insertRow
self.tables[table].file[str(primaryKey)] = str(values)
File "D:\Python\virtualenvs\new4\textweb\bin\SnakeSQL\external\lockdbm.py", line 50, in __setitem__
raise lock.LockError('Lock no longer valid.')
LockError: Lock no longer valid.
Here is my createdb.py code;
import SnakeSQL
connection = SnakeSQL.connect(database='test', autoCreate=True)
connection = SnakeSQL.connect(database='test')
cursor = connection.cursor()
def createTables():
cursor.execute("CREATE TABLE table (dateColumn Date, numberColumn Integer)")
cursor.execute("INSERT INTO table (dateColumn, numberColumn) VALUES ('2003-11-8', 3)")
cursor.execute("INSERT INTO table (dateColumn, numberColumn) VALUES ('2004-11-8', 4)")
cursor.execute("INSERT INTO table (dateColumn, numberColumn) VALUES ('2005-11-8', 5)")
cursor.execute("INSERT INTO table (dateColumn, numberColumn) VALUES ('2006-11-8', 6)")
def select():
selectResult = cursor.execute("SELECT dateColumn FROM table WHERE numberColumn = 3")
return selectResult
if __name__ == "__main__":
createTables()
and here is my app.py code;
import web
import SnakeSQL
import createdb
render = web.template.render('templates/')
connection = SnakeSQL.connect(database='test')
cursor = connection.cursor()
urls = (
'/', 'index'
)
class index:
def GET(self):
createdb.createTables()
result = createdb.select()
return render.index(result)
if __name__ == "__main__":
app = web.application(urls, globals())
app.run()
I couldn't find out why I am having this error. Can you please share your knowledge for solving this problem?
First off, the SnakeSQL docs appear to be from 2004, the actual code was last updated in 2009, and the author states that the project is no longer maintained. You may want to consider using something still actively maintained instead.
The docs also mention:
In theory, one of the processes accessing the database could get stuck in an infinite loop and not release the lock on the database to allow other users to access it. After a period of 2 seconds, if the process with the current lock on the database doesn't access it, the lock will be released and another process can obtain a lock. The first process will itself have to wait to obtain a lock.
Looking at your traceback, I'll make an educated guess that since you put the cursor at module level (which again, you probably don't want to do), it created the cursor when the module was first imported, then by the time your program actually ran the createTables function, more than 2 seconds had elapsed, and it has given up the lock.
Try moving the line to create your cursor inside your methods:
def createTables():
cursor = connection.cursor()
cursor.execute("CREATE TABLE table (dateColumn Date, numberColumn Integer)")
cursor.execute("INSERT INTO table (dateColumn, numberColumn) VALUES ('2003-11-8', 3)")
cursor.execute("INSERT INTO table (dateColumn, numberColumn) VALUES ('2004-11-8', 4)")
cursor.execute("INSERT INTO table (dateColumn, numberColumn) VALUES ('2005-11-8', 5)")
cursor.execute("INSERT INTO table (dateColumn, numberColumn) VALUES ('2006-11-8', 6)")
def select():
cursor = connection.cursor()
selectResult = cursor.execute("SELECT dateColumn FROM table WHERE numberColumn = 3")
return selectResult
(and do the same in your app.py code).

Passing lists or tuples as arguments in django raw sql

I have a list and want to pass thru django raw sql.
Here is my list
region = ['US','CA','UK']
I am pasting a part of raw sql here.
results = MMCode.objects.raw('select assigner, assignee from mm_code where date between %s and %s and country_code in %s',[fromdate,todate,region])
Now it gives the below error, when i execute it in django python shell
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 1412, in __iter__
query = iter(self.query)
File "/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.py", line 73, in __iter__
self._execute_query()
File "/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.py", line 87, in _execute_query
self.cursor.execute(self.sql, self.params)
File "/usr/local/lib/python2.6/dist-packages/django/db/backends/util.py", line 15, in execute
return self.cursor.execute(sql, params)
File "/usr/local/lib/python2.6/dist-packages/django/db/backends/mysql/base.py", line 86, in execute
return self.cursor.execute(query, args)
File "/usr/lib/pymodules/python2.6/MySQLdb/cursors.py", line 166, in execute
self.errorhandler(self, exc, value)
File "/usr/lib/pymodules/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler
raise errorclass, errorvalue
DatabaseError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1")
I have tried by passing the tuple also but there is no use. Can some one help me.
Thanks
Vikram
For PostgreSQL at least, a list/tuple parameter is converted into an array in SQL, e.g.
ARRAY['US', 'CA', 'UK']
When this is inserted into the given query, it results in invalid SQL -
SELECT assigner, assignee FROM mm_code
WHERE date BETWEEN '2014-02-01' AND '2014-02-05'
AND country_code IN ARRAY['US', 'CA', 'UK']
However, the 'in' clause in SQL is logically equivalent to -
SELECT assigner, assignee FROM mm_code
WHERE date BETWEEN %s AND %s
AND country_code = ANY(%s)
... and when this query is filled with the parameters, the resulting SQL is valid and works -
SELECT assigner, assignee FROM mm_code
WHERE date BETWEEN '2014-02-01' AND '2014-02-05'
AND country_code = ANY(ARRAY['US', 'CA', 'UK'])
I'm not sure if this works in the other databases though, and whether or not this changes how the query is planned.
Casting the list to a tuple does work in Postgres, although the same code fails under sqlite3 with DatabaseError: near "?": syntax error so it seems this is backend-specific. Your line of code would become:
results = MMCode.objects.raw('select assigner, assignee from mm_code where date between %s and %s and country_code in %s',[fromdate,todate,tuple(region)])
I tested this on a clean Django 1.5.1 project with the following in bar/models.py:
from django.db import models
class MMCode(models.Model):
assigner = models.CharField(max_length=100)
assignee = models.CharField(max_length=100)
date = models.DateField()
country_code = models.CharField(max_length=2)
then at the shell:
>>> from datetime import date
>>> from bar.models import MMCode
>>>
>>> regions = ['US', 'CA', 'UK']
>>> fromdate = date.today()
>>> todate = date.today()
>>>
>>> results = MMCode.objects.raw('select id, assigner, assignee from bar_mmcode where date between %s and %s and country_code in %s',[fromdate,todate,tuple(regions)])
>>> list(results)
[]
(note that the query line is changed slightly here, to use the default table name created by Django, and to include the id column in the output so that the ORM doesn't complain)
This is not a great solution, because you must make sure your "region" values are correctly escaped for SQL. However, this is the only thing I could get to work with Sqlite:
sql = ('select assigner, assignee from mm_code '
'where date between %%s and %%s and country_code in %s' % (tuple(region),))
results = MMCode.objects.raw(sql, [fromdate,todate])
I ran into exactly this problem today. Django has changed (we now have RawSQL() and friends!), but the general solution is still the same.
According to https://stackoverflow.com/a/283801/532513 the general idea is to explicitly add the same numbers of placeholders to your SQL string as there are elements in your region array.
Your code would then look like this:
sql = 'select assigner, assignee from mm_code where date between %s and %s and country_code in ({0})'\
.format(','.join([%s] * len(region)))
results = MMCode.objects.raw(sql, [fromdate,todate] + region)
Your sql string would then first become ... between %s and %s and country_code in (%s, %s, %s) ... and your params would be effectively [fromdate, todate, 'US', 'CA', 'UK']. This way, you allow the database backend to correctly escape and potentially encode each of the country codes.
Well i'm not against raw sql but you can use:
MMCode.objects.filter(country_code__in=region, date__range=[fromdate,todate])
hope this helps.