Can I use start_urls to scrape a url list? - scrapy

I have a url list that I want to scrape data from. It's from a database that I want to update but I'm unsure as to how to proceed.
import scrapy
import sqlite3
from datetime import datetime, timedelta
class A1hrlaterSpider(scrapy.Spider):
name = 'onehrlater'
allowed_domains = ['donedeal.ie']
timenow = datetime.now()
delta = timedelta(minutes=0)
delta2 = timedelta(minutes=1)
past_time = timenow - delta
past_time2 = timenow - delta2
conn = sqlite3.connect('ddother.db')
c = conn.cursor()
c.execute("SELECT adUrl FROM database WHERE timestamp BETWEEN ? AND ?", (past_time2, past_time))
all_urls = c.fetchall()
urllist = [item[0] for item in all_urls]
print(urllist)
conn.commit()
conn.close()
Urllist is the list of urls I want to scrape. But I'm not sure how I use start_urls to
follow the links or if indeed this is the right way to go about it. Can I say start_urls = urllist or is this wrong?
Any help would be appreciated. Thanks

Related

selenium webdriver send keys pycharm

I have data in an excel sheet, first Column has a number, and second Column has text. My program works with text but not with numbers.
import xlrd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = "C:/Program Files (x86)/chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.youtube.com")
print(driver.title)
search = driver.find_element_by_name("search_query")
workbook = xlrd.open_workbook("mohammed2.xls")
sheet = workbook.sheet_by_name("sheet3")
rowCount = sheet.nrows
colCount = sheet.ncols
print(rowCount)
print(colCount)
for curr_row in range(1, rowCount):
numpValue = sheet.cell_value(curr_row, 0)
#name = sheet.cell_value(curr_row, 1)
search.send_keys(numpValue)
time.sleep(3)
search.send_keys(Keys.RETURN)
search.clear()
time.sleep(3)
search.clear()
search.send_keys(str(numpValue))
it seems sendKeys doesn't allow float , and the value from number field is in folat formate

How to wait 30 second after 20 requests selenium scraping

Hello i have a csv file 300 datas.
After 10 requests , the website stop to give me results.
How to pause 3 minutes my script after 10 requests
thanks you
my code :
societelist =[]
import csv
with open('1.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
browser = webdriver.Firefox(options=options)
browser.get("myurl".format(row[0]))
time.sleep(20)
try:
societe = browser.find_element_by_xpath('/html/body/div[3]/div[2]/div/div[1]/div[2]/div[1]/span[2]').text
except NoSuchElementException:
societe = 'Element not found'
societelist.append(societe)
print (row[0])
browser.quit()
df = pd.DataFrame(list(zip(societelist)), columns=['societe'])
data = df.to_csv('X7878.csv', index=False)
Use:
import csv
societelist =[]
with open('1.csv') as csvfile:
reader = csv.reader(csvfile)
for i, row in enumerate(reader): # i gives the index of the row.
browser = webdriver.Firefox(options=options)
browser.get("myurl".format(row[0]))
time.sleep(20)
try:
societe = browser.find_element_by_xpath('/html/body/div[3]/div[2]/div/div[1]/div[2]/div[1]/span[2]').text
except NoSuchElementException:
societe = 'Element not found'
societelist.append(societe)
print(row[0])
browser.quit()
if not ((i+1) % 10):
time.sleep(180)
df = pd.DataFrame(list(zip(societelist)), columns=['societe'])
df.to_csv('X7878.csv', index=False)
Alternate solution to write each line of text to Excel after scraping instead of writing all text at once.
import csv
import win32com.client as win32
# Launch excel
excel = win32.Dispatch('Excel.Application')
excel.Visible = 1
wb = excel.Workbooks.Add()
ws = wb.Sheets(1)
# Read csv and scrape webpage
with open('1.csv') as csvfile:
reader = csv.reader(csvfile)
for i, row in enumerate(reader): # i gives the index of the row.
browser = webdriver.Firefox(options=options)
browser.get("myurl".format(row[0]))
time.sleep(20)
try:
societe = browser.find_element_by_xpath('/html/body/div[3]/div[2]/div/div[1]/div[2]/div[1]/span[2]').text
except NoSuchElementException:
societe = 'Element not found'
# it may make sense to write the input text and the scraped value side by side.
ws.Cells(i+1, 1).Value = row[0]
ws.Cells(i+1, 2).Value = societe
print(row[0], societe)
browser.quit()
if not ((i+1) % 10):
time.sleep(180)
# If you want to save the file programmatically and close excel.
path = r'C:\Users\jarodfrance\Documents\X7878.xlsx'
wb.SaveAs(path)
wb.Close()
excel.Quit()

isnull() and dropna() not working for pandas 0.22 when using xlwings to get dataframe

Desperate about this mystery. So i just upgraded my pandas to 0.22 (from 0.18) and mysteriously, when using xlwings, dropna or isnull does NOT work anymore. I see that myTemp is still giving me the correct True and False, yet
unwindDF will give me all the df_raw data just with everything filled to become nan and naT. Similar issue for noPx.
This is the case even if I manually assign np.nan to a cell Yet surprisingly, when in the same file I create a simple df towards the end, then myTest1
is working well. why? is there something special about xlwings with pandas 0.22?
My code is below and my xlsx file in the image.
import pythoncom
import pandas as pd
import xlwings as xw
import numpy as np
folder_path = 'S:/Order/all PNL files/'
excel_name='pnlTest.xlsx'
pnl_excel_path = folder_path + excel_name
sheetName = 'Sheet1'
pythoncom.CoInitialize()
app = None
bk = None
app_count = xw.apps.count
for i in range(app_count):
try:
app = xw.apps[i]
temp = app.books[excel_name]
bk = temp
print()
print("Using Opened File")
except:
print()
if bk == None:
print("Open New Excel App")
app = xw.App()
bk = xw.Book(pnl_excel_path)
bk.app.calculation = 'manual'
bk.app.screen_updating = False
sht = bk.sheets[sheetName]
last_row_index = sht.range('A1').end('down').row
df_raw = sht.range('A1:M' + str(last_row_index)).options(pd.DataFrame, header=1,
index=0).value
myTemp = df_raw['UNWD_DT'].isnull()
unwindDF = df_raw[df_raw['UNWD_DT'].isnull()]
df_raw.loc[10,'Curr_Px']=np.nan
df_raw.iloc[10,11]=np.nan
noPx=df_raw[df_raw['Curr_Px'].isnull()]
df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1],'c':[np.nan,1,0,np.nan]})
myTemp1=df['c'].isnull()
myTest1=df[df['c'].isnull()]
df_raw.dropna(thresh=2,inplace=True)
df_raw2=df_raw.dropna(thresh=2)

Replace before save to CSV

I'm using scrapy's export to CSV but sometimes the content I'm scraping contains quotes and comma's which i don't want.
How can I replace those chars with nothing '' before outputting to CSV?
Heres my CSV containing the unwanted chars in the strTitle column:
strTitle,strLink,strPrice,strPicture
"TOYWATCH 'Metallic Stones' Bracelet Watch, 35mm",http://shop.nordstrom.com/s/toywatch-metallic-stones-bracelet-watch-35mm/3662824?origin=category,0,http://g.nordstromimage.com/imagegallery/store/product/Medium/11/_8412991.jpg
Heres my code which errors on the replace line:
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='fashion-item']")
items = []
for titles in titles[:1]:
item = watch2Item()
item ["strTitle"] = titles.xpath(".//a[#class='title']/text()").extract()
item ["strTitle"] = item ["strTitle"].replace("'", '').replace(",",'')
item ["strLink"] = urlparse.urljoin(response.url, titles.xpath("div[2]/a[1]/#href").extract()[0])
item ["strPrice"] = "0"
item ["strPicture"] = titles.xpath(".//img/#data-original").extract()
items.append(item)
return items
EDIT
Try adding this line before the replace.
item["strTitle"] = ''.join(item["strTitle"])
strTitle = "TOYWATCH 'Metallic Stones' Bracelet Watch, 35mm"
strTitle = strTitle.replace("'", '').replace(",",'')
strTitle == "TOYWATCH Metallic Stones Bracelet Watch 35mm"
In the end the solution was:
item["strTitle"] = [titles.xpath(".//a[#class='title']/text()").extract()[0].replace("'", '').replace(",",'')]

vb.Net code to use AX 2009 ReturnOrderInService web service

Need to use the create method of the AX 2009 ReturnOrderInService web service in a vb.NET aspx page to create an RMA in AX.
The code I've written below creates the RMA in AX, but doesn't show the line details in the AX RMA form, even though the records are in SalesTable and SalesLine.
Is a record needed in InventTrans or is there a missing InventRefId value somewhere?
Dim rmaClient As ReturnOrderInServiceClient = New ReturnOrderInServiceClient("WSHttpBinding_ReturnOrderInService1")
Dim roi As AxdReturnOrderIn = New AxdReturnOrderIn
Dim st As AxdEntity_SalesTable = New AxdEntity_SalesTable
st.CustAccount = "123"
st.ReturnReasonCodeId = "RRC1"
st.DlvMode = "01"
st.SalesType = 4 'return item
st.ReturnDeadline = DateAdd(DateInterval.Day, 15, Now())
Dim sl As AxdEntity_SalesLine = New AxdEntity_SalesLine
sl.ItemId = "ITEM 123"
sl.ExpectedRetQty = -2
sl.LineAmount = 0
sl.InventTransIdReturn = ""
st.SalesLine = New AxdEntity_SalesLine() {sl}
roi.SalesTable = New AxdEntity_SalesTable() {st}
txtFeedback.Text = ""
Try
Dim returnedSalesOrderEntityKey As EntityKey() = rmaClient.create(roi)
Dim returnedSalesOrder As EntityKey = CType(returnedSalesOrderEntityKey.GetValue(0), EntityKey)
txtFeedback.Text = GetRMANo(returnedSalesOrder.KeyData(0).Value)
Catch ex As Exception
txtFeedback.Text = ex.Message
End Try
rmaClient.Close()
Did you generate the proxy classes as stated in http://msdn.microsoft.com/en-us/library/cc652581(v=ax.50).aspx?
This should create the AxdEntity classes needed.
First I would try to translate the example to VB. I cannot help you with the specific syntax, but there is nothing fancy here, so it should be rather simple.
Regarding the use of web services in AX, see also:
http://www.axaptapedia.com/Webservice
http://blogs.msdn.com/b/aif/archive/2011/06/15/microsoft-dynamics-ax-2012-services-and-aif-white-papers.aspx
http://blogs.msdn.com/b/aif/archive/2007/11/28/consuming-the-customer-service-from-a-c-client.aspx (last part)
http://channel9.msdn.com/blogs/sanjayjain/microsoft-dynamics-ax-2009-aif-web-services-screencast