scrapy single spider to pass multiple item classes to pipeline - scrapy

I am new to scrapy. In items.py, I declare 2 ItemClass called ItemClass1 and ItemClass2. A spider method parseUrl get the html and scrape data and put into lists for respective Item Classes.
e.g:
C1Items = []
C1Item = ItemClass1()
#scrape data
C1Items.append(C1Item)
...
C2Items = []
C2Item = ItemClass2()
#scrape data
C2Items.append(C2Item)
...
finally: C1Items and C2Items contain required data.
return C1Items #will pass ItemClass1 data to pipeline
return C2Items #will pass ItemClass2 data to pipeline
Could you please advise what is the best way to pass both C1Items, C2Items to pipeline.

Either combine all the items of different classes into one list and return that list, or use yield statement:
C1Item = ItemClass1()
#scrape data
yield C1Item
...
C2Item = ItemClass2()
#scrape data
yield C2Item

Just combine the arrays into one big array and return that:
return C1Items + C2Items
or alternatively you could turn parseUrl into a generator function with:
yield C1Items
yield C2Items

Related

How to input via gradio dropdowns using variable information stored in a data frame?

I'm trying to input info from an easily editable CSV file where the first row contains the variable names. I am doing this by iterating over the list of variables so each variable is a new dropdown. Using this method everything appears correctly in the interface but something is going wrong when it comes to outputting the selected values. I'm thinking I need another approach but I can't think how to do it short of generating a new gradio interface for each dropdown.
the following code hangs when you click submit:
import gradio as gr
import pandas as pd
df = pd.read_csv(file_path)
# first row of the CSV as the variable names
variables = list(df.columns)
# filter out null values
df = df.dropna()
# rest of the rows as the options for each variable
options = {var: list(df[var]) for var in variables}
def assign_values(**kwargs):
global save_variables
save_variables.update(kwargs)
return kwargs
save_variables = {}
inputs = [gr.inputs.Dropdown(label=var, choices=options[var], type="value") for var in variables]
outputs = ["text"]
gr.Interface(fn=assign_values, inputs=inputs, outputs=outputs, title="Assign Values").launch()

How to get row data from web table in Karate framework

The web table has a combination of textbox, span and checkboxes. I need to get first row of all the data in single def and have to verify with DB in order wise.
Ex: First row of table has columns like below.
OrderID(span), EmpName(input), IsHeEligible(checkbox), Address(span)
By using below,
def tebleFirstData = scriptAll("table/tbody/tr/td",'_.textContent')
Able to get only span text data, not able to get input tag test data.
I tried below,
data = attribute("table/tbody/tr/td[5]/input",'value')
But, I'm able to get only single input tag attribute value.
How can I get all the data in single def, i.e span data and input data?
Below is the solution to get all row data from table..!
* def UiFirstRowElements = locateAll("Row xpath")
print UiFirstRowElements
def tableData = []
def RowHtml = UiFirstRowElements[0].html
print RowHtml
eval
"""
for(var i=0; i<UiFirstRowElements.length; i++)
if(UiFirstRowElements[i].html.contains("input") && UiFirstRowElements[i].html.contains("date")){
tableData.push(locate("//table/tbody/tr/td["+(i+1)+"]/div/input").property('value'))
}
else if(UiFirstRowElements[i].html.contains("input") && UiFirstRowElements[i].html.contains("checkbox")){
tableData.push(locate("//table/tbody/tr/td["+(i+1)+"]/div/input").property('checked'))
}
else {
tableData.push(locate("//table/tbody/tr/td["+(i+1)+"]").property('textContent'))
}
"""
* print 'TableName-->', TableName
* print tableData

How do I loop thought each DB field to see if range is correct

I have this response in soapUI:
<pointsCriteria>
<calculatorLabel>Have you registered for inContact, signed up for marketing news from FNB/RMB Private Bank, updated your contact details and chosen to receive your statements</calculatorLabel>
<description>Be registered for inContact, allow us to communicate with you (i.e. update your marketing consent to 'Yes'), receive your statements via email and keep your contact information up to date</description>
<grades>
<points>0</points>
<value>No</value>
</grades>
<grades>
<points>1000</points>
<value>Yes</value>
</grades>
<label>Marketing consent given and Online Contact details updated in last 12 months</label>
<name>c21_mrktng_cnsnt_cntct_cmb_point</name>
</pointsCriteria>
There are many many many pointsCriteria and I use the below xquery to give me the DB value and Range of what that field is meant to be:
<return>
{
for $x in //pointsCriteria
return <DBRange>
<db>{data($x/name/text())}</db>
<points>{data($x//points/text())}</points>
</DBRange>
}
</return>
And i get the below response
<return><DBRange><db>c21_mrktng_cnsnt_cntct_cmb_point</db><points>0 1000</points></DBRange>
That last bit sits in a property transfer. I need SQL to bring back all rows where that DB field is not in that points range (field can only be 0 or 1000 in this case), my problem is I dont know how to loop through each DBRange/DBrange in this manner? please help
I'm not sure that I really understand your question, however I think that you want to make queries in your DB using specific table with a column name defined in your <db> field of your xml, and using as values the values defined in <points> field of the same xml.
So you can try using a groovy TestStep, first parse your Xml and get back your column name, and your points. To iterate over points if the values are separated with a blank space you can make a split(" ") to get a list and then use each() to iterate over the points on this list. Then using groovy.sql.Sql you can perform the queries in your DB.
Only one more thing, you need to put the JDBC drivers for your vendor DB in $SOAPUI_HOME/bin/ext and then restart SOAPUI in order that it can load the necessary driver classes.
So the follow code approach can achieve your goal:
import groovy.sql.Sql
import groovy.util.XmlSlurper
// soapui groovy testStep requires that first register your
// db vendor drivers, as example I use oracle drivers...
com.eviware.soapui.support.GroovyUtils.registerJdbcDriver( "oracle.jdbc.driver.OracleDriver")
// connection properties db (example for oracle data base)
def db = [
url : 'jdbc:oracle:thin:#db_host:d_bport/db_name',
username : 'yourUser',
password : '********',
driver : 'oracle.jdbc.driver.OracleDriver'
]
// create the db instance
def sql = Sql.newInstance("${db.url}", "${db.username}", "${db.password}","${db.driver}")
def result = '''<return>
<DBRange>
<db>c21_mrktng_cnsnt_cntct_cmb_point</db>
<points>0 1000</points>
</DBRange>
</return>'''
def resXml = new XmlSlurper().parseText(result)
// get the field
def field = resXml.DBRange.db.text()
// get the points
def points = resXml.DBRange.points.text()
// points are separated by blank space,
// so split to get an array with the points
def pointList = points.split(" ")
// for each point make your query
pointList.each {
def sqlResult = sql.rows "select * from your_table where ${field} = ?",[it]
log.info sqlResult
}
sql.close();
Hope this helps,
Thanks again for your help #albciff, I had to add this into a multidimensional array (I renamed field to column and result is a large return from the Xquery above)
def resXml = new XmlSlurper().parseText(result)
//get the columns and points ranges
def Column = resXml.DBRange.db*.text()
def Points = resXml.DBRange.points*.text()
//sorting it all out into a multidimensional array (index per index)
count = 0
bigList = Column.collect
{
[it, Points[count++]]
}
//iterating through the array
bigList.each
{//creating two smaller lists and making it readable for sql part later
def column = it[0]
def points = it[1]
//further splitting the points to test each
pointList = points.split(" ")
pointList.each
{//test each points range per column
def sqlResult = sql.rows "select * from my_table where ${column} <> ",[it]
log.info sqlResult
}
}
sql.close();
return;

Groovy dynamic method invocation with nested function

I need to evaluate a string with nested function calls. Is there an easy way to do this with groovy?
Edit: Code made more realistic. The context is nonacademic; my function needs to combine and evaluate a bunch of arbitrary strings and values from json files.
JSON file 1 will have strings like:
"biggerThan(isList,0)"
"smallerThan(isList,3)"
"biggerThan(isList,1)"
JSON file 2 will have values like
[4,1]
[1,2,1]
[1,5,6,2,98]
[]
def biggerThan= {func, val->
{v->return func(v) && (v.size() > val)}
}
def isList ={n->
return n instanceof List
}
def a=biggerThan(isList,1)
a([4,1])
// -> returns true in groovy console because [4,1] is a list with size>1

Add extra field in Django QuerySet as timedelta type

I have the following model:
class UptimeManager(models.Manager):
def with_length(self):
"""Get querySet of uptimes sorted by length including the current one. """
extra_length = Uptime.objects.extra(select={'length':
"""
SELECT
IF (end is null,
timestampdiff(second,begin,now()),
timestampdiff(second,begin,end))
FROM content_uptime c
WHERE content_uptime.id = c.id
"""
})
return extra_length
class Uptime(models.Model):
begin = models.DateTimeField('beginning')
end = models.DateTimeField('end', null=True) I call
host = models.ForeignKey("Host")
objects = UptimeManager()
...
then I call Uptime.objects.with_length().order_by('-length')[:10] to get list of longest uptimes.
But the length in template is of integer type. How to modify my code as the length of object returned by manager would be accessible in template as timedelta object?
I almost could do it by returning a list and converting number of seconds to timedelta objects, but then I have to do sorting, filtering etc. in my Python code which is rather ineffective in comparison to one well done SQL query.
Add a property to the model that looks at the actual field and converts it to the appropriate type.
My solution is to create a filter that determines type of length var and returns timedelta in case it's some integer type
from django import template
import datetime
register = template.Library()
def timedelta(value):
if isinstance(value, (long,int)):
return datetime.timedelta(seconds=value)
elif isinstance(value, datetime.timedelta):
return value
else: raise UnsupportedOperation
register.filter('timedelta',timedelta)
and use in template it's trivial
{{ uptime.length|timedelta }}