Scrapy output only the last incrementally updated item

Scrapy output only the last incrementally updated item - scrapy

Can someone help me with this please, I have been searching for this information for 2 days, no luck.
I have an item with 1 field as a list of another items. The spider works fine, but in the output file I get all the lines of this item.
For example, I need json to be printed as:
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"},
{"date" : "2013-02-10", type="B"},
{"date" : "2013-03-10", type="C"},
{"date" : "2013-04-10"}, type="D"]}
but I get:
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"}]}
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"},
{"date" : "2013-02-10", type="B"}]}
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"},
{"date" : "2013-02-10", type="B"},
{"date" : "2013-03-10", type="C"}
]}
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"},
{"date" : "2013-02-10", type="B"},
{"date" : "2013-03-10", type="C"},
{"date" : "2013-04-10"}, type="D"]}
I use a function to update my parent item:
def rePackIt(parent, item):
if 'details' in parent:
items = parent.get('details')
else:
items = []
items.append(dict(item))
parent['details'] = items
return parent
In parse function I do:
parent = ParentItem()
parent['id'] = self.param # actually I parse a text file with many IDs
parent['details'] = []
yield FormRequest.from_response(response,
formname='...',
formdata={'...':'...', '...': parent['id'],
'...':''},
meta = {'parent': parent, 'dont_merge_cookies': True},
callback=self.parse1)
def parse1(self, response):
parent = response.meta['parent']
sel = HtmlXPathSelector(response)
records = sel.select('//ul[#class="...."]')
for record in records:
item = DetailItem()
item['type'] = record.select('child...')
doc_link = record.select('child.../a/#href').extract()
yield Request(doc_link,
callback=self.parse2,
method='GET',
headers={...},
meta={'dont_merge_cookies': True, 'cookiejar': cookieJar, 'item' : item, 'parent' : parent}
)
def parse2(self, response):
item = response.meta['item']
parent = response.meta['parent']
sel = HtmlXPathSelector(response)
# some other parsing code
item['date'] = cell.select('span[1]/text()[1]').extact()
rePackIt(parent, item)
return parent

The page you are trying to scrap and output as json has this structure
MainItem 1 {some information }
Detail Item 1
Detail Item 2
Main Item 2
Detail Item 1
Detail Item 2
You are returning the parent object for each of the detail item scrapped. While your intention is to return the parent object only once, after it is "complete". Meaning your parent is populated with all the detailed item 1..n. The problem is you don't have a nicer way to say when you finished building the parent item.
One of way to handle this would be writing the pipeline(http://doc.scrapy.org/en/latest/topics/item-pipeline.html). This might sound complicated but its not.
Basically, there is three steps in the pipeline
open_spider
you create your global object of the form
itemlist = []
process_item
if item is parent then
add the item to the list
if item is child then
find the parentitem from the itemlist
parentitem["detail"].add(childitem)
close_spider
Write your json serialise and write to the desired file. One caveat with this is, if you are scrapping huge data, all the scraped item will live in memory, until you write them to the file in this method, as you won't be able to stream write your json items.
Let me know if this works or did you find any better solution.

Related

Create a JSON Object based on input parameter in MuleSoft DataWeave 2.0

So I have a REST API which can have there input Params: START, STOP, RESTART. Start and Stop are distinct operations for the REST API but RESTART essentially means STOP and START. Hence I want to create a Dynamic JSON of either 1 node or 2 nodes based on the operation chosen For e.g. For START/STOP the JSON will be:
{ "appID": "1234",
"operation": "START"}
OR
{ "appID": "1234",
"operation": "STOP"}
While for RESTART it be be like:
{ "appID": "1234","operation": "STOP"},{ "appID": "1234","operation": "START"}
I can then loop through this array and call my API once or twice.However I am at a loss to understand how do I create this JSON dynamically in data weave based on the Operation param passed as an input to the REST API call.I have tried to create a variable with 2 node JSON nodes and then try too loop but that doesn't seem to be working.
I tried something like this:
var count = 0
var appID = "1234567890"
var op = "START"
---
(operation map ((item, index) -> {
"appID": appID,
"operation": if(op=='START' and index == 0) "START"
else if(op=='STOP' and index==0) "STOP"
else if(op=='RESTART' and index==0) "STOP"
else if(op=='RESTART' and index==1) "START"
else ''
})) [ 0 to operation.totalcount - count ]
where the value of Count is either 0 or 1 based on the operation

If I understand correctly if the input value operation is "RESTART" then the script should return an array of elements start and stop, if "START" just a start element and if "STOP" just a stop element. I assume the output is always an array.
For this script the variable op represents the input. I use the value of op to return the operation if it is not "RESTART". If it is I return "STOP" first and add a second element for "START".
%dw 2.0
output application/json
var appID = "1234567890"
var op = "RESTART"
---
[
{
"appID": appID,
"operation": if(op !='RESTART') op
else "STOP"
}
] ++ if (op =='RESTART') [{
"appID": appID,
"operation": "START"
}] else []
Output:
[
{
"appID": "1234567890",
"operation": "STOP"
},
{
"appID": "1234567890",
"operation": "START"
}
]

How to validate Nested JSON Response

I am facing issue while validate Nested JSON response in API Testing using Karate Framework.
JSON Response:
Feed[
{ "item_type": "Cake" ,
"title": "Birthday Cake",
"Services":
[
{
"id": "1",
"name": {
"first_name": "Rahul",
"last_name": "Goyal"
}
},
{
"id": "2",
"name":{
"first_name": "Hitendra",
"last_name": "garg"
}
}
]
},
{
"item_type":"Cycle",
"title": "used by"
},
{
"item_type": "College"
"dept":
[
{"branch": "EC"},
{"branch": "CSE"},
{"branch": "CIVIL"}
]
},
]
}
Now i need to validate response based on Item type. as we can see nested JSON is different for different item_type.
I have tried with below solution
Schema Design for Item_type value cake
def Feed_Cake_Service_name={first_name: '#string',last_name: '#string'}
def Feed_Cake_Services= {id: '#string',name:#(Feed_Cake_Service_name)}
def Feed_Cake={item_type:'#string',title: '#string',Services: '#[] Feed_Cake_Services'}
def Feed_Cake_Response= {Feed: '#[] Feed_Cake'}
Schema Design for item_type Cycle
def Feed_Cycle={item_type:'#string',title:'#string'}
Schema Design for item type College
def Feed_College_Dept_Branch={branch:'#string'}
def Feed_College={item_type:'#string',dept: '[] Feed_College_Dept_Branch'}
now if i want to verify only item type Cake then i have written match like below
match response contains Feed_Cake_Response
but here my test case is getting failed. because it is comparing for all item type.
so here i have two question
1.) How we can compare particular item type schema
2.) How we can include all item type in one match equation since any item type can come in JSON response , and i want to validate all
Thanks

I'll just give you one hint. For the rest, read the documentation please:
* def item = { item_type: '#string', title: '##string', dept: '##[]', Services: '##[]' }
* match each response == item

Karate API framework - Validate randomly displayed items in response

I am using Karate API framework for the API automation and came across with one scenario, the scenario is when I am hitting a post call it gives me some json response and few of the items are having tags whereas few of them are showing tags as blank to get all the tags below is the feature file scenario line
* def getTags = get response.items[*].resource.tags
It is giving me response as
[
[
],
[
],
[
{
"tags" : "Entertainment"
}
],
[
],
[
{
"tags" : "Family"
}
],
As you can see out of 5 or 6 tags only 2 tags are having the value, so I want to capture if any tags value is showing or not. What would be the logic for the assertion considering these tags can all come as empty and sometimes with come with a string value. In above case "Family" & "Entertainment"
Thanks in advance !

* match each response.items[*].resource.tags == "##string"
This will validate that tags either doesn't exist or is a string.

I think you can use a second variable to strip out the empties, or maybe your original JsonPath should use .., you can experiment:
* def allowed = ['Music', 'Entertainment', 'Documentaries', 'Family']
* def response =
"""
[
[
],
[
],
[
{
"tags":"Entertainment"
}
],
[
],
[
{
"tags":"Family"
}
]
]
"""
* def temp = get response..tags
* print temp
* match each temp == "#? allowed.contains(_)"

how to query embedded document using mongodb

Need help constructing this mongo query.
So far I can query on the first level, but unable to do so at the next embedded level ("labels" > 2")
For example, the document structure looks like this:
> db.versions_20170420.findOne();
{
"_id" : ObjectId("54bf146b77ac503bbf0f0130"),
"account" : "foo",
"labels" : {
"1" : {
"name" : "one",
"color" : "color1"
},
"2" : {
"name" : "two",
"color" : "color2"
},
"3" : {
"name" : "three",
"color" : "color3"
}
},
"profile" : "bar",
"version" : NumberLong("201412192106")
This query I can filter at the first level (account, profile).
db.profile_versions_20170420.find({"account":"foo", "profile": "bar"}).pretty()
However, given this structure, I'm looking for documents where the "label" > "2". It doesn't look like "2" is a number, but a string. Is there a way to construct the mongo query to do that? Do I need to do some conversion?

If I correctly understand you and your data structure, "label" > "2" means that object labels must contain property labels.3, and it is easy to check with next code:
db.profile_versions_20170420.find(
{"account": "foo", "profile": "bar", "labels.3": {$exists: true}}
).pretty();
But it doesn't mean that your object contains at least 3 properties, because it is not $size function which calculates count of elements in array, and we cannot use $size because labels is object not array. Hence in our case, we only know that labels have property 3 even it is the only one property which labels contains.
You can improve find criteria:
db.profile_versions_20170420.find({
"account": "foo",
"profile": "bar",
"labels.1": {$exists: true},
"labels.2": {$exists: true},
"labels.3": {$exists: true}
}).pretty();
and ensure that labes contains elements 1, 2, 3, but in this case, you have to care about object structure on application level during insert/update/delete data in document.
As another option, you can update your db and add extra field labelsConut and after that you will be able to run query like this:
db.profile_versions_20170420.find(
{"account": "foo", "profile": "bar", "labelsConut": {$gt: 2}}
).pretty();
btw, it will work faster...

Mule:Dataweave Iteration not working

I am trying to take output from Salesforce & transform it to a json. here is my code:
%dw 1.0
%output application/json
payload map {
headerandlines:{ id : $.Id,
agreementLineID : $.LineItems__r.Id,
netPrice : $.LineItems__r.Price__c,
volume : $.Volume__c,
name : $.Name,
StartDate : $.Start_Date__c,
EndDate : $.End_Date__c,
poField : $.PO_Field__c,
ConsoleNumber : $.Console_Number__c,
Term : $.Term__c,
ownerID : $.OwnerId,
Unit : $.Unit__c,
siteNumber : $.Site_Num__c,
customerNumber : $.Customer_Num__c
}
}
input payload looks like this.. it is a collection of objects. Somehow after the transformation only the first object is sent & rest is clobbered.
[
{
"id": "DA0YAAW",
"LineID": [
"jGEAU",
"jBEAU",
"j6EAE"
],
"Price": [
"50000.0",
"12000.0",
"45000.0"
],
"netPrice": null,
"volume": null,
"name": " Test 2.24",
"StartDate": "2017-02-17",
"EndDate": "2018-02-17",
"poField": "123456",
"ConsoleNumber": "8888888",
"PaymentTerm": "thirty (30)",
"ownerID": “abcd”,
"OperatingUnit": " International Company",
"siteNumber": null,
"customerNumber": null
},
{
"id": "a37n0000000DAMAAA4",
"LineID": [
"JunEAE",
"JuiEAE",
"KdMEAU",
"JuYEAU"
],
"Price": [
"5000.0",
"8000.0",
"5000.0",
"5000.0"
],
"netPrice": null,
"volume": null,
"name": " Test 3.6",
"StartDate": "2017-03-06",
"EndDate": "2018-03-16",
"poField": "12345",
"ConsoleNumber": "123456-",
"PaymentTerm": "30 NET",
"ownerID": “dfgh”,
"OperatingUnit": ", inc.",
"siteNumber": null,
"customerNumber": null
},
….
]
When I call this code from the browser (using API testing) I get the complete payload with multiple objects. When I call this from another API I get only one 1 object indicating it is not looping through. I can confirm that the payload has multiple objects . Is there anything I am missing in terms of looping through this code to extract multiple objects? I assume that '$' notation is good enough for iteration.

#insaneyogi, your input is either incorrect or your dataweave is incorrect.
Here in the input you have specified id in the small. but in dataweave, it is mentioned in capital.

I think the problem here is with your Lineitem and Price type elements. They are collection within and element. In your data mapping $. will take care of the outer object. However, i think the mapping like LineItems__r.Price__c is not correct. It should have proper index , probably LineItems__r.Price__c[0]. Please try that and it should work. First change the input with single element for price or line-item and test.

It looks like the agreementLineID and netPrice are arrays and you need to loop through them with a map operator within the bigger outer map to get all the line items. That should work.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy output only the last incrementally updated item - scrapy

Related

Create a JSON Object based on input parameter in MuleSoft DataWeave 2.0

How to validate Nested JSON Response

Karate API framework - Validate randomly displayed items in response

how to query embedded document using mongodb

Mule:Dataweave Iteration not working

Categories

Resources