How to merge items with same identifier from different urls - scrapy

I am currently crawling multiple product-sites from an Excel-document.
The document looks basically like this:
ID URL1 URL2 URL3
01 abc.com/1 def.com/1 ghi.com/1
02 abc.com/2 def.com/2 ghi.com/2
03 abc.com/3 def.com/3 ghi.com/3
My spider now takes the ID and the url and, since the sites are different, yields a request to the corresponding parse-function and passes the ID to each function.
The problem now is that my csv-output lists each ID multiple times, like this:
ID PriceURL1 PriceURL2 PriceURL3
01 xx.xx
01 xx.xx
01 xx.xx
02 xx.xx
02 xx.xx
...
Is there a way to merge the items, so that each ID has the prices gathered from all the different parse-functions?
So far I think my options would be:
Implement a scraping order, so that the spider crawls URL1 first, URL2 second and so on, yielding a request in each parse_urlX-function, but I don't know if this is really an elegant and resource-friendly way, as I'd have to read the urls from the document each time.
Crawl it like it is, and merge the scraped items in the output-csv-file after the spider is done scraping.
What I think might be the best way, work with an item pipeline, so that it adds the ID to a set and subsequently checks if the currently scraped ID has already been scraped and if yes, adds the current price to the item.
I tried 3. but until now, I can't figure out how to access previously scraped items and update them.
What would be the best way to achieve this?

In my opinion the best approach is to process the items afterwards. The item pipeline is not designed to hold items and merge items, you can filter out but not merge them in an elegant way.
Aggregating the output is straightforward:
import json
from collections import defaultdict
groups = defaultdict(list)
for line in open("items.jl"):
item = json.loads(line)
key = get_key(item) # a function that returns a key for an item
groups[key].append(item)
for items in groups.values():
merged = merge_items(items) # a function that returns a single item
# store the merged item somewhere.
...
If your data is large enough that doesn't fit memory, I'd suggest to post a new question and you will get plenty of answers.

Related

How to get the contributors you've coincided editing the most in Wikipedia

I'm doing a gamification web app to help Wikimedia's community health.
I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.
I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.
So, I want to obtain something like
Username
Occurrences
Pages
Mikey
13
Obama,..
So the query would be something like (I'm accepting suggestions):
Get the pages that the user 'Jake' has edited in the last week.
Get the contributors of that page in last week.
For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.
I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).
I only get the edited pages and get the contributors of that page and just count them and it's very slow.
My pywikibot code is:
site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
contributed_pages.add(page)
return get_contributor_ocurrences(contributed_pages,site, username)
And the function
def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:
for editor in page.contributors():
if APISite.isBot(self= site,username=editor) or editor==username:
continue
contributors.append(editor)
return Counter(contributors)
PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot
You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:
from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username) # username must be a string
# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)
contributors= []
# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
# note: editors is a Counter
editors = page.contributors(endtime=stamp)
print('{:<35}: {}'.format(page.title(), editors))
contributors.extend(editors.elements())
total = Counter(contributors)
This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.
It requires some additional work to get the table you mentioned above.

Twitter Premium API Profile location operators profile_country: and profile_region: not working

I am using premium account (not sandbox) for data collection.
I want to collect:
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to US and not geolocated at tweet level, excluding all retweets
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to ‘Minnesota’ and not geolocated at tweet level, excluding all retweets
The code is as follows:
premium_search_args = load_credentials('twitter_API.yaml',
yaml_key ='search_tweets_premium_api', env_overwrite=False)
# keywords for the search
# key word 1
keywords = '(China OR Chinese) lang:en profile_country:US -place_country:US -is:retweet'
# key word 2
keywords = '(China OR Chinese) lang:en -place_country:US profile_region:"Minnesota" -is:retweet'
# define search rule
rule = gen_rule_payload(keywords,from_date='2019-12-01',
to_date='2019-12-10',results_per_call=500)
# create result stream and print before start
rs = ResultStream(rule_payload=rule, max_results=1250000,
**premium_search_args)
My problems are that:
For the first one, a large portion of the results I get didn’t satisfy the query. First, some don’t have Profile Geo enrichment, i.e. user.derived.locations attribute is not in the user object. Second, if it is, a lot don’t have country code US, i.e. they are identified to other countries.
For the second one, the result I get from this method is a smaller subset of the results I can get from 1). That is, when I filter all tweets user geolocated to Minnesota (by user.derived.locations.region) from profile_country:US, it gives a larger sample than using profile_region:“Minnesota”. A considerable amount of data is missing using this method.
I have tried several times but it seems that user geolocation operators don’t work exactly what I want. Does anyone has any idea why this is the case? I would very much appreciate any answers/suggestions/comments.
Thank you!

Is there a way to speed up this webscraping iteration? Pandas

So I'm collecting data on a list of stocks and putting all that info into a dataframe. The list has about 700 stocks.
import pandas as pd
stock =['adma','aapl','fb'] # list has about 700 stocks which I extracted from a pickled dataframe that was storing the info.
#The site I'm visiting is below with the name of the stock added to the end of the end of the link
##http://finviz.com/quote.ashx?t=adma
##http://finviz.com/quote.ashx?t=aapl
I'm just extracting one portion of that site, evident by [-2] in the code below
df2 = pd.DataFrame()
for i in stock:
df = pd.read_html('http://finviz.com/quote.ashx?t={}'.format(i), header =0)[-2].set_index('SEC Form 4')
df['Stock'] = i.upper() # creating a column which has the name of the stock, so I can differentiate between stocks
df2 = df2.append(df)
It feels like I'm doing a few seconds per iteration and I have around 700 to go through at the moment. It's not terribly slow, but I was just curious if there is a more efficient method. Thanks.
Your current code is blocking, you don't proceed with retrieving the information from the next url until you are done with the current. Instead, you can switch to, for example, Scrapy which is based on twisted and working asynchronously processing multiple pages at the same time.

How to use Bioproject ID, for example, PRJNA12997, in biopython?

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.
You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()

Magento Bulk update attributes

I am missing the SQL out of this to Bulk update attributes by SKU/UPC.
Running EE1.10 FYI
I have all the rest of the code working but I"m not sure the who/what/why of
actually updating our attributes, and haven't been able to find them, my logic
is
Open a CSV and grab all skus and associated attrib into a 2d array
Parse the SKU into an entity_id
Take the entity_id and the attribute and run updates until finished
Take the rest of the day of since its Friday
Here's my (almost finished) code, I would GREATLY appreciate some help.
/**
* FUNCTION: updateAttrib
*
* REQS: $db_magento
* Session resource
*
* REQS: entity_id
* Product entity value
*
* REQS: $attrib
* Attribute to alter
*
*/
See my response for working production code. Hope this helps someone in the Magento community.
While this may technically work, the code you have written is just about the last way you should do this.
In Magento, you really should be using the models provided by the code and not write database queries on your own.
In your case, if you need to update attributes for 1 or many products, there is a way for you to do that very quickly (and pretty safely).
If you look in: /app/code/core/Mage/Adminhtml/controllers/Catalog/Product/Action/AttributeController.php you will find that this controller is dedicated to updating multiple products quickly.
If you look in the saveAction() function you will find the following line of code:
Mage::getSingleton('catalog/product_action')
->updateAttributes($this->_getHelper()->getProductIds(), $attributesData, $storeId);
This code is responsible for updating all the product IDs you want, only the changed attributes for any single store at a time.
The first parameter is basically an array of Product IDs. If you only want to update a single product, just put it in an array.
The second parameter is an array that contains the attributes you want to update for the given products. For example if you wanted to update price to $10 and weight to 5, you would pass the following array:
array('price' => 10.00, 'weight' => 5)
Then finally, the third and final attribute is the store ID you want these updates to happen to. Most likely this number will either be 1 or 0.
I would play around with this function call and use this instead of writing and maintaining your own database queries.
General Update Query will be like:
UPDATE
catalog_product_entity_[backend_type] cpex
SET
cpex.value = ?
WHERE cpex.attribute_id = ?
AND cpex.entity_id = ?
In order to find the [backend_type] associated with the attribute:
SELECT
  backend_type
FROM
  eav_attribute
WHERE entity_type_id =
  (SELECT
    entity_type_id
  FROM
    eav_entity_type
  WHERE entity_type_code = 'catalog_product')
AND attribute_id = ?
You can get more info from the following blog article:
http://www.blog.magepsycho.com/magento-eav-structure-role-of-eav_attributes-backend_type-field/
Hope this helps you.