Python: Accessing indices of iterable passed into `Pool.map(function(), iterable)` - indexing

I have code that looks something like this:
def downloadImages(self, path):
for link in self.images: #self.images=list of links opened via requests & written to file
if stuff in link:
#parse link a certain way to get `name`
else:
#parse link a different way to get `name`
r = requests.get(link)
with open(path+name,'wb') as f:
f.write(r.content)
pool = Pool(2)
scraper.prepPage(url)
scraper.downloadImages('path/to/directory')
I want to change downloadImages to parallelize the function. Here's my attempt:
def downloadImage(self, path, index of self.images currently being processed):
if stuff in link: #because of multiprocessing, there's no longer a loop providing a reference to each index...the index is necessary to parse and complete the function.
#parse link a certain way to get `name`
else:
#parse link a different way to get `name`
r = requests.get(link)
with open(path+name,'wb') as f:
f.write(r.content)
pool = Pool(2)
scraper.prepPage(url)
pool.map(scraper.downloadImages('path/to/directory', ***some way to grab index of self.images****), self.images)
How can I refer to the index currently being worked on of the iterable passed into pool.map()?
I'm completely new to multiprocessing & couldn't find what I was looking for in the documentation...I also couldn't find a similar question via google or stackoverflow.

Related

How can I access value in sequence type?

There are the following attributes in client_output
weights_delta = attr.ib()
client_weight = attr.ib()
model_output = attr.ib()
client_loss = attr.ib()
After that, I made the client_output in the form of a sequence through
a = tff.federated_collect(client_output) and round_model_delta = tff.federated_map(selecting_fn,a)in here . and I declared
`
#tff.tf_computation() # append
def selecting_fn(a):
#TODO
return round_model_delta
in here. In the process of averaging on the server, I want to average the weights_delta by selecting some of the clients with a small loss value. So I try to access it via a.weights_delta but it doesn't work.
The tff.federated_collect returns a tff.SequenceType placed at tff.SERVER which you can manipulate the same way as for example client dataset is usually handled in a method decorated by tff.tf_computation.
Note that you have to use the tff.federated_collect operator in the scope of a tff.federated_computation. What you probably want to do[*] is pass it into a tff.tf_computation, using the tff.federated_map operator. Once inside the tff.tf_computation, you can think of it as a tf.data.Dataset object and everything in the tf.data module is available.
[*] I am guessing. More detailed explanation of what you would like to achieve would be helpful.

How do I find a specific tag's value (which could be anything) with beautifulsoup?

I am trying to get the job IDs from the tags of Indeed listings. So far, I have taken Indeed search results and put each job into its own "bs4.element.Tag" object, but I don't know how to extract the value of the tag (or is it a class?) "data-jk". Here is what I have so far:
import requests
import bs4
import re
# 1: scrape (5?) pages of search results for listing ID's
results = []
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=10"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=20"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=30"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=40"))
# each search page has a query "q", location "l", and a "start" = 10*int
# the search results are contained in a "td" with ID = "resultsCol"
justjobs = []
for eachResult in results:
soup_jobs = bs4.BeautifulSoup(eachResult.text, "lxml") # this is for IDs
justjobs.extend(soup_jobs.find_all(attrs={"data-jk":True})) # re.compile("data-jk")
# each "card" is a div object
# each has the class "jobsearch-SerpJobCard unifiedRow row result clickcard"
# as well as a specific tag "data-jk"
# "data-jk" seems to be the actual IDs used in each listing's URL
# Now, each div element has a data-jk. I will try to get data-jk from each one:
jobIDs = []
print(type(justjobs[0])) # DEBUG
for eachJob in justjobs:
jobIDs.append(eachJob.find("data-jk"))
print("Length: " + str(len(jobIDs))) # DEBUG
print("Example JobID: " + str(jobIDs[1])) # DEBUG
The examples I've seen online generally try to get the information contained between and , but I am not sure how to get the info from inside of the (first) tag itself. I've tried doing it by parsing it as a string instead:
print(justjobs[0])
for eachJob in justjobs:
jobIDs.append(str(eachJob)[115:131])
print(jobIDs)
but the website is also inconsistent with how the tags operate, and I think that using beautifulsoup would be more flexible than multiple cases and substrings.
Any pointers would be greatly appreciated!
Looks like you can regex them out from a script tag
import requests,re
html = requests.get('https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0').text
p = re.compile(r"jk:'(.*?)'")
ids = p.findall(html)

Pymongo: insert_many() gives "TypeError: document must be instance of dict" for list of dicts

I haven't been able to find any relevant solutions to my problem when googling, so I thought I'd try here.
I have a program where I parse though folders for a certain kind of trace files, and then save these in a MongoDB database. Like so:
posts = function(source_path)
client = pymongo.MongoClient()
db = client.database
collection = db.collection
insert = collection.insert_many(posts)
def function(...):
....
post = parse(trace)
posts.append(post)
return posts
def parse(...):
....
post = {'Thing1': thing,
'Thing2': other_thing,
etc}
return post
However, when I get to "insert = collection.insert_many(posts)", it returns an error:
TypeError: document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping
According to the debugger, "posts" is a list of about 1000 dicts, which should be vaild input according to all of my research. If I construct a smaller list of dicts and insert_many(), it works flawlessly.
Does anyone know what the issue may be?
Some more debugging revealed the issue to be that the "parse" function sometimes returned None rather than a dict. Easily fixed.

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

Creating a .p file from an InvertedIndex

I am trying to construct a Python search engine that will pull queried tweets out of the TwitterAPI. I cannot use tweepy. I have constructed the class TwitterWrapper, class Engine, and have also created the (pickled) corpus from my query term of “raptor”, no problem there. I have created the class InvertedIndex but cannot seem to construct the index of lemmatized words and add it to my Jupyter Notebooks as a pickle file (.p) (which I must do) in order to use my TwittIR.py interface:
t = TwittIR.Engine(“raptor.p”, “index (to be named).p”)
results = t.query (“raptor”)
for result in results:
print(result)
So, my question is, how do I make an index (using my inverted index), and save it into my Jupyter Notebook as a .p file? I am of course, somewhat of a Python beginner.
def build_index(self, corpus):
index = {}
if self.lemmatizer == None:
self.lemmatizer = nltk.WordNetLemmatizer()
if self.stop_words is None:
self.stop_words = [self.wordtotoken(word) for word in
nltk.corpus.stopwords.words('english')]
for doc in corpus:
self.add_document(doc['text'], self.ndocs)
self.ndocs = self.ndocs + 1
def __getitem__(self,index):
return self.index[self.normalize_document(index)[0]]