Webscraping from a list of urls using ThreadPoolManager and Selenium

Webscraping from a list of urls using ThreadPoolManager and Selenium - selenium

I am trying to scrap flight prices using selenium from a list of urls. The list of urls I have is very large so my initial implementation that simply grabbed an element from from each url in iteration would take of 24 hours to complete. So I decided to take stab at speeding it up. I did some research and decided that using threading might help. The goal of the code below is to divide the urls between 3 threads, however, it is not working. I think the webpages might just not be loading? I am looking for advice in whether this is a feasible approach or not, or if it isn't what a better strategy might be. Thank you!
#import libraries
from time import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from concurrent.futures import ThreadPoolExecutor
from threading import current_thread
from threading import get_ident
from threading import get_native_id
## initialize drivers
driver0 = webdriver.Chrome(executable_path = 'C:\Program Files (x86)\chromedriver.exe')
driver1 = webdriver.Chrome(executable_path = 'C:\Program Files (x86)\chromedriver.exe')
driver2 = webdriver.Chrome(executable_path = 'C:\Program Files (x86)\chromedriver.exe')
def get_cost(url):
#use correct driver
# print('url: ',url)
thread = current_thread()
print(f'Worker thread: name={thread.name}, idnet={get_ident()}, id={get_native_id()}')
if thread.name == 'ThreadPoolExecutor-0_0':
print('- Thread 0')
driver = driver0.get(url)
elif thread.name == 'ThreadPoolExecutor-0_1':
print('- Thread 1')
driver = driver1.get(url)
elif thread.name == 'ThreadPoolExecutor-0_2':
print('- Thread 2')
driver = driver2.get(url)
else:
print("error")
time.sleep(20) # maybe it doesnt have time to load?
#find cost
try:
element = WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.XPATH,'/html/body/c-wiz[2]/div/div[2]/c-wiz/div/c-wiz/c-wiz/div[2]/div[2]/ul[1]/li[1]/div/div[2]/div/div[9]/div[2]/span'))
)
cost = element.get_attribute('textContent')
except:
cost = "-"
print('url: ',url)
print('cost: ',cost)
urls = ['https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-25%20one%20way',
'https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-26%20one%20way',
'https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-27%20one%20way',
'https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-28%20one%20way',
'https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-29%20one%20way',
'https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-30%20one%20way',
'https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-08-01%20one%20way',
'https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-08-02%20one%20way',
'https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-08-03%20one%20way']
## MAIN
with ThreadPoolExecutor(max_workers=3) as exe:
exe.map(get_cost, urls)
## close drivers
driver0.quit()
driver1.quit()
driver2.quit()

I created a quick solution using SeleniumBase that runs with pytest, which already includes multithreading abilities:
pytest test_scrape_google_flights.py -n 4 --headless
from parameterized import parameterized
from seleniumbase import BaseCase
class GoogleTests(BaseCase):
#parameterized.expand(
[
['https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-25%20one%20way'],
['https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-26%20one%20way'],
['https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-27%20one%20way'],
['https://www.google.com/travel/flights?q=Flights%20to%20Paphos%20from%20Vienna%20on%202022-07-28%20one%20way'],
]
)
def test_parameterized_google_search(self, url):
self.open(url)
self.wait_for_text("$", 'div[role="main"]')
content = self.get_text('div[role="main"]')
items = content.split("\n")
for item in items:
if "$" in item:
self._print("%s - %s\n" % (url.split(r"%20")[-3], item))
This was the result of running that:
pytest test_scrape_google_flights.py -n 4 --headless
=================================== test session starts ===================================
platform darwin -- Python 3.10.1, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/michael/github/SeleniumBase/examples, configfile: pytest.ini
plugins: html-2.0.1, xdist-2.5.0, forked-1.4.0, metadata-2.0.1, rerunfailures-10.2, ordering-0.6, cov-3.0.0, seleniumbase-3.2.9
gw0 [5] / gw1 [5] / gw2 [5] / gw3 [5]
2022-07-27 - $80
2022-07-27 - $235
2022-07-27 - $267
2022-07-27 - $216
2022-07-27 - $471
2022-07-27 - $516
2022-07-27 - $601
2022-07-27 - $626
2022-07-27 - $827
.2022-07-26 - Travel on Jul 27 for $80
2022-07-26 - $453
2022-07-26 - $515
2022-07-26 - $572
2022-07-26 - $596
2022-07-26 - $601
2022-07-26 - $314
2022-07-26 - $316
2022-07-26 - $480
.2022-07-28 - Travel on Jul 29 for $71
2022-07-28 - $96
2022-07-28 - $301
2022-07-28 - $616
2022-07-28 - $340
2022-07-28 - $347
2022-07-28 - $412
2022-07-28 - $462
2022-07-28 - $478
2022-07-28 - $669
2022-07-28 - $1,165
.2022-07-25 - Travel on Jul 27 for $80
2022-07-25 - $130
2022-07-25 - $422
2022-07-25 - $573
2022-07-25 - $615
2022-07-25 - $340
2022-07-25 - $384
2022-07-25 - $501
2022-07-25 - $646
(Full disclosure: I built that automation framework)

Related

Nexflow: structured inputs with files

I have an array of structure data similar to:
- name: foobar
sex: male
fastqs:
- r1: /path/to/foobar_R1.fastq.gz
r2: /path/to/foobar_R2.fastq.gz
- r1: /path/to/more/foobar_R1.fastq.gz
r2: /path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: /path/to/bazquux_R1.fastq.gz
r2: /path/to/bazquux_R2.fastq.gz
Note that fastqs come in pairs, and the number of pairs per "sample" may be variable.
I want to write a process in nextflow that processes one sample at a time.
In order for the nextflow executor to properly marshal the files, they must somehow be typed as path (or file). Thus typed, the executor will copy the files to the compute node for processing. Simply typing the files paths as var will treat the paths as strings and no files will be copied.
A trivial example of a path input from the docs:
process foo {
input:
path x from '/some/data/file.txt'
"""
your_command --in $x
"""
}
How should I go about declaring the process input so that the files are properly marshaled to the compute node? So far I haven't found any examples in the docs for how to handle structured inputs.

Your structured data looks a lot like YAML. If you can include a top-level object so that your file looks something like this:
samples:
- name: foobar
sex: male
fastqs:
- r1: ./path/to/foobar_R1.fastq.gz
r2: ./path/to/foobar_R2.fastq.gz
- r1: ./path/to/more/foobar_R1.fastq.gz
r2: ./path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: ./path/to/bazquux_R1.fastq.gz
r2: ./path/to/bazquux_R2.fastq.gz
Then, we can use Nextflow's -params-file option to load the params when we run our workflow. We can access the top-level object from the params, which gives us a list that we can use to create a Channel using the fromList factory method. The following example uses the new DSL 2:
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(fastqs)
"""
echo "${sample_name},${sex}:"
ls -g *.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| flatMap { rec ->
rec.fastqs.collect { rg ->
readgroup = tuple( file(rg.r1), file(rg.r2) )
tuple( rec.name, rec.sex, readgroup )
}
}
| test_proc
}
Results:
$ mkdir -p ./path/to/more
$ touch ./path/to/foobar_R{1,2}.fastq.gz
$ touch ./path/to/more/foobar_R{1,2}.fastq.gz
$ touch ./path/to/bazquux_R{1,2}.fastq.gz
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [desperate_colden] DSL2 - revision: 391a9a3b3a
executor > local (3)
[ed/61c5c3] process > test_proc (foobar) [100%] 3 of 3 ✔
foobar,male:
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/foobar_R2.fastq.gz
bazquux,female:
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R1.fastq.gz -> ../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R2.fastq.gz -> ../../../path/to/bazquux_R2.fastq.gz
foobar,male:
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/more/foobar_R2.fastq.gz
As requested, here's a solution that runs per sample. The problem we have is that we cannot simply feed in a list of lists using the path qualifier (since an ArrayList is not a valid path value). We could flatten() the list of file pairs, but this makes it difficult to access each of the file pairs if we need them. You may not necessarily need the file pair relationship but assuming you do, I think the right solution is to feed the R1 and R2 files in separately (i.e. using a path qualifier for R1 and another path qualifier for R2). The following example introspects the instance type to (re-)create the list of readgroups. We can use the stageAs option to localize the files into progressively indexed subdirectories, since some files in the YAML have identical names.
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(r1, stageAs:'*/*'), path(r2, stageAs:'*/*')
script:
if( [r1, r2].every { it instanceof List } )
readgroups = [r1, r2].transpose()
else if( [r1, r2].every { it instanceof Path } )
readgroups = [[r1, r2], ]
else
error "Invalid readgroup configuration"
read_pairs = readgroups.collect { r1, r2 -> "${r1},${r2}" }
"""
echo "${sample_name},${sex}:"
echo ${read_pairs.join(' ')}
ls -g */*.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| map { rec ->
def r1 = rec.fastqs.r1.collect { file(it) }
def r2 = rec.fastqs.r2.collect { file(it) }
tuple( rec.name, rec.sex, r1, r2 )
}
| test_proc
}
Results:
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [berserk_sanger] DSL2 - revision: 2f317a8cee
executor > local (2)
[93/6345c9] process > test_proc (bazquux) [100%] 2 of 2 ✔
foobar,male:
1/foobar_R1.fastq.gz,1/foobar_R2.fastq.gz 2/foobar_R1.fastq.gz,2/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R1.fastq.gz -> ../../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R2.fastq.gz -> ../../../../path/to/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R1.fastq.gz -> ../../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R2.fastq.gz -> ../../../../path/to/more/foobar_R2.fastq.gz
bazquux,female:
1/bazquux_R1.fastq.gz,1/bazquux_R2.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R1.fastq.gz -> ../../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R2.fastq.gz -> ../../../../path/to/bazquux_R2.fastq.gz

perigee and apogee calculations off by a few minutes

I'm trying to calculate perigee and apogee (or apsis in general given a second body such as the Sun, and planet, etc)
from skyfield import api, almanac
from scipy.signal import argrelextrema
import numpy as np
e = api.load('de430t.bsp')
def apsis(year = 2019, body='moon'):
apogees = dict()
perigees = dict()
planets = e
earth, moon = planets['earth'], planets[body]
t = ts.utc(year, 1, range(1,367))
dt = t.utc_datetime()
astrometric = earth.at(t).observe(moon)
_, _, distance = astrometric.radec()
#find perigees, at day precision
localmaxes = argrelextrema(distance.km, np.less)[0]
for i in localmaxes:
# get minute precision
t2 = ts.utc(dt[i].year, dt[i].month, dt[i].day-1, 0, range(2881))
dt2 = t2.utc_datetime() # _and_leap_second()
astrometric2 = earth.at(t2).observe(moon)
_, _, distance2 = astrometric2.radec()
m = min(distance2.km)
daindex = list(distance2.km).index(m)
perigees[dt2[daindex]] = m
#find apogees, at day precision
localmaxes = argrelextrema(distance.km, np.greater)[0]
for i in localmaxes:
# get minute precision
t2 = ts.utc(dt[i].year, dt[i].month, dt[i].day-1, 0, range(2881))
dt2 = t2.utc_datetime()
astrometric2 = earth.at(t2).observe(moon)
_, _, distance2 = astrometric2.radec()
m = max(distance2.km)
daindex = list(distance2.km).index(m)
apogees[dt2[daindex]] = m
return apogees, perigee
When I run this for 2019, the next apogee calculates out at 2019-09-13 13:16. This differs by a few minutes from tables such as John Walker's (13:33), Fred Espenak's (13:32), Time and Date dot com (13:32).
I'd expect difference of a minute as seen above in the other sources for reasons such as rounding vs truncation of seconds, but more than 15 minutes difference seems unusual. I've tried this with de431t and de421 ephemeris with similar results.
Whats the difference here? I'm calculating distance of the center of each body, right? What am I screwing up?

After a bit more research and comparing skyfield output to the output of JPL's Horizons, it appears that Skyfield is correct in its calculations, at least against the JPL ephemeris (not surprise there)
I switched the above code snippet to use the same (massive) de432t SPICE kernel used by HORIZONS. This lines up with HORIZONS output (see below, apogees reported by various sources marked), the Moon begins moving away (deldot or range-rate between the observer (geocentric Earth) and the target body (geocentric Moon) goes negative
Ephemeris / WWW_USER Fri Sep 13 17:05:39 2019 Pasadena, USA / Horizons
*******************************************************************************
Target body name: Moon (301) {source: DE431mx}
Center body name: Earth (399) {source: DE431mx}
Center-site name: GEOCENTRIC
*******************************************************************************
Start time : A.D. 2019-Sep-13 13:10:00.0000 UT
Stop time : A.D. 2019-Sep-13 13:35:00.0000 UT
Step-size : 1 minutes
*******************************************************************************
Target pole/equ : IAU_MOON {East-longitude positive}
Target radii : 1737.4 x 1737.4 x 1737.4 km {Equator, meridian, pole}
Center geodetic : 0.00000000,0.00000000,0.0000000 {E-lon(deg),Lat(deg),Alt(km)}
Center cylindric: 0.00000000,0.00000000,0.0000000 {E-lon(deg),Dxy(km),Dz(km)}
Center pole/equ : High-precision EOP model {East-longitude positive}
Center radii : 6378.1 x 6378.1 x 6356.8 km {Equator, meridian, pole}
Target primary : Earth
Vis. interferer : MOON (R_eq= 1737.400) km {source: DE431mx}
Rel. light bend : Sun, EARTH {source: DE431mx}
Rel. lght bnd GM: 1.3271E+11, 3.9860E+05 km^3/s^2
Atmos refraction: NO (AIRLESS)
RA format : HMS
Time format : CAL
EOP file : eop.190912.p191204
EOP coverage : DATA-BASED 1962-JAN-20 TO 2019-SEP-12. PREDICTS-> 2019-DEC-03
Units conversion: 1 au= 149597870.700 km, c= 299792.458 km/s, 1 day= 86400.0 s
Table cut-offs 1: Elevation (-90.0deg=NO ),Airmass (>38.000=NO), Daylight (NO )
Table cut-offs 2: Solar elongation ( 0.0,180.0=NO ),Local Hour Angle( 0.0=NO )
Table cut-offs 3: RA/DEC angular rate ( 0.0=NO )
*******************************************************************************
Date__(UT)__HR:MN delta deldot
***************************************************
$$SOE
2019-Sep-13 13:10 0.00271650099697 0.0000340
2019-Sep-13 13:11 0.00271650100952 0.0000286
2019-Sep-13 13:12 0.00271650101990 0.0000232
2019-Sep-13 13:13 0.00271650102812 0.0000178
2019-Sep-13 13:14 0.00271650103417 0.0000124
2019-Sep-13 13:15 0.00271650103805 0.0000070
2019-Sep-13 13:16 0.00271650103977 0.0000016 <----- Skyfield, HORIZONS
2019-Sep-13 13:17 0.00271650103932 -0.0000038
2019-Sep-13 13:18 0.00271650103670 -0.0000092
2019-Sep-13 13:19 0.00271650103191 -0.0000146
2019-Sep-13 13:20 0.00271650102496 -0.0000200
2019-Sep-13 13:21 0.00271650101585 -0.0000254
2019-Sep-13 13:22 0.00271650100456 -0.0000308
2019-Sep-13 13:23 0.00271650099112 -0.0000362
2019-Sep-13 13:24 0.00271650097550 -0.0000416
2019-Sep-13 13:25 0.00271650095772 -0.0000470
2019-Sep-13 13:26 0.00271650093778 -0.0000524
2019-Sep-13 13:27 0.00271650091566 -0.0000578
2019-Sep-13 13:28 0.00271650089139 -0.0000632
2019-Sep-13 13:29 0.00271650086494 -0.0000686
2019-Sep-13 13:30 0.00271650083633 -0.0000740
2019-Sep-13 13:31 0.00271650080556 -0.0000794
2019-Sep-13 13:32 0.00271650077262 -0.0000848 <------ Espenak, T&D.com
2019-Sep-13 13:33 0.00271650073751 -0.0000902
2019-Sep-13 13:34 0.00271650070024 -0.0000956
2019-Sep-13 13:35 0.00271650066081 -0.0001010
$$EOE
Looking at Espenak's page a bit more, his calculations are based on Jean Meeus' Astronomical Algorithms book (a must have for anyone who plays with this stuff). Lunar ephemeris in that book comes from Jean Chapront's ELP2000/82. While this has been fitted into DE430 (among others),
Sure enough, when using that ELP2000 model to find the maximum lunar distance today Sept 13 2019. You get 2019-09-13 13:34. See code below.
Meeus based his formulae on the 1982 version of Ephemeride Lunaire Parisienne and the source code below leverages the 2002 update by Chapront, but is pretty much what those other sources are coming up with.
So I think my answer is, they are different answers because they are using different models. Skyfield is leveraging the models represented as numerical integrations by the JPL Development ephemeris while ELP is a more analytical approach.
In the end I realize it's a nit-pick, I just wanted to better understand the tools I'm using. But it begs the question, which approach is more accurate?
From what I've read, DE430 and its isotopes, have been fit to observational data, namely Lunar Laser Ranging (LLR) measurement. If just for that LLR consideration, I think I'll stick with Skyfield for calculating lunar distance.
from elp_mpp02 import mpp02 as mpp
import julian
import pytz
import datetime
def main():
mpp.dataDir = 'ELPmpp02'
mode = 1 # Historical mode
jd = 2451545
data = dict()
maxdist = 0
apogee = None
for x in range(10,41):
dt = datetime.datetime(2019, 9, 13, 13, x, tzinfo=pytz.timezone("UTC"))
jd = julian.to_jd(dt, fmt='jd')
lon, lat, dist = mpp.compute_lbr(jd, mode)
if dist > maxdist:
maxdist = dist
apogee = dt
print(f"{maxdist:.2} {apogee}")

Queue of multiprocessing doesn't work well with gevent

It's a producer and worker workflow with multiprocessing and gevent. I want to share some data with Queue of multiprocessing between Process. And at the same time, gevent producer and worker get data and put task to the Queue.
task1_producer will generate some data and put them into q1
task1_worker comsumes the data from task q1 and put generated data into q2 and q3.
Then the task2 does.
But question here is that, data has been inserted into q3 and q4, but nothing happened with task2. If add some logs in task2, you will find that, q3 is empty.
Why would this happened? What's the best method to share data between process?
from multiprocessing import Value, Process, Queue
#from gevent.queue import Queue
from gevent import monkey, spawn, joinall
monkey.patch_all() # Magic!
import requests
import json
import time
import logging
from logging.config import fileConfig
def configure():
logging.basicConfig(level=logging.DEBUG,
format="%(asctime)s - %(module)s - line %(lineno)d - process-id %(process)d - (%(threadName)-5s)- %(levelname)s - %(message)s")
# fileConfig(log_file_path)
return logging
logger = configure().getLogger(__name__)
def task2(q2, q3):
crawl = task2_class(q2, q3)
crawl.run()
class task2_class:
def __init__(self, q2, q3):
self.q2 = q2
self.q3 = q3
def task2_producer(self):
while not self.q2.empty():
logger.debug("comment_weibo_id_queue not empty")
task_q2 = self.q2.get()
logger.debug("task_q2 is {}".format(task_q2))
self.q4.put(task_q2)
def worker(self):
while not self.q3.empty():
logger.debug("q3 not empty")
data_q3 = self.q3.get()
print(data_q3)
def run(self):
spawn(self.task2_producer).join()
joinall([spawn(self.worker) for _ in range(40)])
def task1(user_id, q1, q2, q3):
task = task1_class(user_id, q1, q2, q3)
task.run()
class task1_class:
def __init__(self, user_id, q1, q2, q3):
self.user_id = user_id
self.q1 = q1
self.q2 = q2
self.q3 = q3
logger.debug(self.user_id)
def task1_producer(self):
for data in range(20):
self.q1.put(data)
logger.debug(
"{} has been put into q1".format(data))
def task1_worker(self):
while not self.q1.empty():
data = self.q1.get()
logger.debug("task1_worker data is {}".format(data))
self.q2.put(data)
logger.debug(
"{} has been inserted to q2".format(data))
self.q3.put(data)
logger.debug(
"{} has been inserted to q3".format(data))
def run(self):
spawn(self.task1_producer).join()
joinall([spawn(self.task1_worker) for _ in range(40)])
if __name__ == "__main__":
q1 = Queue()
q2 = Queue()
q3 = Queue()
p2 = Process(target=task1, args=(
"user_id", q1, q2, q3,))
p3 = Process(target=task2, args=(
q2, q3))
p2.start()
p3.start()
p2.join()
p3.join()
some logs
017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-12)- DEBUG - 10 has been inserted to q3
2017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-13)- DEBUG - 11 has been inserted to q3
2017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-14)- DEBUG - 12 has been inserted to q3
2017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-15)- DEBUG - 13 has been inserted to q3
2017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-16)- DEBUG - 14 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-17)- DEBUG - 15 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-18)- DEBUG - 16 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-19)- DEBUG - 17 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-20)- DEBUG - 18 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-21)- DEBUG - 19 has been inserted to q3
[Finished in 0.4s]

gevent's patch_all is incompatible with multiprocessing.Queue. Specifically, patch_all calls patch_thread by default, and patch_thread is documented to have issues with multiprocessing.Queue.
If you want to use multiprocessing.Queue, you can pass thread=False as an argument to patch_all, or just use the specific patch functions that you need, e.g., patch_socket(). (This assumes that you don't need monkey-patched threads, of course, which your example doesn't use.)
Alternatively, you could consider an external queue like Redis, or directly passing data across (unix, probably) sockets (which is what multiprocessing.Queue does under the covers). Admittedly, both are more complex.

How to pull EOD stock data from yahoo finance for excatly last 20 WORKING Days using Pandas in Python 2.7

Right now what I am doing is to pull data for the last 30 days, store this in a dataframe and then pick the data for the last 20 days to use. However If one of the days in the last 20 days is a holiday, then Yahoo shows the Volume across that day as 0 and fills the OHLC(Open, High, Low, Close, Adj Close) with the Adj Close of the previous day. In the example shown below, the data for 2016-01-26 is invalid and I dont want to retreive this data.
So how do I pull data from Yahoo for excatly the last 20 working days ?
My present code is below:
from datetime import date, datetime, timedelta
import pandas_datareader.data as web
todays_date = date.today()
n = 30
date_n_days_ago = date.today() - timedelta(days=n)
yahoo_data = web.DataReader('ACC.NS', 'yahoo', date_n_days_ago, todays_date)
yahoo_data_20_day = yahoo_data.tail(20)

IIUC you can add filter, where column Volume is not 0:
from datetime import date, datetime, timedelta
import pandas_datareader.data as web
todays_date = date.today()
n = 30
date_n_days_ago = date.today() - timedelta(days=n)
yahoo_data = web.DataReader('ACC.NS', 'yahoo', date_n_days_ago, todays_date)
#add filter - get data, where column Volume is not 0
yahoo_data = yahoo_data[yahoo_data.Volume != 0]
yahoo_data_20_day = yahoo_data.tail(20)
print yahoo_data_20_day
Open High Low Close Volume Adj Close
Date
2016-01-20 1218.90 1229.00 1205.00 1212.25 156300 1206.32
2016-01-21 1225.00 1236.95 1211.25 1228.45 209200 1222.44
2016-01-22 1239.95 1256.65 1230.05 1241.00 123200 1234.93
2016-01-25 1250.00 1263.50 1241.05 1245.00 124500 1238.91
2016-01-27 1249.00 1250.00 1228.00 1230.35 112800 1224.33
2016-01-28 1232.40 1234.90 1208.00 1214.95 134500 1209.00
2016-01-29 1220.10 1253.50 1216.05 1240.05 254400 1233.98
2016-02-01 1245.00 1278.90 1240.30 1271.85 210900 1265.63
2016-02-02 1266.80 1283.00 1253.05 1261.35 204600 1255.18
2016-02-03 1244.00 1279.00 1241.45 1248.95 191000 1242.84
2016-02-04 1255.25 1277.40 1253.20 1270.40 205900 1264.18
2016-02-05 1267.05 1286.00 1259.05 1271.40 231300 1265.18
2016-02-08 1271.00 1309.75 1270.15 1280.60 218500 1274.33
2016-02-09 1271.00 1292.85 1270.00 1279.10 148600 1272.84
2016-02-10 1270.00 1278.25 1250.05 1265.85 256800 1259.66
2016-02-11 1250.00 1264.70 1225.50 1234.00 231500 1227.96
2016-02-12 1234.20 1242.65 1199.10 1221.05 212000 1215.07
2016-02-15 1230.00 1268.70 1228.35 1256.55 130800 1250.40
2016-02-16 1265.00 1273.10 1225.00 1227.80 144700 1221.79
2016-02-17 1222.80 1233.50 1204.00 1226.05 165000 1220.05

Need Help Parsing File for This Pattern "Feb 06 2010 15:49:00.017 MCO"

Need to parse a file for lines of data that start with this pattern "Feb 06 2010 15:49:00.017 MCO", where MCO could be any 3 letter ID, and return the entire record for the line. I think I could get the first part, but the returning the rest of the line is where I get lost.
Here is some sample data.
Feb 06 2010 15:49:00.017 MCO -I -I -I -I 0.34 527 0.26 0.24 184 Tentative 0.00 0 Radar Only -RDR- - - - - No 282356N 0811758W - 3-3
Feb 06 2010 15:49:00.017 MLB -I -I -I -I 44.31 3175 -10.05 -10.05 216 Established 0.00 0 Radar Only -RDR- - - - - No 281336N 0812939W - 2-
Feb 06 2010 15:49:00.018 MLB -I -I -I -I 44.31 3175 -10.05 -10.05 216 Established 15.51 99 Radar Only -RDR- - - - - No 281336N 0812939W - 2-
Feb 06 2010 15:49:00.023 QML N856 7437-V -I 62-V 61-V 67.00 3420 -30.93 15.34 534 Established 328.53 129 Reinforced - - - - - - No 283900N 0815325W - -
Feb 06 2010 15:49:00.023 QML N516SP 0723-V -I 22-V 21-V 42.25 3460 -8.19 5.03 146 Established 243.93 83 Beacon Only - - - - - - No 282844N 0812734W - -
Feb 06 2010 15:49:00.023 QML 2247-V -I 145-V 144-V 78.88 3443 -39.68 23.68 676 Established 177.66 368 Reinforced - - - - - - No 284719N 0820325W - -
Feb 06 2010 15:49:00.023 MLB 1200-V -I 15-V 14-V 45.25 3015 -11.32 -20.97 475 Established 349.68 88 Beacon Only - - - - - - No 280239N 0813104W - -
Feb 06 2010 15:49:00.023 MLB 1011-V -I 91-V 90-V 94.50 3264 -56.77 10.21 698 Established 152.28 187 Beacon Only - - - - - - No 283341N 0822244W - -
- - - - - -

seems like your date + 3 characters are always the first 5 fields (with space as delimiter). Just go through the file, and do a split on space to each line. Then get the first 5 fields
s=Split(strLineOfFile," ")
wscript.echo s(0),s(1),s(2),s(3),s(4)
No need regex

From your sample data it seems that you don't have to check for the presence of a three letter identifier following the date -- it's always there. Add a final three letters to the regex if that's not a valid assumption. Also, add more grouping as needed for regex groups to be useful to you. Anyway:
import re
dtre = re.compile(r'^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [0-9]{2} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}')
[line for line in file if dtre.match(line)]
Wrap it in a with statement or whatever to open your file, then do any processing you need on the list this builds up.
Another possibility would be to use a generator expression instead of a list comprehension (replace the outer [ and ] with ( and ) to do so). This is useful if you're outputting results to somewhere as you go, the file is large and you don't need to have it all in memory for different purposes. Just be sure not to close the file before you consume the entire generator if you go with this approach!
Also, you could use datetime's built-in parsing facility:
import datetime
for line in file:
try:
# the line[:24] bit assumes you're always going to have three-digit
# µs part
dt = datetime.datetime.strptime(line[:24], '%b %d %Y %H:%M:%S.%f')
except ValueError:
# a ValueError means the beginning of the line isn't parseable as datetime
continue
# do something with the line; the datetime is already parsed and stored in dt
That's probably better if you're going to create the datetime.datetime object anyway.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Webscraping from a list of urls using ThreadPoolManager and Selenium - selenium

Related

Nexflow: structured inputs with files

perigee and apogee calculations off by a few minutes

Queue of multiprocessing doesn't work well with gevent

How to pull EOD stock data from yahoo finance for excatly last 20 WORKING Days using Pandas in Python 2.7

Need Help Parsing File for This Pattern "Feb 06 2010 15:49:00.017 MCO"

Categories

Resources