Queue of multiprocessing doesn't work well with gevent - process

It's a producer and worker workflow with multiprocessing and gevent. I want to share some data with Queue of multiprocessing between Process. And at the same time, gevent producer and worker get data and put task to the Queue.
task1_producer will generate some data and put them into q1
task1_worker comsumes the data from task q1 and put generated data into q2 and q3.
Then the task2 does.
But question here is that, data has been inserted into q3 and q4, but nothing happened with task2. If add some logs in task2, you will find that, q3 is empty.
Why would this happened? What's the best method to share data between process?
from multiprocessing import Value, Process, Queue
#from gevent.queue import Queue
from gevent import monkey, spawn, joinall
monkey.patch_all() # Magic!
import requests
import json
import time
import logging
from logging.config import fileConfig
def configure():
logging.basicConfig(level=logging.DEBUG,
format="%(asctime)s - %(module)s - line %(lineno)d - process-id %(process)d - (%(threadName)-5s)- %(levelname)s - %(message)s")
# fileConfig(log_file_path)
return logging
logger = configure().getLogger(__name__)
def task2(q2, q3):
crawl = task2_class(q2, q3)
crawl.run()
class task2_class:
def __init__(self, q2, q3):
self.q2 = q2
self.q3 = q3
def task2_producer(self):
while not self.q2.empty():
logger.debug("comment_weibo_id_queue not empty")
task_q2 = self.q2.get()
logger.debug("task_q2 is {}".format(task_q2))
self.q4.put(task_q2)
def worker(self):
while not self.q3.empty():
logger.debug("q3 not empty")
data_q3 = self.q3.get()
print(data_q3)
def run(self):
spawn(self.task2_producer).join()
joinall([spawn(self.worker) for _ in range(40)])
def task1(user_id, q1, q2, q3):
task = task1_class(user_id, q1, q2, q3)
task.run()
class task1_class:
def __init__(self, user_id, q1, q2, q3):
self.user_id = user_id
self.q1 = q1
self.q2 = q2
self.q3 = q3
logger.debug(self.user_id)
def task1_producer(self):
for data in range(20):
self.q1.put(data)
logger.debug(
"{} has been put into q1".format(data))
def task1_worker(self):
while not self.q1.empty():
data = self.q1.get()
logger.debug("task1_worker data is {}".format(data))
self.q2.put(data)
logger.debug(
"{} has been inserted to q2".format(data))
self.q3.put(data)
logger.debug(
"{} has been inserted to q3".format(data))
def run(self):
spawn(self.task1_producer).join()
joinall([spawn(self.task1_worker) for _ in range(40)])
if __name__ == "__main__":
q1 = Queue()
q2 = Queue()
q3 = Queue()
p2 = Process(target=task1, args=(
"user_id", q1, q2, q3,))
p3 = Process(target=task2, args=(
q2, q3))
p2.start()
p3.start()
p2.join()
p3.join()
some logs
017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-12)- DEBUG - 10 has been inserted to q3
2017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-13)- DEBUG - 11 has been inserted to q3
2017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-14)- DEBUG - 12 has been inserted to q3
2017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-15)- DEBUG - 13 has been inserted to q3
2017-05-17 13:46:40,222 - demo - line 78 - process-id 13269 - (DummyThread-16)- DEBUG - 14 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-17)- DEBUG - 15 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-18)- DEBUG - 16 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-19)- DEBUG - 17 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-20)- DEBUG - 18 has been inserted to q3
2017-05-17 13:46:40,223 - demo - line 78 - process-id 13269 - (DummyThread-21)- DEBUG - 19 has been inserted to q3
[Finished in 0.4s]

gevent's patch_all is incompatible with multiprocessing.Queue. Specifically, patch_all calls patch_thread by default, and patch_thread is documented to have issues with multiprocessing.Queue.
If you want to use multiprocessing.Queue, you can pass thread=False as an argument to patch_all, or just use the specific patch functions that you need, e.g., patch_socket(). (This assumes that you don't need monkey-patched threads, of course, which your example doesn't use.)
Alternatively, you could consider an external queue like Redis, or directly passing data across (unix, probably) sockets (which is what multiprocessing.Queue does under the covers). Admittedly, both are more complex.

Related

Is there a way to implement equations as Dymos path constraints?

For example, if I have a function h_max(mach) and I want the altitude to always respect this predefined altitude-mach relationship throughout the flight enveloppe, how could I impliment this?
I have tried calculating the limit quantity (in this case, h_max) as its own state and then calculating another state as h_max-h and then constraining that through a path constraint to being greater than 0. This type of approach has worked, but involved two explicit components, a group and alot of extra coding just to get a constraint working. I was wondering if there was a better way?
Thanks so much in advance.
The next version of Dymos, 1.7.0 will be released soon and will support this.
In the mean time, you can install the latest developmental version of Dymos directly from github to have access to this capability:
python -m pip install git+https://github.com/OpenMDAO/dymos.git
Then, you can define boundary and path constraints with an equation. Note the equation must have an equals sign in it, and then lower, upper, or equals will apply to the result of the equation.
In reality, dymos is just inserting an OpenMDAO ExecComp for you under the hood, so the one caveat to this is that your expression must be compatible with complex-step differentiation.
Here's an example of the brachistochrone that uses constraint expressions to set the final y value to a specific value while satisfying a path constraint defined with a second equation.
import openmdao.api as om
import dymos as dm
from dymos.examples.plotting import plot_results
from dymos.examples.brachistochrone import BrachistochroneODE
import matplotlib.pyplot as plt
#
# Initialize the Problem and the optimization driver
#
p = om.Problem(model=om.Group())
p.driver = om.ScipyOptimizeDriver()
p.driver.declare_coloring()
#
# Create a trajectory and add a phase to it
#
traj = p.model.add_subsystem('traj', dm.Trajectory())
phase = traj.add_phase('phase0',
dm.Phase(ode_class=BrachistochroneODE,
transcription=dm.GaussLobatto(num_segments=10)))
#
# Set the variables
#
phase.set_time_options(fix_initial=True, duration_bounds=(.5, 10))
phase.add_state('x', fix_initial=True, fix_final=True)
phase.add_state('y', fix_initial=True, fix_final=False)
phase.add_state('v', fix_initial=True, fix_final=False)
phase.add_control('theta', continuity=True, rate_continuity=True,
units='deg', lower=0.01, upper=179.9)
phase.add_parameter('g', units='m/s**2', val=9.80665)
Y_FINAL = 5.0
Y_MIN = 5.0
phase.add_boundary_constraint(f'bcf_y = y - {Y_FINAL}', loc='final', equals=0.0)
phase.add_path_constraint(f'path_y = y - {Y_MIN}', lower=0.0)
#
# Minimize time at the end of the phase
#
phase.add_objective('time', loc='final', scaler=10)
p.model.linear_solver = om.DirectSolver()
#
# Setup the Problem
#
p.setup()
#
# Set the initial values
#
p['traj.phase0.t_initial'] = 0.0
p['traj.phase0.t_duration'] = 2.0
p.set_val('traj.phase0.states:x', phase.interp('x', ys=[0, 10]))
p.set_val('traj.phase0.states:y', phase.interp('y', ys=[10, 5]))
p.set_val('traj.phase0.states:v', phase.interp('v', ys=[0, 9.9]))
p.set_val('traj.phase0.controls:theta', phase.interp('theta', ys=[5, 100.5]))
#
# Solve for the optimal trajectory
#
dm.run_problem(p)
# Check the results
print('final time')
print(p.get_val('traj.phase0.timeseries.time')[-1])
p.list_problem_vars()
Note the constraints from the list_problem_vars() call that come from timeseries_exec_comp - this is the OpenMDAO ExecComp that Dymos automatically inserts for you.
--- Constraint Report [traj] ---
--- phase0 ---
[final] 0.0000e+00 == bcf_y [None]
[path] 0.0000e+00 <= path_y [None]
/usr/local/lib/python3.8/dist-packages/openmdao/recorders/sqlite_recorder.py:227: UserWarning:The existing case recorder file, dymos_solution.db, is being overwritten.
Model viewer data has already been recorded for Driver.
Full total jacobian was computed 3 times, taking 0.057485 seconds.
Total jacobian shape: (71, 51)
Jacobian shape: (71, 51) (12.51% nonzero)
FWD solves: 12 REV solves: 0
Total colors vs. total size: 12 vs 51 (76.5% improvement)
Sparsity computed using tolerance: 1e-25
Time to compute sparsity: 0.057485 sec.
Time to compute coloring: 0.054118 sec.
Memory to compute coloring: 0.000000 MB.
/usr/local/lib/python3.8/dist-packages/openmdao/core/total_jac.py:1585: DerivativesWarning:Constraints or objectives [('traj.phases.phase0.timeseries.timeseries_exec_comp.path_y', inds=[(0, 0)])] cannot be impacted by the design variables of the problem.
Optimization terminated successfully (Exit mode 0)
Current function value: [18.02999766]
Iterations: 14
Function evaluations: 14
Gradient evaluations: 14
Optimization Complete
-----------------------------------
final time
[1.80299977]
----------------
Design Variables
----------------
name val size indices
-------------------------- -------------- ---- ---------------------------------------------
traj.phase0.t_duration [1.80299977] 1 None
traj.phase0.states:x |12.14992234| 9 [1 2 3 4 5 6 7 8 9]
traj.phase0.states:y |22.69124774| 10 [ 1 2 3 4 5 6 7 8 9 10]
traj.phase0.states:v |24.46289861| 10 [ 1 2 3 4 5 6 7 8 9 10]
traj.phase0.controls:theta |266.48489386| 21 [ 0 1 2 3 4 5 ... 4 15 16 17 18 19 20]
-----------
Constraints
-----------
name val size indices alias
----------------------------------------------------------- ------------- ---- --------------------------------------------- ----------------------------------------------------
timeseries.timeseries_exec_comp.bcf_y [0.] 1 [29] traj.phases.phase0->final_boundary_constraint->bcf_y
timeseries.timeseries_exec_comp.path_y |15.73297378| 30 [ 0 1 2 3 4 5 ... 3 24 25 26 27 28 29] traj.phases.phase0->path_constraint->path_y
traj.phase0.collocation_constraint.defects:x |6e-08| 10 None None
traj.phase0.collocation_constraint.defects:y |7e-08| 10 None None
traj.phase0.collocation_constraint.defects:v |3e-08| 10 None None
traj.phase0.continuity_comp.defect_control_rates:theta_rate |0.0| 9 None None
----------
Objectives
----------
name val size indices
------------- ------------- ---- -------
traj.phase0.t [18.02999766] 1 -1

How do I add a line break to an external text log file from a pentaho transform?

I'm using pentaho pdi (spoon) I have a transform to compare 2 database tables (from a query selecting year and quarters within those tables), i'm then hoping to a merge rows (diff) to a filter rows if flagfield is not identical, which if success logs the matches, and if doesn't match logs the output, both with text file output steps...
my issue is my external log file gets appended and looks like this:
412542 - 21 - 4 - deleted - DOMAIN1
461623 - 22 - 1 - deleted - DOMAIN1
^failuresDOMAIN1 - 238388 - 12 - 4 - identical
DOMAIN1- 223016 - 13 - 1 - identical
DOMAIN1- 171764 - 13 - 2 - identical
DOMAIN1- 185569 - 13 - 3 - identical
DOMAIN1- 232247 - 13 - 4 - identical
DOMAIN1- 260057 - 14 - 1 - identical
^successes
I want this output:
412542 - 21 - 4 - deleted - DOMAIN1
461623 - 22 - 1 - deleted - DOMAIN1
^failures
DOMAIN1 - 238388 - 12 - 4 - identical
DOMAIN1- 223016 - 13 - 1 - identical
DOMAIN1- 171764 - 13 - 2 - identical
DOMAIN1- 185569 - 13 - 3 - identical
DOMAIN1- 232247 - 13 - 4 - identical
DOMAIN1- 260057 - 14 - 1 - identical
^successes
notice the line breaks between the successes and failures
using add a data grid w/ a "line_break" string that's simply a new line, then passing that to each "text file output" that logs as this "line_break" data column string value quickly which helps, but I can't seem to sequence the transform steps because they're parallel...

Does Simpy support optimized dynamic resource distributon across multiple nodes?

I have 2 nodes 0 and 1 and in total there are 12 resources which will server in the nodes 0 and 1. Is there a method in Simpy to schedule the 12 resources across nodes 0 and 1 so that the average total processing time of an item through node 0 followed by node 1 is minimized. From time to time resources can move from one node to another for serving. Attached is the code where I have come up with a static distribution of 5 resources in node 0 and 7 resources in node 1. How to make it dynamic with time ?
import numpy as np
import simpy
def interarrival():
return(np.random.exponential(20))
def servicetime():
return(np.random.exponential(60))
def servicing(env, servers_1):
i = 0
while(True):
i = i+1
yield env.timeout(interarrival())
print("Customer "+str(i)+ " arrived in the process at "+str(env.now))
state = 0
env.process(items(env, i, servers_array, state))
def items(env, customer_id, servers_array, state):
with servers_array[state].request() as request:
yield request
t_arrival = env.now
print("Customer "+str(customer_id)+ " arrived in "+str(state)+ " at "+str(t_arrival))
yield env.timeout(servicetime())
t_depart = env.now
print("Customer "+str(customer_id)+ " departed from "+str(state)+ " at "+str(t_depart))
if (state == 1):
print("Customer exits")
else:
state = 1
env.process(items(env, customer_id, servers_array, state))
env = simpy.Environment()
servers_array = []
servers_array.append(simpy.Resource(env, capacity = 5))
servers_array.append(simpy.Resource(env, capacity = 7))
env.process(servicing(env, servers_array))
env.run(until=2880)
If you use the resources, start each node with a capacity of 12 and use the delay from your last question to delay some of the resources from each node so the total active resources is the total you want. Otherwise you may want to start looking at containers and stores which will allow you to move a resource from one node to another.

Error matplotlib.lines.Line2D object at 0x025B8350

>>> import pylab as pl
>>> x = np.linspace(0,4*np.pi, 100)
>>> pl.plot(x, np.sin(x))
[<matplotlib.lines.Line2D object at 0x025B8350>]
after install numpy, scipy, sympy, matplotlib, ipython
---------------------------------------------------------------------------
TypeError Python 2.7.3: C:\Python27\python.exe
Fri Sep 28 09:59:01 2012
A problem occured executing Python code. Here is the sequence of function
calls leading up to the error, with the most recent (innermost) call last.
C:\Python27\scripts\ipython.py in <module>()
13
14 [or simply IPython.Shell.IPShell().mainloop(1) ]
15
16 and IPython will be your working environment when you start python. The final
17 sys.exit() call will make python exit transparently when IPython finishes, so
18 you don't have an extra prompt to get out of.
19
20 This is probably useful to developers who manage multiple Python versions and
21 don't want to have correspondingly multiple IPython versions. Note that in
22 this mode, there is no way to pass IPython any command-line options, as those
23 are trapped first by Python itself.
24 """
25
26 import IPython.Shell
27
---> 28 IPython.Shell.start().mainloop()
global IPython.Shell.start.mainloop = undefined
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
C:\Python27\lib\site-packages\IPython\Shell.pyc in start(user_ns=None)
1244
1245 # New versions of pygtk don't need the brittle threaded support.
1246 th_mode = check_gtk(th_mode)
1247 return th_shell[th_mode]
1248
1249
1250 # This is the one which should be called by external code.
1251 def start(user_ns = None):
1252 """Return a running shell instance, dealing with threading options.
1253
1254 This is a factory function which will instantiate the proper IPython shell
1255 based on the user's threading choice. Such a selector is needed because
1256 different GUI toolkits require different thread handling details."""
1257
1258 shell = _select_shell(sys.argv)
-> 1259 return shell(user_ns = user_ns)
1260
1261 # Some aliases for backwards compatibility
1262 IPythonShell = IPShell
1263 IPythonShellEmbed = IPShellEmbed
1264 #************************ End of file <Shell.py> ***************************
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
C:\Python27\lib\site-packages\IPython\Shell.pyc in __init__(self=<IPython.Shell.IPShell instance>, argv=None, user_ns=None, user_global_ns=None, debug=1, shell_class=<class 'IPython.iplib.InteractiveShell'>)
58 # Default timeout for waiting for multithreaded shells (in seconds)
59 GUI_TIMEOUT = 10
60
61 #-----------------------------------------------------------------------------
62 # This class is trivial now, but I want to have it in to publish a clean
63 # interface. Later when the internals are reorganized, code that uses this
64 # shouldn't have to change.
65
66 class IPShell:
67 """Create an IPython instance."""
68
69 def __init__(self,argv=None,user_ns=None,user_global_ns=None,
70 debug=1,shell_class=InteractiveShell):
71 self.IP = make_IPython(argv,user_ns=user_ns,
72 user_global_ns=user_global_ns,
---> 73 debug=debug,shell_class=shell_class)
global For = undefined
global more = undefined
global details = undefined
global see = undefined
global the = undefined
global __call__ = undefined
global method = undefined
global below. = undefined
74
75 def mainloop(self,sys_exit=0,banner=None):
76 self.IP.mainloop(banner)
77 if sys_exit:
78 sys.exit()
79
80 #-----------------------------------------------------------------------------
81 def kill_embedded(self,parameter_s=''):
82 """%kill_embedded : deactivate for good the current embedded IPython.
83
84 This function (after asking for confirmation) sets an internal flag so that
85 an embedded IPython will never activate again. This is useful to
86 permanently disable a shell that is being called inside a loop: once you've
87 figured out what you needed from it, you may then kill it and the program
88 will then continue to run without the interactive shell interfering again.
C:\Python27\lib\site-packages\IPython\ipmaker.pyc in make_IPython(argv=[r'C:\Python27\scripts\ipython.py'], user_ns=None, user_global_ns=None, debug=1, rc_override=None, shell_class=<class 'IPython.iplib.InteractiveShell'>, embedded=False, **kw={})
506 # tweaks. Basically options which affect other options. I guess this
507 # should just be written so that options are fully orthogonal and we
508 # wouldn't worry about this stuff!
509
510 if IP_rc.classic:
511 IP_rc.quick = 1
512 IP_rc.cache_size = 0
513 IP_rc.pprint = 0
514 IP_rc.prompt_in1 = '>>> '
515 IP_rc.prompt_in2 = '... '
516 IP_rc.prompt_out = ''
517 IP_rc.separate_in = IP_rc.separate_out = IP_rc.separate_out2 = '0'
518 IP_rc.colors = 'NoColor'
519 IP_rc.xmode = 'Plain'
520
--> 521 IP.pre_config_initialization()
522 # configure readline
523
524 # update exception handlers with rc file status
525 otrap.trap_out() # I don't want these messages ever.
526 IP.magic_xmode(IP_rc.xmode)
527 otrap.release_out()
528
529 # activate logging if requested and not reloading a log
530 if IP_rc.logplay:
531 IP.magic_logstart(IP_rc.logplay + ' append')
532 elif IP_rc.logfile:
533 IP.magic_logstart(IP_rc.logfile)
534 elif IP_rc.log:
535 IP.magic_logstart()
536
C:\Python27\lib\site-packages\IPython\iplib.pyc in pre_config_initialization(self=<IPython.iplib.InteractiveShell object>)
820 self.user_ns, # globals
821 # Skip our own frame in searching for locals:
822 sys._getframe(depth+1).f_locals # locals
823 ))
824
825 def pre_config_initialization(self):
826 """Pre-configuration init method
827
828 This is called before the configuration files are processed to
829 prepare the services the config files might need.
830
831 self.rc already has reasonable default values at this point.
832 """
833 rc = self.rc
834 try:
--> 835 self.db = pickleshare.PickleShareDB(rc.ipythondir + "/db")
global Optional = undefined
global inputs = undefined
836 except exceptions.UnicodeDecodeError:
837 print "Your ipythondir can't be decoded to unicode!"
838 print "Please set HOME environment variable to something that"
839 print r"only has ASCII characters, e.g. c:\home"
840 print "Now it is",rc.ipythondir
841 sys.exit()
842 self.shadowhist = IPython.history.ShadowHist(self.db)
843
844 def post_config_initialization(self):
845 """Post configuration init method
846
847 This is called after the configuration files have been processed to
848 'finalize' the initialization."""
849
850 rc = self.rc
C:\Python27\lib\site-packages\IPython\Extensions\pickleshare.pyc in __init__(self=PickleShareDB('C:\Documents and Settings\martinhylee\_ipython\db'), root=u'C:\\Documents and Settings\\martinhylee\\_ipython/db')
38 import cPickle as pickle
39 import UserDict
40 import warnings
41 import glob
42
43 def gethashfile(key):
44 return ("%02x" % abs(hash(key) % 256))[-2:]
45
46 _sentinel = object()
47
48 class PickleShareDB(UserDict.DictMixin):
49 """ The main 'connection' object for PickleShare database """
50 def __init__(self,root):
51 """ Return a db object that will manage the specied directory"""
52 self.root = Path(root).expanduser().abspath()
---> 53 if not self.root.isdir():
54 self.root.makedirs()
55 # cache has { 'key' : (obj, orig_mod_time) }
56 self.cache = {}
57
58
59 def __getitem__(self,key):
60 """ db['key'] reading """
61 fil = self.root / key
62 try:
63 mtime = (fil.stat()[stat.ST_MTIME])
64 except OSError:
65 raise KeyError(key)
66
67 if fil in self.cache and mtime == self.cache[fil][1]:
68 return self.cache[fil][0]
TypeError: _isdir() takes exactly 1 argument (0 given)
**********************************************************************
Oops, IPython crashed. We do our best to make it stable, but...
A crash report was automatically generated with the following information:
- A verbatim copy of the crash traceback.
- A copy of your input history during this session.
- Data on your current IPython configuration.
It was left in the file named:
'C:\Documents and Settings\martinhylee\_ipython\IPython_crash_report.txt'
If you can email this file to the developers, the information in it will help
them in understanding and correcting the problem.
You can mail it to: Fernando Perez at fperez.net#gmail.com
with the subject 'IPython Crash Report'.
If you want to do it now, the following command will work (under Unix):
mail -s 'IPython Crash Report' fperez.net#gmail.com < C:\Documents and Settings\martinhylee\_ipython\IPython_crash_report.txt
To ensure accurate tracking of this issue, please file a report about it at:
https://bugs.launchpad.net/ipython/+filebug
Error in sys.excepthook:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\IPython\CrashHandler.py", line 157, in __call__
report.write(self.make_report(traceback))
File "C:\Python27\lib\site-packages\IPython\CrashHandler.py", line 215, in make_report
rpt_add('BZR revision : %s \n\n' % Release.revision)
AttributeError: 'module' object has no attribute 'revision'
Original exception was:
Traceback (most recent call last):
File "C:\Python27\scripts\ipython.py", line 28, in <module>
IPython.Shell.start().mainloop()
File "C:\Python27\lib\site-packages\IPython\Shell.py", line 1259, in start
return shell(user_ns = user_ns)
File "C:\Python27\lib\site-packages\IPython\Shell.py", line 73, in __init__
debug=debug,shell_class=shell_class)
File "C:\Python27\lib\site-packages\IPython\ipmaker.py", line 521, in make_IPython
IP.pre_config_initialization()
File "C:\Python27\lib\site-packages\IPython\iplib.py", line 835, in pre_config_initialization
self.db = pickleshare.PickleShareDB(rc.ipythondir + "/db")
File "C:\Python27\lib\site-packages\IPython\Extensions\pickleshare.py", line 53, in __init__
if not self.root.isdir():
TypeError: _isdir() takes exactly 1 argument (0 given)
Try this:
from pylab import *
x = np.linspace(0.4 * np.pi, 100)
plot(x, np.sin(x))
show()

Need Help Parsing File for This Pattern "Feb 06 2010 15:49:00.017 MCO"

Need to parse a file for lines of data that start with this pattern "Feb 06 2010 15:49:00.017 MCO", where MCO could be any 3 letter ID, and return the entire record for the line. I think I could get the first part, but the returning the rest of the line is where I get lost.
Here is some sample data.
Feb 06 2010 15:49:00.017 MCO -I -I -I -I 0.34 527 0.26 0.24 184 Tentative 0.00 0 Radar Only -RDR- - - - - No 282356N 0811758W - 3-3
Feb 06 2010 15:49:00.017 MLB -I -I -I -I 44.31 3175 -10.05 -10.05 216 Established 0.00 0 Radar Only -RDR- - - - - No 281336N 0812939W - 2-
Feb 06 2010 15:49:00.018 MLB -I -I -I -I 44.31 3175 -10.05 -10.05 216 Established 15.51 99 Radar Only -RDR- - - - - No 281336N 0812939W - 2-
Feb 06 2010 15:49:00.023 QML N856 7437-V -I 62-V 61-V 67.00 3420 -30.93 15.34 534 Established 328.53 129 Reinforced - - - - - - No 283900N 0815325W - -
Feb 06 2010 15:49:00.023 QML N516SP 0723-V -I 22-V 21-V 42.25 3460 -8.19 5.03 146 Established 243.93 83 Beacon Only - - - - - - No 282844N 0812734W - -
Feb 06 2010 15:49:00.023 QML 2247-V -I 145-V 144-V 78.88 3443 -39.68 23.68 676 Established 177.66 368 Reinforced - - - - - - No 284719N 0820325W - -
Feb 06 2010 15:49:00.023 MLB 1200-V -I 15-V 14-V 45.25 3015 -11.32 -20.97 475 Established 349.68 88 Beacon Only - - - - - - No 280239N 0813104W - -
Feb 06 2010 15:49:00.023 MLB 1011-V -I 91-V 90-V 94.50 3264 -56.77 10.21 698 Established 152.28 187 Beacon Only - - - - - - No 283341N 0822244W - -
- - - - - -
seems like your date + 3 characters are always the first 5 fields (with space as delimiter). Just go through the file, and do a split on space to each line. Then get the first 5 fields
s=Split(strLineOfFile," ")
wscript.echo s(0),s(1),s(2),s(3),s(4)
No need regex
From your sample data it seems that you don't have to check for the presence of a three letter identifier following the date -- it's always there. Add a final three letters to the regex if that's not a valid assumption. Also, add more grouping as needed for regex groups to be useful to you. Anyway:
import re
dtre = re.compile(r'^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [0-9]{2} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}')
[line for line in file if dtre.match(line)]
Wrap it in a with statement or whatever to open your file, then do any processing you need on the list this builds up.
Another possibility would be to use a generator expression instead of a list comprehension (replace the outer [ and ] with ( and ) to do so). This is useful if you're outputting results to somewhere as you go, the file is large and you don't need to have it all in memory for different purposes. Just be sure not to close the file before you consume the entire generator if you go with this approach!
Also, you could use datetime's built-in parsing facility:
import datetime
for line in file:
try:
# the line[:24] bit assumes you're always going to have three-digit
# µs part
dt = datetime.datetime.strptime(line[:24], '%b %d %Y %H:%M:%S.%f')
except ValueError:
# a ValueError means the beginning of the line isn't parseable as datetime
continue
# do something with the line; the datetime is already parsed and stored in dt
That's probably better if you're going to create the datetime.datetime object anyway.