How to convert AIX timestamp of file to EPOCH - aix

I want to get the epoch date format on the file below:
-rw-rw---- 1 informix informix 12247577 Jan 21 00:50 shop14_0_Log0001274968.Z
Using stat on the file I get the below file date information.
stat shop14_0_Log0001274968.Z
File: shop14_0_Log0001274968.Z
Size: 12247577 Blocks: 23928 IO Block: 4096 regular file
Device: 800000640000000dh/9223372466351505421d Inode: 410 Links: 1
Access: (0660/-rw-rw----) Uid: (66001/informix) Gid: ( 3000/informix)
Access: 2020-01-21 00:50:07.000000000 +0200
Modify: 2020-01-21 00:50:06.000000000 +0200
Change: 2020-01-21 00:50:08.000000000 +0200
Birth: -
Is there a better command to pull date information on a file and use that to convert to epoch on AIX. If I can get the date information, what command will I use to do what I need to retrieve.

You can always use perl, an example:
#!/usr/bin/perl -w
use strict;
sub ftimestamp {
my $fname= $_[0];
my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
$atime,$mtime,$ctime,$blksize,$blocks)
= stat($fname);
printf ("%-20s: atime=%d mtime=%d ctime=%d\n", $fname, $atime, $mtime, $ctime);
}
my $arg;
foreach $arg (#ARGV) {
ftimestamp ($arg);
}

Related

Splunk bucket name conversion to epoch to human script

I'm facing the following issue: I have some frozen bucket in a Splunk enviroment that are saved in epoch format. More specifically the template is:
db_1181756465_1162600547_1001
that, if converted, return to me the end date, which is in the first number, and the start one, that is in the second one. So, it means that, base on my example:
1181756465 = Wednesday 13 June 2007 17:41:05
1162600547 = Saturday 4 November 2006 00:35:47
Now, ho to convert in human is clear for me, also because if not i coudn't put the translation here. My problem is that I have file full of bucket name that must be converted, with hunderds of entry; so, I'm asking if there is a script or other way to authomatize this conversion and print the output in a file. The idea is to have the final oputput with somehting like that:
db_1181756465_1162600547_1001 = Wednesday 13 June 2007 17:41:05 - Saturday 4 November 2006 00:35:47
You could use Splunk to view these values. They are outputted from the dbinspect command which provides startEpoch & endEpoch times for the frozen bucket
| dbinspect index=* state=frozen
| eval startDate=strftime(startEpoch,"%A %d %B %Y %H:%M:%S")
| eval endDate=strftime(endEpoch,"%A %d %B %Y %H:%M:%S")
| fields index, path, startDate, endDate
listing example using hot buckets since I don't have frozen in this test system
If you just have the list of folder names you can upload it to a Splunk instance as CSV and do some processing to extract startDate & endDate
| makeresults
| eval frozenbucket="db_1181756465_1162600547_1001"
| eval temp=split(frozenbucket,"_")
| eval sDate=mvindex(temp,2)
| eval eDate=mvindex(temp,1)
| eval startDate=strftime(sDate,"%A %d %B %Y %H:%M:%S")
| eval endDate=strftime(eDate,"%A %d %B %Y %H:%M:%S")
| fields frozenbucket,startDate,endDate
| fields - _time

Nexflow: structured inputs with files

I have an array of structure data similar to:
- name: foobar
sex: male
fastqs:
- r1: /path/to/foobar_R1.fastq.gz
r2: /path/to/foobar_R2.fastq.gz
- r1: /path/to/more/foobar_R1.fastq.gz
r2: /path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: /path/to/bazquux_R1.fastq.gz
r2: /path/to/bazquux_R2.fastq.gz
Note that fastqs come in pairs, and the number of pairs per "sample" may be variable.
I want to write a process in nextflow that processes one sample at a time.
In order for the nextflow executor to properly marshal the files, they must somehow be typed as path (or file). Thus typed, the executor will copy the files to the compute node for processing. Simply typing the files paths as var will treat the paths as strings and no files will be copied.
A trivial example of a path input from the docs:
process foo {
input:
path x from '/some/data/file.txt'
"""
your_command --in $x
"""
}
How should I go about declaring the process input so that the files are properly marshaled to the compute node? So far I haven't found any examples in the docs for how to handle structured inputs.
Your structured data looks a lot like YAML. If you can include a top-level object so that your file looks something like this:
samples:
- name: foobar
sex: male
fastqs:
- r1: ./path/to/foobar_R1.fastq.gz
r2: ./path/to/foobar_R2.fastq.gz
- r1: ./path/to/more/foobar_R1.fastq.gz
r2: ./path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: ./path/to/bazquux_R1.fastq.gz
r2: ./path/to/bazquux_R2.fastq.gz
Then, we can use Nextflow's -params-file option to load the params when we run our workflow. We can access the top-level object from the params, which gives us a list that we can use to create a Channel using the fromList factory method. The following example uses the new DSL 2:
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(fastqs)
"""
echo "${sample_name},${sex}:"
ls -g *.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| flatMap { rec ->
rec.fastqs.collect { rg ->
readgroup = tuple( file(rg.r1), file(rg.r2) )
tuple( rec.name, rec.sex, readgroup )
}
}
| test_proc
}
Results:
$ mkdir -p ./path/to/more
$ touch ./path/to/foobar_R{1,2}.fastq.gz
$ touch ./path/to/more/foobar_R{1,2}.fastq.gz
$ touch ./path/to/bazquux_R{1,2}.fastq.gz
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [desperate_colden] DSL2 - revision: 391a9a3b3a
executor > local (3)
[ed/61c5c3] process > test_proc (foobar) [100%] 3 of 3 ✔
foobar,male:
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/foobar_R2.fastq.gz
bazquux,female:
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R1.fastq.gz -> ../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R2.fastq.gz -> ../../../path/to/bazquux_R2.fastq.gz
foobar,male:
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/more/foobar_R2.fastq.gz
As requested, here's a solution that runs per sample. The problem we have is that we cannot simply feed in a list of lists using the path qualifier (since an ArrayList is not a valid path value). We could flatten() the list of file pairs, but this makes it difficult to access each of the file pairs if we need them. You may not necessarily need the file pair relationship but assuming you do, I think the right solution is to feed the R1 and R2 files in separately (i.e. using a path qualifier for R1 and another path qualifier for R2). The following example introspects the instance type to (re-)create the list of readgroups. We can use the stageAs option to localize the files into progressively indexed subdirectories, since some files in the YAML have identical names.
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(r1, stageAs:'*/*'), path(r2, stageAs:'*/*')
script:
if( [r1, r2].every { it instanceof List } )
readgroups = [r1, r2].transpose()
else if( [r1, r2].every { it instanceof Path } )
readgroups = [[r1, r2], ]
else
error "Invalid readgroup configuration"
read_pairs = readgroups.collect { r1, r2 -> "${r1},${r2}" }
"""
echo "${sample_name},${sex}:"
echo ${read_pairs.join(' ')}
ls -g */*.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| map { rec ->
def r1 = rec.fastqs.r1.collect { file(it) }
def r2 = rec.fastqs.r2.collect { file(it) }
tuple( rec.name, rec.sex, r1, r2 )
}
| test_proc
}
Results:
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [berserk_sanger] DSL2 - revision: 2f317a8cee
executor > local (2)
[93/6345c9] process > test_proc (bazquux) [100%] 2 of 2 ✔
foobar,male:
1/foobar_R1.fastq.gz,1/foobar_R2.fastq.gz 2/foobar_R1.fastq.gz,2/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R1.fastq.gz -> ../../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R2.fastq.gz -> ../../../../path/to/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R1.fastq.gz -> ../../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R2.fastq.gz -> ../../../../path/to/more/foobar_R2.fastq.gz
bazquux,female:
1/bazquux_R1.fastq.gz,1/bazquux_R2.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R1.fastq.gz -> ../../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R2.fastq.gz -> ../../../../path/to/bazquux_R2.fastq.gz

Column names when exporting ORC files from hive server 2 using beeline

I am facing a problem where exporting results from hive server 2 to ORC files show some kind of default column names (e.g. _col0, _col1, _col2) instead of the original ones created in hive. We are using pretty much default components from HDP-2.6.3.0.
I am also wondering if the below issue is related:
https://issues.apache.org/jira/browse/HIVE-4243
Below are the steps we are taking:
Connecting:
export SPARK_HOME=/usr/hdp/current/spark2-client
beeline
!connect jdbc:hive2://HOST1:2181,HOST2:2181,HOST2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
Creating test table and inserting sample values:
create table test(str string);
insert into test values ('1');
insert into test values ('2');
insert into test values ('3');
Running test query:
select * from test;
+-----------+--+
| test.str |
+-----------+--+
| 1 |
| 2 |
| 3 |
+-----------+--+
Exporting as ORC:
insert overwrite directory 'hdfs://HOST1:8020/tmp/test' stored as orc select * from test;
Getting the results:
hdfs dfs -get /tmp/test/000000_0 test.orc
Checking the results:
java -jar orc-tools-1.4.1-uber.jar data test.orc
Processing data file test.orc [length: 228]
{"_col0":"1"}
{"_col0":"2"}
{"_col0":"3"}
java -jar orc-tools-1.4.1-uber.jar meta test.orc
Processing data file test.orc [length: 228]
Structure for test.orc
File Version: 0.12 with HIVE_13083
Rows: 2
Compression: SNAPPY
Compression size: 262144
Type: struct<_col0:string>
Stripe Statistics:
Stripe 1:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 3 sum: 2
File Statistics:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 3 sum: 2
Stripes:
Stripe: offset: 3 data: 11 rows: 2 tail: 60 index: 39
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 28
Stream: column 1 section DATA start: 42 length 5
Stream: column 1 section LENGTH start: 47 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 228 bytes
Padding length: 0 bytes
Padding ratio: 0%
Looking at the results I can see _col0 as the column name while expecting the original str.
Any ideas on what I am missing?
Update
I noticed that the connection from beeline was going to hive 1.x, and not 2.x as wanted. I changed the connection to the Hive Server 2 Interactive URL:
Connected to: Apache Hive (version 2.1.0.2.6.3.0-235)
Driver: Hive JDBC (version 1.21.2.2.6.3.0-235)
Transaction isolation: TRANSACTION_REPEATABLE_READ
And tried again with the same sample. It even prints out the schema correctly:
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test.str, type:string, comment:null)], properties:null)
But still no luck in getting it to the ORC file.
Solution
You need to enable Hive LLAP (Interactive SQL) in Ambari, then change the connection string you are using. For example, my connection became jdbc:hive2://.../;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2
Note the additional "-hive2" at the end of the URL. Here is a tutorial vid from hortonworks.
"Proof"
After connecting to the updated Hive endpoint, I ran
create table t_orc(customer string, age int) stored as orc;
insert into t_orc values('bob', 12),('kate', 15);
Then
~$ hdfs dfs -copyToLocal /apps/hive/warehouse/t_orc/000000_0 ~/tmp/orc/hive2.orc
~$ orc-metadata tmp/orc/hive2.orc
{ "name": "tmp/orc/hive2.orc",
"type": "struct<customer:string,age:int>",
"rows": 2,
"stripe count": 1,
"format": "0.12", "writer version": "HIVE-13083",
"compression": "zlib", "compression block": 262144,
"file length": 305,
"content": 139, "stripe stats": 46, "footer": 96, "postscript": 23,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 2,
"offset": 3, "length": 136,
"index": 67, "data": 23, "footer": 46
}
]
}
Where orc-metadata is a tool distributed by the ORC repo on github.
You have to set this in hive script or hive-shell otherwise put it in a .hiverc file in your main directory or any of the other hive user properties files.
set hive.cli.print.header=true;

Error while executing ForEach - Apache PIG

I have 3 logs , a Squid , a login and a logoff. I need to cross these logs to find out which sites each user has accessed.
I'm using the Apache Pig and created the following scrip to do it:
copyFromLocal /home/marcelo/Documentos/hadoop/squid.txt /tmp/squid.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_in /tmp/login.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_out /tmp/logout.txt;
squid = LOAD '/tmp/squid.txt' USING PigStorage AS (linha: chararray);
nsquid = FOREACH squid GENERATE FLATTEN (STRSPLIT(linha,'[ ]+'));
nsquid = FOREACH nsquid GENERATE $0 AS timeStamp:chararray, $2 AS ipCliente:chararray, $5 AS request:chararray, $6 AS url:chararray;
nsquid = FOREACH nsquid GENERATE FLATTEN (STRSPLIT(timeStamp,'[.]'))AS (timeStamp:int,resto:chararray),ipCliente,request,url;
nsquid = FOREACH nsquid GENERATE (int)$0 AS timeStamp:int, $2 AS ipCliente:chararray,$3 AS request:chararray, $4 AS url:chararray;
connect = FILTER nsquid BY (request=='CONNECT');
login = LOAD '/tmp/login.txt' USING PigStorage(' ') AS (serverAL: chararray, data: chararray, hora: chararray, netlogon: chararray, on: chararray, ip: chararray);
nlogin = FOREACH login GENERATE FLATTEN(STRSPLIT(serverAL,'[\\\\]')),data, hora,FLATTEN(STRSPLIT(ip,'[\\\\]'));
nlogin = FOREACH nlogin GENERATE $1 AS al:chararray, $2 AS data:chararray, $3 AS hora:chararray, $4 AS ipCliente:chararray;
logout = LOAD '/tmp/logout.txt' USING PigStorage(' ') AS (data: chararray, hora: chararray, logout: chararray, ipAl: chararray, disconec: chararray);
nlogout = FOREACH logout GENERATE data, hora, FLATTEN(STRSPLIT(ipAl,'[\\\\]'));
nlogout = FOREACH nlogout GENERATE $0 AS data:chararray,$1 AS hora:chararray,$2 AS ipCliente:chararray, $3 AS al:chararray;
data = JOIN nlogin BY (al,ipCliente,data), nlogout BY (al,ipCliente,data);
ndata = FOREACH data GENERATE nlogin::al,ToUnixTime(ToDate(CONCAT(nlogin::data, nlogin::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogin:int,ToUnixTime(ToDate(CONCAT(nlogout::data, nlogout::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogout:int,nlogout::ipCliente;
BB = FOREACH ndata GENERATE $0 AS al:chararray, (int)$1 AS tslogin:int, (int)$2 AS tslogout:int, $3 AS ipCliente:chararray;
CC = JOIN BB BY ipCliente, connect BY ipCliente;
DD = FOREACH CC GENERATE BB::al AS al:chararray, (int)BB::tslogin AS tslogin:int, (int)BB::tslogout AS tslogout:int,(int)connect::timeStamp AS timeStamp:int, connect::ipCliente AS ipCliente:chararray, connect::url AS url:chararray;
EE = FILTER DD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
STORE EE INTO 'EEs';
But it is returning the following error
2015-10-16 21:58:10,600 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201510162141_0008 has failed! Stop running all dependent jobs
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0: Error while executing ForEach at [DD[93,5]]
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-10-16 21:58:10,667 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.1 0.12.1 root 2015-10-16 21:56:48 2015-10-16 21:58:10 HASH_JOIN,FILTER
Some jobs have failed! Stop running all dependent jobs
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201510162141_0007 2 1 4 3 4 4 9 9 9 9 BB,data,login,logout,ndata,nlogin,nlogout HASH_JOIN
Failed Jobs:
JobId Alias Feature Message Outputs
job_201510162141_0008 CC,DD,EE,connect,nsquid,squid HASH_JOIN Message: Job failed! Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201510162141_0008_r_000000 hdfs://localhost:9000/user/root/EEb,
Input(s):
Successfully read 7367 records from: "/tmp/login.txt"
Successfully read 7374 records from: "/tmp/logout.txt"
Failed to read data from "/tmp/squid.txt"
Output(s):
Failed to produce result in "hdfs://localhost:9000/user/root/EEb"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201510162141_0007 -> job_201510162141_0008,
job_201510162141_0008
2015-10-16 21:58:10,674 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 11 time(s).
2015-10-16 21:58:10,674 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
I had created an alternative that worked, just replaced the penultimate line by:
STORE DD INTO 'DD';
newDD = LOAD 'hdfs://localhost:9000/user/root/DD' USING PigStorage AS (al:chararray, tslogin:int, tslogout:int, timeStamp:int, ipCliente:chararray, url:chararray);
EE = FILTER newDD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
Does anyone have any idea how to fix it without the “store” ?

Need Help Parsing File for This Pattern "Feb 06 2010 15:49:00.017 MCO"

Need to parse a file for lines of data that start with this pattern "Feb 06 2010 15:49:00.017 MCO", where MCO could be any 3 letter ID, and return the entire record for the line. I think I could get the first part, but the returning the rest of the line is where I get lost.
Here is some sample data.
Feb 06 2010 15:49:00.017 MCO -I -I -I -I 0.34 527 0.26 0.24 184 Tentative 0.00 0 Radar Only -RDR- - - - - No 282356N 0811758W - 3-3
Feb 06 2010 15:49:00.017 MLB -I -I -I -I 44.31 3175 -10.05 -10.05 216 Established 0.00 0 Radar Only -RDR- - - - - No 281336N 0812939W - 2-
Feb 06 2010 15:49:00.018 MLB -I -I -I -I 44.31 3175 -10.05 -10.05 216 Established 15.51 99 Radar Only -RDR- - - - - No 281336N 0812939W - 2-
Feb 06 2010 15:49:00.023 QML N856 7437-V -I 62-V 61-V 67.00 3420 -30.93 15.34 534 Established 328.53 129 Reinforced - - - - - - No 283900N 0815325W - -
Feb 06 2010 15:49:00.023 QML N516SP 0723-V -I 22-V 21-V 42.25 3460 -8.19 5.03 146 Established 243.93 83 Beacon Only - - - - - - No 282844N 0812734W - -
Feb 06 2010 15:49:00.023 QML 2247-V -I 145-V 144-V 78.88 3443 -39.68 23.68 676 Established 177.66 368 Reinforced - - - - - - No 284719N 0820325W - -
Feb 06 2010 15:49:00.023 MLB 1200-V -I 15-V 14-V 45.25 3015 -11.32 -20.97 475 Established 349.68 88 Beacon Only - - - - - - No 280239N 0813104W - -
Feb 06 2010 15:49:00.023 MLB 1011-V -I 91-V 90-V 94.50 3264 -56.77 10.21 698 Established 152.28 187 Beacon Only - - - - - - No 283341N 0822244W - -
- - - - - -
seems like your date + 3 characters are always the first 5 fields (with space as delimiter). Just go through the file, and do a split on space to each line. Then get the first 5 fields
s=Split(strLineOfFile," ")
wscript.echo s(0),s(1),s(2),s(3),s(4)
No need regex
From your sample data it seems that you don't have to check for the presence of a three letter identifier following the date -- it's always there. Add a final three letters to the regex if that's not a valid assumption. Also, add more grouping as needed for regex groups to be useful to you. Anyway:
import re
dtre = re.compile(r'^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [0-9]{2} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}')
[line for line in file if dtre.match(line)]
Wrap it in a with statement or whatever to open your file, then do any processing you need on the list this builds up.
Another possibility would be to use a generator expression instead of a list comprehension (replace the outer [ and ] with ( and ) to do so). This is useful if you're outputting results to somewhere as you go, the file is large and you don't need to have it all in memory for different purposes. Just be sure not to close the file before you consume the entire generator if you go with this approach!
Also, you could use datetime's built-in parsing facility:
import datetime
for line in file:
try:
# the line[:24] bit assumes you're always going to have three-digit
# µs part
dt = datetime.datetime.strptime(line[:24], '%b %d %Y %H:%M:%S.%f')
except ValueError:
# a ValueError means the beginning of the line isn't parseable as datetime
continue
# do something with the line; the datetime is already parsed and stored in dt
That's probably better if you're going to create the datetime.datetime object anyway.