Appending the datetime to the end of every line in a 600 million row file - awk

I have a 680 million rows (19gig) file that I need the datetime appended onto every line. I get this file every night and I have to add the time that I processed it to the end of each line. I have tried many ways to do this including sed/awk and loading it into a SQL database with the last column being defaulted to the current timestamp.
I was wondering if there is a fast way to do this? My fastest way so far takes two hours and that is just not fast enough given the urgency of the information in this file. It is a flat CSV file.
edit1:
Here's what I've done so far:
awk -v date="$(date +"%Y-%m-%d %r")" '{ print $0","date}' lrn.ae.txt > testoutput.txt
Time = 117 minutes
perl -ne 'chomp; printf "%s.pdf\n", $_' EXPORT.txt > testoutput.txt
Time = 135 minutes
mysql load data local infile '/tmp/input.txt' into table testoutput
Time = 211 minutes

You don't specify if the timestamps have to be different for each of the lines. Would a "start of processing" time be enough?
If so, a simple solution is to use the paste command, with a pre-generated file of timestamps, exactly the same length as the file you're processing. Then just paste the whole thing together. Also, if the whole process is I/O bound, as others are speculating, then maybe running this on a box with an SSD drive would help speed up the process.
I just tried it locally on a 6 million row file (roughly 1% of yours), and it's actually able to do it in less than one second, on Macbook Pro, with an SSD drive.
~> date; time paste file1.txt timestamps.txt > final.txt; date
Mon Jun 5 10:57:49 MDT 2017
real 0m0.944s
user 0m0.680s
sys 0m0.222s
Mon Jun 5 10:57:49 MDT 2017
I'm going to now try a ~500 million row file, and see how that fares.
Updated:
Ok, the results are in. Paste is blazing fast compared to your solution, it took just over 90 seconds total to process the whole thing, 600M rows of simple data.
~> wc -l huge.txt
600000000 huge.txt
~> wc -l hugetimestamps.txt
600000000 hugetimestamps.txt
~> date; time paste huge.txt hugetimestamps.txt > final.txt; date
Mon Jun 5 11:09:11 MDT 2017
real 1m35.652s
user 1m8.352s
sys 0m22.643s
Mon Jun 5 11:10:47 MDT 2017
You still need to prepare the timestamps file ahead of time, but that's a trivial bash loop. I created mine in less than one minute.

A solution that simplifies mjuarez' helpful approach:
yes "$(date +"%Y-%m-%d %r")" | paste -d',' file - | head -n "$(wc -l < file)" > out-file
Note that, as with the approach in the linked answer, you must know the number of input lines in advance - here I'm using wc -l to count them, but if the number is fixed, simply use that fixed number.
yes keeps repeating its argument indefinitely, each on its own output line, until it is terminated.
paste -d',' file - pastes a corresponding pair of lines from file and stdin (-) on a single output line, separated with ,
Since yes produces "endless" output, head -n "$(wc -l < file)" ensures that processing stops once all input lines have been processed.
The use of a pipeline acts as a memory throttle, so running out of memory shouldn't be a concern.

Another alternative to test is
$ date +"%Y-%m-%d %r" > timestamp
$ join -t, -j9999 file timestamp | cut -d, -f2-
or time stamp can be generated in place as well <(date +"%Y-%m-%d %r")
join creates a cross product of the first file and second file using the non-existing field (9999), and since second file is only one line, practically appending it to the first file. Need the cut to get rid of the empty key field generated by join

If you want to add the same (current) datetime to each row in the file, you might as well leave the file as it is, and put the datetime in the filename instead. Depending on the use later, the software that processes the file could then first get the datetime from the filename.
To put the same datetime at the end of each row, some simple code could be written:
Make a string containing a separator and the datetime.
Read the lines from the file, append the above string and write back to a new file.
This way a conversion from datetime to string is only done once, and converting the file should not take much longer than copying the file on disk.

Related

How can I delete a specific line (e.g. line 102,206,973) from a 30gb csv file?

What method can I use to delete a specific line from a csv/txt file that is too big too load into memory and edit manually?
Background
My question is actually an indirect solution to a problem related with importing csv into sql databases.
I have a series of 10-30gb csv files I want to import and populate an sqlite table from within R (Since they are too large to import as data frames as a whole into R). I am using the 'RSQlite' package for this.
A couple fail because of an error related to one of the lines being badly formatted. The populating process is then cancelled. R returns the line number which caused the process to fail.
The error given is:
./csvfilename line 102206973 expected 9 columns of data but found 3)
So I know exactly the line which causes the error.
I see 2 potential 'indirect' solutions which I was hoping someone could help me with.
(i) Deleting the line causing the error in 20+gb files. e.g. line 102,206,973 in the example above.
I am not concerned with 'losing' the data in line 102,206,973 by just skipping or deleting it. However I have tried and failed to somehow access the csv file and to remove the line.
(ii) Using sqlite directly (or anything else?) to import an csv which does allow you to skip lines or an error.
Although not likely to be related directly to the solution, here is the R code used.
db <- dbConnect(SQLite(), dbname=name_of_table)
dbWriteTable(conn = db, name ="currentdata", value = csvfilename, row.names = FALSE, header = TRUE)
Thanks!
To delete a specific line you can use sed:
sed -e '102206973d' your_file
If you want the replacement to be done in-place, do
sed -i.bak -e '102206973d' your_file
This will create a backup names your_file.bak and your_file will have the specified line removed.
Example
$ cat a
1
2
3
4
5
$ sed -i.bak -e '3d' a
$ cat a
1
2
4
5
$ cat a.bak
1
2
3
4
5

How can I remove lines from a file with more than a certain number of entries

I've looked at the similar question about removing lines with more than a certain number of characters and my problem is similar but a bit trickier. I have a file that is generated after analyzing some data and each line is supposed to contain 29 numbers. For example:
53.0399 0.203827 7.28285 0.0139936 129.537 0.313907 11.3814 0.0137903 355.008 \
0.160464 12.2717 0.120802 55.7404 0.0875189 11.3311 0.0841887 536.66 0.256761 \
19.4495 0.197625 46.4401 2.38957 15.8914 17.1149 240.192 0.270649 19.348 0.230\
402 23001028 23800855
53.4843 0.198886 7.31329 0.0135975 129.215 0.335697 11.3673 0.014766 355.091 0\
.155786 11.9938 0.118147 55.567 0.368255 11.449 0.0842612 536.91 0.251735 18.9\
639 0.184361 47.2451 0.119655 18.6589 0.592563 240.477 0.298805 20.7409 0.2548\
56 23001585
50.7302 0.226066 7.12251 0.0158698 237.335 1.83226 15.4057 0.059467 -164.075 5\
.14639 146.619 1.37761 55.6474 0.289037 11.4864 0.0857042 536.34 0.252356 19.3\
91 0.198221 46.7011 0.139855 20.1464 0.668163 240.664 0.284125 20.3799 0.24696\
23002153
But every once in a while, a line like the first one appears that has an extra 8 digit number at the end from analyzing an empty file (so it just returns the file ID number but not on a new line like it should). So I just want to find lines that have this extra 30th number and remove just that 30th entry. I figure I could do this with awk but since I have little experience with it I'm not sure how. So if anyone can help I'd appreciate it.
Thanks
Summary: Want to find lines in a text file with an extra entry in a row and remove the last extra entry so all rows have same number of entries.
With awk, you tell it how many fields there are per record. The extras are ignored
awk '{NF = 29; print}' filename
If you want to save that back to the file, you have to do a little extra work
awk '{NF = 29; print}' filename > filename.new && mv filename.new filename

ksh loop putting results of file into variable

I'm not sure the best way to handle this, I'm guessing it's using a while loop. I have a .txt file with a set of numbers ( these numbers can change based on another script that runs )
ex:
0
36
41
53
60
Each number is on it's own line. For each number I want to get that number and execute a script using it. So in this example I would call a script to stop database 0, after that completes call a script to stop database 36 and so on until it's complete with all numbers in the list.
1) Is a while loop the best way to handle this?
2) I'm having trouble trying to determine what the [[condition]] needs to be to get each number 1 at a time, where can i find some additional help on this?
while [[ condition ]] ; do
command1
done
For testing purposes the file that contains all the numbers is test.txt. The script that will execute is a python script - "amgr.py stop (number from test.txt)"
Here's a simpler way:
cat test.txt | xargs amgr.py stop
this will get each line of your file, and then put it as an extra parameter for your amgr.py:
amgr.py stop 0
amgr.py stop 36
and so on..
I ended up using this method to give me the results i was looking for.
while read -r line ; do
amgr.py stop $line
done<test.txt

alternative to tail -F

I am monitoring a log file by doing "TAIL -n -0 -F filename". But this is taking up a lof of CPU as there are many messages being written to the logfile. Is there a way, I can open a file and read new/few entries and close it and repeat it every 5 second interval? So that I don't need to keep following the file? How can I remember the last read line to start from the next one in the next run? I am trying to do this in nawk by spawning tail shell cmd.
You won't be able to magically use less resources to tail a file by writing your own implementation. If tail -f is using resources because the file is growing fast, a custom version won't help any if you still want to view all lines as they are being written. You are simply limited by your hardware I/O and/or CPU.
Try using --sleep-interval=S where "S" is a number of seconds (the default is 1.0 - you can specify decimals).
tail -n 0 --sleep-interval=.5 -F filename
If you have so many log entries that tail is bogging down the CPU, how are you able to monitor them?

Processing apache logs quickly

I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.
Here is the awk script:
#!/bin/bash
awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1
EDIT:
For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility date recognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.
Example input: 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
Returned output: 189.5.56.113,124237889
#OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion
awk 'BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
date[d[o]]=sprintf("%02d",o)
}
}
{
gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5)
n=split($4, DATE,"/")
day=DATE[1]
mth=DATE[2]
year=DATE[3]
hr=DATE[4]
min=DATE[5]
sec=DATE[6]
MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
print $1,MKTIME
}' file
output
$ more file
189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
$ ./shell.sh
189.5.56.113 1264110895
If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortable writing code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds.
You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.
If you are using gawk, you can massage your date and time into a format that mktime (a gawk function) understands. It will give you the same timestamp you're using now and save you the overhead of repeated system() calls.
This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):
import time
src = open('x.log', 'r')
dest = open('x.csv', 'w')
for line in src:
ip = line[:line.index(' ')]
date = line[line.index('[') + 1:line.index(']') - 6]
t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
dest.write(ip)
dest.write(',')
dest.write(str(int(t)))
dest.write('\n')
src.close()
dest.close()
A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.
But to be honest, something as simple as that should be just as easy to rewrite in C.
gawk '{
dt=substr($4,2,11);
gsub(/\//," ",dt);
"date -d \""dt"\" +%s"|getline ts;
print $1, ts
}' yourfile