More efficient sed on huge input files

More efficient sed on huge input files - awk

I regularly have to reduce huge db sql dumps (more than 100gb) down to more manageable file sizes by removing unnecessary INSERT statements. I do that with the following script.
I'm concerned that my script involves iterating multiple times through the source file, which is obviously computationally expensive.
Is there a way to combine all my SED statements into one, so the source file only needs to be processed once, or can be processed in a more efficient way?
sed '/INSERT INTO `attendance_log`/d' input.sql | \
sed '/INSERT INTO `analytics_models_log`/d' | \
sed '/INSERT INTO `backup_logs`/d' | \
sed '/INSERT INTO `config_log`/d' | \
sed '/INSERT INTO `course_completion_log`/d' | \
sed '/INSERT INTO `errorlog`/d' | \
sed '/INSERT INTO `log`/d' | \
sed '/INSERT INTO `logstore_standard_log`/d' | \
sed '/INSERT INTO `mnet_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `prog_completion_log`/d' | \
sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | \
sed '/INSERT INTO `totara_sync_log`/d' | \
sed '/INSERT INTO `prog_messagelog`/d' | \
sed '/INSERT INTO `stats_daily`/d' | \
sed '/INSERT INTO `course_modules_completion`/d' | \
sed '/INSERT INTO `question_attempt_step_data`/d' | \
sed '/INSERT INTO `scorm_scoes_track`/d' | \
sed '/INSERT INTO `question_attempts`/d' | \
sed '/INSERT INTO `grade_grades_history`/d' | \
sed '/INSERT INTO `task_log`/d' > reduced.sql
Is this idea going in the right direction?
cat input.sql | sed '/INSERT INTO `analytics_models_log`/d' | sed '/INSERT INTO `backup_logs`/d' | sed '/INSERT INTO `config_log`/d' | sed '/INSERT INTO `course_completion_log`/d' | sed '/INSERT INTO `errorlog`/d' | sed '/INSERT INTO `log`/d' | sed '/INSERT INTO `logstore_standard_log`/d' | sed '/INSERT INTO `mnet_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `prog_completion_log`/d' | sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | sed '/INSERT INTO `totara_sync_log`/d' | sed '/INSERT INTO `prog_messagelog`/d' | sed '/INSERT INTO `stats_daily`/d' | sed '/INSERT INTO `course_modules_completion`/d' | sed '/INSERT INTO `question_attempt_step_data`/d' | sed '/INSERT INTO `scorm_scoes_track`/d' | sed '/INSERT INTO `question_attempts`/d' | sed '/INSERT INTO `grade_grades_history`/d' | sed '/INSERT INTO `task_log`/d' > reduced.sql

If you have multiple sed ... | sed ... you can combine them by writing sed -e ... -e ... or sed ...;.... But in this case there is an even more efficient method:
sed -E '/INSERT INTO `(attendance_log|analytics_models_log|...)`/d'
Alternatively, switch to grep which could be even faster:
grep -vE 'INSERT INTO `(attendance_log|analytics_models_log|...)`'
or
grep -vFf <(printf 'INSERT INTO `%s`\n' attendance_log analytics_models_log ...)
You could even try to replace all ..._log and logs with a regex, if this is what you want. With this, you only have to explicitly list the non-log files:
INSERT INTO `([^`]*logs?|local_amosdatasend_log_entry|stats_daily|...)`

For ease of maintenance it may make sense to have a list of tables (in a file) that awk can use to filter the SQL script.
List of the (database) tables to skip ...
$ cat table.list
attendance_log
analytics_models_log
backup_logs
config_log
course_completion_log
Sample SQL script:
$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...
Let awk do the pruning for us:
$ awk 'FNR==NR {table[$1];next} /^INSERT INTO / && $3 in table{next}1' table.list sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...
NOTES:
this is based solely on the fact the question only mentions INSERT INTO commands
I'm assuming the lines (of interest) start with INSERT INTO (otherwise remove the ^)
this solution will need additional checks/coding to address other SQL statements OP wants to remove

For ease of maintenance it may make sense to have a list of INSERT INTO <table>/d commands (in a file) that sed can use to filter the SQL script.
Storing the sed commands in a file, eg:
$ cat sed.cmds
/INSERT INTO attendance_log/d
/INSERT INTO analytics_models_log/d
/INSERT INTO backup_logs/d
/INSERT INTO config_log/d
/INSERT INTO course_completion_log/d
Sample SQL script:
$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...
Invoking the file of sed commands:
$ sed -f sed.cmds sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...

Related

Using multiple delimiters when one of them is a pipe character

I have a text file where fields are separated by a pipe character. Since it is a human readable text, there are spaces used for column alignment.
Here is a sample input:
+------------------------------------------+----------------+------------------+
| Column1 | Column2 | Column3 | Column4 | Last Column |
+------------------------------------------+----------------+------------------+
| some_text | other_text | third_text | fourth_text | last_text |
<more such lines>
+------------------------------------------+----------------+------------------+
How can I use awk to extract the third field in this case? The
I tried:
awk -F '[ |]' '{print $3}' file
awk -F '[\|| ]' '{print $3}' file
awk -F '[\| ]' '{print $3}' file
The expected result is:
<blank>
Column3
<more column 3 values>
<blank>
third_text
I am trying to achieve this with a single awk command. Isn't that possible?
The following post talks about using pipe as a delimiter in awk but it doesn't talk about the case of multiple delimiters where one of them is a pipe character:
Using pipe character as a field separator

Am I missing something ?
Example input :
+------------------------------------------+----------------+------------------+
| Column1 | Column2 | Column3 | Column4 | Last Column |
+------------------------------------------+----------------+------------------+
| some_text | other_text | third_text | fourth_text | last_text |
| some_text2| other_text2 | third_text2 | fourth_text2 | last_text2 |
+------------------------------------------+----------------+------------------+
Command :
gawk -F '[| ]*' '{print $4}' <file>
Output :
<blank>
Column3
<blank>
third_text
third_text2
<blank>
Works for every column (you just need to use i+1 instead of i because first column empty values or +-----).

perl is better suited for this use case :
$ perl -F'\s*\|\s*' -lane 'print $F[3]' File
#  ____________
#  ^
#  |
# FULL regex support with -F switch (delimiter, like awk, but more powerful)

First preparse with sed - remove first, third and last line, replace all spaces+|+spaces with a single |, remove leading | - then just split with awk using | (could be really cut -d'|' -f3).
sed '1d;3d;$d;s/ *| */|/g;s/^|//;' |
awk -F'|' '{print $3}'

Feed these particular commands to xargs to run in parallel

Here is the correctly working sequential version:
tail -n +2 movieratings.csv | cut -d "|" -f 1 | sort | uniq | wc -l
tail -n +2 movieratings.csv | cut -d "|" -f 2 | sort | uniq | wc -l
tail -n +2 movieratings.csv | cut -d "|" -f 3 | sort | uniq | wc -l
and so on for many more f values. The field number is the only thing that changes. Parallelism is required for allowed answers to my question.
I tried lots of things as follows (and even more than shown) if you want to wade through them, but none are working right. GNU parallel is seemingly not allowed on this host otherwise I would use it and be done with it already. It is a host on google colaboratory. Perhaps there is a way to install GNU parallel on colab and then that answer would also be acceptable to me even if not to the rest of stack overflow. It's my question and I own it.
for i in {1..1}; do echo $i; done | xargs -L 30 -I x -n 1 -P 8 cut movieratings.csv -d "|" -f x | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -I xxx -P 8 cut movieratings.csv -d "|" -f xxx | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -d $'\n' -I xxx -P 8 cut movieratings.csv -d "|" -f xxx | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -d $'\n' -n 1 -I xxx -P 8 cut -d "|" -f xxx movieratings.csv | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -n 1 -I xxx -P 8 cut -d "|" -f xxx movieratings.csv | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -I xxx -P 8 'cut -d "|" -f xxx movieratings.csv' | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -I xxx -P 8 cut movieratings.csv -d "|" -f xxx | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -n 1 -I xxx -P 8 cut movieratings.csv -d "|" -f xxx | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -I xxx -P 8 "cut movieratings.csv -d '|'" -f xxx | sort | uniq | wc -l
Here is some sample data on which to run commands. Hope this helps.
userid|itemid|rating|rating_year|title|unknown|action|adventure|animation|childrens|comedy|crime|documentary|drama|fantasy|film_noir|horror|musical|mystery|romance|scifi|thriller|war|western|movie_year|movie_age|user_age|gender|job|zipcode
196|242|3.0|1997|Kolya (1996)|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1997|0|49|M|writer|55105
186|302|3.0|1998|L.A. Confidential (1997)|0|0|0|0|0|0|1|0|0|0|1|0|0|1|0|0|1|0|0|1997|1|39|F|executive|00000
22|377|1.0|1997|Heavyweights (1994)|0|0|0|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1994|3|25|M|writer|40206
244|51|2.0|1997|Legends of the Fall (1994)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|1|0|0|1|1|1994|3|28|M|technician|80525
166|346|1.0|1998|Jackie Brown (1997)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|0|0|0|1997|1|47|M|educator|55113
298|474|4.0|1998|Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1963|35|44|M|executive|01581
115|265|2.0|1997|Hunt for Red October, The (1990)|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|1990|7|31|M|engineer|17110
253|465|5.0|1998|Jungle Book, The (1994)|0|0|1|0|1|0|0|0|0|0|0|0|0|0|1|0|0|0|0|1994|4|26|F|librarian|22903
305|451|3.0|1998|Grease (1978)|0|0|0|0|0|1|0|0|0|0|0|0|1|0|1|0|0|0|0|1978|20|23|M|programmer|94086

I do not know colab, but did you read https://oletange.wordpress.com/2018/03/28/excuses-for-not-installing-gnu-parallel/ (Especially parallel --embed).
tail -n +2 movieratings.csv |
parallel --tee --pipe 'cut -d "|" -f {} | sort | uniq | wc -l' ::: 1 2 3

Separate two words in the same colum in awk cli?

In CLI I type:
$ timedatectl | grep Time | awk '{print $3}'
Which gives me the correct output:
Country/City
How can I just get the print out of city without the Country?

Use this command instead:
timedatectl | grep Time | awk '{print $3}' | cut -d/ -f2

How to get the empty values in data using awk command

I had linux lvs command sample result set. I am trying to re-arrange the fields using AWK command.Using that I am not able to skip the empty vales in data.
LV VG Attr LSize Pool Origin Data% Meta% Move Log
root centos -wi-ao---- 45.62g root Online
swap centos -wi-ao---- root Offline
I tried the following command,
awk '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' lvs.txt
But Output is
LV VG Attr LSize Pool Origin Data% Meta% Move Log
root centos -wi-ao---- 45.62g root Online
swap centos -wi-ao---- root Offline
My expected result must be,
LV | VG | Attr | LSize | Pool | Origin | Data% | Meta% | Move | Log
root | centos | -wi-ao---- | 45.62g | | root | | | | Online
swap | centos | -wi-ao---- | | | root | | | | Offline
Please help me through this.
Any other possible ways is also welcome.Thanks in advance.

Solution, use awk with --separator and format the output with awk -F (Field Separator) and printf:
lvs --separator ',' | awk -F ',' '{printf "%-15s| %-10s| %10s| %10s| %10s| %10s| %10s| %10s| %10s|%10s| \n", $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}'
Output:
LV | VG | Attr| LSize| Pool| Origin| Data%| Meta%| Move| Log|
logicalTest1 | testgroup | -wi-a-----| 1.00g| | | | | | |
logicalTest2 | testgroup | -wi-a-----| 1.00g| | | | | | |
Explanation
Installed lvm and created a couple of lvs to test.
The lvs command produces an output without field separators, just a bunch of spaces. After redirecting the output to a file I've proceeded to display the spaces:
root#florida:~# cat test2 | tr " " "*"
**Host*******Attr*******KMaj*LSize*OSize*Origin
**florida****-wi-a-----**254*1.00g*************
**florida****-wi-a-----**254*1.00g*************
Because spaces aren't good delimiters and after reading man lvs I found it has several options to print the output, one of them is the --separator parameter which allows you to use a personalized separator for each column, I've used a "," (comma) which is common for CSV.
Then, looking for options to delimit awk I've found an option to use a field separator on it, I've glued that with the print formatting explained here.
I just had to search and read a little but all the answers for this problem were on internet.

Piping grep result to awk

I would to delete column from csv files containing some text as the column header.
I would like that the output_file be the same name as the file name found by the grep.
I did the following
grep -l "pattern" * | xargs -0 awk -F'\t' '{print $1"\t"$2}' > output_file
How to output the result to the same file found by the grep ?
Thank you.

Just do this :
grep -l "pattern" * | xargs awk -F'\t' '{print $1"\t"$2 > FILENAME}'
FILENAME is the awk variable for your input file
Example :
$ cat file1
ABC zzz
EFG xxx
HIJ yyy
$ cat file2
123 aaa
456 bbb
789 ccc
grep -l "123" * | xargs awk '{print $2"\t"$1 > FILENAME}'
I switch columns 1 and 2 in the file containing "123" and overwrite file2.
$ cat file1
ABC zzz
EFG xxx
HIJ yyy
$ cat file2
aaa 123
bbb 456
ccc 789

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

More efficient sed on huge input files - awk

Related

Using multiple delimiters when one of them is a pipe character

Feed these particular commands to xargs to run in parallel

Separate two words in the same colum in awk cli?

How to get the empty values in data using awk command

Piping grep result to awk

Categories

Resources