Get first word of every line with AWK or SED? - awk

I have a text file like this:
aaaa bbbb cccc
dddd eeee ffff
gggg hhhh iiii
...
..
.
How can I create a text file only with first column of every lines with awk or sed like this?
aaaa
dddd
gggg
...
..
.
I have seen similar topics but I could not resolve my problem!

If your input is
aaaa bbbb cccc
dddd eeee ffff
gggg hhhh iiii
and what you want is:
aaaa
dddd
gggg
then you can use any of:
awk NF=1 input.file
sed 's/ .*//' input.file
cut -d' ' -f1 input.file

Using awk: By setting number of fields to 1:
awk '{NF=1}1' inputfile
Using grep and look arounds:
grep -oP '^.[^ ]+' inputfile
Using sed backrefrencing:
sed -r 's/(^.[^ ]+).*/\1/' inputfile

Related

More efficient sed on huge input files

I regularly have to reduce huge db sql dumps (more than 100gb) down to more manageable file sizes by removing unnecessary INSERT statements. I do that with the following script.
I'm concerned that my script involves iterating multiple times through the source file, which is obviously computationally expensive.
Is there a way to combine all my SED statements into one, so the source file only needs to be processed once, or can be processed in a more efficient way?
sed '/INSERT INTO `attendance_log`/d' input.sql | \
sed '/INSERT INTO `analytics_models_log`/d' | \
sed '/INSERT INTO `backup_logs`/d' | \
sed '/INSERT INTO `config_log`/d' | \
sed '/INSERT INTO `course_completion_log`/d' | \
sed '/INSERT INTO `errorlog`/d' | \
sed '/INSERT INTO `log`/d' | \
sed '/INSERT INTO `logstore_standard_log`/d' | \
sed '/INSERT INTO `mnet_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `prog_completion_log`/d' | \
sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | \
sed '/INSERT INTO `totara_sync_log`/d' | \
sed '/INSERT INTO `prog_messagelog`/d' | \
sed '/INSERT INTO `stats_daily`/d' | \
sed '/INSERT INTO `course_modules_completion`/d' | \
sed '/INSERT INTO `question_attempt_step_data`/d' | \
sed '/INSERT INTO `scorm_scoes_track`/d' | \
sed '/INSERT INTO `question_attempts`/d' | \
sed '/INSERT INTO `grade_grades_history`/d' | \
sed '/INSERT INTO `task_log`/d' > reduced.sql
Is this idea going in the right direction?
cat input.sql | sed '/INSERT INTO `analytics_models_log`/d' | sed '/INSERT INTO `backup_logs`/d' | sed '/INSERT INTO `config_log`/d' | sed '/INSERT INTO `course_completion_log`/d' | sed '/INSERT INTO `errorlog`/d' | sed '/INSERT INTO `log`/d' | sed '/INSERT INTO `logstore_standard_log`/d' | sed '/INSERT INTO `mnet_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `prog_completion_log`/d' | sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | sed '/INSERT INTO `totara_sync_log`/d' | sed '/INSERT INTO `prog_messagelog`/d' | sed '/INSERT INTO `stats_daily`/d' | sed '/INSERT INTO `course_modules_completion`/d' | sed '/INSERT INTO `question_attempt_step_data`/d' | sed '/INSERT INTO `scorm_scoes_track`/d' | sed '/INSERT INTO `question_attempts`/d' | sed '/INSERT INTO `grade_grades_history`/d' | sed '/INSERT INTO `task_log`/d' > reduced.sql
If you have multiple sed ... | sed ... you can combine them by writing sed -e ... -e ... or sed ...;.... But in this case there is an even more efficient method:
sed -E '/INSERT INTO `(attendance_log|analytics_models_log|...)`/d'
Alternatively, switch to grep which could be even faster:
grep -vE 'INSERT INTO `(attendance_log|analytics_models_log|...)`'
or
grep -vFf <(printf 'INSERT INTO `%s`\n' attendance_log analytics_models_log ...)
You could even try to replace all ..._log and logs with a regex, if this is what you want. With this, you only have to explicitly list the non-log files:
INSERT INTO `([^`]*logs?|local_amosdatasend_log_entry|stats_daily|...)`
For ease of maintenance it may make sense to have a list of tables (in a file) that awk can use to filter the SQL script.
List of the (database) tables to skip ...
$ cat table.list
attendance_log
analytics_models_log
backup_logs
config_log
course_completion_log
Sample SQL script:
$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...
Let awk do the pruning for us:
$ awk 'FNR==NR {table[$1];next} /^INSERT INTO / && $3 in table{next}1' table.list sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...
NOTES:
this is based solely on the fact the question only mentions INSERT INTO commands
I'm assuming the lines (of interest) start with INSERT INTO (otherwise remove the ^)
this solution will need additional checks/coding to address other SQL statements OP wants to remove
For ease of maintenance it may make sense to have a list of INSERT INTO <table>/d commands (in a file) that sed can use to filter the SQL script.
Storing the sed commands in a file, eg:
$ cat sed.cmds
/INSERT INTO attendance_log/d
/INSERT INTO analytics_models_log/d
/INSERT INTO backup_logs/d
/INSERT INTO config_log/d
/INSERT INTO course_completion_log/d
Sample SQL script:
$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...
Invoking the file of sed commands:
$ sed -f sed.cmds sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...

Using multiple delimiters when one of them is a pipe character

I have a text file where fields are separated by a pipe character. Since it is a human readable text, there are spaces used for column alignment.
Here is a sample input:
+------------------------------------------+----------------+------------------+
| Column1 | Column2 | Column3 | Column4 | Last Column |
+------------------------------------------+----------------+------------------+
| some_text | other_text | third_text | fourth_text | last_text |
<more such lines>
+------------------------------------------+----------------+------------------+
How can I use awk to extract the third field in this case? The
I tried:
awk -F '[ |]' '{print $3}' file
awk -F '[\|| ]' '{print $3}' file
awk -F '[\| ]' '{print $3}' file
The expected result is:
<blank>
Column3
<more column 3 values>
<blank>
third_text
I am trying to achieve this with a single awk command. Isn't that possible?
The following post talks about using pipe as a delimiter in awk but it doesn't talk about the case of multiple delimiters where one of them is a pipe character:
Using pipe character as a field separator
Am I missing something ?
Example input :
+------------------------------------------+----------------+------------------+
| Column1 | Column2 | Column3 | Column4 | Last Column |
+------------------------------------------+----------------+------------------+
| some_text | other_text | third_text | fourth_text | last_text |
| some_text2| other_text2 | third_text2 | fourth_text2 | last_text2 |
+------------------------------------------+----------------+------------------+
Command :
gawk -F '[| ]*' '{print $4}' <file>
Output :
<blank>
Column3
<blank>
third_text
third_text2
<blank>
Works for every column (you just need to use i+1 instead of i because first column empty values or +-----).
perl is better suited for this use case :
$ perl -F'\s*\|\s*' -lane 'print $F[3]' File
#  ____________
#  ^
#  |
# FULL regex support with -F switch (delimiter, like awk, but more powerful)
First preparse with sed - remove first, third and last line, replace all spaces+|+spaces with a single |, remove leading | - then just split with awk using | (could be really cut -d'|' -f3).
sed '1d;3d;$d;s/ *| */|/g;s/^|//;' |
awk -F'|' '{print $3}'

Separate two words in the same colum in awk cli?

In CLI I type:
$ timedatectl | grep Time | awk '{print $3}'
Which gives me the correct output:
Country/City
How can I just get the print out of city without the Country?
Use this command instead:
timedatectl | grep Time | awk '{print $3}' | cut -d/ -f2

Piping grep result to awk

I would to delete column from csv files containing some text as the column header.
I would like that the output_file be the same name as the file name found by the grep.
I did the following
grep -l "pattern" * | xargs -0 awk -F'\t' '{print $1"\t"$2}' > output_file
How to output the result to the same file found by the grep ?
Thank you.
Just do this :
grep -l "pattern" * | xargs awk -F'\t' '{print $1"\t"$2 > FILENAME}'
FILENAME is the awk variable for your input file
Example :
$ cat file1
ABC zzz
EFG xxx
HIJ yyy
$ cat file2
123 aaa
456 bbb
789 ccc
grep -l "123" * | xargs awk '{print $2"\t"$1 > FILENAME}'
I switch columns 1 and 2 in the file containing "123" and overwrite file2.
$ cat file1
ABC zzz
EFG xxx
HIJ yyy
$ cat file2
aaa 123
bbb 456
ccc 789

Using multicharacter field separator using AWK

I'm having problems with AWK's field delimiter,
the input file appears as below
1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria | scientific name |
2 | Monera | Monera | in-part |
2 | Procaryotae | Procaryotae | in-part |
2 | Prokaryota | Prokaryota | in-part |
2 | Prokaryotae | Prokaryotae | in-part |
2 | bacteria | bacteria | blast name |
the field delimiter here is tab,pipe,tab \t|\t
so in my attempt to print just the 1st and 2nd column
awk -F'\t|\t' '{print $1 "\t" $2}' nodes.dmp | less
instead of the desired output, the output is the 1st column followed by the pipe character. I tried escaping the pipe \t\|\t, but the output remains the same.
1 |
1 |
2 |
2 |
2 |
2 |
Printing the 1st and 3rd column gave me the original intended output.
awk -F'\t|\t' '{print $1 "\t" $3}' nodes.dmp | less
but i'm puzzed as to why this is not working as intended.
I understand that the perl one liner below will work but what i really want is to use awk.
perl -aln -F"\t\|\t" -e 'print $F[0],"\t",$F[1]' nodes.dmp | less
The pipe | character seems to be confusing awk into thinking that \t|\t implies that the field separator could be one of \t or \t. Tell awk to interpret the | literally.
$ awk -F'\t[|]\t' '{print $1 "\t" $2}'
1 all
1 root
2 Bacteria
2 Monera
2 Procaryotae
2 Prokaryota
2 Prokaryotae
2 bacteria
From your posted input:
your lines can end in |, not |\t, and
you have cases (the first 2 lines) where the input contains |\t|, and
your lines start with a tab
So, an FS of tab-pipe-tab is wrong since it won't match any of the above cases since the first is just tab-pipe and the tab in the middle of the second will match the tab-pipe-tab from the preceding field but then that just leaves pipe-tab for the following field, and the first leaves you with an undesirable leading tab.
What you actually need is to set the FS to just tab-pipe and then strip off the leading tab from each field:
awk -F'\t|' -v OFS='\t' '{gsub(/(^|[|])\t/,""); print $1, $2}' file
That way you can handle all fields from 1 to NF-1 exactly the same as each other.
Using cut command:
cut -f1,2 -d'|' file.txt
without pipe in output:
cut -f1,2 -d'|' file.txt | tr -d '|'