Linux remove space/column if matching string pattern

Linux remove space/column if matching string pattern - awk

I would like awk or sed or any other display filter mechanisms in native shell to be able to remove space from string lines that a match & do not remove any space between the 2 strings (columns) when not matched. And then display the output.
My file:
# cat test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
My output I would like to be:
# <something to view/grep/filter with> test2
col0 col1 col2 col3 col4 col5
ln1 abc defghi jkl mno
ln2 abc defghi jkl mno pqr
I have tried multiple combinations of grep & awk & cut. But not able to do it. I am not good with sed, but I can try. I even tried to use an interim file i.e. echo output to some file & then grep. But I failed to do that too.
Edited with more of my requirement:
My biggest problem is that I can't predict where the space will be & what the contents of the row entries would be. So I would like sed to get the output not based on a specific string but instead based on column numbers.
My file:
# cat test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
ln3 aaa bbb ccc ddd eee
ln4 aaa bbbccc ddd eee
Output File:
# <something to view/grep/filter with> test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
ln3 aaa bbbccc ddd eee
ln4 aaa bbbccc ddd eee

sed 's/def ghi/defghi/' file
If that's not what you wanted then edit your question to clarify your requirements and provide input/output that better demonstrates your problem.

Related

remove rows with today's date pandas

I have a df with several columns, and I want to remove rows that have today's date time.
col1 col2 col3 col4
ABC 2022-08-12 00:03:29.872 123 A1B2
BCD 2022-08-12 00:02:08.067 234 B1C2
CDE 2022-08-11 23:57:24.208 345 C1D2
DEF 2022-08-11 23:56:55.257 456 D1E2
expected result (assuming today's date is 12th august 2022):
col1 col2 col3 col4
CDE 2022-08-11 23:57:24.208 345 C1D2
DEF 2022-08-11 23:56:55.257 456 D1E2
I tried doing below
df[pd.to_datetime(df.col2, errors='coerce') < pd.to_datetime('today')]
but it is not working, I still get rows from today. Can someone please help me with this?

Use Series.dt.date with != Timestamp.date:
df = df[pd.to_datetime(df.col2, errors='coerce').dt.date != pd.to_datetime('today').date()]
print (df)
col1 col2 col3 col4
2 CDE 2022-08-11 23:57:24.208 345 C1D2
3 DEF 2022-08-11 23:56:55.257 456 D1E2

How to merge two dataframes in PySpark where the output dataframe has alternate rows from each of the input dataframes?

I have two input dataframes like below:
ABC DEF GHI
PQR STU VWZ
SMT YUH SGR
SWI FYG LKU
and
HI HELLO HOW
ARE YOU FINE
ETC NO WORRY
SAY YOU ARE
Output:
ABC DEF GHI
HI HELLO HOW
PQR STU VWZ
ARE YOU FINE
SMT YUH SGR
ETC NO WORRY
SWI FYG LKU
SAY YOU ARE
How to achieve this in PySpark (Scala Spark)?
Dataframe creation scripts for convenience:
data1 = [('ABC', 'DEF', 'GHI'),
('PQR', 'STU', 'VWZ'),
('SMT', 'YUH', 'SGR'),
('SWI', 'FYG', 'LKU')]
df1 = spark.createDataFrame(data1, ["A", "B", "C"]
data2 = [('HI', 'HELLO', 'HOW'),
('ARE', 'YOU', 'FINE'),
('ETC', 'NO', 'WORRY'),
('SAY', 'YOU', 'ARE')]
df2 = spark.createDataFrame(data2, ["A1", "B1", "C1"])

You can add a row number to both of the dataframes and then you can union it and finally order it on row number, this will give you alternate values for the dataframe.
with row number:-
ABC DEF GHI 1
PQR STU VWZ 2
SMT YUH SGR 3
SWI FYG LKU 4
HI HELLO HOW 1
ARE YOU FINE 2
ETC NO WORRY 3
SAY YOU ARE 4
After union and order by:-
ABC DEF GHI 1
HI HELLO HOW 1
PQR STU VWZ 2
ARE YOU FINE 2
SMT YUH SGR 3
ETC NO WORRY 3
SWI FYG LKU 4
SAY YOU ARE 4
But even with this there is chance that order within the row number might vary like this:-
ABC DEF GHI 1
HI HELLO HOW 1
ARE YOU FINE 2
PQR STU VWZ 2
SMT YUH SGR 3
ETC NO WORRY 3
SAY YOU ARE 4
SWI FYG LKU 4

Remove multiple strings at a time in sql

I want to remove 'hi' ,'by', 'dy' from col2 at one shot in sql. I'm very new to sql server, if anyone could give an outline how such problems are solved that would be really helpful.
Col1 col2 col3
A hi!abcd 123
B bypython 678
C norm 888
D dupty dy 999
output:
Col1 col2 col3
A abcd 123
B python 678
C norm 888
D dupty 999

Parse hive array<string> output into either json format or something like following

I have a HIVE table with 3 columns, col3 is array the output from the table is like this
select col1, col2, col3 from testing;
col1 col2 col3
xyx 123 ["xyz","Good investing","123","abc","Bad investing","006","port123","future investing","008","flaf4","good research investing","01"]
xyx 789 ["xyz","Good investing","789","flag1","Bad investing","006","port123","future investing","008"]
I want to parse the col3 so that the output like as following
xyx 123 "xyz","Good investing","123"
xyx 123 "abc","Bad investing","006","port123"
xyx 123 "future investing","008","flaf4"
xyx 123 "good research investing","01"
xyx 789 "xyz","Good investing","789",
xyx 789 "flag1","Bad investing","006",
xyx 789 "port123","future investing","008"
any help will be highly appreciated
-kb

Awk for handling three files

Input:
File 1
col1 col2 col3 col4 col5 col6 col7
A 91 - E Abu 7 -
B 82 - f Anu 9 -
C 93 - G Aru 8 -
File 2
col1 col2 col3 col4 col5 col6 col7
A 91 - x Bob 7 -
B 82 - y Bag 9 -
C 93 - z Bui 8 -
File 3
col1 col2 col3 col4 col5 col6 col7
A 91 - T Can 7 -
B 82 - U Con 9 -
C 93 - V Cuu 8 -
Output Expected:
col1 col2 col3 col4 col5
A 91 Abu Bob Can
B 82 Anu Bag Cun
C 93 Aru Bui Cuu
I have three files having same data at col1 and 2. I need to print fifth column of all files along with first two column.
I am able to do using two files. So Can anyone help me to do with three and more files?

Here is one way using awk:
$ awk '
BEGIN {
SUBSEP = FS;
print "col1 col2 col3 col4 col5"
}
FNR>1 {
a[$1,$2] = (a[$1,$2]?a[$1,$2]FS$5:$5)
}
END {
for(x in a) print x, a[x]
}' file1 file2 file3
col1 col2 col3 col4 col5
C 93 Aru Bui Cuu
A 91 Abu Bob Can
B 82 Anu Bag Con
You can pipe the output to sort if you required sorted output. This does not limit to three files. It is scalable to n number of files. Just add the file names at the end or use * to glob to all files under a given directory.

Assuming all three files have same number of rows because of this sentence
I have three files having same data at col1 and 2.
awk 'BEGIN{OFS="\t";
getline<"file1";getline<"file2";getline<"file3";
print "col1","col2","col3","col4","col5";
while(1) {getline < "file1";a=$1;b=$2;c=$5;getline<"file2";d=$5;f=getline<"file3";e=$5;
if(!f)exit;print a,b,c,d,e}}'
Output:
col1 col2 col3 col4 col5
A 91 Abu Bob Can
B 82 Anu Bag Con
C 93 Aru Bui Cuu
This will discard first line of each file, then reads files line by line, printing desired fields.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Linux remove space/column if matching string pattern - awk

sed 's/def ghi/defghi/' file If that's not what you wanted then edit your question to clarify your requirements and provide input/output that better demonstrates your problem.

Related

remove rows with today's date pandas

How to merge two dataframes in PySpark where the output dataframe has alternate rows from each of the input dataframes?

Remove multiple strings at a time in sql

Parse hive array<string> output into either json format or something like following

Awk for handling three files

Categories

Resources