Remove column taking last column as reference - awk

I am looking to remove third last and second last column and print the rest using bash. eg.
Line 1 ------ A B C D E F G H I J K
Line 2 ------ A B C D E F E F I G H I J M
Line 3 ------ A B C D E I J Y
Line 4 ------ A B C D A B C D F G J E F G H I J C
Now taking last column as reference ($NF) I need to remove third last and second last column.
Desired output should look like below where in each line I J should be removed.
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
Thanks

Depending if you want to keep or collapse the separators around the removed fields:
$ awk '{$(NF-2)=$(NF-1)=""}1' file
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
$ awk '{$(NF-2)=$(NF-1)=""; $0=$0; $1=$1}1' file
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
You said in a comment about ...retains my tab delimiter. If your fields are tab-separated then state that in your question and add BEGIN{FS=OFS="\t"} at the start of the script.

You can do this with a for loop inside of awk:
awk '{for(i=1;i<=NF;++i){if (i<NF-2||i==NF){printf i==NF?"%s\n":"%s ", $i}}}'
That's just looping through all of the columns, if the column isn't the 2nd or 3rd from last then it prints the column, appending a line feed if it's the last column.
There may be a prettier way to do it in awk, but it works.

This might work for you (GNU sed):
sed -E 's/(\s+\S+){3}$/\1/' file
Replace the last 3 fields with the last field on each line.

Related

How to modify groups of a grouped pandas dataframe

I have this dataframe:
s = pd.DataFrame({'A': [*'1112222'], 'B': [*'abcdefg'], 'C': [*'ABCDEFG']})
that is like this:
A B C
0 1 a A
1 1 b B
2 1 c C
3 2 d D
4 2 e E
5 2 f F
6 2 g G
I want to do a groupby like this:
groups = s.groupby("A")
for example, the group 2 is:
g2 = groups.get_group("2")
that looks like this:
A B C
3 2 d D
4 2 e E
5 2 f F
6 2 g G
Anyway, I want to do some operation in each group.
Let me show how my final result should be:
A B C D
1 1 b B a=b;A=B
2 1 c C a=c;A=C
4 2 e E d=e;D=E
5 2 f F d=f;F=F
6 2 g G d=g;D=G
Actually, I am dropping the first row in each group but combining it with the other rows of the group to create column C
Any idea how to do this?
Summary of what I want to do in two lines:
I want to do a group by and in each group, I want to drop the first row. I also want to add a column to the whole dataframe that is based on the rows of the group
What I have tried:
In order to solve this, I am going to create a function:
def func(g):
first_row_of_group = g.iloc[0]
g = g.iloc[1:]
g["C"] = g.apply(lambda row: ";".join([f'{a}={b}' for a, b in zip(row, first_row_of_group)]))
return g
Then I am going to do this:
groups.apply(lambda g: func(g))
You can apply a custom function to each group where you add the elements from the first row to the remaining rows and remove it:
def remove_first(x):
first = x.iloc[0]
x = x.iloc[1:]
x['D'] = first['B'] + '=' + x['B'] + ';' + first['C'] + '=' + x['C']
# an equivalent operation
# x['D'] = first.iloc[1] + '=' + x.iloc[:,1] + ';' + first.iloc[2] + '=' + x.iloc[:,2]
return x
s = s.groupby('A').apply(remove_first).droplevel(0)
Output:
A B C D
1 1 b B a=b;A=B
2 1 c C a=c;A=C
4 2 e E d=e;D=E
5 2 f F d=f;D=F
6 2 g G d=g;D=G
Note: The dataframe shown in your question is constructed from
s = pd.DataFrame({'A': [*'1112222'], 'B': [*'abcdefg'], 'C': [*'ABCDEFG']})
but you give a different one as raw input.

Pandas Merging Data Frames Repeated Values and Values Missing

So I've created three data frames from 3 separate files (csv and xls). I want to combine the three of them into a single data frame that is 20 columns and 15 rows. I've managed to successfully do this using the code at the bottom (this is the final part of the code where I started to merge all of the existing data frames I created). However, an odd thing is happening, where the highest ranking country is duplicated 3 times, and there are two values from the 15 columns that should be there but that are missing, and I'm not exactly sure why.
I've set the index to be the same in each data frame!
So essentially my issue is that there are duplicate values showing up and other values being eliminated after I merge the data frames.
If someone could explain the mechanics to me as to why this issue is occuring I'd really appreciate it :)
***merged = pd.merge(pd.merge(df_ScimEn,df_energy[ListEnergy],left_index=True,right_index=True),df_GDP[ListOfGDP],left_index=True,right_index=True))
merged = merged[ListOfColumns]
merged = merged.sort_values('Rank')
merged = merged[merged['Rank']<16]
final = pd.DataFrame(merged)***
***Example: a shorter version of what is happening
expected:
A B C D J K L R
1 x y z j a e c d
2 b c d l a l c d
3 j k e k a m c d
4 d k c k a n h d
5 d k j l a h c d
generated after I run the code above: (the 1 is repeated and the 3 is missing)
A B C D J K L R
1 x y z j a b c d
1 x y z j a b c d
1 x y z j a b c d
4 d k c k a b h d
5 d k j l a h c d***
***Example Input
df1 = {[1:A,B,C],[2:A,B,C],[3:A,B,C],[4:A,B,C],[5:A,B,C]}
df2 = {[1:J,K,L,M],[2:J,K,L,M],[3:J,K,L,M],[4:J,K,L,M],[5:J,K,L,M]}
df3 = {[1:R,E,T],[2:R,E,T],[3:R,E,T],[4:R,E,T],[5:R,E,T]}
So the indexes are all the same for each data frame and then some have a
different number of rows and different number of columns but I've edited them
to form the final data frame. and each capital letter stands for a column
name with different values for each column***

Append two pandas dataframe with different shapes and in for loop using python or pandasql

I have two dataframe such as:
df1:
id A B C D
1 a b c d
1 e f g h
1 i j k l
df2:
id A C D
2 x y z
2 u v w
The final outcome should be:
id A B C D
1 a b c d
1 e f g h
1 i j k l
2 x y z
2 u v w
These tables are generated using for loop from json files. So have to keep on appending these tables one below another.
Note: Two dataframes 'id' column is always different.
My approach:
data is a dataframe in which column 'X' has json data and has and "id" column also.
df1=pd.DataFrame()
for i, row1 in data.head(2).iterrows():
df2= pd.io.json.json_normalize(row1["X"])
df2.columns = df2.columns.map(lambda x: x.split(".")[-1])
df2["id"]=[row1["id"] for i in range(df2.shape[0])]
if len(df1)==0:
df1=df2.copy()
df1=pd.concat((df1,df2), ignore_index=True)
Error: AssertionError: Number of manager items must equal union of block items # manager items: 46, # tot_items: 49
How to solve this using python or pandas sql.
You can use pd.concat to concatenate two dataframes like
>>> pd.concat((df,df1), ignore_index=True)
id A B C D
0 1 a b c d
1 1 e f g h
2 1 i j k l
3 2 x NaN y z
4 2 u NaN v w

combine 2 files with AWK based last colums [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
i have two files
file1
-------------------------------
1 a t p b
2 b c f a
3 d y u b
2 b c f a
2 u g t c
2 b j h c
file2
--------------------------------
1 a b
2 p c
3 n a
4 4 a
i want combine these 2 files based last columns (column 5 of file1 and column 3 of file2) using awk
result
----------------------------------------------
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c
at the very beginning, I didn't see the duplicated "a" in file2, I thought it would be solved with normal array matching. ... now it works.
an awk onliner:
awk 'NR==FNR{a[$3"_"NR]=$0;next;}{for(x in a){if(x~"^"$5) print $1,$2,$3,$4,a[x];}}' f2.txt f1.txt
test
kent$ head *.txt
==> f1.txt <==
1 a t p b
2 b c f a
3 d y u b
2 b c f a
2 u g t c
2 b j h c
==> f2.txt <==
1 a b
2 p c
3 n a
4 4 a
kent$ awk 'NR==FNR{a[$3"_"NR]=$0;next;}{for(x in a){if(x~"^"$5) print $1,$2,$3,$4,a[x];}}' f2.txt f1.txt
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c
note, the output format was not sexy, but it would be acceptable if pipe it to column -t
Other way assuming files have no headers:
awk '
FNR == NR {
f2[ $NF ] = f2[ $NF ] ? f2[ $NF ] SUBSEP $0 : $0;
next;
}
FNR < NR {
if ( $NF in f2 ) {
split( f2[ $NF ], a, SUBSEP );
len = length( a );
for ( i = 1; i <= len; i++ ) {
$NF = a[ i ];
}
}
printf "%s\n", $0;
}
' file2 file1 | column -t
It yields:
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c
A bit easier in a language that supports arbitrary data structures (list of lists). Here's ruby
# read "file2" and group by the last field
file2 = File .foreach('file2') .map(&:split) .group_by {|fields| fields[-1]}
# process file1
File .foreach('file1') .map(&:split) .each do |fields|
file2[fields[-1]] .each do |fields2|
puts (fields[0..-2] + fields2).join(" ")
end
end
outputs
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c

strange html file returned by web server

While working on a web crawler, I ran across this strange occurrence; the following is a snippet of the page content returned by the web server for http://nexgen.ae :
< ! D O C T Y P E H T M L P U B L I C " - / / W 3 C / / D T D H T M L 4 . 0 T r a n s i t i o n a l / / E N " >
< H T M L > < H E A D > < T I T L E > N e x G e n T e c h n o l o g i e s L L C | F i n g e r p r i n t T i m e A t t e n d a n c e M a n a g e m e n t S y s t e m | A c c e s s C o n t r o l M a n a g e m e n t S y s t e m | F a c e R e c o g n i t i o n | D o o r A c c e s s C o n t r o l | E m p l o y e e s A t t e n d a n c e | S o l u t i o n P r o v i d e r | N e t w o r k S t r u c t u e d C a b l i n g | D u b a i | U A E ) < / T I T L E >
As you can see, the web server seems to have inserted a space character after every other character in the original HTML source. I checked the HTML source with "Page Source" in Firefox and there were no extra spaces there. I also checked other web pages from the same website, and I am obtaining the correct HTML file for those pages. So far the problem seems to only be happening with this website's default page when accessed through a web crawler.
I noticed the html file contains "google optimizer tracking script" at the very end. I wonder if the problem has anything to do with that...
Or could this just be the Website manager's way of keeping web crawlers away? If that's the case, a robots.txt file would do!
Those probably aren't spaces, they are null bytes. The page is encoded in UTF-16 (multiples of 2 bytes per character, minimum 2), and because the website has not properly specified its encoding in its HTTP headers, you are trying to read it as ASCII (1 byte per character) or possibly UTF-8 (1 byte or more per character).
To see what I mean, open it in your browser and change the encoding (somewhere in the browser's menus, might have to right-click on the page) and choose the UTF-16LE option.