strange html file returned by web server - tracking

While working on a web crawler, I ran across this strange occurrence; the following is a snippet of the page content returned by the web server for http://nexgen.ae :
< ! D O C T Y P E H T M L P U B L I C " - / / W 3 C / / D T D H T M L 4 . 0 T r a n s i t i o n a l / / E N " >
< H T M L > < H E A D > < T I T L E > N e x G e n T e c h n o l o g i e s L L C | F i n g e r p r i n t T i m e A t t e n d a n c e M a n a g e m e n t S y s t e m | A c c e s s C o n t r o l M a n a g e m e n t S y s t e m | F a c e R e c o g n i t i o n | D o o r A c c e s s C o n t r o l | E m p l o y e e s A t t e n d a n c e | S o l u t i o n P r o v i d e r | N e t w o r k S t r u c t u e d C a b l i n g | D u b a i | U A E ) < / T I T L E >
As you can see, the web server seems to have inserted a space character after every other character in the original HTML source. I checked the HTML source with "Page Source" in Firefox and there were no extra spaces there. I also checked other web pages from the same website, and I am obtaining the correct HTML file for those pages. So far the problem seems to only be happening with this website's default page when accessed through a web crawler.
I noticed the html file contains "google optimizer tracking script" at the very end. I wonder if the problem has anything to do with that...
Or could this just be the Website manager's way of keeping web crawlers away? If that's the case, a robots.txt file would do!

Those probably aren't spaces, they are null bytes. The page is encoded in UTF-16 (multiples of 2 bytes per character, minimum 2), and because the website has not properly specified its encoding in its HTTP headers, you are trying to read it as ASCII (1 byte per character) or possibly UTF-8 (1 byte or more per character).
To see what I mean, open it in your browser and change the encoding (somewhere in the browser's menus, might have to right-click on the page) and choose the UTF-16LE option.

Related

Is there a way to re arrange two data sets for one in pandas?

I have data read as follows
K a b c
p1 x y z
p2 e r x
p3 v w q
...........
..............
p9 Y..........
K d e f
p9 x y z
p8 o p q
p7 h i j
..............
............
I like to re arrange data as follows
K a b c.......d e f
p1 x y z ..............
p2 e r x ..............
p3 v w q.......
*
*
*
p7............h i j
p8........... o p q
p9 Y..........x y z
Index name is K, and one of index is 'k' too.
Can I turn my one of index into same index name and merge them into one?
Thank you in advance.
I divided them into 2 dfs, then
pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
but didn't work.

Pandas Merging Data Frames Repeated Values and Values Missing

So I've created three data frames from 3 separate files (csv and xls). I want to combine the three of them into a single data frame that is 20 columns and 15 rows. I've managed to successfully do this using the code at the bottom (this is the final part of the code where I started to merge all of the existing data frames I created). However, an odd thing is happening, where the highest ranking country is duplicated 3 times, and there are two values from the 15 columns that should be there but that are missing, and I'm not exactly sure why.
I've set the index to be the same in each data frame!
So essentially my issue is that there are duplicate values showing up and other values being eliminated after I merge the data frames.
If someone could explain the mechanics to me as to why this issue is occuring I'd really appreciate it :)
***merged = pd.merge(pd.merge(df_ScimEn,df_energy[ListEnergy],left_index=True,right_index=True),df_GDP[ListOfGDP],left_index=True,right_index=True))
merged = merged[ListOfColumns]
merged = merged.sort_values('Rank')
merged = merged[merged['Rank']<16]
final = pd.DataFrame(merged)***
***Example: a shorter version of what is happening
expected:
A B C D J K L R
1 x y z j a e c d
2 b c d l a l c d
3 j k e k a m c d
4 d k c k a n h d
5 d k j l a h c d
generated after I run the code above: (the 1 is repeated and the 3 is missing)
A B C D J K L R
1 x y z j a b c d
1 x y z j a b c d
1 x y z j a b c d
4 d k c k a b h d
5 d k j l a h c d***
***Example Input
df1 = {[1:A,B,C],[2:A,B,C],[3:A,B,C],[4:A,B,C],[5:A,B,C]}
df2 = {[1:J,K,L,M],[2:J,K,L,M],[3:J,K,L,M],[4:J,K,L,M],[5:J,K,L,M]}
df3 = {[1:R,E,T],[2:R,E,T],[3:R,E,T],[4:R,E,T],[5:R,E,T]}
So the indexes are all the same for each data frame and then some have a
different number of rows and different number of columns but I've edited them
to form the final data frame. and each capital letter stands for a column
name with different values for each column***

Trying to join two text files based on the first column in both files and want to keep all the columns of the matches from the second file

I'm trying to join two text files based on their first columns and where those columns are the same I want to keep all the columns from the second file.
List1.txt
action
adan
adap
adapka
adat
yen
List2.txt
action e KK SS # n
adham a d h a m
adidas a d i d a s
administration e d m i n i s t r e SS # n
administrative e d m i n i s t r e t i v
admiral e d m aj r # l
adnan a d n a n
ado a d o
adan a d # n
adap a d a p
adapka a d a p k a
adrenalin # d r e n # l i n
adrian a d r j a n
adat a d a t
adtec e d t e k
adult # d a l t
yen j e n
I'd like to get everything from list1.txt that matches list2.txt plus all the other columns in list2.txt. List3.txt should look like this.
List3.txt
action e KK SS # n
adan a d # n
adap a d a p
adapka a d a p k a
adat a d a t
yen j e n
I've tried the following command from here:
$awk -F: 'FNR==NR{a[$1]=$0;next}{if($1 in a){print a[$1];} else {print;}}' List1.txt List2.txt > List3.txt
I've also tried this:
$comm <(sort List2.txt) <(sort List1.txt)
I'm sure there are ways to do this is awk, but join is also relatively simple.
join -1 1 -2 1 List1.txt <(sort -k 1,1 List2.txt) > List3.txt
You are joining List1 based on the first column, and joining List2 also based on the first column. You then need to make sure the files are sorted in alphabetical order so join can work.
This produces the columns you want, separated by a whitespace.
List3.txt
action e KK SS # n
adan a d # n
adap a d a p
adapka a d a p k a
adat a d a t
yen j e n
Another simple way to accomplish what you are attempting is with grep using the values in List1.txt to match as fixed-string the content in List2.txt redirecting the result to List3.txt, e.g.
grep -Ff List1.txt List2.txt > List3.txt
If using GNU grep or with -w, --word-regex available adding -w ensures only a whole-word match, e.g.
grep -Fwf List1.txt List2.txt > List3.txt
Resulting List3.txt
$ cat List3.txt
action e KK SS # n
adan a d # n
adap a d a p
adapka a d a p k a
adat a d a t
yen j e n
(note: all whitespace is preserved in List3.txt)

Remove column taking last column as reference

I am looking to remove third last and second last column and print the rest using bash. eg.
Line 1 ------ A B C D E F G H I J K
Line 2 ------ A B C D E F E F I G H I J M
Line 3 ------ A B C D E I J Y
Line 4 ------ A B C D A B C D F G J E F G H I J C
Now taking last column as reference ($NF) I need to remove third last and second last column.
Desired output should look like below where in each line I J should be removed.
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
Thanks
Depending if you want to keep or collapse the separators around the removed fields:
$ awk '{$(NF-2)=$(NF-1)=""}1' file
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
$ awk '{$(NF-2)=$(NF-1)=""; $0=$0; $1=$1}1' file
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
You said in a comment about ...retains my tab delimiter. If your fields are tab-separated then state that in your question and add BEGIN{FS=OFS="\t"} at the start of the script.
You can do this with a for loop inside of awk:
awk '{for(i=1;i<=NF;++i){if (i<NF-2||i==NF){printf i==NF?"%s\n":"%s ", $i}}}'
That's just looping through all of the columns, if the column isn't the 2nd or 3rd from last then it prints the column, appending a line feed if it's the last column.
There may be a prettier way to do it in awk, but it works.
This might work for you (GNU sed):
sed -E 's/(\s+\S+){3}$/\1/' file
Replace the last 3 fields with the last field on each line.

awk using space in character class

Would anybody be able to tell me how to use awk to parse columns based on a character class in with a space in it? I have some arbitrary input and I'm trying to parse it using:
somecommand | awk -F'[:space:]' '{print $2}'
I have also tried
somecommand | awk -F'[ ]' '{print $2}'
and
somecommand | awk -F'[[:space:]]' '{print $2}'
I know for sure that the command below works, although I can't use it because I am trying to parse upon multiple delimiters, space being one of them (colon being the other), which is why I'm trying to use a character class.
somecommand | awk -F' ' '{print $2}'
The problem is that I am getting spaces in the output of the first three examples this, which I obviously would not expect. I am certain that I am actually getting spaces in the output because I am piping the output to "od -a" and see the character is actually a space. I am using GNU Awk 3.1.7. I appreciate your help :-)
]$ cat hi.txt | awk -F'[[:blank:]]' '{print $2,$3}'
*
*
*
*
*
]$ cat hi.txt
* PID: 4240 Sessions: 0 Processed: 1 Uptime: 3m 28, URL : http://127.0.0.1:56784, Password: F3N04418IlVe9230Vs2bnJ7lPk9KE2PIzhlSOrv173X
* PID: 4247 Sessions: 0 Processed: 0 Uptime: 3m 28, URL : http://127.0.0.1:59918, Password: g0jOTdEawcbFluPMGawbbeSo3u7mQvYlT5P136ymELh
* PID: 4254 Sessions: 0 Processed: 0 Uptime: 3m 28, URL : http://127.0.0.1:60870, Password: C9QHDrxfCPZjZd1AiIvbSdnaG3rI0QuvfUf4stvdqww
* PID: 4261 Sessions: 0 Processed: 0 Uptime: 3m 28, URL : http://127.0.0.1:56414, Password: cnPMFUXq6f6AtkcqOCqba7Bx76mC3K7uX7XxCfIDYYq
* PID: 4268 Sessions: 0 Processed: 0 Uptime: 3m 28, URL : http://127.0.0.1:53253, Password: yAff1NmrTz4B6kUKCsZ0HPvA75bWAvmOl8YHUi9aOgo
]$ awk --version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
~]$ cat hi.txt | od -a
0000000 sp sp * sp P I D : sp 4 2 4 0 sp sp sp
0000020 sp S e s s i o n s : sp 0 sp sp sp sp
0000040 sp sp sp P r o c e s s e d : sp 1 sp
0000060 sp sp sp sp sp sp U p t i m e : sp 3 m
0000100 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0000120 : sp h t t p : / / 1 2 7 . 0 . 0
0000140 . 1 : 5 6 7 8 4 , sp P a s s w o
0000160 r d : sp F 3 N 0 4 4 1 8 I l V e
0000200 9 2 3 0 V s 2 b n J 7 l P k 9 K
0000220 E 2 P I z h l S O r v 1 7 3 X nl
0000240 sp sp * sp P I D : sp 4 2 4 7 sp sp sp
0000260 sp S e s s i o n s : sp 0 sp sp sp sp
0000300 sp sp sp P r o c e s s e d : sp 0 sp
0000320 sp sp sp sp sp sp U p t i m e : sp 3 m
0000340 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0000360 : sp h t t p : / / 1 2 7 . 0 . 0
0000400 . 1 : 5 9 9 1 8 , sp P a s s w o
0000420 r d : sp g 0 j O T d E a w c b F
0000440 l u P M G a w b b e S o 3 u 7 m
0000460 Q v Y l T 5 P 1 3 6 y m E L h nl
0000500 sp sp * sp P I D : sp 4 2 5 4 sp sp sp
0000520 sp S e s s i o n s : sp 0 sp sp sp sp
0000540 sp sp sp P r o c e s s e d : sp 0 sp
0000560 sp sp sp sp sp sp U p t i m e : sp 3 m
0000600 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0000620 : sp h t t p : / / 1 2 7 . 0 . 0
0000640 . 1 : 6 0 8 7 0 , sp P a s s w o
0000660 r d : sp C 9 Q H D r x f C P Z j
0000700 Z d 1 A i I v b S d n a G 3 r I
0000720 0 Q u v f U f 4 s t v d q w w nl
0000740 sp sp * sp P I D : sp 4 2 6 1 sp sp sp
0000760 sp S e s s i o n s : sp 0 sp sp sp sp
0001000 sp sp sp P r o c e s s e d : sp 0 sp
0001020 sp sp sp sp sp sp U p t i m e : sp 3 m
0001040 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0001060 : sp h t t p : / / 1 2 7 . 0 . 0
0001100 . 1 : 5 6 4 1 4 , sp P a s s w o
0001120 r d : sp c n P M F U X q 6 f 6 A
0001140 t k c q O C q b a 7 B x 7 6 m C
0001160 3 K 7 u X 7 X x C f I D Y Y q nl
0001200 sp sp * sp P I D : sp 4 2 6 8 sp sp sp
0001220 sp S e s s i o n s : sp 0 sp sp sp sp
0001240 sp sp sp P r o c e s s e d : sp 0 sp
0001260 sp sp sp sp sp sp U p t i m e : sp 3 m
0001300 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0001320 : sp h t t p : / / 1 2 7 . 0 . 0
0001340 . 1 : 5 3 2 5 3 , sp P a s s w o
0001360 r d : sp y A f f 1 N m r T z 4 B
0001400 6 k U K C s Z 0 H P v A 7 5 b W
0001420 A v m O l 8 Y H U i 9 a O g o nl
0001440
[a.cri.dsullivan#aaa-arnor04 ~]$ cat hi.txt | awk -F'[[:blank:]]' '{print $4,$6}'
PID:
PID:
PID:
PID:
PID:
[a.cri.dsullivan#aaa-arnor04 ~]$ cat hi.txt | awk -F'[[:blank:]]' '{print $4,$7}'
PID:
PID:
PID:
PID:
PID:
[a.cri.dsullivan#aaa-arnor04 ~]$ cat hi.txt | awk -F'[[:space:]]' '{print $4,$7}'
PID:
PID:
PID:
PID:
PID:
Here's a space class with semicolon, commas and colons:
[[:space:];:,]
And some other variations:
[[:blank:][:cntrl:]]
[ \t,:;]
Update
Try this one:
somecommand | awk -F'[[:space:]]+' '{print $2}' ## + or *, they're probably just similar.
There are a couple of different Field Separator values that might do what you want:
FS="[[:space:]:]+"
or
FS="[[:space:]]+|:"
or
FS="[[:space:]]*:[[:space:]]+"
or something else. Given the input data you posted at https://gist.github.com/anonymous/a007032c9facd9e71dff, that last one actually looks like what you really want.
If you go with FS="[[:space:]:]+"it will break your URL value (e.g. http://127.0.0.1:56784) into 3 separate sections around the :s and the other alternatives will have similar negative impacts for various fields.