I'm learning some Python pandas and the course uses https://gist.githubusercontent.com/sh7ata/e075ff35b51ebb0d2d577fbe1d19ebc9/raw/b966d02c7c26bcca60703acb1390e938a65a35cb/drinks.csv
Clicking this link opens the actual .csv file contents in my browser and I can read the data into pandas straight away.
However, this doesn't work for https://www.spss-tutorials.com/downloads/browsers.csv. If I click this link, Google Chrome downloads the file rather than show its contents.
Why is this and what can I do about it? I mean, they're both URLs for .csv files, right?
You can use requests module with custom HTTP header to download it. For example:
import requests
import pandas as pd
from io import StringIO
url = "https://www.spss-tutorials.com/downloads/browsers.csv"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
}
req = requests.get(url, headers=headers)
df = pd.read_csv(StringIO(req.text))
print(df.to_markdown())
Prints:
screen_resolution
sessions
perc_new_sessions
new_users
bounce_rate
pages_session
avg_session_duration
goal_conversion_rate
goal_completions
goal_value
0
1366x768
2,284
79.60%
1,818
69.40%
1.93
00:02:14
0.00%
0
€0.00
1
1920x1080
2,013
72.28%
1,455
71.93%
2.02
00:02:18
0.00%
0
€0.00
2
1280x1024
1,217
72.14%
878
74.53%
1.9
00:02:05
0.00%
0
€0.00
3
1680x1050
1,052
68.16%
717
74.62%
1.93
00:01:46
0.00%
0
€0.00
4
1440x900
921
77.85%
717
74.05%
1.73
00:01:45
0.00%
0
€0.00
5
1280x800
865
80.00%
692
71.91%
1.76
00:01:37
0.00%
0
€0.00
6
1600x900
737
76.39%
563
72.86%
1.8
00:02:02
0.00%
0
€0.00
7
1920x1200
441
64.85%
286
73.92%
1.87
00:01:55
0.00%
0
€0.00
8
1024x768
192
88.02%
169
73.96%
2.07
00:01:32
0.00%
0
€0.00
9
2560x1440
137
67.15%
92
61.31%
1.86
00:02:02
0.00%
0
€0.00
10
1280x720
134
82.84%
111
66.42%
2.16
00:01:15
0.00%
0
€0.00
11
1536x864
118
78.81%
93
72.03%
1.78
00:01:45
0.00%
0
€0.00
12
320x568
104
84.62%
88
75.00%
1.89
00:01:18
0.00%
0
€0.00
13
768x1024
91
83.52%
76
67.03%
2.66
00:02:11
0.00%
0
€0.00
14
1360x768
70
77.14%
54
74.29%
1.69
00:01:08
0.00%
0
€0.00
15
360x640
70
71.43%
50
77.14%
2.06
00:02:06
0.00%
0
€0.00
16
1600x1200
62
80.65%
50
82.26%
1.32
00:01:22
0.00%
0
€0.00
17
1344x840
56
44.64%
25
53.57%
3.11
00:04:39
0.00%
0
€0.00
18
320x480
51
80.39%
41
72.55%
1.61
00:00:55
0.00%
0
€0.00
19
1093x614
41
80.49%
33
78.05%
1.76
00:01:42
0.00%
0
€0.00
20
1280x768
38
60.53%
23
68.42%
2.63
00:02:41
0.00%
0
€0.00
21
1024x600
35
94.29%
33
85.71%
1.37
00:01:23
0.00%
0
€0.00
...and so on
I am trying to extract the text from the below site shown within the code.
While I can print the list fine, I can't seem to turn it into a pandas dataframe, and print it out as a csv.
This is a site that only has the pre info.
Please let me know if there is way to do this.
import requests
from bs4 import BeautifulSoup
#url list for the new stations
url1="https://www.kyoshin.bosai.go.jp/cgi-bin/kyoshin/db/sitedat.cgi?1+NIG010+knet"
tt1="C:/temp/"
page = requests.get(url1)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)
N-Value P,S-Velocity Density Soil Column
(m/s) (g/cm^3)
----------------------------------------------------------------------------------
1m 13 1351 93 1.43 0m - 1m Fl
2m 9 1351 105 1.77 1m - 7.75m S
3m 11 1389 102 1.86 7.75m - 15.15m S
4m 7 1408 104 1.83 15.15m - 16.75m S
5m 20 1429 120 1.74 16.75m - 19.3m SF
6m 20 1481 121 1.89 19.3m - 22.75m SF
7m 24 1538 143 1.97 22.75m - 25.7m M
8m 53 1550 189 1.87 25.7m - 33.44m S
9m 52 1550 233 1.85
10m 47 1504 222 1.93
11m 43 1493 206 1.9
12m 38 1504 222 1.89
13m 27 1492 213 1.84
14m 44 1492 213 1.9
15m 62 1527 235 1.89
16m 46 1504 189 1.92
17m 22 1481 165 1.87
18m 26 1471 147 1.86
19m 24 1493 202 1.82
20m 21 1493 198 1.87
Not the most robust, but you can iterate line by line to parse the data you need:
import requests
import pandas as pd
from io import StringIO
#url list for the new stations
url1="https://www.kyoshin.bosai.go.jp/cgi-bin/kyoshin/db/sitedat.cgi?1+NIG010+knet"
tt1="C:/temp/"
page = requests.get(url1)
s=str(page.content,'utf-8')
df = pd.DataFrame()
for lineNum, line in enumerate(s.splitlines()):
if lineNum == 0:
headers = line.split()
elif lineNum == 1:
unit1, unit2 = line.split()
elif lineNum == 2:
continue
else:
row = line.split()
idx = row[0]
nval = row[1]
ps = row[2] + ' ' + row[3]
den = row[4]
try:
soil = '%s %s %s' %(row[5], row[6], row[7])
col = row[8]
except Exception as e:
print(e)
soil = ''
col = ''
temp_df = pd.DataFrame([[idx, nval, ps, den, soil, col]],
columns = ['Index',headers[0], headers[1] + ' ' + unit1, headers[2] + ' ' + unit2, headers[3], headers[4]])
df = df.append(temp_df, sort=False).reset_index(drop=True)
df.to_csv('file.csv',index=False)
Output:
print (df)
Index N-Value P,S-Velocity (m/s) Density (g/cm^3) Soil Column
0 1m 13 1351 93 1.43 0m - 1m Fl
1 2m 9 1351 105 1.77 1m - 7.75m S
2 3m 11 1389 102 1.86 7.75m - 15.15m S
3 4m 7 1408 104 1.83 15.15m - 16.75m S
4 5m 20 1429 120 1.74 16.75m - 19.3m SF
5 6m 20 1481 121 1.89 19.3m - 22.75m SF
6 7m 24 1538 143 1.97 22.75m - 25.7m M
7 8m 53 1550 189 1.87 25.7m - 33.44m S
8 9m 52 1550 233 1.85
9 10m 47 1504 222 1.93
10 11m 43 1493 206 1.9
11 12m 38 1504 222 1.89
12 13m 27 1492 213 1.84
13 14m 44 1492 213 1.9
14 15m 62 1527 235 1.89
15 16m 46 1504 189 1.92
16 17m 22 1481 165 1.87
17 18m 26 1471 147 1.86
18 19m 24 1493 202 1.82
19 20m 21 1493 198 1.87
Say I have this table T below which defines/stored a tree structure by storing parent/child couples. These values are integers. Say in another table S, I have each ID/value mapped to a string.
So, let's say in S we have:
Table S
ID Name
90 "node 90"
301 "node 301"
etc. (even though the real names are different)
Is it possible to add a computed column here in T which gives for each child node, the textual representation of the path all the way up to the root of the tree in an appended form e.g.
"node 1 > node 2 > node 3" (for child/leaf node 3)
or
"node 10 > node 20" (for child/leaf node 20)
If it's not possible through a computed column, then can I do it with a regular column and a one-time update of that column? I was thinking of some recursive CTE but I cannot get my head around it (for now).
Table T
ParentEventID ChildEventID
90 301
90 302
90 303
90 304
90 305
90 306
90 307
301 401
301 402
302 403
302 404
302 405
302 406
302 407
303 408
304 409
304 410
304 411
304 412
304 413
304 414
305 415
305 416
305 417
305 418
306 419
306 420
306 421
306 422
307 423
307 424
307 425
307 426
307 427
403 501
403 502
403 503
403 504
403 505
404 506
404 507
404 508
404 509
404 510
405 511
405 512
405 513
405 514
405 515
406 516
406 517
406 518
406 519
406 520
407 521
407 522
407 523
407 524
407 525
415 526
415 527
415 528
415 529
415 530
416 531
416 532
416 533
416 534
416 535
417 536
417 537
417 538
417 539
417 540
418 541
418 542
418 543
418 544
418 545
420 546
420 547
420 548
420 549
420 550
421 551
421 552
421 553
421 554
421 555
422 556
422 557
422 558
422 559
422 560
Here's what I came up with:
WITH cte AS (
SELECT * FROM (VALUES
(90, 301),
(90, 302),
(90, 303),
(90, 304),
(90, 305),
(90, 306),
(90, 307),
(301,401),
(301,402),
(302,403),
(302,404),
(302,405),
(302,406),
(302,407),
(303,408),
(304,409),
(304,410),
(304,411),
(304,412),
(304,413),
(304,414),
(305,415),
(305,416),
(305,417),
(305,418),
(306,419),
(306,420),
(306,421),
(306,422),
(307,423),
(307,424),
(307,425),
(307,426),
(307,427),
(403,501),
(403,502),
(403,503),
(403,504),
(403,505),
(404,506),
(404,507),
(404,508),
(404,509),
(404,510),
(405,511),
(405,512),
(405,513),
(405,514),
(405,515),
(406,516),
(406,517),
(406,518),
(406,519),
(406,520),
(407,521),
(407,522),
(407,523),
(407,524),
(407,525),
(415,526),
(415,527),
(415,528),
(415,529),
(415,530),
(416,531),
(416,532),
(416,533),
(416,534),
(416,535),
(417,536),
(417,537),
(417,538),
(417,539),
(417,540),
(418,541),
(418,542),
(418,543),
(418,544),
(418,545),
(420,546),
(420,547),
(420,548),
(420,549),
(420,550),
(421,551),
(421,552),
(421,553),
(421,554),
(421,555),
(422,556),
(422,557),
(422,558),
(422,559),
(422,560)
) AS x(ParentEventID, ChildEventID)
), rcte AS (
SELECT DISTINCT NULL AS [ParentEventID], a.[ParentEventID] AS ChildEventID, CONCAT('/', CAST(a.[ParentEventID] AS NVARCHAR(MAX)), '/') AS h
FROM cte AS a
WHERE NOT EXISTS (
SELECT *
FROM [cte]
WHERE [cte].[ChildEventID] = a.[ParentEventID]
)
UNION ALL
SELECT child.[ParentEventID], child.[ChildEventID], CONCAT(parent.h, [child].[ChildEventID], '/')
FROM [cte] AS child
JOIN rcte AS parent
ON child.[ParentEventID] = [parent].[ChildEventID]
)
SELECT * FROM rcte
The first cte is just a quick way for me to expose your data; the real meat of the solution is in rcte. Note, the h column is immediately convertible to a HierarchyID if that is what you're looking for. Which, by the way, you should be looking for that as that allows for you to answer questions of the type "what are the children of this row?" or "which rows are in this row's lineage?" quite easily (i.e. w/o having to compute the entire hierarchy on the fly).
I have the following configuration:
apache:80 -> varnish:8080 -> apache:81 -> thin:9070
That worked fine with apache 2.2, but apache 2.4 I keep getting 400 Bad Request.
The varnishlog follows:
➜ ~ varnishlog
0 CLI - Rd ping
0 CLI - Wr 200 19 PONG 1383139919 1.0
0 CLI - Rd ping
0 CLI - Wr 200 19 PONG 1383139922 1.0
0 CLI - Rd ping
0 CLI - Wr 200 19 PONG 1383139925 1.0
13 SessionOpen c 201.8.255.45 38752 :8080
13 ReqStart c 201.8.255.45 38752 1838475349
13 RxRequest c GET
13 RxURL c /
13 RxProtocol c HTTP/1.1
13 RxHeader c Host: escambo.org.br:8080
13 RxHeader c Connection: keep-alive
13 RxHeader c Cache-Control: max-age=0
13 RxHeader c Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
13 RxHeader c User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/29.0.1547.76 Chrome/29.0.1547.76 Safari/537.36
13 RxHeader c DNT: 1
13 RxHeader c Accept-Encoding: gzip,deflate,sdch
13 RxHeader c Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4
13 RxHeader c Cookie: _noosfero_session=fdb69fb023c5b4f23578041a7e1ae390; active_organization=6
13 VCL_call c recv
13 VCL_return c pass
13 VCL_call c hash
13 VCL_return c hash
13 VCL_call c pass
13 VCL_return c pass
14 BackendOpen b default 127.0.0.1 34279 127.0.0.1 81
13 Backend c 14 default default
14 TxRequest b GET
14 TxURL b /
14 TxProtocol b HTTP/1.1
14 TxHeader b Host: escambo.org.br:8080
14 TxHeader b Cache-Control: max-age=0
14 TxHeader b Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
14 TxHeader b User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/29.0.1547.76 Chrome/29.0.1547.76 Safari/537.36
14 TxHeader b DNT: 1
14 TxHeader b Accept-Encoding: gzip,deflate,sdch
14 TxHeader b Accept-Language: pt-BR
14 TxHeader b Cookie: _noosfero_session=fdb69fb023c5b4f23578041a7e1ae390; active_organization=6
14 TxHeader b X-Varnish-Accept-Language: pt
14 TxHeader b X-Forwarded-For: 201.8.255.45
14 TxHeader b X-Varnish: 1838475349
14 RxProtocol b HTTP/1.1
14 RxStatus b 400
14 RxResponse b Bad Request
14 RxHeader b Date: Wed, 30 Oct 2013 13:32:06 GMT
14 RxHeader b Server: Apache/2.4.6 (Ubuntu)
14 RxHeader b Content-Length: 306
14 RxHeader b Connection: close
14 RxHeader b Content-Type: text/html; charset=iso-8859-1
13 TTL c 1838475349 RFC 120 1383139926 0 0 0 0
13 VCL_call c fetch
13 VCL_return c pass
13 ObjProtocol c HTTP/1.1
13 ObjStatus c 400
13 ObjResponse c Bad Request
13 ObjHeader c Date: Wed, 30 Oct 2013 13:32:06 GMT
13 ObjHeader c Server: Apache/2.4.6 (Ubuntu)
13 ObjHeader c Content-Length: 306
13 ObjHeader c Content-Type: text/html; charset=iso-8859-1
13 ObjHeader c Vary: X-Varnish-Accept-Language
14 Fetch_Body b 4 0 1
14 Length b 306
14 BackendClose b default
13 VCL_call c deliver
13 VCL_return c deliver
13 TxProtocol c HTTP/1.1
13 TxStatus c 400
13 TxResponse c Bad Request
13 TxHeader c Server: Apache/2.4.6 (Ubuntu)
13 TxHeader c Content-Type: text/html; charset=iso-8859-1
13 TxHeader c Vary: X-Varnish-Accept-Language
13 TxHeader c Content-Length: 306
13 TxHeader c Date: Wed, 30 Oct 2013 13:32:06 GMT
13 TxHeader c X-Varnish: 1838475349
13 TxHeader c Age: 0
13 TxHeader c Via: 1.1 varnish
13 TxHeader c Connection: keep-alive
13 Length c 306
13 ReqEnd c 1838475349 1383139926.780520439 1383139926.781727314 0.000143051 0.001113653 0.000093222
13 Debug c "herding"
13 ReqStart c 201.8.255.45 38752 1838475350
13 RxRequest c GET
13 RxURL c /favicon.ico
13 RxProtocol c HTTP/1.1
13 RxHeader c Host: escambo.org.br:8080
13 RxHeader c Connection: keep-alive
13 RxHeader c Accept: */*
13 RxHeader c DNT: 1
13 RxHeader c User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/29.0.1547.76 Chrome/29.0.1547.76 Safari/537.36
13 RxHeader c Accept-Encoding: gzip,deflate,sdch
13 RxHeader c Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4
13 RxHeader c Cookie: _noosfero_session=fdb69fb023c5b4f23578041a7e1ae390; active_organization=6
13 VCL_call c recv
13 VCL_return c pass
13 VCL_call c hash
13 VCL_return c hash
13 VCL_call c pass
13 VCL_return c pass
14 BackendOpen b default 127.0.0.1 34280 127.0.0.1 81
13 Backend c 14 default default
14 TxRequest b GET
14 TxURL b /favicon.ico
14 TxProtocol b HTTP/1.1
14 TxHeader b Host: escambo.org.br:8080
14 TxHeader b Accept: */*
14 TxHeader b DNT: 1
14 TxHeader b User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/29.0.1547.76 Chrome/29.0.1547.76 Safari/537.36
14 TxHeader b Accept-Encoding: gzip,deflate,sdch
14 TxHeader b Accept-Language: pt-BR
14 TxHeader b Cookie: _noosfero_session=fdb69fb023c5b4f23578041a7e1ae390; active_organization=6
14 TxHeader b X-Varnish-Accept-Language: pt
14 TxHeader b X-Forwarded-For: 201.8.255.45
14 TxHeader b X-Varnish: 1838475350
14 RxProtocol b HTTP/1.1
14 RxStatus b 400
14 RxResponse b Bad Request
14 RxHeader b Date: Wed, 30 Oct 2013 13:32:07 GMT
14 RxHeader b Server: Apache/2.4.6 (Ubuntu)
14 RxHeader b Content-Length: 306
14 RxHeader b Connection: close
14 RxHeader b Content-Type: text/html; charset=iso-8859-1
13 TTL c 1838475350 RFC 120 1383139927 0 0 0 0
13 VCL_call c fetch
13 VCL_return c pass
13 ObjProtocol c HTTP/1.1
13 ObjStatus c 400
13 ObjResponse c Bad Request
13 ObjHeader c Date: Wed, 30 Oct 2013 13:32:07 GMT
13 ObjHeader c Server: Apache/2.4.6 (Ubuntu)
13 ObjHeader c Content-Length: 306
13 ObjHeader c Content-Type: text/html; charset=iso-8859-1
13 ObjHeader c Vary: X-Varnish-Accept-Language
14 Fetch_Body b 4 0 1
14 Length b 306
14 BackendClose b default
13 VCL_call c deliver
13 VCL_return c deliver
13 TxProtocol c HTTP/1.1
13 TxStatus c 400
13 TxResponse c Bad Request
13 TxHeader c Server: Apache/2.4.6 (Ubuntu)
13 TxHeader c Content-Type: text/html; charset=iso-8859-1
13 TxHeader c Vary: X-Varnish-Accept-Language
13 TxHeader c Content-Length: 306
13 TxHeader c Date: Wed, 30 Oct 2013 13:32:07 GMT
13 TxHeader c X-Varnish: 1838475350
13 TxHeader c Age: 0
13 TxHeader c Via: 1.1 varnish
13 TxHeader c Connection: keep-alive
13 Length c 306
13 ReqEnd c 1838475350 1383139927.131367445 1383139927.132475853 0.349640131 0.000998974 0.000109434
13 Debug c "herding"
0 CLI - Rd ping
0 CLI - Wr 200 19 PONG 1383139928 1.0
➜ ~
The access.log doesn't show any detail on the problem.
The problem was to the use of an old version of https://github.com/cosimo/varnish-accept-language. Updating it solved!