I am trying to scrape a website called WikiCFP and return the information in the table as a dataframe.
As of now I have this code
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
df = pd.DataFrame(columns=["abbreviation", "name", "dates", "place", "deadline"])
url = "http://www.wikicfp.com/cfp/call?conference=computer%20science&page=1"
response = requests.get(url)
soup= BeautifulSoup(response.content, "html.parser")
table = soup.find("table", align="center",cellpadding="3",cellspacing="1",width="100%")
for row in table.find_all("tr")[1:]:
values= row.find_all("td")
print(values[0].text.split("/n")[0])
I specifically don't know how to convert the text in each row into a viable list or some other thing from which a dataframe can be made.
Thanks in advance
You can use read_html directly:
dfs = pd.read_html(url, header=0) # return all tables in a list of dataframe
df = dfs[4] # 4 is the index of the dataframe you want
Output:
>>> df
Event When Where Deadline Unnamed: 4
0 DSA 2021 2nd International Conference on Data Science a... 2nd International Conference on Data Science a... 2nd International Conference on Data Science a... NaN
1 DSA 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
2 NCWMC 2022 7th International Conference on Networks, Comm... 7th International Conference on Networks, Comm... 7th International Conference on Networks, Comm... NaN
3 NCWMC 2022 Jul 30, 2022 - Jul 31, 2022 London, United Kingdom Oct 24, 2021 NaN
4 ICAIT 2022 11th International Conference on Advanced Comp... 11th International Conference on Advanced Comp... 11th International Conference on Advanced Comp... NaN
5 ICAIT 2022 Jul 23, 2022 - Jul 24, 2022 Toronto, Canada Oct 24, 2021 NaN
6 CNDC 2021 8th International Conference on Computer Netwo... 8th International Conference on Computer Netwo... 8th International Conference on Computer Netwo... NaN
7 CNDC 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
8 CAIML 2022 3rd International Conference on Artificial Int... 3rd International Conference on Artificial Int... 3rd International Conference on Artificial Int... NaN
9 CAIML 2022 Jul 23, 2022 - Jul 24, 2022 Toronto, Canada Oct 24, 2021 NaN
10 ITCSE 2022 11th International Conference on Information T... 11th International Conference on Information T... 11th International Conference on Information T... NaN
11 ITCSE 2022 Jul 23, 2022 - Jul 24, 2022 Toronto, Canada Oct 24, 2021 NaN
12 CCSIT 2022 12th International Conference on Computer Scie... 12th International Conference on Computer Scie... 12th International Conference on Computer Scie... NaN
13 CCSIT 2022 Jul 30, 2022 - Jul 31, 2022 London, United Kingdom Oct 24, 2021 NaN
14 SOEN 2022 7th International Conference on Software Engin... 7th International Conference on Software Engin... 7th International Conference on Software Engin... NaN
15 SOEN 2022 Jul 30, 2022 - Jul 31, 2022 London, United Kingdom Oct 24, 2021 NaN
16 AIAA 2021 11th International Conference on Artificial In... 11th International Conference on Artificial In... 11th International Conference on Artificial In... NaN
17 AIAA 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
18 NLPTA 2021 2nd International Conference on NLP Techniques... 2nd International Conference on NLP Techniques... 2nd International Conference on NLP Techniques... NaN
19 NLPTA 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
20 CSTY 2021 7th International Conference on Computer Scien... 7th International Conference on Computer Scien... 7th International Conference on Computer Scien... NaN
21 CSTY 2021 Dec 18, 2021 - Dec 19, 2021 Dubai, UAE Oct 24, 2021 NaN
22 KG#SAC 2022 ACM SAC 2022 Track on Knowledge Graphs ACM SAC 2022 Track on Knowledge Graphs ACM SAC 2022 Track on Knowledge Graphs NaN
23 KG#SAC 2022 Apr 25, 2022 - Apr 29, 2022 Brno, Czech Republic Oct 24, 2021 NaN
24 E&C 2021 5th International Conference on Electrical & C... 5th International Conference on Electrical & C... 5th International Conference on Electrical & C... NaN
25 E&C 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
26 IoTE 2021 2nd International Conference on Internet of Th... 2nd International Conference on Internet of Th... 2nd International Conference on Internet of Th... NaN
27 IoTE 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
28 EEEN 2021 5th International Conference on Electrical and... 5th International Conference on Electrical and... 5th International Conference on Electrical and... NaN
29 EEEN 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
30 CVIE--EI, Scopus 2022 2022 2nd International Conference on Computer ... 2022 2nd International Conference on Computer ... 2022 2nd International Conference on Computer ... NaN
31 CVIE--EI, Scopus 2022 Feb 18, 2022 - Feb 20, 2022 Sanya, China Oct 25, 2021 NaN
32 ICCDA--Ei, Scopus 2022 2022 The 6th International Conference on Compu... 2022 The 6th International Conference on Compu... 2022 The 6th International Conference on Compu... NaN
33 ICCDA--Ei, Scopus 2022 Feb 18, 2022 - Feb 20, 2022 Sanya, China Oct 25, 2021 NaN
34 ACM--ICMLC--Ei and Scopus 2022 ACM--2022 14th International Conference on Mac... ACM--2022 14th International Conference on Mac... ACM--2022 14th International Conference on Mac... NaN
35 ACM--ICMLC--Ei and Scopus 2022 Feb 18, 2022 - Feb 20, 2022 Guangzhou, China Oct 25, 2021 NaN
36 IEEE CSP--EI Compendex, Scopus 2022 2022 IEEE 6th International Conference on Cryp... 2022 IEEE 6th International Conference on Cryp... 2022 IEEE 6th International Conference on Cryp... NaN
37 IEEE CSP--EI Compendex, Scopus 2022 Jan 14, 2022 - Jan 16, 2022 Tianjin, China Oct 25, 2021 NaN
38 ACM--ICMIP--Ei and Scopus 2022 ACM--2022 7th International Conference on Mult... ACM--2022 7th International Conference on Mult... ACM--2022 7th International Conference on Mult... NaN
39 ACM--ICMIP--Ei and Scopus 2022 Jan 14, 2022 - Jan 16, 2022 Tianjin, China Oct 25, 2021 NaN
Try
values.getText()
getText() function returns textual content from bs4 HTML objects.
Related
I'm trying to extract the name of space organisations from a table but the closest i can get is the amount of times it appears next to the name of the organisation but i just want the name of the organisation not the amount of times it is named in the table.
if you can help me please leave a comment on my google colab.
https://colab.research.google.com/drive/1m4zI4YGguQ5aWdDVyc7Bdpr-78KHdxhR?usp=sharing
What I get:
variable number
organisation
time of launch
0
SpaceX
Fri Aug 07, 2020 05:12 UTC
1
CASC
Thu Aug 06, 2020 04:01 UTC
2
SpaceX
Tue Aug 04, 2020 23:57 UTC
3
Roscosmos
Thu Jul 30, 2020 21:25 UTC
4
ULA
Thu Jul 30, 2020 11:50 UTC
...
...
...
4319
US Navy
Wed Feb 05, 1958 07:33 UTC
4320
AMBA
Sat Feb 01, 1958 03:48 UTC
4321
US Navy
Fri Dec 06, 1957 16:44 UTC
4322
RVSN USSR
Sun Nov 03, 1957 02:30 UTC
4323
RVSN USSR
Fri Oct 04, 1957 19:28 UTC
etc
etc
etc
What I want:
organisation
RVSN USSR
Arianespace
CASC
General Dynamics
NASA
VKS RF
US Air Force
ULA
Boeing
Martin Marietta
etc
As you can see a Date & Time Column are being saved in this CSV File. Now what problem is that the date & time are in format of something like - 30-1-2022 & 20:08:00
But i want it to look something like 30th Jan 22 and 8:08 PM
Any code for that ?
import requests
import pandas as pd
from datetime import datetime
from datetime import date
currentd = date.today()
s = requests.Session()
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url = 'https://www.nseindia.com/'
step = s.get(url,headers=headers)
today = datetime.now().strftime('%d-%m-%Y')
api_url = f'https://www.nseindia.com/api/corporate-announcements?index=equities&from_date={today}&to_date={today}'
resp = s.get(api_url,headers=headers).json()
result = pd.DataFrame(resp)
result.drop(['difference', 'dt','exchdisstime','csvName','old_new','orgid','seq_id','sm_isin','bflag','symbol','sort_date'], axis = 1, inplace = True)
result.rename(columns = {'an_dt':'DateandTime', 'attchmntFile':'Source','attchmntText':'Topic','desc':'Type','smIndustry':'Sector','sm_name':'Company Name'}, inplace = True)
result[['Date','Time']] = result.DateandTime.str.split(expand=True)
result.drop(['DateandTime'], axis = 1, inplace = True)
result.to_csv( ( str(currentd.day) +'-'+str(currentd.month) +'-'+'CA.csv'), index=True)
print('Saved the CSV File')
Try creating a temporary column:
result['Full_date']=pd.to_datetime(result['Date']+' '+result['Time'])
Then format 'Date' and 'Time'
result['Date']=result['Full_date'].dt.strftime('%b %d, %Y')
result['Time']=result['Full_date'].dt.strftime('%R' '%p')
Try this:
# Remove comment if needed
# import locale
# locale.setlocale(locale.LC_TIME, 'C')
# https://stackoverflow.com/a/16671271
def ord(n):
return str(n)+("th" if 4<=n%100<=20 else {1:"st",2:"nd",3:"rd"}.get(n%10, "th"))
result['Date'] = pd.to_datetime(result['Date'], format='%d-%b-%Y')
result['Date'] = result['Date'].dt.day.map(ord) + result['Date'].dt.strftime(' %b %Y')
result['Time'] = pd.to_datetime(result['Time']).dt.strftime('%-H:%M %p')
# Now you can export
Output:
>>> result[['Date', 'Time']]
Date Time
0 30th Jan 2022 21:07 PM
1 30th Jan 2022 20:57 PM
2 30th Jan 2022 19:40 PM
3 30th Jan 2022 18:55 PM
4 30th Jan 2022 18:53 PM
5 30th Jan 2022 18:09 PM
6 30th Jan 2022 17:44 PM
7 30th Jan 2022 16:01 PM
8 30th Jan 2022 15:21 PM
9 30th Jan 2022 15:16 PM
10 30th Jan 2022 15:10 PM
11 30th Jan 2022 15:06 PM
12 30th Jan 2022 14:29 PM
13 30th Jan 2022 14:15 PM
14 30th Jan 2022 13:41 PM
15 30th Jan 2022 12:20 PM
16 30th Jan 2022 12:09 PM
17 30th Jan 2022 12:07 PM
18 30th Jan 2022 10:58 AM
19 30th Jan 2022 10:42 AM
20 30th Jan 2022 10:40 AM
21 30th Jan 2022 10:39 AM
22 30th Jan 2022 10:06 AM
23 30th Jan 2022 9:39 AM
24 30th Jan 2022 9:36 AM
25 30th Jan 2022 9:25 AM
26 30th Jan 2022 8:43 AM
27 30th Jan 2022 1:00 AM
28 30th Jan 2022 0:59 AM
29 30th Jan 2022 0:13 AM
I have data in the table that contains the surcharge price of many carriers in many years. Each carrier will have 12 months.
CARRIER_ID YEAR MONTH RATE
DHL 2021 April 16.5
DHL 2021 August 18.5
DHL 2021 December 0
DHL 2021 February 14
DHL 2021 January 12.5
DHL 2021 July 17.75
DHL 2021 June 17
DHL 2021 March 15
DHL 2021 May 17
DHL 2021 November 0
DHL 2021 October 0
DHL 2021 September 0
FedEx 2021 April 16.5
FedEx 2021 August 17.5
FedEx 2021 December 0
FedEx 2021 February 14.5
FedEx 2021 January 13.5
FedEx 2021 July 17.5
FedEx 2021 June 17
FedEx 2021 March 16
FedEx 2021 May 16.5
FedEx 2021 November 0
FedEx 2021 October 0
FedEx 2021 September 0
And I want to make a query in SQL server to get data like this.
Please note that: The data need to group by the year(exp: 2021)
Month DHL FedEx
January 12.50% 13.50%
February14.00% 14.50%
March 15.00% 16.00%
April 16.50% 16.50%
May 17.00% 16.50%
June 17.00% 17.00%
July 17.75% 17.50%
August 18.50% 17.50%
September0 0
October 0 0
November0 0
December0 0
I did search in google but can not find the solution.
Pls give me how to do it.
Thank you so much.
If you do know your list of carriers, you can do it like this with standard sql
select
t.YEAR,
t.MONTH,
max(case when t.CARRIER_ID = 'DHL' then t.RATE else NULL) as DHL,
max(case when t.CARRIER_ID = 'FedEx' then t.RATE else NULL) as FedEx
from your_table t
group by t.YEAR, t.MONTH
order by t.YEAR, t.MONTH
YEAR and MONTH are usually reserved words, so it's not recommended to use them in your data.
I tried so many methods recommended by other threads, but failed to make my code work.
So... I want to load the csv file arranged like below to the dataframe.
year, 2021
month, march
date, 28
here, are, values
42.1, 28.7, 27.0, 9.54, 12.23, 22.25
I had a hard time dealing with this csv file(actually this is just a concise example of mine) because of the irregularity, letters and numbers-mixed formats and comma and space-mixed delimiters of this data.
I want this dataset to be placed left-aligned in the dataframe like,
year 2021 NaN NaN NaN NaN
month march NaN NaN NaN NaN
date 28 NaN NaN NaN NaN
here are values NaN NaN NaN
42.1 28.7 27.0 9.54 12.23 22.25
Sorry that I cannot show you what I have done so far, because I have a bunch of versions of code from the methods I searched.
If all values refer to the same year, month and date, you need to have a DataFrame where each line is an observation of value, i.e.
year = 2021
month = 'march'
date = 28
values = [42.1, 28.7, 27.0, 9.54, 12.23, 22.25]
df = pd.DataFrame({
'year': np.repeat(year, len(values)),
'month': np.repeat(month, len(values)),
'date': np.repeat(date, len(values)),
'value': values
})
yielding
year month date value
0 2021 march 28 42.10
1 2021 march 28 28.70
2 2021 march 28 27.00
3 2021 march 28 9.54
4 2021 march 28 12.23
5 2021 march 28 22.25
If you want it transposed, you can do
df = df.T
that gives
0 1 2 3 4 5
year 2021 2021 2021 2021 2021 2021
month march march march march march march
date 28 28 28 28 28 28
value 42.1 28.7 27.0 9.54 12.23 22.25
If i have a string containing a date formatted like this:
Nov, 04 1983
May 10th, 1988
July 17 1979
July 08, 1978
January 03rd, 1990
Jan 5th 1985
Dec 8, 1988.
August 5, 1969
Aug., 28, 1983
9th May,1978
9th April 1976
7th February 1983
7th February 1983
7july 1986
6th Oct. 1986
5th July 1982
5th July 1973
5th Jan, 1985
5th Dec 1982.
5th August 1987
5th Aug, 1990
3rd November 1982.
3rd February,1982
3rd December 1986
31th May 1981
31st of August 19876
31st August 1990
31st AUGUST 1987
31st August 1978
31'DEC 1978
30th October 1986
30th December 1978.
30-06-1987
30/07/1982
2nd Sep. 1987
2nd Sep 1989
2nd July 1974
2nd Dec. 1990.
2nd Dec. 1983.
2nd Dec. 1983.
29-07-1986
28-march-1987
28/07/1986
28 April, 1981
27-07-1985
27/01/1993
26th May, 1988
26th June 1981.
26-DEC-1974
25th NOV 1985
25th June, 1976
25th Dec 1985
25-05-1985
25/07/85
25 Year & 28.09.1985.
24th June 1987.
24th July 1977
-24th Dec 1977
24th April 1989
23rd March 1980
23rd December, 1990
22nd April 1984
22nd- Apr-1989
22.11.1989
22 FEB 1990
22 April 1988.
21st September 1972
21st June 1990
21st Jan 1983
21 August 1985
20/08/1988
20/08/1987
20/02/1985
19TH JUNE. 1986
19th June, 1987
19-08-1988
18th June 1987
18/03/1980
17th September, 1975
17th April 1985
16th, March 1983
16th May 1987
16-October-1988
16/11/1989
16 / 06 / 1981
15th June, 1979
15-02-1989
And I want it to convert it to MM-DD-YYYY format, Looking solution in Vb.net
Here is my code Function I am using
Public Function ParseDate(ByVal txt As String)
If txt.Length > 20 Then
Dim a() As String
a = txt.Split("")
If a.Length = 0 Then
a = txt.Split(vbTab)
End If
If a.Length = 1 Then
a = txt.Split("&")
If a.Length > 1 Then
txt = a(1).Trim()
GoTo ok1
End If
End If
If a.Length > 1 Then
For Each ad As String In a
If Len(ad.Trim()) >= 8 Then
txt = ad.Trim()
GoTo ok1
End If
Next
End If
End If
ok1:
txt = txt.Replace(":", " ").Trim()
txt = txt.Replace(vbLf, " ").Trim()
txt = txt.Replace(",", " ").Trim()
txt = txt.Replace(vbTab, " ").Trim()
Dim result As String = ""
Dim mydate As New Date
txt = Regex.Replace(txt.ToLower, "[\s+,.'`-]|(ust)|(st)|(rd)|(nd)|(th)", " ")
Date.TryParse(txt, mydate)
result = mydate.ToString("MM-dd-yyyy")
Return result
End Function
I really appreciate all of your help thanks in advance
Its a bit complicated if your date strings are in a different format. But you can try this:
Function FormatDateString(ByVal str As String) As String
Dim result As String = ""
Dim mydate As New Date
str = Regex.Replace(str.ToLower, "[\s+,.'`-]|(ust)|(st)|(rd)|(nd)|(th)", " ")
Date.TryParse(str, mydate)
result = mydate.ToString("MM-dd-yyyy")
Return result
End Function
Example:
Console.WriteLine(FormatDateString("January 03rd, 1990"))
Console.WriteLine(FormatDateString("August., 21st, 1983"))
Console.WriteLine(FormatDateString(" - 13/08/1984-"))
Output:
01-03-1990
08-21-1983
08-13-1984