How to get section heading of tables in wikipedia through API - wikipedia-api

How do I get section headings for individual tables: Xia dynasty (夏朝) (2070–1600 BC), Shang dynasty (商朝) (1600–1046 BC), Zhou dynasty (周朝) (1046–256 BC) etc. for the Chinese Monarchs list on Wikipedia via API? I use the code below to connect:
from pprint import pprint
import requests, wikitextparser
r = requests.get(
'https://en.wikipedia.org/w/api.php',
params={
'action': 'query',
'titles': 'List_of_Chinese_monarchs',
'prop': 'revisions',
'rvprop': 'content',
'format': 'json',
}
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')
han = doc.tables[5].data()
doc.tables[6].data()
doc.tables[i].data() only return the table values, without its <h2> section headings. I would like the API to return me a list of title strings that correspond to each of the 83 tables returned.
Original website:
https://en.wikipedia.org/wiki/List_of_Chinese_monarchs

I'm not sure why you are using doc.tables when it is the sections you are interested in. This works for me:
for i in range(1,94,1):
print(doc.sections[i].title.replace('[[','').replace(']]',''))
I get 94 sections though rather than 83 and while you can use len(doc.sections) this will include See also etc. There must be a more elegant way of removing the wikilinks.

Related

Scrapy Wikipedia: Yield does not show all rows

I am trying to get the GDP Estimate (Under IMF) from the following page:
https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)
However, I am only getting the first row (93,863,851). Here's the Scrapy Spider code:
def parse(self, response):
title = response.xpath("(//tbody)[3]")
for country in title:
yield {'GDP': country.xpath(".//td[3]/text()").get()}
On other hand, I can use getall() method to get all the data but this brings all data points into one single cell when I export it to CSV/XLSX. So this is not a solution for me.
How can I get all the datapoints via the loop? Please help.
Your selector is not correct. You should loop through the table rows and yield the data that you need. See sample below.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)']
def parse(self, response):
for row in response.xpath("//caption/parent::table/tbody/tr"):
yield {
"country": row.xpath("./td[1]/a/text()").get(),
"region": row.xpath("./td[2]/a/text()").get(),
"imf_est": row.xpath("./td[3]/text()").get(),
"imf_est_year": row.xpath("./td[4]/text()").get(),
"un_est": row.xpath("./td[5]/text()").get(),
"un_est_year": row.xpath("./td[6]/text()").get(),
"worldbank_est": row.xpath("./td[7]/text()").get(),
"worldbank_est_year": row.xpath("./td[8]/text()").get(),
}

Not able to put tag column value on monday item

I am trying to use python to automation common Monday tasks. I am able to create an item in the board but the column (type=tag) is not updating.
I used this tutorial:
https://support.monday.com/hc/en-us/articles/360013483119-API-Quickstart-Tutorial-Python#
Here is my graphql code that I am executing:
query = 'mutation ($device: String!, $columnVals: JSON!) { create_item (board_id:<myboardid>, item_name:$device, column_values:$columnVals) { id } }'
vars = {'device': device,
'columnVals': json.dumps({
'cloud_name6': {'text': cloudname} # this is where i want to add a tag. cloud_name6 is id of the column.
})
}
data = {'query' : query, 'variables' : vars}
r = requests.post(url=apiUrl, json=data, headers=headers) print(r.json())
I have tried changing id to title as key in the Json string but no luck. I fetched the existing item and tried to add exact json string but still no luck. I also tried below json data without any luck
'columnVals': json.dumps({
'cloud_name6': cloudname
})
Any idea what's wrong with the query?
When creating or mutating tag columns via item queries, you need to send an array of ids of the tags ("tag_ids") that are relating to this item. You don't set or alter tag names via an item query.
Corrected Code
'columnVals': json.dumps({
'cloud_name6': {'tag_ids': [295026,295064]}
})
https://developer.monday.com/api-reference/docs/tags

List of dictionarys to dataframe

im trying to make a dataframe out of a list of dictionaries. I am quite new at this whole programming thing, and google just makes me more confused. That is why i am turning to you guys hoping for some assistance.
The first two list values (YV01', '3nP3RFgGnBrOfILK4DF2Tp) i would like to have under columns called: Name and GlobalId. I would lie to drop Pset_wallcommon, AC_Pset_RenovationAndPhasing, and BaseQuantities. And use the rest of the keys(if that what they are called) as column names.
It would be great if someone could give me the right push :)
For the record: Im am parsing an Ifc file with the IfcOpenshell package
The data:
['YV01', '3nP3RFgGnBrOfILK4DF2Tp', {'Pset_WallCommon': {'Combustible': False, 'Compartmentation': False, 'ExtendToStructure': False, 'SurfaceSpreadOfFlame': '', 'ThermalTransmittance': 0.0, 'Reference': '', 'AcousticRating': '', 'FireRating': '', 'LoadBearing': False, 'IsExternal': False}, 'AC_Pset_RenovationAndPhasing': {'Renovation Status': 'New'}, 'BaseQuantities': {'Length': 13786.7314346, 'Height': 2700.0, 'Width': 276.0, 'GrossFootprintArea': 3.88131387595, 'NetFootprintArea': 3.88131387595, 'GrossSideArea': 37.9693748734, 'NetSideArea': 37.9693748734, 'GrossVolume': 10.4795474651, 'NetVolume': 10.4795474651}}, 'YV01', '1M4JyBJhXD5xt8fBFUcjUU', {'Pset_WallCommon': {'Combustible': False, 'Compartmentation': False, 'ExtendToStructure': False, 'SurfaceSpreadOfFlame': '', 'ThermalTransmittance': 0.0, 'Reference': '', 'AcousticRating': '', 'FireRating': '', 'LoadBearing': False, 'IsExternal': False}, 'AC_Pset_RenovationAndPhasing': {'Renovation Status': 'New'}, 'BaseQuantities': {'Length': 6166.67382573, 'Height': 2700.0, 'Width': 276.0, 'GrossFootprintArea': 1.6258259759, 'NetFootprintArea': 1.6258259759, 'GrossSideArea': 15.9048193295, 'NetSideArea': 15.9048193295, 'GrossVolume': 4.38973013494, 'NetVolume': 4.38973013494}}
all_walls = ifc_file.by_type('IfcWall')
wallList = []
for wall in all_walls:
propertySets = (ifcopenshell.util.element.get_psets(wall))
wallList.append(wall.Name)
wallList.append(wall.GlobalId)
wallList.append(propertySets)
print(wallList)
wall_table = pd.DataFrame.from_records(wallList)
print(wall_table)
I have tried these basic pd.DataFrame.from_dict/records/arrays(data)
but the output looks like this
enter image description here
UPDATE: Thank you so much for your help, i am learning alot from this!
So i made a dictionary out of the wallList, and flattened the dict. like this:
#list of walls
for wall in all_walls:
propertySets = (ifcopenshell.util.element.get_psets(wall))
wallList.append(wall.Name)
wallList.append(wall.GlobalId)
wallList.append(propertySets)
#dict from list
wall_dict = {i: wallList[i] for i in range(0, len(wallList))}
new_dict = {}
#flattening dict
for key, value in wall_dict.items():
if isinstance(value, dict):
for key in value.keys():
for key2 in value[key].keys():
new_dict[key + '_' + key2] = value[key][key2]
else:
new_dict[key] = value
wall_table = pd.DataFrame.from_dict(new_dict, orient='index')
print(wall_table)
It seems to work pretty good, the only problem is that the dataframe contains all walls, but only propertyset data from the first in the list. I cant seem to understand how the dict flattening loop works. I would also like the index names (Pset_WallCommon_Combustible, and so on) to be the columns in my dataframe. Is that possible?
enter image description here
EDIT : Simply flattening a list as i did goes nowhere. Actually, i think you should drop this list thing altogether and try to load the Dataframe from a dictionnary. We'd need to see what does all_walls look like to help you for that, tho.
Have you tried directly loading the all_walls dictionary into a dataframe : df = pd.Dataframe.from_dict(all_walls) ?
I think if that doesnt work, flattening the dictionnaries in a fashion similar to the following should do the trick.
new_dict = {}
for key, value in all_walls.items():
if isinstance(value, dict):
for key in value.keys():
for key2 in value[key].keys():
new_dict[key + '_' + key2] = value[key][key2]
else:
new_dict[key] = value

amadeus API list of all possible hotel "amenities"

In Amadeus hotels API there is amenities choices and in the search results there is different possibilities as well.
To make amenities more user readable I'd like a FULL list of ALL different possible amenities so that I can populate a database with amenities code and different translations.
For a client searching for hotels: stuff like ACC_BATHS, SAFE_DEP_BOX is kind of not readable friendly...
I'm referring to this
{
"data": [
{
"type": "hotel-offers",
"hotel": {
"type": "hotel",
"cityCode": "MIA",
...
"amenities": [
"HANDICAP_FAC",
"ACC_BATHS",
"ACC_WASHBASIN",
"ACC_BATH_CTRLS",
"ACC_LIGHT_
where can I find a csv of all amenities ?
I contacted the Amadeus tech support and they answered me this :
(you can copy this list, it's csv format... NAME_OF_AMENITY,amenity_code )
226 codes
PHOTOCOPIER,BUS.2
PRINTER,BUS.28
AUDIO-VIS_EQT,BUS.37
WHITE/BLACKBOARD,BUS.38
BUSINESS_CENTER,BUS.39
CELLULAR_PHONE_RENTAL,BUS.40
COMPUTER_RENTAL,BUS.41
EXECUTIVE_DESK,BUS.42
LCD/PROJECTOR,BUS.45
MEETING_ROOMS,BUS.46
OVERHEAD_PROJECTOR,BUS.48
SECRETARIAL_SERVICES,BUS.49
CONFERENCE_SUITE,BUS.94
CONVENTION_CTR,BUS.95
MEETING_FACILITIES,BUS.96
24_HOUR_FRONT_DESK,HAC.1
DISABLED_FACILITIES,HAC.101
MULTILINGUAL_STAFF,HAC.103
WEDDING_SERVICES,HAC.104
BANQUETING_FACILITIES,HAC.105
PORTER/BELLBOY,HAC.106
BEAUTY_PARLOUR,HAC.107
WOMENS_GST_RMS,HAC.110
PHARMACY,HAC.111
120_AC,HAC.113
120_DC,HAC.114
220_AC,HAC.115
220_DC,HAC.117
BARBECUE,HAC.118
BUTLER_SERVICE,HAC.136
CAR_RENTAL,HAC.15
CASINO,HAC.16
BAR,HAC.165
LOUNGE,HAC.165
TRANSPORTATION,HAC.172
WIFI,HAC.178
WIRELESS_CONNECTIVITY,HAC.179
BALLROOM,HAC.191
BUS_PARKING,HAC.192
CHILDRENS_PLAY_AREA,HAC.193
NURSERY,HAC.194
DISCO,HAC.195
24_HOUR_ROOM_SERVICE,HAC.2
COFFEE_SHOP,HAC.20
BAGGAGE_STORAGE,HAC.201
NO_KID_ALLOWED,HAC.217
KIDS_WELCOME,HAC.218
COURTESY_CAR,HAC.219
CONCIERGE,HAC.22
NO_PORN_FILMS,HAC.220
INT_HOTSPOTS,HAC.221
FREE_INTERNET,HAC.222
INTERNET_SERVICES,HAC.223
PETS_ALLOWED,HAC.224
FREE_BREAKFAST,HAC.227
CONFERENCE_FACILITIES,HAC.24
HI_INTERNET,HAC.259
EXCHANGE_FAC,HAC.26
LOBBY,HAC.276
DOCTOR_ON_CALL,HAC.28
24H_COFFEE_SHOP,HAC.281
AIRPORT_SHUTTLE,HAC.282
LUGGAGE_SERVICE,HAC.283
PIANO_BAR,HAC.284
VIP_SECURITY,HAC.285
DRIVING_RANGE,HAC.30
DUTY_FREE_SHOP,HAC.32
ELEVATOR,HAC.33
EXECUTIVE_FLR,HAC.34
GYM,HAC.35
EXPRESS_CHECK_IN,HAC.36
EXPRESS_CHECK_OUT,HAC.37
FLORIST,HAC.39
CONNECTING_ROOMS,HAC.4
FREE_AIRPORT_SHUTTLE,HAC.41
FREE_PARKING,HAC.42
FREE_TRANSPORTATION,HAC.43
GAMES_ROOM,HAC.44
GIFT_SHOP,HAC.45
HAIRDRESSER,HAC.46
ICE_MACHINES,HAC.52
GARAGE_PARKING,HAC.53
JACUZZI,HAC.55
JOGGING_TRACK,HAC.56
KENNELS,HAC.57
LAUNDRY_SVC,HAC.58
AIRLINE_DESK,HAC.6
LIVE_ENTERTAINMENT,HAC.60
MASSAGE,HAC.61
NIGHT_CLUB,HAC.62
SWIMMING_POOL,HAC.66
PARKING,HAC.68
ATM/CASH_MACHINE,HAC.7
POOLSIDE_SNACK_BAR,HAC.72
RESTAURANT,HAC.76
ROOM_SERVICE,HAC.77
SAFE_DEP_BOX,HAC.78
SAUNA,HAC.79
BABY-SITTING,HAC.8
SOLARIUM,HAC.83
SPA,HAC.84
CONVENIENCE_STOR,HAC.88
PICNIC_AREA,HAC.9
THEATRE_DESK,HAC.90
TOUR_DESK,HAC.91
TRANSLATION_SERVICES,HAC.92
TRAVEL_AGENCY,HAC.93
VALET_PARKING,HAC.97
VENDING_MACHINES,HAC.98
TELECONFERENCE,MRC.121
VOLTAGE_AVAILABLE,MRC.123
NATURAL_DAYLIGHT,MRC.126
GROUP_RATES,MRC.141
INTERNET-HIGH_SPEED,MRC.17
VIDEO_CONF_FACILITIES,MRC.53
ACC_BATHS,PHY.102
BR/L_PRINT_LIT,PHY.103
ADAPT_RM_DOORS,PHY.104
ACC_RM_WCHAIR,PHY.105
SERV_SPEC_MENU,PHY.106
WIDE_ENTRANCE,PHY.107
WIDE_CORRIDORS,PHY.108
WIDE_REST_ENT,PHY.109
ACC_LIGHT_SW,PHY.15
ACC_WCHAIR,PHY.28
SERV_DOGS_ALWD,PHY.29
ACC_WASHBASIN,PHY.3
ACC_TOILETS,PHY.32
ADAPT_BATHROOM,PHY.38
HANDRAIL_BTHRM,PHY.38
ADAPTED_PHONES,PHY.39
ACC_ELEVATORS,PHY.42
TV_SUB/CAPTION,PHY.45
DIS_PARKG,PHY.50
EMERG_COD/BUT,PHY.57
HANDICAP_FAC,PHY.6
DIS_EMERG_PLAN,PHY.60
HEAR_IND_LOOPS,PHY.65
BR/L_PRNT_MENU,PHY.66
DIS_TRAIN_STAF,PHY.71
PIL_ALARMS_AVL,PHY.76
ACC_BATH_CTRLS,PHY.79
PUTTING_GREEN,REC.5
TROUSER_PRESS,RMA.111
VIDEO,RMA.116
GAMES_SYSTEM_IN_ROOM,RMA.117
VOICEMAIL_IN_ROOM,RMA.118
WAKEUP_SERVICE,RMA.119
WI-FI_IN_ROOM,RMA.123
CD_PLAYER,RMA.129
BATH,RMA.13
MOVIE_CHANNELS,RMA.139
SHOWER,RMA.142
OUTLET_ADAPTERS,RMA.159
BIDET,RMA.16
DVD_PLAYER,RMA.163
CABLE_TELEVISION,RMA.18
OVERSIZED_ROOMS,RMA.185
TEA/COFFEE_MK_FACILITIES,RMA.19
AIR_CONDITIONING,RMA.2
TELEVISION,RMA.20
ANNEX_ROOM,RMA.204
FREE_NEWSPAPER,RMA.205
HONEYMOON_SUITES,RMA.206
INTERNETFREE_HIGH_IN_RM,RMA.207
MAID_SERVICE,RMA.208
PC_HOOKUP_INRM,RMA.209
PC_IN_ROOM,RMA.21
SATELLITE_TV,RMA.210
VIP_ROOMS,RMA.211
CORDLESS_PHONE,RMA.25
CRIBS_AVAILABLE,RMA.26
ALARM_CLOCK,RMA.3
PHONE-DIR_DIAL,RMA.31
FAX_FAC_INROOM,RMA.38
FREE_LOCAL_CALLS,RMA.45
HAIR_DRYER,RMA.50
INTERNET-HI_SPEED_IN_RM,RMA.51
IRON/IRON_BOARD,RMA.55
KITCHEN,RMA.59
BABY_LISTENING_DEVICE,RMA.6
LAUNDRY_EQUIPMENT_IN_ROOM,RMA.66
MICROWAVE,RMA.68
MINIBAR,RMA.69
NONSMOKING_RMS,RMA.74
REFRIGERATOR,RMA.88
ROLLAWAY_BEDS,RMA.91
SAFE,RMA.92
WATER_SPORTS,RST.110
ANIMAL_WATCHING,RST.126
BIRD_WATCHING,RST.127
SIGHTSEEING,RST.142
BEACH_WITH_DIRECT_ACCESS,RST.155
SKI_IN/OUT,RST.156
TENNIS_PROFESSIONAL,RST.157
FISHING,RST.20
GOLF,RST.27
FITNESS_CENTER,RST.36
BEACH,RST.5
HORSE_RIDING,RST.61
INDOOR_TENNIS,RST.62
MINIATURE_GOLF,RST.67
BOATING,RST.7
TENNIS,RST.71
SCUBA_DIVING,RST.82
SKEET_SHOOTING,RST.85
SNOW_SKIING,RST.88
BOWLING,RST.9
VOLLEYBALL,RST.98
ELEC_GENERATOR,SEC.15
EMERG_LIGHTING,SEC.19
FIRE_DETECTORS,SEC.22
GUARDED_PARKG,SEC.34
RESTRIC_RM_ACC,SEC.39
EXT_ROOM_ENTRY,SEC.40
INT_ROOM_ENTRY,SEC.41
SMOKE_DETECTOR,SEC.50
ROOMS_WITH_BALCONIES,SEC.51
SPRINKLERS,SEC.54
FIRST_AID_STAF,SEC.57
SECURITY_GUARD,SEC.58
VIDEO_SURVEIL,SEC.62
EXTINGUISHERS,SEC.89
FIRE_SAFETY,SEC.9
FEMA_FIRE_SAFETY_COMPLIANT,SEC.93
FIRE_SAF_NOT_STANDARD,SEC.95
According to the API, you can filter the offers by amenities:
https://developers.amadeus.com/self-service/category/hotel/api-doc/hotel-search/api-reference
I assume the multiple select list in the amenities property contains all the items you need.
EDIT: I noticed that unfortunately, the response example contains additional values, apart from the input. So the input is not enough.

Extracting href from attribute with BeatifulSoup

I use this method
allcity = dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")})
to return a list like this:
[<a onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" href="http://www.ylyd.com/showurl.asp?id=6182" target="_blank"><font size="3">掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫</font></a>,
掳脵露脠驴矛脮脮]
How do I extract this href?
http://www.ylyd.com/showurl.asp?id=6182
Thanks. :)
you can use
for a in dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")}, href=True):
a['href']
In this example, there's no real need to use regex, it can be simply as calling <a> tag and then ['href'] attribute like so:
get_me_url = soup.a['href'] # http://www.ylyd.com/showurl.asp?id=6182
# cached URL
get_me_cached_url = soup.find('a', class_='m')['href']
You can always use prettify() method to better see the HTML code.
from bs4 import BeautifulSoup
string = '''
[
<a href="http://www.ylyd.com/showurl.asp?id=6182" onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" target="_blank">
<font size="3">
掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫
</font>
</a>
,
<a class="m" href="http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu" target="_blank">
掳脵露脠驴矛脮脮
</a>
]
'''
soup = BeautifulSoup(string, 'html.parser')
href = soup.a['href']
cache_href = soup.find('a', class_='m')['href']
print(f'{href}\n{cache_href}')
# output:
'''
http://www.ylyd.com/showurl.asp?id=6182
http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu
'''
Alternatively, you can do the same thing using Baidu Organic Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Essentially, the main difference in this example is that you don't have to figure out how to grab certain elements since it's already done for the end-user with a JSON output.
Code to grab href/cached href from first page results:
from serpapi import BaiduSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "baidu",
"q": "ylyd"
}
search = BaiduSearch(params)
results = search.get_dict()
for result in results['organic_results']:
# try/expect used since sometimes there's no link/cached link
try:
link = result['link']
except:
link = None
try:
cached_link = result['cached_page_link']
except:
cached_link = None
print(f'{link}\n{cached_link}\n')
# Part of the output:
'''
http://www.baidu.com/link?url=7VlSB5iaA1_llQKA3-0eiE8O9sXe4IoZzn0RogiBMCnJHcgoDDYxz2KimQcSDoxK
http://cache.baiducontent.com/c?m=LU3QMzVa1VhvBXthaoh17aUpq4KUpU8MCL3t1k8LqlKPUU9qqZgQInMNxAPNWQDY6pkr-tWwNiQ2O8xfItH5gtqxpmjXRj0m2vEHkxLmsCu&p=882a9646d5891ffc57efc63e57519d&newp=926a8416d9c10ef208e2977d0e4dcd231610db2151d6d5106b82c825d7331b001c3bbfb423291505d3c77e6305a54d5ceaf13673330923a3dda5c91d9fb4c57479c77a&s=c81e728d9d4c2f63&user=baidu&fm=sc&query=ylyd&qid=e42a54720006d857&p1=1
'''
Disclaimer, I work for SerpApi.