How to get all the substates from a country in OSMNX? - osmnx

What would be the code to easily get all the states (second subdivisions) of a country?
The pattern from OSMNX is, more or less:
division
admin_level
country
2
region
3
state
4
city
8
neighborhood
10
For an example, to get all the neighborhoods from a city:
import pandas as pd
import geopandas as gpd
import osmnx as ox
place = 'Rio de Janeiro'
tags = {'admin_level': '10'}
gdf = ox.geometries_from_place(place, tags)
The same wouldn't apply if one wants the states from a country?
place = 'Brasil'
tags = {'admin_level': '4'}
gdf = ox.geometries_from_place(place, tags)
I'm not even sure this snippet doesn't work, because I let it run for 4 hours and it didn't stop running. Maybe the package isn't made for downloading big chunks of data, or there's a solution more efficient than ox.geometries_from_place() for that task, or there's more information I could add to the tags. Help is appreciated.

OSMnx can potentially get all the states or provinces from some country, but this isn't a use case it's optimized for, and your specific use creates a few obstacles. You can see your query reproduced on Overpass Turbo.
You're using the default query area size, so it's making thousands of requests
Brazil's bounding box intersects portions of overseas French territory, which in turn pulls in all of France (spanning the entire globe)
OSMnx uses an r-tree to filter the final results, but globe-spanning results make this index perform very slowly
OSMnx can acquire geometries either via the geometries module (as you're doing) or via the geocode_to_gdf function in the geocoder module. You may want to try the latter if it fits your use case, as it's extremely more efficient.
With that in mind, if you must use the geometries module, you can try a few things to improve performance. First off, adjust the query area so you're downloading everything with one single API request. You're downloading relatively few entities, so the huge query area should still be ok within the timeout interval. The "intersecting overseas France" and "globe-spanning r-tree" problems are harder to solve. But as a demonstration, here's a simple example with Uruguay instead. It takes 20 something seconds to run everything on my machine:
import osmnx as ox
ox.settings.log_console = True
ox.settings.max_query_area_size = 25e12
place = 'Uruguay'
tags = {'admin_level': '4'}
gdf = ox.geometries_from_place(place, tags)
gdf = gdf[gdf["is_in:country"] == place]
gdf.plot()

Related

How to extract xml tags with BeautifulSoup?

I am trying to extract the tags from this data:
[{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"},
But I cannot seem to get the tags; I am trying:
# Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("file.xml", "r") as file:
# Read each line in the file
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "lxml")
result = bs_content.find_all("title")
print(result)
But I only get an empty []
Appreciate any help!
It is not XML its a JSON like structure, so simply iterate the list of dicts:
l = [{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"},]
for d in l:
print(d['title'])
Or while you have a string just convert it before via json.loads():
import json
l = '[{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"}]'
for d in json.loads(l):
print(d['title'])
Output:
Joshua Cohen
Louise Erdrich

How to get the contributors you've coincided editing the most in Wikipedia

I'm doing a gamification web app to help Wikimedia's community health.
I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.
I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.
So, I want to obtain something like
Username
Occurrences
Pages
Mikey
13
Obama,..
So the query would be something like (I'm accepting suggestions):
Get the pages that the user 'Jake' has edited in the last week.
Get the contributors of that page in last week.
For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.
I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).
I only get the edited pages and get the contributors of that page and just count them and it's very slow.
My pywikibot code is:
site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
contributed_pages.add(page)
return get_contributor_ocurrences(contributed_pages,site, username)
And the function
def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:
for editor in page.contributors():
if APISite.isBot(self= site,username=editor) or editor==username:
continue
contributors.append(editor)
return Counter(contributors)
PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot
You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:
from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username) # username must be a string
# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)
contributors= []
# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
# note: editors is a Counter
editors = page.contributors(endtime=stamp)
print('{:<35}: {}'.format(page.title(), editors))
contributors.extend(editors.elements())
total = Counter(contributors)
This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.
It requires some additional work to get the table you mentioned above.

Create subgraph query in Gremlin around single node with outgoing and incoming edges

I have a large Janusgraph database and I'd to create a subgraph centered around one node type and including incoming and outgoing nodes of specific types.
In Cypher, the query would look like this:
MATCH (a:Journal)N-[:PublishedIn]-(b:Paper{paperTitle:'My Paper Title'})<-[:AuthorOf]-(c:Author)
RETURN a,b,c
This is what I tried in Gremlin:
sg = g.V().outE('PublishedIn').subgraph('j_p_a').has('Paper','paperTitle', 'My Paper Title')
.inE('AuthorOf').subgraph('j_p_a')
.cap('j_p_a').next()
But I get a syntax error. 'AuthorOf' and 'PublishedIn' are not the only edge types ending at 'Paper' nodes.
Can someone show me how to correctly execute this query in Gremlin?
As written in your query, the outE step yields edges and the has step will check properties on those edges, following that the query processor will expect an inV not another inE. Without your data model it is hard to know exactly what you need, however, looking at the Cypher I think this is what you want.
sg = g.V().outE('PublishedIn').
subgraph('j_p_a').
inV().
has('Paper','paperTitle', 'My Paper Title').
inE('AuthorOf').
subgraph('j_p_a')
cap('j_p_a').
next()
Edited to add:
As I do not have your data I used my air-routes graph. I modeled this query on yours and used some select steps to limit the data size processed. This seems to work in my testing. Hopefully you can see the changes I made and try those in your query.
sg = g.V().outE('route').as('a').
inV().
has('code','AUS').as('b').
select('a').
subgraph('sg').
select('b').
inE('contains').
subgraph('sg').
cap('sg').
next()

Quick Crosswalk For State Abbreviations & State Names

For the millionth time I had a dataset today that listed full state names. But, I needed it to list state postal code abbreviations. Here is a code snip I wrote that mapped the changes for me using data from a generic website.
1) Anyone know of or think of a better solution?
2a) Anyone know of a better web reference? Using USPS sites (such as the ones below) will not seem to work with pd.read_html()
2b) I also had a hard time isolating the correct table from pd.read_html() and the wiki page at: https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations
import pandas as pd
# Make Generic Data For Demonstration Purpose
data = {'StName':['Wisconsin','Minnesota','Minnesota',
'Wisconsin','Florida','New York']}
df = pd.DataFrame(data)
# Get State Crosswalk From Generic Website
crosswalk = 'http://app02.clerk.org/menu/ccis/Help/CCIS%20Codes/state_codes.html'
states = pd.read_html(crosswalk)[0]
# Demo Crosswalking State Name to State Abbreviation
df['StAbbr'] = df['StName'].map(dict(zip(states['Description'],
states['Code'])))
# Demo Reverse Crosswalking Back to State Name
df['StNameAgain'] = df['StName'].map(dict(zip(states['Code'],
states['Description'])))

options for questions in Watson conversation api

I need to get the available options for a certain question in Watson conversation api?
For example I have a conversation app and in some cases Y need to give the users a list to select an option from it.
So I am searching for a way to get the available reply options for a certain question.
I can't answer to the NPM part, but you can get a list of the top 10 possible answers by setting alternate_intents to true. For example.
{
"context":{
"conversation_id":"cbbea7b5-6971-4437-99e0-a82927607079",
"system":{
"dialog_stack":["root"
],
"dialog_turn_counter":1,
"dialog_request_counter":1
}
},
"alternate_intents":true,
"input":{
"text":"Is it hot outside?"
}
}
This will return at most the top ten answers. If there is a limited number of intents it will only show them.
Part of your JSON response will have something like this:
"intents":[{
"intent":"temperature",
"confidence":0.9822100598134365
},
{
"intent":"conditions",
"confidence":0.017789940186563623
}
This won't get you the output text though from the node. So you will need to have your answer store elsewhere to cross reference.
Also be aware that just because it is in the list, doesn't mean it's a valid answer to give the end user. The confidence level needs to be taken into account.
The confidence level also does not work like a normal confidence. You need to determine your upper and lower bounds. I detail this briefly here.
Unlike earlier versions of WEA, the confidence is relative to the
number of intents you have. So the quickest way to find the lowest
confidence is to send a really ambiguous word.
These are the results I get for determining temperature or conditions.
treehouse = conditions / 0.5940327076534431
goldfish = conditions / 0.5940327076534431
music = conditions / 0.5940327076534431
See a pattern?🙂 So the low confidence level I will set at 0.6. Next
is to determine the higher confidence range. You can do this by mixing
intents within the same question text. It may take a few goes to get a
reasonable result.
These are results from trying this (C = Conditions, T = Temperature).
hot rain = T/0.7710267712183176, C/0.22897322878168241
windy desert = C/0.8597747113239446, T/0.14022528867605547
ice wind = C/0.5940327076534431, T/0.405967292346557
I purposely left out high confidence ones. In this I am going to go
with 0.8 as the high confidence level.