Quick Crosswalk For State Abbreviations & State Names - pandas

For the millionth time I had a dataset today that listed full state names. But, I needed it to list state postal code abbreviations. Here is a code snip I wrote that mapped the changes for me using data from a generic website.
1) Anyone know of or think of a better solution?
2a) Anyone know of a better web reference? Using USPS sites (such as the ones below) will not seem to work with pd.read_html()
2b) I also had a hard time isolating the correct table from pd.read_html() and the wiki page at: https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations
import pandas as pd
# Make Generic Data For Demonstration Purpose
data = {'StName':['Wisconsin','Minnesota','Minnesota',
'Wisconsin','Florida','New York']}
df = pd.DataFrame(data)
# Get State Crosswalk From Generic Website
crosswalk = 'http://app02.clerk.org/menu/ccis/Help/CCIS%20Codes/state_codes.html'
states = pd.read_html(crosswalk)[0]
# Demo Crosswalking State Name to State Abbreviation
df['StAbbr'] = df['StName'].map(dict(zip(states['Description'],
states['Code'])))
# Demo Reverse Crosswalking Back to State Name
df['StNameAgain'] = df['StName'].map(dict(zip(states['Code'],
states['Description'])))

Related

How to get all the substates from a country in OSMNX?

What would be the code to easily get all the states (second subdivisions) of a country?
The pattern from OSMNX is, more or less:
division
admin_level
country
2
region
3
state
4
city
8
neighborhood
10
For an example, to get all the neighborhoods from a city:
import pandas as pd
import geopandas as gpd
import osmnx as ox
place = 'Rio de Janeiro'
tags = {'admin_level': '10'}
gdf = ox.geometries_from_place(place, tags)
The same wouldn't apply if one wants the states from a country?
place = 'Brasil'
tags = {'admin_level': '4'}
gdf = ox.geometries_from_place(place, tags)
I'm not even sure this snippet doesn't work, because I let it run for 4 hours and it didn't stop running. Maybe the package isn't made for downloading big chunks of data, or there's a solution more efficient than ox.geometries_from_place() for that task, or there's more information I could add to the tags. Help is appreciated.
OSMnx can potentially get all the states or provinces from some country, but this isn't a use case it's optimized for, and your specific use creates a few obstacles. You can see your query reproduced on Overpass Turbo.
You're using the default query area size, so it's making thousands of requests
Brazil's bounding box intersects portions of overseas French territory, which in turn pulls in all of France (spanning the entire globe)
OSMnx uses an r-tree to filter the final results, but globe-spanning results make this index perform very slowly
OSMnx can acquire geometries either via the geometries module (as you're doing) or via the geocode_to_gdf function in the geocoder module. You may want to try the latter if it fits your use case, as it's extremely more efficient.
With that in mind, if you must use the geometries module, you can try a few things to improve performance. First off, adjust the query area so you're downloading everything with one single API request. You're downloading relatively few entities, so the huge query area should still be ok within the timeout interval. The "intersecting overseas France" and "globe-spanning r-tree" problems are harder to solve. But as a demonstration, here's a simple example with Uruguay instead. It takes 20 something seconds to run everything on my machine:
import osmnx as ox
ox.settings.log_console = True
ox.settings.max_query_area_size = 25e12
place = 'Uruguay'
tags = {'admin_level': '4'}
gdf = ox.geometries_from_place(place, tags)
gdf = gdf[gdf["is_in:country"] == place]
gdf.plot()

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

Twitter Premium API Profile location operators profile_country: and profile_region: not working

I am using premium account (not sandbox) for data collection.
I want to collect:
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to US and not geolocated at tweet level, excluding all retweets
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to ‘Minnesota’ and not geolocated at tweet level, excluding all retweets
The code is as follows:
premium_search_args = load_credentials('twitter_API.yaml',
yaml_key ='search_tweets_premium_api', env_overwrite=False)
# keywords for the search
# key word 1
keywords = '(China OR Chinese) lang:en profile_country:US -place_country:US -is:retweet'
# key word 2
keywords = '(China OR Chinese) lang:en -place_country:US profile_region:"Minnesota" -is:retweet'
# define search rule
rule = gen_rule_payload(keywords,from_date='2019-12-01',
to_date='2019-12-10',results_per_call=500)
# create result stream and print before start
rs = ResultStream(rule_payload=rule, max_results=1250000,
**premium_search_args)
My problems are that:
For the first one, a large portion of the results I get didn’t satisfy the query. First, some don’t have Profile Geo enrichment, i.e. user.derived.locations attribute is not in the user object. Second, if it is, a lot don’t have country code US, i.e. they are identified to other countries.
For the second one, the result I get from this method is a smaller subset of the results I can get from 1). That is, when I filter all tweets user geolocated to Minnesota (by user.derived.locations.region) from profile_country:US, it gives a larger sample than using profile_region:“Minnesota”. A considerable amount of data is missing using this method.
I have tried several times but it seems that user geolocation operators don’t work exactly what I want. Does anyone has any idea why this is the case? I would very much appreciate any answers/suggestions/comments.
Thank you!

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

Pull info from datapool, increment a value and store the datapool

The application I test has some areas where it requires unique data. Specifically, the application will generate a request number that can only be used once. After my test runs I must manually update my datapool reference for this number. Is there any way using java, that I can get the information stored in my datapool, increase the value by one, and then save the data back to the datapool. This way I can keep rft in sync with my application in regard to this number.
Here is an example how to read a value from the datapool, increment it by 1, and save it back to the datapool. It is an adapted example from the book Software Test Engineering with IBM Rational Functional Tester. The original source code is from chapter 5 (and can be downloaded from the book's homepage).
// some imports
import org.eclipse.hyades.edit.datapool.IDatapoolCell;
import org.eclipse.hyades.edit.datapool.IDatapoolEquivalenceClass;
import org.eclipse.hyades.execution.runtime.datapool.IDatapool;
import org.eclipse.hyades.execution.runtime.datapool.IDatapoolRecord;
int value = dpInt("value");
value++;
java.io.File dpFile = new java.io.File((String) getOption(IOptionName.DATASTORE), "SomeDatapool.rftdp");
IDatapool dp = dpFactory().load(dpFile, true);
IDatapoolEquivalenceClass equivalenceClass = (IDatapoolEquivalenceClass) dp.getEquivalenceClass(dp
.getDefaultEquivalenceClassIndex());
IDatapoolRecord record = equivalenceClass.getRecord(0);
IDatapoolCell cell = (IDatapoolCell) record.getCell(0);
cell.setCellValue(value);
DatapoolFactory factory = DatapoolFactory.get();
factory.save((org.eclipse.hyades.edit.datapool.IDatapool) dp);
I think it is quite a lot of code to simply change one value—maybe it is easier to use some other method like writing the value to a normal text file.