Twitter Premium API Profile location operators profile_country: and profile_region: not working - api

I am using premium account (not sandbox) for data collection.
I want to collect:
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to US and not geolocated at tweet level, excluding all retweets
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to ‘Minnesota’ and not geolocated at tweet level, excluding all retweets
The code is as follows:
premium_search_args = load_credentials('twitter_API.yaml',
yaml_key ='search_tweets_premium_api', env_overwrite=False)
# keywords for the search
# key word 1
keywords = '(China OR Chinese) lang:en profile_country:US -place_country:US -is:retweet'
# key word 2
keywords = '(China OR Chinese) lang:en -place_country:US profile_region:"Minnesota" -is:retweet'
# define search rule
rule = gen_rule_payload(keywords,from_date='2019-12-01',
to_date='2019-12-10',results_per_call=500)
# create result stream and print before start
rs = ResultStream(rule_payload=rule, max_results=1250000,
**premium_search_args)
My problems are that:
For the first one, a large portion of the results I get didn’t satisfy the query. First, some don’t have Profile Geo enrichment, i.e. user.derived.locations attribute is not in the user object. Second, if it is, a lot don’t have country code US, i.e. they are identified to other countries.
For the second one, the result I get from this method is a smaller subset of the results I can get from 1). That is, when I filter all tweets user geolocated to Minnesota (by user.derived.locations.region) from profile_country:US, it gives a larger sample than using profile_region:“Minnesota”. A considerable amount of data is missing using this method.
I have tried several times but it seems that user geolocation operators don’t work exactly what I want. Does anyone has any idea why this is the case? I would very much appreciate any answers/suggestions/comments.
Thank you!

Related

extract text from documents like PAN and Aadhaar

I am using cloud google vision API to extract text from Aadhaar and PAN. How can I get exact user details like name, father's name, and address?
Raw Data
ଭାରତ ସରକାର
Government of India
ଜିତ୍ୟାନନ୍ଦ ଖେମୁକୁ
NITYANANDA KHEMUDU
ପିତା : ସୀତାରାମ ଖେମୁକୁ
Father: Sitaram Khemudu
ଜନ୍ମ ତାରିଖ / DOB : 01.07.1999
ପୁରୁଷ / Male
ମୋ ଆଧାର, ମୋ ପରିଚୟ
I have built 5-6 OCR till date like aadhar, pan, ITR, Driving Linces etc., using google cloud vision API, I think you are looking for response like
{"pan_card_no":"ECXXXXXX123",
"name":"fshksj"
}
to get such response you need to built your own logic, here are some logic's i can share with you
Perform OCR on your document using Google_cloud_vision API and store that response into one array (Goggle gives logic line by line)
Like in above case if you want to grab DOB first you can build logic like i) if "DOB" in (list of item) then grab the numeric values
To get the name what you can do is dropping the unnecessary items from list by if using if condition like (if "India" in i) or (if i.isdigit()) then drop it likewise you can drop the unnesseary items from main list to get the Name
to grab the Address what you can do is, 95% of the time address come with pincode at last, so what you can do is treat pincode as a last index of address and look of "Address" kind of keyword then add all the elements from "Add keyword index" to "pincode index" ( this can be easily done in list) to validate whether the pincode is valid or not you can use library like Pyzipin
There are multiple conditions that you can use, above are the very basic one i mentioned, if you need any specific logic then then you can ask me

How to get the contributors you've coincided editing the most in Wikipedia

I'm doing a gamification web app to help Wikimedia's community health.
I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.
I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.
So, I want to obtain something like
Username
Occurrences
Pages
Mikey
13
Obama,..
So the query would be something like (I'm accepting suggestions):
Get the pages that the user 'Jake' has edited in the last week.
Get the contributors of that page in last week.
For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.
I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).
I only get the edited pages and get the contributors of that page and just count them and it's very slow.
My pywikibot code is:
site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
contributed_pages.add(page)
return get_contributor_ocurrences(contributed_pages,site, username)
And the function
def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:
for editor in page.contributors():
if APISite.isBot(self= site,username=editor) or editor==username:
continue
contributors.append(editor)
return Counter(contributors)
PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot
You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:
from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username) # username must be a string
# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)
contributors= []
# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
# note: editors is a Counter
editors = page.contributors(endtime=stamp)
print('{:<35}: {}'.format(page.title(), editors))
contributors.extend(editors.elements())
total = Counter(contributors)
This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.
It requires some additional work to get the table you mentioned above.

Cypher Query - conditionally return a relationship

I have been trying to figure out how to perform specific query for quite a while, and have been through several attempts to no avail. Below is an example to illustrate the problem:
There are 2 types of nodes, users and documents. Users can have a relationship type labeled collaborates_with, and can be related to documents as can_edit, created, or have no relationship.
Now what I would like to do is perform a query that will return all documents that fit a set of search criteria (say created within the last week), AND if the document was created by a collaborator of a specific user, return that relationship.
To fetch the documents and the creator of each document, the query is pretty straight forward:
MATCH (doc:document)<-[rel:created]-(u1:user)
WHERE doc.createddate > TIMESTAMP_FOR_ONE_WEEK_AGO
RETURN doc, u1
where TIMESTAMP_FOR_ONE_WEEK_AGO is just the unix timestamp corresponding to right now minus 7*24*60*60*1000.
The difficulty comes when trying to conditionally return the relationship with the current user.
I have played with CASE statements and OPTIONAL MATCH, but nothing seems to get what I'm looking for. One example of my attempts:
MATCH (doc:document)<-[rel:created]-(u1:user)
WHERE doc.createddate > TIMESTAMP_FOR_ONE_WEEK_AGO
WITH doc, u1
MATCH u1-[rel:collaborates_with]-(me:user)
WHERE me.username = MY_USERNAME
RETURN doc, rel
This, however, only returns the documents that have been created by one of my collaborators. Instead, I'd like it to return ALL of the documents fitting the search, and only return the relationship if it exists.
Has anyone been able to perform something like this?
NOTE: This question is similar, but not quite what I'm running into.
Optional Match should do it:
MATCH (doc:document)<-[rel:created]-(u1:user)
WHERE doc.createddate > TIMESTAMP_FOR_ONE_WEEK_AGO
WITH doc, u1 //find all docs that satisfy search conditions
OPTIONAL MATCH u1-[rel:collaborates_with]-(me:user) //optionally see if the creator collaborates with me
WHERE me.username = MY_USERNAME
RETURN doc, rel

Foursquare Venue API & Number of Results, in a more efficient way?

I'd like to ask if there is a more efficient way to get more than 50 results besides these options?
How do I get more locations?
Foursquare Venue API & Number of Results
and this, which is for the old API Foursquare API nearByVenue service issue
I'm using the current foursquare api for the venue search https://developer.foursquare.com/docs/venues/search .
What I'd like is something like an offset option, in order to get more results, but it seems that there isn't such an option.
Is there an alternative solution?
Thank you in advance.
you should use venues explore with offset and limit as paramters,
venues explore gives you totalResults and you can use this response to calculate number of pages you need in paginate
for example assume totalResults is 90(pay attention at offset and limit parametr value )
in first request:
https://api.foursquare.com/v2/venues/explore?client_id=client_id&client_secret=client_secret&v=20150825&near=city_name&categoryId=category_id&intent=browse&offset=0&limit=30
in second request:
https://api.foursquare.com/v2/venues/explore?client_id=client_id&client_secret=client_secret&v=20150825&near=city_name&categoryId=category_id&intent=browse&offset=30&limit=30
in third request:
https://api.foursquare.com/v2/venues/explore?client_id=client_id&client_secret=client_secret&v=20150825&near=city_name&categoryId=category_id&intent=browse&offset=60&limit=30
for 90 results you can get all records with above three request
There is actually another option not mentioned here (not pagination though)
Using the (experimental?) categoryId filter.
You can search for a single point (ll) a few times with different category ids, giving you many results (some duplicates as venues can have more than one category).
So you can search for 'Food' venues and 'Nightlife' venues at the same place, getting 100 results in stand of 50.. as said it is 100 results, but not unique results, could be duplicates. I think that is more efficient then trying to play around with the browse radius thing.
Not pagination, but will give a lot more results than a normal search - usually enough even in urban areas.
But yea, having some sort of way to extract more than 50 on a single point is not possible, but could be nice :)
Afraid not. Currently there is no pagination, in order to find more venues you need to move your search area around as in the answers you highlighted. I agree, pagination would be handy though!
For the explorer endpoint this worked for me: If the maximum number of results that is returned for instance is 100, just use offset=100 in the next call which gives you the next 100 results starting from 100 (the offset). Iterate (e.g. using while loop) and keep increasing offset by 100 until you reach the total number or results (which is returned in in the api for totalResults).
My first stack overflow post, tried to answer as clearly as possible
def getNearbyVenues(neighborhoods, latitudes, longitudes, radius=500,ven_num=300):
venues_list=[]
for name, lat, lng in zip(neighborhoods, latitudes, longitudes):
i=0
while (i < ven_num+50):
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&offset={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
i,
LIMIT)
# make the GET request
results = requests.get(url).json()['response']['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
i=i+50
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
print('Ok')
return(nearby_venues)
the code above has worked perfectly with me, where ven_num variable is the desired limit for calling venues in a certain neighborhood

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.