How to get the contributors you've coincided editing the most in Wikipedia - sql

I'm doing a gamification web app to help Wikimedia's community health.
I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.
I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.
So, I want to obtain something like
Username
Occurrences
Pages
Mikey
13
Obama,..
So the query would be something like (I'm accepting suggestions):
Get the pages that the user 'Jake' has edited in the last week.
Get the contributors of that page in last week.
For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.
I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).
I only get the edited pages and get the contributors of that page and just count them and it's very slow.
My pywikibot code is:
site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
contributed_pages.add(page)
return get_contributor_ocurrences(contributed_pages,site, username)
And the function
def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:
for editor in page.contributors():
if APISite.isBot(self= site,username=editor) or editor==username:
continue
contributors.append(editor)
return Counter(contributors)
PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot

You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:
from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username) # username must be a string
# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)
contributors= []
# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
# note: editors is a Counter
editors = page.contributors(endtime=stamp)
print('{:<35}: {}'.format(page.title(), editors))
contributors.extend(editors.elements())
total = Counter(contributors)
This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.
It requires some additional work to get the table you mentioned above.

Related

Twitter Premium API Profile location operators profile_country: and profile_region: not working

I am using premium account (not sandbox) for data collection.
I want to collect:
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to US and not geolocated at tweet level, excluding all retweets
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to ‘Minnesota’ and not geolocated at tweet level, excluding all retweets
The code is as follows:
premium_search_args = load_credentials('twitter_API.yaml',
yaml_key ='search_tweets_premium_api', env_overwrite=False)
# keywords for the search
# key word 1
keywords = '(China OR Chinese) lang:en profile_country:US -place_country:US -is:retweet'
# key word 2
keywords = '(China OR Chinese) lang:en -place_country:US profile_region:"Minnesota" -is:retweet'
# define search rule
rule = gen_rule_payload(keywords,from_date='2019-12-01',
to_date='2019-12-10',results_per_call=500)
# create result stream and print before start
rs = ResultStream(rule_payload=rule, max_results=1250000,
**premium_search_args)
My problems are that:
For the first one, a large portion of the results I get didn’t satisfy the query. First, some don’t have Profile Geo enrichment, i.e. user.derived.locations attribute is not in the user object. Second, if it is, a lot don’t have country code US, i.e. they are identified to other countries.
For the second one, the result I get from this method is a smaller subset of the results I can get from 1). That is, when I filter all tweets user geolocated to Minnesota (by user.derived.locations.region) from profile_country:US, it gives a larger sample than using profile_region:“Minnesota”. A considerable amount of data is missing using this method.
I have tried several times but it seems that user geolocation operators don’t work exactly what I want. Does anyone has any idea why this is the case? I would very much appreciate any answers/suggestions/comments.
Thank you!

Use API to gather statistics on my followers

I am very new to this and would like to know how to start gathering statistics on my followers as I am currently growing my follower base. I am subscribed to several statistic tracking apps but none are really good.
I wish to track things such as:
Follower count by Location
Frequency distribution of followers and tags
Follower growth rate by Hour, Day, week, etc..
Follower Loss
Is this at all possible using APIs? Can anyone tell me how to get started?
There is no direct API call to get follower growth by hour and week, you have to get all followers every hour and store it in database and analyze for growth or loss every hour compared to previous hour and save it on the server.
You cannot get location of followers from API, you can may be estimate the location by checking for location in bio or analyzing all user posts and finding most posted location (this is expensive on API side and will have to make a lot of API calls to get analyze)
Yes, all this is possible to do using API, but it is a lot of work on backend, so if some service does this, it will cost you money cause they cannot do it for free, my guess is that you have checked all free or cheap services and they cannot do all this analysis for cheap.
You can get a broad breakdown of follower count in Google Sheets. This doesn't require API access so you won't get all of the data you are looking for, such as GEO. But, if you would like to see your follower increase by the hour, do this -
Open up a new Google Sheet
Go to Tools > Script Editor
Name your script 'IGFollowers'
In the code box, copy and paste this code below, but make sure to write this replace 'AccountName' with your username
var sheetName = "IGFollowers";
var instagramAccountName = "AccountName";
function insertFollowerCount() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheetByName("IGFollowers");
sheet.appendRow([Utilities.formatDate(new Date(), "PST", "yyyy-MM-dd"),
Utilities.formatDate(new Date(), "PST", "hh:mm"),
getInstagramFollowerCount(this.instagramAccountName)]);
};
function getInstagramFollowerCount(username) {
var url = "https://www.instagram.com/" + username + "/?__a=1";
var response = UrlFetchApp.fetch(url).getContentText();
return JSON.parse(response).user.followed_by.count;
}
Go to Run > InsertFollowerCount
NOTE: You may need to do a bit of formatting with the main Google Sheet, but this will get you some very long columns showing an increase in followers by the hour.

How to use Bioproject ID, for example, PRJNA12997, in biopython?

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.
You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()

Get ALL tweets, not just recent ones via twitter API (Using twitter4j - Java)

I've built an app using twitter4j which pulls in a bunch of tweets when I enter a keyword, takes the geolocation out of the tweet (or falls back to profile location) then maps them using ammaps. The problem is I'm only getting a small portion of tweets, is there some kind of limit here? I've got a DB going collecting the tweet data so soon enough it will have a decent amount, but I'm curious as to why I'm only getting tweets within the last 12 hours or so?
For example if I search by my username I only get one tweet, that I sent today.
Thanks for any info!
EDIT: I understand twitter doesn't allow public access to the firehose.. more of why am I limited to only finding tweets of recent?
You need to keep redoing the query, resetting the maxId every time, until you get nothing back. You can also use setSince and setUntil.
An example:
Query query = new Query();
query.setCount(DEFAULT_QUERY_COUNT);
query.setLang("en");
// set the bounding dates
query.setSince(sdf.format(startDate));
query.setUntil(sdf.format(endDate));
QueryResult result = searchWithRetry(twitter, query); // searchWithRetry is my function that deals with rate limits
while (result.getTweets().size() != 0) {
List<Status> tweets = result.getTweets();
System.out.print("# Tweets:\t" + tweets.size());
Long minId = Long.MAX_VALUE;
for (Status tweet : tweets) {
// do stuff here
if (tweet.getId() < minId)
minId = tweet.getId();
}
query.setMaxId(minId-1);
result = searchWithRetry(twitter, query);
}
Really it depend on which API system you are using. I mean Streaming or Search API. In the search API there is a parameter (result_type) that is an optional parameter. The values of this parameter might be followings:
* mixed: Include both popular and real time results in the response.
* recent: return only the most recent results in the response
* popular: return only the most popular results in the response.
The default one is the mixed one.
As far as I understand, you are using the recent one, that is why; you are getting the recent set of tweets. Another issue is getting low volume of tweets that have the geological information. Because there are very few users added the geological information to their profile, you are getting very few tweets.

Rails show different object every day

I want to match my user to a different user in his/her community every day. Currently, I use code like this:
#matched_user = User.near(#user).order("RANDOM()").first
But I want to have a different #matched_user on a daily basis. I haven't been able to find anything in Stack or in the APIs that has given me insight on how to do it. I feel it should be simpler than having to resort to a rake task with cron. (I'm on postgres.)
Whenever I find myself hankering for shared 'memory' or transient state, I think to myself "this is what (distributed) caches were invented for".
#matched_user = Rails.cache.fetch(#user.cache_key + '/daily_match', expires_in: 1.day) {
User.near(#user).order("RANDOM()").first
}
NOTE: While specifying a TTL for cache entry tells Rails/the cache system to try and keep that value for the given timeframe, there's NO guarantee that it will. In particular, a cache that aggressively tries to reclaim memory may expire an entry well before its desired expires_in time.
For this particular use case, it shouldn't be a big deal but in cases where the business/domain logic demands periodically generated values that are durable then you really have to factor that into your database.
How about using PostgreSQL's SETSEED function? I used the date to seed so that every day the seed will change, but within a day, the seed will be consistent.:
User.connection.execute "SELECT SETSEED(#{Date.today.strftime("%y%d%m").to_i/1000000.0})"
#matched_user = User.near(#user).order("RANDOM()").first
You may want to seed a random value after using this so that any future calls to random aren't biased:
random = User.connection.execute("SELECT RANDOM()").to_a.first["random"]
# Same code as above:
User.connection.execute "SELECT SETSEED(#{Date.today.strftime("%y%d%m").to_i/1000000.0})"
#matched_user = User.near(#user).order("RANDOM()").first
# Use random value before seed to make new seed:
User.connection.execute "SELECT SETSEED(#{random})"
I have split these steps in different sections just for readability. you can optimise query later.
1) Find all user records till today morning. so that the count will freeze.
usrs_till_today_morning = User.where("created_at <?", DateTime.now.in_time_zone(Time.zone).beginning_of_day)
2) Pluck all ID's
user_ids = usr_till_today_morning.pluck(:id)
3) Today date it will be a range (1..30) but will remain constant throughout the day.
day_today = Time.now.day
4) Select the same ID for the day
todays_user_id = user_ids[day_today % user_ids.count]
#matched_user = User.find(todays_user_id)
So it will give you random user records by maintaining same record throughout the day!!