Get ALL tweets, not just recent ones via twitter API (Using twitter4j - Java) - api

I've built an app using twitter4j which pulls in a bunch of tweets when I enter a keyword, takes the geolocation out of the tweet (or falls back to profile location) then maps them using ammaps. The problem is I'm only getting a small portion of tweets, is there some kind of limit here? I've got a DB going collecting the tweet data so soon enough it will have a decent amount, but I'm curious as to why I'm only getting tweets within the last 12 hours or so?
For example if I search by my username I only get one tweet, that I sent today.
Thanks for any info!
EDIT: I understand twitter doesn't allow public access to the firehose.. more of why am I limited to only finding tweets of recent?

You need to keep redoing the query, resetting the maxId every time, until you get nothing back. You can also use setSince and setUntil.
An example:
Query query = new Query();
query.setCount(DEFAULT_QUERY_COUNT);
query.setLang("en");
// set the bounding dates
query.setSince(sdf.format(startDate));
query.setUntil(sdf.format(endDate));
QueryResult result = searchWithRetry(twitter, query); // searchWithRetry is my function that deals with rate limits
while (result.getTweets().size() != 0) {
List<Status> tweets = result.getTweets();
System.out.print("# Tweets:\t" + tweets.size());
Long minId = Long.MAX_VALUE;
for (Status tweet : tweets) {
// do stuff here
if (tweet.getId() < minId)
minId = tweet.getId();
}
query.setMaxId(minId-1);
result = searchWithRetry(twitter, query);
}

Really it depend on which API system you are using. I mean Streaming or Search API. In the search API there is a parameter (result_type) that is an optional parameter. The values of this parameter might be followings:
* mixed: Include both popular and real time results in the response.
* recent: return only the most recent results in the response
* popular: return only the most popular results in the response.
The default one is the mixed one.
As far as I understand, you are using the recent one, that is why; you are getting the recent set of tweets. Another issue is getting low volume of tweets that have the geological information. Because there are very few users added the geological information to their profile, you are getting very few tweets.

Related

How to get just the most recent of all documents

In sanity studio you get a nice list of the most recent version of all your documents. If there is a draft you get that, if not, you get the published one.
I need the same list for a few filters and scripts. The following groq does the job but is not very fast and does not work in the new API (v2021-03-25).
*[
_type == $type &&
!defined(*[_id == "drafts." + ^._id])
]._id
A way around the breaking changes in the API is to use length() = 0 in place of !defined() but that makes an already slow query 10-20 X slower.
Does anyone know a way of making filters that consider only the latest version?
Edit: An example where I need this is if I want to see all documents without any categories. Regardless whether it is the published document or the draft that has no categories it shows up in a normal filter. So if you add categories but don't immediately want to publish it will be confusing in the no-categories-list. ,'-)
100 X improvement on API v2021-03-25 🥳
The only way I was able to solve this with speed was to first make a projection of the sub-query so it doesn't run once for every non-draft. Then I thought, why not project both sets and then figure out the overlap, and that was even faster! It runs more than 10 x faster than possible on API v1 and 100 x faster than any suggestions for new API.
{
'drafts': *[ _type == $type && _id in path("drafts.**") ]._id,
'published': *[ _type == $type && !(_id in path("drafts.**"))]._id,
}
{
'current': published[ !("drafts." + # in ^.drafts) ] + drafts
}
First I get both drafts and non-drafts and "store" it in this projection, like a variable-😉-ish
Then I start with my non-drafts - published
And filter out any that has a counterpart in my drafts "variable"
Lastly I add all drafts to the my list of filtered non-drafts
Overall I think you're on the right track. Some ideas to help you out:
Drafts are always fresher and newer than published documents, so if a given doc's id in path("drafts.**"), that's already the last updated one.
Knowing the above allows you to skip the defined(*[_id == ...]) part of the query for drafts, speeding up your execution
As drafts are already included, we can exclude published documents with a draft (defined(*[_id == "drafts." + ^._id][0]))
Notice I added a [0] to the end of the query to pick only the first element that matches. This will improve performance slightly.
For getting only documents that have no categories, use count(categoriesField) < 1
Order documents with | order(_updatedAt desc) to get the freshest documents first
And paginate your request to reduce the payload and speed things up.
Here's a sample query applying these principles (I haven't ran it, you may have to do some adjustments there):
*[
_type == $type &&
// Assuming you only want those without categories:
count(categories) < 1 &&
(
// Is either a draft -> drafts are always fresher
_id in path("drafts.**") ||
// Or a published document with no draft
!defined(*[_id == "drafts." + ^._id][0])
// 👆 with the check above we're ensuring only
// published documents run the expensive defined query
)
]
// Order by last updated
| order(_updatedAt desc)
// Paginate for faster queries
[$paginationStart..$paginationEnd]
// Get only the _id, assuming that's what you want
._id
Hope this helps 🙌

Use API to gather statistics on my followers

I am very new to this and would like to know how to start gathering statistics on my followers as I am currently growing my follower base. I am subscribed to several statistic tracking apps but none are really good.
I wish to track things such as:
Follower count by Location
Frequency distribution of followers and tags
Follower growth rate by Hour, Day, week, etc..
Follower Loss
Is this at all possible using APIs? Can anyone tell me how to get started?
There is no direct API call to get follower growth by hour and week, you have to get all followers every hour and store it in database and analyze for growth or loss every hour compared to previous hour and save it on the server.
You cannot get location of followers from API, you can may be estimate the location by checking for location in bio or analyzing all user posts and finding most posted location (this is expensive on API side and will have to make a lot of API calls to get analyze)
Yes, all this is possible to do using API, but it is a lot of work on backend, so if some service does this, it will cost you money cause they cannot do it for free, my guess is that you have checked all free or cheap services and they cannot do all this analysis for cheap.
You can get a broad breakdown of follower count in Google Sheets. This doesn't require API access so you won't get all of the data you are looking for, such as GEO. But, if you would like to see your follower increase by the hour, do this -
Open up a new Google Sheet
Go to Tools > Script Editor
Name your script 'IGFollowers'
In the code box, copy and paste this code below, but make sure to write this replace 'AccountName' with your username
var sheetName = "IGFollowers";
var instagramAccountName = "AccountName";
function insertFollowerCount() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheetByName("IGFollowers");
sheet.appendRow([Utilities.formatDate(new Date(), "PST", "yyyy-MM-dd"),
Utilities.formatDate(new Date(), "PST", "hh:mm"),
getInstagramFollowerCount(this.instagramAccountName)]);
};
function getInstagramFollowerCount(username) {
var url = "https://www.instagram.com/" + username + "/?__a=1";
var response = UrlFetchApp.fetch(url).getContentText();
return JSON.parse(response).user.followed_by.count;
}
Go to Run > InsertFollowerCount
NOTE: You may need to do a bit of formatting with the main Google Sheet, but this will get you some very long columns showing an increase in followers by the hour.

Gmail API : resultSizeEstimate gives odd numbers

https://developers.google.com/gmail/api/v1/reference/users/messages/list
The Gmail API allows to retrieve an estimate for the number of message for a given query (from:send1#gmail.com is:unread). Somehow the number that the API returns seems very different from the one shown on the webmail.
Any idea on how to return the actual number?
resultSizeEstimate is only an estimate and isn't guaranteed to be accurate for general queries. It should give more reasonable (still estimated) numbers for queries on specific labels ("label:MYLABEL" or "label:MYLABEL is:unread").
Unfortunately, there currently isn't a method of getting the actual numbers, other than retrieving all of them and looking at the size of the list returned.
Partial answer - I was hitting the 103 limit as well.
I've noticed that by playing with the maxResults parameter of the Google API I get different, yet still invalid, results.
With this value of maxResults:
request = service.users().messages().list(userId=user_id, labelIds=formatted_labels)
--> resultSizeEstimate = 103
With this value of maxResults:
request = service.users().messages().list(userId=user_id, labelIds=formatted_labels, maxResults=100)
--> resultSizeEstimate = 103
With this value of maxResults:
request = service.users().messages().list(userId=user_id, labelIds=formatted_labels, maxResults=1000)
--> resultSizeEstimate = 511
With this value of maxResults:
request = service.users().messages().list(userId=user_id, labelIds=formatted_labels, maxResults=50)
--> resultSizeEstimate = 52
This is not exactly the solution to all cases but it may help. You can get the exact number of emails under a given label using Users.labels:get
https://developers.google.com/gmail/api/v1/reference/users/labels/get
The messagesTotal value gives you that number.
Unfortunately, as others notice, listing messages with Users.messages:list or doing a search do not return accurate results.

twiiter4j when to STOP when no more tweets available?

So, I've figured out how to be able to get more than 100 tweets, thanks to How to retrieve more than 100 results using Twitter4j
However, when do I make the script stop and print stop when maximum results have been reached? For example, I set
int numberOfTweets = 512;
And, it finds just 82 tweets matching my query.
However, because of:
while (tweets.size () < numberOfTweets)
it still continues to keep on querying over and over until I max out my rate limit of 180 requests per 15 seconds.
I'm really a novice at java, so I would really appreciate if you could show me how to resolve this by modifying the first answer script at How to retrieve more than 100 results using Twitter4j
Thanks in advance!
You only need to modify things in the try{} block. One solution is to check whether the ID of the last tweet you found on the previous loop(previousLastID) in the while is the same as the ID of the last tweet (lastID) in the new batch collected (newTweets). If it is, it means the new batch's elements already exist in the previous array, and that that we have reached the end of possible tweets for this hastag.
try {
QueryResult result = twitter.search(query);
List<Status> newTweets = result.getTweets();
long previousLastID = lastID;
for (Status t: newTweets)
if (t.getId() < lastID) lastID = t.getId();
if (previousLastID == lastID) {
println("Last batch (" + tweets.size() + " tweets) was the same as first. Stopping the Gathering process");
break;
}

Foursquare Venue API & Number of Results, in a more efficient way?

I'd like to ask if there is a more efficient way to get more than 50 results besides these options?
How do I get more locations?
Foursquare Venue API & Number of Results
and this, which is for the old API Foursquare API nearByVenue service issue
I'm using the current foursquare api for the venue search https://developer.foursquare.com/docs/venues/search .
What I'd like is something like an offset option, in order to get more results, but it seems that there isn't such an option.
Is there an alternative solution?
Thank you in advance.
you should use venues explore with offset and limit as paramters,
venues explore gives you totalResults and you can use this response to calculate number of pages you need in paginate
for example assume totalResults is 90(pay attention at offset and limit parametr value )
in first request:
https://api.foursquare.com/v2/venues/explore?client_id=client_id&client_secret=client_secret&v=20150825&near=city_name&categoryId=category_id&intent=browse&offset=0&limit=30
in second request:
https://api.foursquare.com/v2/venues/explore?client_id=client_id&client_secret=client_secret&v=20150825&near=city_name&categoryId=category_id&intent=browse&offset=30&limit=30
in third request:
https://api.foursquare.com/v2/venues/explore?client_id=client_id&client_secret=client_secret&v=20150825&near=city_name&categoryId=category_id&intent=browse&offset=60&limit=30
for 90 results you can get all records with above three request
There is actually another option not mentioned here (not pagination though)
Using the (experimental?) categoryId filter.
You can search for a single point (ll) a few times with different category ids, giving you many results (some duplicates as venues can have more than one category).
So you can search for 'Food' venues and 'Nightlife' venues at the same place, getting 100 results in stand of 50.. as said it is 100 results, but not unique results, could be duplicates. I think that is more efficient then trying to play around with the browse radius thing.
Not pagination, but will give a lot more results than a normal search - usually enough even in urban areas.
But yea, having some sort of way to extract more than 50 on a single point is not possible, but could be nice :)
Afraid not. Currently there is no pagination, in order to find more venues you need to move your search area around as in the answers you highlighted. I agree, pagination would be handy though!
For the explorer endpoint this worked for me: If the maximum number of results that is returned for instance is 100, just use offset=100 in the next call which gives you the next 100 results starting from 100 (the offset). Iterate (e.g. using while loop) and keep increasing offset by 100 until you reach the total number or results (which is returned in in the api for totalResults).
My first stack overflow post, tried to answer as clearly as possible
def getNearbyVenues(neighborhoods, latitudes, longitudes, radius=500,ven_num=300):
venues_list=[]
for name, lat, lng in zip(neighborhoods, latitudes, longitudes):
i=0
while (i < ven_num+50):
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&offset={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
i,
LIMIT)
# make the GET request
results = requests.get(url).json()['response']['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
i=i+50
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
print('Ok')
return(nearby_venues)
the code above has worked perfectly with me, where ven_num variable is the desired limit for calling venues in a certain neighborhood