So, I've figured out how to be able to get more than 100 tweets, thanks to How to retrieve more than 100 results using Twitter4j
However, when do I make the script stop and print stop when maximum results have been reached? For example, I set
int numberOfTweets = 512;
And, it finds just 82 tweets matching my query.
However, because of:
while (tweets.size () < numberOfTweets)
it still continues to keep on querying over and over until I max out my rate limit of 180 requests per 15 seconds.
I'm really a novice at java, so I would really appreciate if you could show me how to resolve this by modifying the first answer script at How to retrieve more than 100 results using Twitter4j
Thanks in advance!
You only need to modify things in the try{} block. One solution is to check whether the ID of the last tweet you found on the previous loop(previousLastID) in the while is the same as the ID of the last tweet (lastID) in the new batch collected (newTweets). If it is, it means the new batch's elements already exist in the previous array, and that that we have reached the end of possible tweets for this hastag.
try {
QueryResult result = twitter.search(query);
List<Status> newTweets = result.getTweets();
long previousLastID = lastID;
for (Status t: newTweets)
if (t.getId() < lastID) lastID = t.getId();
if (previousLastID == lastID) {
println("Last batch (" + tweets.size() + " tweets) was the same as first. Stopping the Gathering process");
break;
}
Related
In sanity studio you get a nice list of the most recent version of all your documents. If there is a draft you get that, if not, you get the published one.
I need the same list for a few filters and scripts. The following groq does the job but is not very fast and does not work in the new API (v2021-03-25).
*[
_type == $type &&
!defined(*[_id == "drafts." + ^._id])
]._id
A way around the breaking changes in the API is to use length() = 0 in place of !defined() but that makes an already slow query 10-20 X slower.
Does anyone know a way of making filters that consider only the latest version?
Edit: An example where I need this is if I want to see all documents without any categories. Regardless whether it is the published document or the draft that has no categories it shows up in a normal filter. So if you add categories but don't immediately want to publish it will be confusing in the no-categories-list. ,'-)
100 X improvement on API v2021-03-25 🥳
The only way I was able to solve this with speed was to first make a projection of the sub-query so it doesn't run once for every non-draft. Then I thought, why not project both sets and then figure out the overlap, and that was even faster! It runs more than 10 x faster than possible on API v1 and 100 x faster than any suggestions for new API.
{
'drafts': *[ _type == $type && _id in path("drafts.**") ]._id,
'published': *[ _type == $type && !(_id in path("drafts.**"))]._id,
}
{
'current': published[ !("drafts." + # in ^.drafts) ] + drafts
}
First I get both drafts and non-drafts and "store" it in this projection, like a variable-😉-ish
Then I start with my non-drafts - published
And filter out any that has a counterpart in my drafts "variable"
Lastly I add all drafts to the my list of filtered non-drafts
Overall I think you're on the right track. Some ideas to help you out:
Drafts are always fresher and newer than published documents, so if a given doc's id in path("drafts.**"), that's already the last updated one.
Knowing the above allows you to skip the defined(*[_id == ...]) part of the query for drafts, speeding up your execution
As drafts are already included, we can exclude published documents with a draft (defined(*[_id == "drafts." + ^._id][0]))
Notice I added a [0] to the end of the query to pick only the first element that matches. This will improve performance slightly.
For getting only documents that have no categories, use count(categoriesField) < 1
Order documents with | order(_updatedAt desc) to get the freshest documents first
And paginate your request to reduce the payload and speed things up.
Here's a sample query applying these principles (I haven't ran it, you may have to do some adjustments there):
*[
_type == $type &&
// Assuming you only want those without categories:
count(categories) < 1 &&
(
// Is either a draft -> drafts are always fresher
_id in path("drafts.**") ||
// Or a published document with no draft
!defined(*[_id == "drafts." + ^._id][0])
// 👆 with the check above we're ensuring only
// published documents run the expensive defined query
)
]
// Order by last updated
| order(_updatedAt desc)
// Paginate for faster queries
[$paginationStart..$paginationEnd]
// Get only the _id, assuming that's what you want
._id
Hope this helps 🙌
I use spring boot with spring data jpa, hibernate and oracle.
Actually, I my table I have arount 10 millions of record, I need to do some operation, write info to a file and after delete the record.
It's a basic sql query
select * from zzz where status = 2;
I done a test without doing operation and delete record
long start = System.nanoTime();
int page = 0;
Pageable pageable = PageRequest.of(page, LIMIT);
Page<Billing> pageBilling = billingRepository.findAllByStatus(pageable);
while (true) {
for (Billing: pageBilling .getContent()) {
//process
//write to file
//delete element
}
if (!pageBilling .hasNext()) {
break;
}
pageable = pageBilling .nextPageable();
pageBilling = billingRepository.findAllByStatus(pageable);
}
long end = System.nanoTime();
long microseconds = (end - start) / 1000;
System.out.println(microseconds + " to write");
Result it's bad, with a limit of 10 000, that took 157 minutes, with 100 000 28 minutes, with millions 19 minutes.
It's there a better solution to increase performance?
The following are likely to improve the performance significantly:
You should not iterate past the first page. Instead, delete the processed data and select the first page again. Actually you don't need a page for that you can encode the limit in the method name. Selecting late pages is rather inefficient.
The process of loading, processing and deleting one batch of items should be in a separate transaction. Otherwise the EntityManager will hold all the entities ever loaded which will make things really slow.
If that still isn't sufficient yet you may look into the following:
Inspect the SQL executed. Does it look sensible? If not consider switching to JdbcTemplate or NamedParameterJdbcTemplate with a query method that takes a RowCallbackHandler you should be able to load and process all rows with a single select statement and at the end to process one delete statement to remove all rows. This requires that the status that you use for filtering does not change in the mean time.
How do the execution plans look like? If they seem of inspect your indices.
I'm using late acceptance as local search algorithm and here is how it actually picks moves:
If my forager is 5, it'll pick 5 moves and then get 1 random move to be applied for every step.
At every step it only picks moves that are increasing scores ie greedy picking across steps.
Forager.pickMove()
public LocalSearchMoveScope pickMove(LocalSearchStepScope stepScope) {
stepScope.setSelectedMoveCount(selectedMoveCount);
stepScope.setAcceptedMoveCount(acceptedMoveCount);
if (earlyPickedMoveScope != null) {
return earlyPickedMoveScope;
}
List<LocalSearchMoveScope> finalistList = finalistPodium.getFinalistList();
if (finalistList.isEmpty()) {
return null;
}
if (finalistList.size() == 1 || !breakTieRandomly) {
return finalistList.get(0);
}
int randomIndex = stepScope.getWorkingRandom().nextInt(finalistList.size());// should have checked for best here
return finalistList.get(randomIndex);
}
I have two questions:
In first, can we make forager to pick the best of 5 instead of pick 1 randomly.
Can we allow move to pick that degrades score but can increase score later(no way to know it)?
Look for acceptedCountLimit and selectedCountLimit in the docs. Those do exactly that.
That's already the case (especially with Late Acceptance and Simulated Annealing). In the DEBUG log, just look at the step score vs the best score. Or ask for the step score statistic in optaplanner-benchmark.
I have a task where I am trying to check an array list of "jobs" against an array list of "machines" (all different types) and if the first three letters off the job match the first three letters off the machine (e.g. a PRT job code could only be assigned to a machine with code PRT). I want it to accept the job but if not I would like it to print out a message saying that there is no available machine. I have only been learning Java for a couple of weeks so this might not be the best way:
public void assignJob() {
for(Job j : jobs) {
String jobCode = j.getCode().substring(0, 3);
for(Machine m : machines) {
String machineCode = m.getCode().substring(0,3);
if (jobCode.equals(machineCode)){
m.acceptJob(j);
System.out.println("The job " + j.getCode() + " has been
assigned to a machine.");
break;
}
else {
System.out.println("Sorry there is no machine available to accept the type of job: " + j.getCode() );
}
}
}
}
The issue I am getting is that it is printing out the message every time it goes around the loop so it will say there is no machine available 3 times before it finds the correct machine the fourth time and then says the job has been accepted. I only really want the message one time and only after it has searched and not found anything.
Any help would be appreciated.
Use a boolean found = false; above the outer for loop, move else part of if statement below the outer for loop and make it an if statement. If you've found your match then set found = true; and break loops. Then check if it is found, display message if not found...
I've built an app using twitter4j which pulls in a bunch of tweets when I enter a keyword, takes the geolocation out of the tweet (or falls back to profile location) then maps them using ammaps. The problem is I'm only getting a small portion of tweets, is there some kind of limit here? I've got a DB going collecting the tweet data so soon enough it will have a decent amount, but I'm curious as to why I'm only getting tweets within the last 12 hours or so?
For example if I search by my username I only get one tweet, that I sent today.
Thanks for any info!
EDIT: I understand twitter doesn't allow public access to the firehose.. more of why am I limited to only finding tweets of recent?
You need to keep redoing the query, resetting the maxId every time, until you get nothing back. You can also use setSince and setUntil.
An example:
Query query = new Query();
query.setCount(DEFAULT_QUERY_COUNT);
query.setLang("en");
// set the bounding dates
query.setSince(sdf.format(startDate));
query.setUntil(sdf.format(endDate));
QueryResult result = searchWithRetry(twitter, query); // searchWithRetry is my function that deals with rate limits
while (result.getTweets().size() != 0) {
List<Status> tweets = result.getTweets();
System.out.print("# Tweets:\t" + tweets.size());
Long minId = Long.MAX_VALUE;
for (Status tweet : tweets) {
// do stuff here
if (tweet.getId() < minId)
minId = tweet.getId();
}
query.setMaxId(minId-1);
result = searchWithRetry(twitter, query);
}
Really it depend on which API system you are using. I mean Streaming or Search API. In the search API there is a parameter (result_type) that is an optional parameter. The values of this parameter might be followings:
* mixed: Include both popular and real time results in the response.
* recent: return only the most recent results in the response
* popular: return only the most popular results in the response.
The default one is the mixed one.
As far as I understand, you are using the recent one, that is why; you are getting the recent set of tweets. Another issue is getting low volume of tweets that have the geological information. Because there are very few users added the geological information to their profile, you are getting very few tweets.