When using the python tweepy library to pull tweets from twitter's streaming API is it possible to exclude retweets?
For instance, if I want only the tweets posted by a particular user ex: twitterStream.filter(follow = ["20264932"]) but this returns retweets and I would like to exclude them. How can I do this?
Thank you in advance.
Just checking a tweet's text to see if it starts with 'RT' is not really a robust solution. You need to make a decision about what you will consider a retweet, since it isn't exactly clear-cut. The Twitter API docs explain that tweets with 'RT' in the tweet text aren't officially retweets.
Sometimes people type RT at the beginning of a Tweet to indicate that they are re-posting someone else's content. This isn't an official Twitter command or feature, but signifies that they are quoting another user's Tweet.
If you're going by the 'official' definition, then you want to filter tweets out if they have a True value for their retweeted attribute, like this:
if not tweet['retweeted']:
# do something with standard tweets
And if you want to be more inclusive, including 'unofficial' re-tweets, you should check the string for the substring 'RT #' and not merely if it starts with 'RT' because that the former is cleaner, faster and eliminates more edge cases where a tweet starts with 'RT' but isn't a retweet (lots of data out there, I'm sure this is a possibility). Here's some code for that:
if not tweet['retweeted'] and 'RT #' not in tweet['text']:
# do something with standard tweets
The latter conditional takes the subset of tweets in your collection that are regular tweets and does an intersection with the subset of tweets in your collection that do not have 'RT #' in the tweet text, leaving you with tweets that are supposedly regular tweets.
Yes there are possible ways of doing this, One of them is to check if the text of the tweet, starts with RT, For this we can easily use .startswith() method on strings and for this you need to change the code of the on_data() method in your streaming class, which can be done as:
class TwitterStreamListener(tweepy.StreamListener):
def on_data(self, data):
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
if not decoded[`text`].startswith('RT'):
#Do processing here
print decoded['text'].encode('ascii', 'ignore')
return True
Related
I am using the Reddit API (Pushshift) : https://github.com/pushshift/api
Using the documentation, I understand how I can use this to extract every comment containing the word "covid" that was left in a certain time period:
https://api.pushshift.io/reddit/search/comment?q=covid&after=3h&before=2h&size=1
The output looks something like this:
{"data":[{"subreddit_id":"t5_2qh6p","author_is_blocked":false,"comment_type":null,"edited":false,"author_flair_type":"richtext","total_awards_received":0,"subreddit":"Conservative","author_flair_template_id":null,"id":"j98zf27","gilded":0,"archived":false,"collapsed_reason_code":null,"no_follow":false,"author":"VamboRoolOkay","send_replies":true,"parent_id":41917615743,"score":1,"author_fullname":"t2_7uxkru5f","all_awardings":[],"body":"I will never believe that election fraud wasn't a significant factor. Go ahead - call it a conspiracy theory. But I also maintained that Covid was lab-created. Truth is the Daughter of Time.","top_awarded_type":null,"author_flair_css_class":null,"author_patreon_flair":false,"collapsed":false,"author_flair_richtext":[{"e":"text","t":"Conservative"}],"is_submitter":false,"gildings":{},"collapsed_reason":null,"associated_award":null,"stickied":false,"author_premium":false,"can_gild":true,"link_id":"t3_116l7ct","unrepliable_reason":null,"author_flair_text_color":"dark","score_hidden":true,"permalink":"/r/Conservative/comments/116l7ct/kamala_harris_plans_on_running_with_biden_in_2024/j98zf27/","subreddit_type":"public","locked":false,"author_flair_text":"Conservative","treatment_tags":[],"created_utc":1676866031,"subreddit_name_prefixed":"r/Conservative","controversiality":0,"author_flair_background_color":"","collapsed_because_crowd_control":null,"distinguished":null,"retrieved_utc":1676866047,"updated_utc":1676866048,"body_sha1":"328df3784d15f77b98a84418c4ce720822227cfe","utc_datetime_str":"2023-02-20 04:07:11"}],"error":null,"metadata":{"es":{"took":98,"timed_out":false,"_shards":{"total":828,"successful":828,"skipped":824,"failed":0},"hits":{"total":{"value":573,"relation":"eq"},"max_score":null}},"es_query":{"size":1,"query":{"bool":{"must":[{"bool":{"must":[{"simple_query_string":{"fields":["body"],"query":"covid","default_operator":"and"}},{"range":{"created_utc":{"gte":1676862433000}}},{"range":{"created_utc":{"lt":1676866033000}}}]}}]}},"aggs":{},"sort":{"created_utc":"desc"}},"es_query2":"{\"size\":1,\"query\":{\"bool\":{\"must\":[{\"bool\":{\"must\":[{\"simple_query_string\":{\"fields\":[\"body\"],\"query\":\"covid\",\"default_operator\":\"and\"}},{\"range\":{\"created_utc\":{\"gte\":1676862433000}}},{\"range\":{\"created_utc\":{\"lt\":1676866033000}}}]}}]}},\"aggs\":{},\"sort\":{\"created_utc\":\"desc\"}}","api_launch_time":1673017478.254743,"api_request_start":1676873233.6143198,"api_request_end":1676873233.7406816,"api_total_time":0.12636184692382812}}
My Question: Suppose I identify a post that contains the word "covid" - now, I want to retrieve every comment on this post : Is this possible to do?
For instance, based on the output of these results, I see that :
link_id: t3_116l7ct
parent_id:41917615743
Can I somehow use this information to write an API query to retrieve all comments from this post?
I tried the following query but got an empty result: https://api.pushshift.io/reddit/comment/search/?link_id=t3_116cjib
Thanks!
Using the Reddit API, is it possible to return a list of Reddit comments if the submission title includes a specific keyword? For example, if the keyword is "Lime Sparkling Water", I want to return all the comments under submissions that have "Lime Sparkline Water" in the title.
I've tried using the Pushshift API for Reddit but looks like we can only isolate the submission data or the comment data and not isolate the comments data based on the submissions data.
Please help :)
Yes, this is possible with PRAW.
You can use PRAW's stream function, that page also has examples how to use PRAW.
An example being:
subreddit = reddit.subreddit("AskReddit")
for submission in subreddit.stream.submissions():
# do something with submission
...
This will return all submissions within "AskReddit". From there you could check the post title:
if 'Lime Sparking Water' in submission:
# do something with the submission
Although, I know this is a hypothetical phrase to search, you'd be better off searching lowercase phrases/words and .lower()
This is how retrieve a particular pull-request's comments according to bitbucket's documentation:
While I do have the pull-request ID and format a correct URL I still get a 400 response error. I am able to make a POST request to comment but I cannot make a GET. After further reading I noticed the six parameters listed for this endpoint do not say 'optional'. It looks like these need to be supplied in order to retrieve all the comments.
But what exactly are these parameters? I don't find their descriptions to be helpful in the slightest. Any and all help would be greatly appreciated!
fromHash and toHash are only required if diffType is'nt set to EFFECTIVE. state also seems optional to me (didn't give me an error when not including it), and anchorState specifies which kind of comments to fetch - you'd probably want ALL there. As far as I understand it, path contains the path of the file to read comments from. (ex: src/a.py and src/b.py were changed -> specify which of them to fetch comments for)
However, that's probably not what you want. I'm assuming you want to fetch all comments.
You can do that via /rest/api/1.0/projects/{projectKey}/repos/{repositorySlug}/pull-requests/{pullRequestId}/activities which also includes other activities like reviews, so you'll have to do some filtering.
I won't paste example data from the documentation or the bitbucket instance I tested this once since the json response is quite long. As I've said, there is an example response on the linked page. I also think you'll figure out how to get to the data you want once downloaded since this is a Q&A forum and not a "program this for me" page :b
As a small quickstart: you can use curl like this
curl -u <your_username>:<your_password> https://<bitbucket-url>/rest/api/1.0/projects/<project-key>/repos/<repo-name>/pull-requests/<pr-id>/activities
which will print the response json.
Python version of that curl snippet using the requests module:
import requests
url = "<your-url>" # see above on how to assemble your url
r = requests.get(
url,
params={}, # you'll need this later
auth=requests.auth.HTTPBasicAuth("your-username", "your-password")
)
Note that the result is paginated according to the api documentation, so you'll have to do some extra work to build a full list: Either set an obnoxiously high limit (dirty) or keep making requests until you've fetched everything. I stronly recommend the latter.
You can control which data you get using the start and limit parameters which you can either append to the url directly (e.g. https://bla/asdasdasd/activity?start=25) or - more cleanly - add to the params dict like so:
requests.get(
url,
params={
"start": 25,
"limit": 123
}
)
Putting it all together:
def get_all_pr_activity(url):
start = 0
values = []
while True:
r = requests.get(url, params={
"limit": 10, # adjust this limit to you liking - 10 is probably too low
"start": start
}, auth=requests.auth.HTTPBasicAuth("your-username", "your-password"))
values.extend(r.json()["values"])
if r.json()["isLastPage"]:
return values
start = r.json()["nextPageStart"]
print([x["id"] for x in get_all_pr_activity("my-bitbucket-url")])
will print a list of activity ids, e.g. [77190, 77188, 77123, 77136] and so on. Of course, you should probably not hardcode your username and password there - it's just meant as an example, not production-ready code.
Finally, to filter by action inside the function, you can replace the return values with something like
return [activity for activity in values if activity["action"] == "COMMENTED"]
Ok, this one has become a little tricky for me and I really need some assistance to work through it.
Problem
I have a GSpreadsheet which has a list of data, in this case Twitter usernames. Using the API of a service provider (in this case the Klout API), I would like to retrieve information about that user to populate a cell within a spreadsheet.
Based on what I can work out so far, I would need to write a custom function to do this but I have no idea where to start, how I might construct it, or if there are any examples of doing this.
Scenario
The Klout API can return either an XML or JSON response (see http://developer.klout.com/docs/read/api/API), based on the string passed. For example, the URL:
http://api.klout.com/1/users/show.xml?key=SECRET&users=thewinchesterau
would return the following XML response:
<users>
<user>
<twitter_id>17439480</twitter_id>
<twitter_screen_name>thewinchesterau</twitter_screen_name>
<score>
<kscore>56.63</kscore>
<slope>0</slope>
<description>creates content that is spread throughout their network and drives discussions.</description>
<kclass_id>10</kclass_id>
<kclass>Socializer</kclass>
<kclass_description>You are the hub of social scenes and people count on you to find out what's happening. You are quick to connect people and readily share your social savvy. Your followers appreciate your network and generosity.</kclass_description>
<kscore_description>thewinchesterau has a low level ofinfluence.</kscore_description>
<network_score>58.06</network_score>
<amplification_score>29.16</amplification_score>
<true_reach>90</true_reach>
<delta_1day>0.3</delta_1day>
<delta_5day>0.5</delta_5day>
</score>
</user>
</users>
Based on this response, I would like to be able to populate different cells with the values returned within the XML (or JSON if easier) packet.
So, for example, I would have a spreadsheet like the following which would have custom functions to go out and retrieve the value of the relevant XML element response to populate the cell:
Cell A B C D E
1 Username kscore Network score Amplification score True reach
2 thewinchester =kscore(A2) =nscore(A2) =ascore(A2) =tscore(A2)
Questions
Are there any gSpreadsheet examples you know of that use an API to pull data in from an external source?
How would one write a custom function to fetch the result from the API and populate a cell with a result of a specific element?
Any information, examples or helpers you have are greatly appreciated.
You want the importXML function, documented here. The formula you want will look something like this:
=importXML("http://api.klout.com/1/users/show.xml?key=SECRET&users=" + A1, "//users/user/score/kscore")
You could write a custom script with Google AppScript, but there's a simple solution to this similar to what Nick Johnson posted. I've tested this against the score function, but it could be easily adapted to the show endpoint with different XPath.
=importXML("http://api.klout.com/1/klout.xml?users="&A1&"&key=YOUR_API_KEY", "//users/user/kscore")
This presumes your Twitter IDs are in the A column.
Note, Google Docs limits the number of such importXML functions to 50 per spreadsheet. You could concatenate groups of 5 userids for each importXML call, effectively putting your limit to 250 a sheet.
This could also be adapted to a similar call in Excel that doesn't have that limit. Keep in mind the Klout ToS, though, using proper attribution and rate limits.
How can I use Twitter Search API (or other) to get a list of tweets which have the "geo" param?
--EDIT--
By example: I wont get list of geotagged tweets, by #apple tag. Without location filter, worldwide.
Looks like the latest API supports that; simply use a large enough geo region for your query:
-180,-90,180,90
See more from the API link for filter and location
The streaming API allowed you to filter by a location and the search API allows you to search by geocode. You can find more information on these services on our developer resources site.
Streaming API: http://dev.twitter.com/pages/streaming_api
Example: Create a file called ‘locations’ that
contains, excluding the quotation
marks, the phrase:
“locations=-122.75,36.8,-121.75,37.8,-74,40,-73,41” then execute:
curl -d #locations
http://stream.twitter.com/1/statuses/filter.json
-uAnyTwitterUser:Password.
You will receive all geo tagged tweets
from the San Francisco and New York
City area.
Search API: http://dev.twitter.com/doc/get/search
Example: http://search.twitter.com/search.json?geocode=37.781157,-122.398720,1mi
From the Twitter API Documentation, this should be the format of your search query:
http://search.twitter.com/search.json?geocode=37.781157,-122.398720,1mi
Where 37.781157 is the latitude, -122.398720 is the longitude and 1mi is the radius to search within.
You can look for every tweet but save only the geotaged ones.
I know it dont make a lot of sense, but works quite well.
if you call you search results, you can state
for result in results:
if result.geo != None:
print result.text.encode('utf-8', errors='ignore') # or do anything you want with the tweets
Use -180,-90,180,90 to get any geotagged tweet.