so I am calling the twitter api:
openurl = urllib.urlopen("https://api.twitter.com/1/statuses/user_timeline.json?include_entities=true&contributor_details&include_rts=true&screen_name="+user+"&count=3600")
and it returns some long file like:
[{"entities":{"hashtags":[],"user_mentions":[],"urls":[{"url":"http:\/\/t.co\/Hd1ubDVX","indices":[115,135],"display_url":"amzn.to\/tPSKgf","expanded_url":"http:\/\/amzn.to\/tPSKgf"}]},"coordinates":null,"truncated":false,"place":null,"geo":null,"in_reply_to_user_id":null,"retweet_count":2,"favorited":false,"in_reply_to_status_id_str":null,"user":{"contributors_enabled":false,"lang":"en","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/151701304\/theme14.gif","favourites_count":0,"profile_text_color":"333333","protected":false,"location":"North America","is_translator":false,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/151701304\/theme14.gif","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1642783876\/idB005XNC8Z4_normal.png","name":"User Interface Books","profile_link_color":"009999","url":"http:\/\/twitter.com\/ReleasedBooks\/genres","utc_offset":-28800,"description":"All new user interface and graphic design book releases posted on their publication day","listed_count":11,"profile_background_color":"131516","statuses_count":1189,"following":false,"profile_background_tile":true,"followers_count":732,"profile_image_url":"http:\/\/a2.twimg.com\/profile_images\/1642783876\/idB005XNC8Z4_normal.png","default_profile":false,"geo_enabled":false,"created_at":"Mon Sep 20 21:28:15 +0000 2010","profile_sidebar_fill_color":"efefef","show_all_inline_media":false,"follow_request_sent":false,"notifications":false,"friends_count":1,"profile_sidebar_border_color":"eeeeee","screen_name":"User","id_str":"193056806","verified":false,"id":193056806,"default_profile_image":false,"profile_use_background_image":true,"time_zone":"Pacific Time (US & Canada)"},"possibly_sensitive":false,"in_reply_to_screen_name":null,"created_at":"Thu Nov 17 00:01:45 +0000 2011","in_reply_to_user_id_str":null,"retweeted":false,"source":"\u003Ca href=\"http:\/\/twitter.com\/ReleasedBooks\/genres\" rel=\"nofollow\"\u003EBook Releases\u003C\/a\u003E","id_str":"136957158075011072","in_reply_to_status_id":null,"id":136957158075011072,"contributors":null,"text":"Digital Media: Technological and Social Challenges of the Interactive World - by William Aspray - Scarecrow Press. http:\/\/t.co\/Hd1ubDVX"},{"entities":{"hashtags":[],"user_mentions":[],"urls":[{"url":"http:\/\/t.co\/GMCzTija","indices":[119,139],"display_u
Well,
the different objects are slit into tables and dictionaries and I want to extract the different parts but to do this I have to know how many objects the file has:
example:
[{1:info , 2:info}][{1:info , 2:info}][{1:info , 2:info}][{1:info , 2:info}]
so to extract the info from 1 in the first table I would:
[0]['1']
>>>>info
But to extract it from the last object in the table I need to know how many object the table has.
This is what my code looks like:
table_timeline = json.loads(twitter_timeline)
table_timeline_inner = table_timeline[x]
lines = 0
while lines < linesmax:
in_reply_to_user_id = table_timeline_inner['in_reply_to_status_id_str']
lines += 1
So how do I find the value of the last object in this table?
thanks
I'm not entirely sure this is what you're looking for, but to get the last item in a python list, use an index of -1. For example,
>>> alist = [{'position': 'first'}, {'position': 'second'}, {'position': 'third'}]
>>> print alist[-1]['position']
{'position': 'third'}
>>> print alist[-1]['position']
third
Related
I'm trying to provide average movie’s ratings by the following four time intervals during which the movies were released (a) 1970 to 1979 (b) 1980 to 1989, ect.. and I wonder what did I wrong here, since I'm new to DS.
EDIT
Since the dataset have no year column, I extract the released year embedded in the title column and assign a new column to the dataset:
year = df['title'].str.findall('\((\d{4})\)').str.get(0)
year_df = df.assign(year = year.values)
1.5. Because there are some str in the column, I convert the entire "year" column to int. Then I implemented groupby function to group the year in 10 years interval.
year_df['year'] = year_df['year'].astype(int)
year_df = year_df.groupby(year_df.year // 10 * 10)
After that, I want to assign the year group into an interval of 10 years:
year_desc = { 1910: "1910 – 1019", 1920: "1920 – 1929", 1930: "1930 – 1939", 1940: "1940 – 1949", 1950: "1950 – 1959",1960: "1960 – 1969",1970: "1970 – 1979",1980: "1980 – 1989",1990: "1990 – 1999",2000: "2000 – 2009"}
year_df['year'] = [year_desc[x] for x in year_df['year']]
When I run my code after trying to assign year group, I get an error stated that:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
UPDATES:
I tried to follow #ozacha suggestion and I still experiencing error, but this time is
'SeriesGroupBy' object has no attribute 'map'
Ad 1) Your year_df already has a year column, so there is no need to recreate it using df.assign(). .assign() is an alternative way of (re)defining columns in a dataframe.
Ad 2) Not sure what your test_group is, so it is difficult to get what's the source of the error. However, I believe this is what you want – using pd.Series.map:
year_df = ...
year_df['year'] = year_df['year'].astype(int)
year_desc = {...}
year_df['year_group'] = year_df['year'].map(year_desc)
Alternatively, you can also generate year groups dynamically:
year_df['year_group'] = year_df['year'].map(lambda year: f"{year} – {year + 9}")
I am trying to get my head around the Pandas module and started learning about the Series data structure.
I have created the following Series in Spyder :-
songs = pd.Series(data = [145,142,38,13], name = "Count")
I can obtain information about the Series index using the code:-
songs.index
The output of the above code is as follows:-
My question is where it states Start = 0 and Stop = 4, what are these referring to?
I have interpreted start = 0 as the first element in the Series is in row 0.
But i am not sure what Stop value refers to as there are no elements in row 4 of the Series?
Can some one explain?
Thank you.
This concept as already explained adequately in the comments (indexing is at minus one the count of items) is prevalent in many places.
For instance, take the list data structure-
z = songs.to_list()
[145, 142, 38, 13]
len(z)
4 # length is four
# however indexing stops at i-1 position 'i' being the length/count of items in the list.
z[4] # this will raise an IndexError
# you will have to start at index 0 going till only index 3 (i.e. 4 items)
z[0], z[1], z[2], z[-1] # notice how -1 can be used to directly access the last element
I'm sorry, I know this is basic but I've tried to figure it out myself for 2 days by sifting through documentation to no avail.
My code:
import numpy as np
import pandas as pd
name = ["bob","bobby","bombastic"]
age = [10,20,30]
price = [111,222,333]
share = [3,6,9]
list = [name,age,price,share]
list2 = np.transpose(list)
dftest = pd.DataFrame(list2, columns = ["name","age","price","share"])
print(dftest)
name age price share
0 bob 10 111 3
1 bobby 20 222 6
2 bombastic 30 333 9
Want to divide all elements in 'price' column with all elements in 'share' column. I've tried:
print(dftest[['price']/['share']]) - Failed
dftest['price']/dftest['share'] - Failed, unsupported operand type
dftest.loc[:,'price']/dftest.loc[:,'share'] - Failed
Wondering if I could just change everything to int or float, I tried:
dftest.astype(float) - cant convert from str to float
Ive tried iter and items methods but could not understand the printouts...
My only suspicion is to use something called iterate, which I am unable to wrap my head around despite reading other old posts...
Please help me T_T
Apologies in advance for the somewhat protracted answer, but the question is somewhat unclear with regards to what exactly you're attempting to accomplish.
If you simply want price[0]/share[0], price[1]/share[1], etc. you can just do:
dftest['price_div_share'] = dftest['price'] / dftest['share']
The issue with the operand types can be solved by:
dftest['price_div_share'] = dftest['price'].astype(float) / dftest['share'].astype(float)
You're getting the cant convert from str to float error because you're trying to call astype(float) on the ENTIRE dataframe which contains string columns.
If you want to divide each item by each item, i.e. price[0] / share[0], price[1] / share[0], price[2] / share[0], price[0] / share[1], etc. You would need to iterate through each item and append the result to a new list. You can do that pretty easily with a for loop, although it may take some time if you're working with a large dataset. It would look something like this if you simply want the result:
new_list = []
for p in dftest['price'].astype(float):
for s in dftest['share'].astype(float):
new_list.append(p/s)
If you want to get this in a new dataframe you can simply save it to a new dataframe using pd.Dataframe() method:
new_df = pd.Dataframe(new_list, columns=[price_divided_by_share])
This new dataframe would only have one column (the result, as mentioned above). If you want the information from the original dataframe as well, then you would do something like the following:
new_list = []
for n, a, p in zip(dftest['name'], dftest['age'], dftest['price'].astype(float):
for s in dftest['share'].astype(float):
new_list.append([n, a, p, s, p/s])
new_df = pd.Dataframe(new_list, columns=[name, age, price, share, price_div_by_share])
If you check the data types of your dataframe, you will realise that they are all strings/object type :
dftest.dtypes
name object
age object
price object
share object
dtype: object
first step will be to change the relevant columns to numbers - this is one way:
dftest = dftest.set_index("name").astype(float)
dftest.dtypes
age float64
price float64
share float64
dtype: object
This way you make the names a useful index, and separate it from the numeric data. This is just a suggestion; you may have other reasons to leave names as a columns - in that case, you have to individually change the data types of each column.
Once that is done, you can safely execute your code :
dftest.div(dftest.share,axis=0)
age price share
name
bob 3.333333 37.0 1.0
bobby 3.333333 37.0 1.0
bombastic 3.333333 37.0 1.0
I assume this is what you expect as your outcome. If not, you can tweak it. Main part is get your data types as numbers before computation/division can occur.
I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.
The head is here, as a tsv: https://pastebin.com/FnnBq9rf
Thank you #zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!
def NounCounter(x):
nouns = []
for (word, pos) in x:
if pos.startswith("NN"):
nouns.append(word)
return nouns
df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()
As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.
def NounCount(x):
nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
return nounCount
df["noun_count"] = df["reviews"].apply(NounCount)
df.to_csv('./dataset.csv')
There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.
Assume you have something like this (please correct me if it's not, as you didn't provide an example):
a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])
You can do something like this to map the Series and sum the count:
a.map(lambda x: 1 if x[1]== "verb" else 0).sum()
This will return you 2.
I grabbed a sentence from the link you shared:
text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2
I am writing code for a Naive Bayes model(I know there's a standard implementation in Sklearn, but I want to code it anyway) - For this I have say upwards of 30 features, against all of which I have the corresponding click & impression counts (Treat them as True/False flags)
What I need then, is to calculate
P(Click/F1, F2.. F30) = (P(Click)*P(F1/Click)*P(F2|click) ..*P(F30|Click))/(P(F1, F2...F30), and
P(NoClick/F1, F2.. F30) = (P(NoClick)*P(F1/NoClick)*P(F2|Noclick) ..*P(F30|NOClick))/(P(F1, F2...F30)
Where I will disregard the denominator as it will affect both Click & Non click behaviour similarly.
Example, for two features, day_custom & is_tablet_phone, I have
is_tablet_phone click impression
FALSE 375417 28291280
TRUE 17743 4220980
day_custom click impression
Fri 77592 7029703
Mon 43576 3773571
Sat 65950 5447976
Sun 66460 5031271
Thu 74329 6971541
Tue 55282 4575114
Wed 51555 4737712
My approach to the Problem : Assuming I read the individual files in data frame, one after another, I want the abilty to calculate & store the corresponding Probablities back in a file, that I will then use for real time prediction of Probabilty to click vs no click.
One possible structure of "processed file" thus would be -:
Here's my entire code -:
In the full blown example, I am traversing the entire directory structure(of 30 txt files, one at a time, from the base path) - which is why I need the ability to create "names" at runtime.
for base_path in base_paths:
for root, dirs, files in os.walk(base_path):
for file in files:
file_paths.append(os.path.join(root, file))
For reasons of tractability, follow from here, by taking the 2 txt files as sample input
file_paths=['/home/ekta/Desktop/NB/day_custom.txt','/home/ekta/Desktop/NB/is_tablet_phone.txt']
flag=0
for filehandle in file_paths:
feature_name=filehandle.split("/")[-1].split(".")[0]
df= pd.read_csv(filehandle,skiprows=0, encoding='utf-8',sep='\t',index_col=False,dtype={feature_name: object,'click': int,'impression': int})
df2=df[(df.impression-df.click>0) & (df.click >0)]
if flag ==0:
MySumC,MySumNC,Mydict=0,0,collections.defaultdict(dict)
MySumC=sum(df2['click'])
MySumNC=sum(df2['impression'])
P_C=float(MySumC)/float(MySumC+MySumNC)
P_NC=1-P_C
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
flag=1 %Set the flag as "1" because we don't need to compute the MySumC,MySumNC, P_C & P_NC again
Question :
It looks like THIS loop is the killer here.Also, intutively, looping on a dataframe is a BAD practice. How can I rewrite this, perhaps using Map/Apply ?
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
What I need in Mydict , which is a hash to store each feature name and each feature value in it
{'day_custom_Mon':{'P_day_custom_Mon_C':.787,'P_day_custom_Mon_NC': 0.556},
'day_custom_Tue':{'P_day_custom_Tue_C':0.887,'P_day_custom_Tue_NC': 0.156},
'day_custom_Wed':{'P_day_custom_Tue_C':0.087,'P_day_custom_Tue_NC': 0.167}
'day_custom_Thu':{'P_day_custom_Tue_C':0.947,'P_day_custom_Tue_NC': 0.196},
'is_tablet_phone_True':{'P_is_tablet_phone_True_C':.787,'P_is_tablet_phone_True_NC': 0.066},
'is_tablet_phone_False':{'P_is_tablet_phone_False_C':.787,'P_is_tablet_phone_False_NC': 0.077},
.. and so on..
%PPS: I just made up those float numbers, but you get the point
Also because I will later serialize this file & pass to Redis directly, for other systems to feed on it, in an cron-job manner, so I need to preserve some sort of Dynamic naming .
What I tried -:
Since I am reading feature_name as
feature_name=filehandle.split("/")[-1].split(".")[0]` # thereby abstracting & creating variables dynamically
def funct1(row):
return row[feature_name]
def funct2(row):
return row['click']
def funct3(row):
return row['impression']
then..
df2.apply(funct2,axis=1)df2.apply(funct,axis=1)*float(P_C))/MySumC, df2.apply(funct3,axis=1)*float(P_NC))/MySumNC Gives me both the values I need for a feature_value(say Mon, Tue, Wed, and so on..) for a feature_name (say,day_custom)
I also know that df2.apply(funct1, axis=1) contains part of mycustom "names"(ie feature values), how would I then build these names using map/apply ?
Ie. I will have the values, but how would I create the "key" 'P_'+feature_name+'_'+feature_value+'_C' , since feature value post apply is returned as a series object.
check out the following recipe which does exactly what you want, only using data frame manipulations. I also simplified the actual frequency calculation a bit ;)
#set the feature name values as the index of
df2.set_index(feature_name, inplace=True)
#This is what df2.set_index() looks like:
# click impression
#day_custom
#Fri 9917 3163
#Mon 2566 3818
#Sat 8725 7753
#Sun 6938 8642
#Thu 6136 2556
#Tue 5234 2356
#Wed 9463 9433
#rename the index of your data frame
df2.rename(index=lambda x:"%s_%s"%('day_custom', x), inplace=True)
#compute the total sum of your data frame entries
totsum = float(df2.values.sum())
#use apply to multiply every data frame element by the total sum
df2 = df2.applymap(lambda x:x/totsum)
#transpose the data frame to have the following shape
#day_custom day_custom_Fri day_custom_Mon ...
#click 0.102019 0.037468 ...
#impression 0.087661 0.045886 ...
#
#
dftranspose = df2.T
# template kw for formatting
templatekw = {'click':"P_%s_C", 'impression':"P_%s_NC"}
# build a list of small data frames with correct index names P_%s_NC etc
dflist = [dftranspose[[col]].rename(lambda x:templatekw[x]%col) for col in dftranspose]
#use the concatenate function to produce a sparse dictionary
MyDict= pd.concat(dflist).to_dict()
Instead of assigning to MyDict at the end, you can use the update-method during the loop.
For understanding the comments below, see here my
Original answer:
Try to use a pivot_table:
def clickfunc(x):
return np.sum(x) * P_C / MySumC
def impressionfunc(x):
return np.sum(x) * P_NC / MySumNC
newtable = df2.pivot_table(['click', 'impression'], 'feature_name', \
aggfunc=[clickfunc, impressionfunc])
#transpose the table for the dictionary to have the right form
newtable = newtable.T
#to_dict functionality already gives the correct result
MyDict = newtable.to_dict()
#rename by copying
for feature_value, subdict in MyDict.items():
word = feature_name +"_"+ feature_value
copydict[word] = {'P_' + word + '_C':subdict['click'],\
'P_' + word + '_NC':subdict['impression'] }
This gives you the result you want in copydict
itertuples() is what worked for me(worked at lightspeed) - though It is still not using the map/apply approach that I so much wanted to see. Itertuples on a pandas dataframe returns the whole row, so I no longer have to do df2[df2[feature_name]==feature_value]['click'] - be aware that this matching by value is not only expensive, but also undesired, since it may return a series, if there were duplicate rows. itertuples solves that problem were elegantly, though I need to then access the individual objects/columns by integer indexes , which means less re-usable code. I could abstract this, but It wont be like accessing by column names, the status-quo.
for row in df2.itertuples():
Mydict[feature_name+'_'+str(row[1])]={'P_'+feature_name+'_'+str(row[1])+'_C':(row[2]*float(P_C))/MySumC, \
'P_'+feature_name+'_'+str(row[1])+'_NC':(row[3]*float(P_NC))/MySumNC}
Note that I am accesing each column in the row by row[1] , row[2] and like. For example, row has (0, u'Fri', 77592, 7029703)
Post this I get
dict(Mydict)
{'day_custom_Thu': {'P_day_custom_Thu_NC': 0.18345372640838162, 'P_day_custom_Thu_C': 0.0019559423132143377}, 'day_custom_Mon': {'P_day_custom_Mon_C': 0.0011466875948906617, 'P_day_custom_Mon_NC': 0.099300235316209587}, 'day_custom_Sat': {'P_day_custom_Sat_NC': 0.14336163246883712, 'P_day_custom_Sat_C': 0.0017354517827023852}, 'day_custom_Tue': {'P_day_custom_Tue_C': 0.001454726996987919, 'P_day_custom_Tue_NC': 0.1203925662982053}, 'day_custom_Sun': {'P_day_custom_Sun_NC': 0.13239618235343156, 'P_day_custom_Sun_C': 0.0017488722589598259}, 'is_tablet_phone_TRUE': {'P_is_tablet_phone_TRUE_NC': 0.11107365073163174, 'P_is_tablet_phone_TRUE_C': 0.00046690100046229593}, 'day_custom_Wed': {'P_day_custom_Wed_NC': 0.12467127727567069, 'P_day_custom_Wed_C': 0.0013566522616712882}, 'day_custom_Fri': {'P_day_custom_Fri_NC': 0.1849842396242351, 'P_day_custom_Fri_C': 0.0020418070466026303}, 'is_tablet_phone_FALSE': {'P_is_tablet_phone_FALSE_NC': 0.74447539516197614, 'P_is_tablet_phone_FALSE_C': 0.0098789704610580936}}