Dataframe Column is not Read as List in Lambda Function - pandas

I have a dataframe which contains list value, let us call it df1:
Text
-------
["good", "job", "we", "are", "so", "proud"]
["it", "was", "his", "honor", "as", "well", "as", "guilty"]
And also another dataframe, df2:
Word Value
-------------
good 7.47
proud 8.03
honor 7.66
guilty 2.63
I want to create apply plus lambda function to create df1['score'] where the values are derived from sum-aggregating words per list in df1 which are found in df2's words. Currently, this is my code:
def score(list_word):
sum = count = mean = sd = 0
for word in list_word:
if word in df2['Word']:
sum = sum + df2.loc[df2['Word'] == word, 'Value'].iloc[0]
count = count + 1
if count != 0:
return sum/count
else:
return 0
df['score'] = df.apply(lambda x: score(x['words']), axis=1)
This is what I envision:
Score
-------
7.75 #average of good (7.47) and proud (8.03)
5.145 #average of honor (7.66) and guilty (2.63)
However, it seems x['words'] did not pass as list object, and I do not know how to modify the score function to meet the object type. I try to convert it by tolist() method, but no avail. Any help appreciated.

Giving the first df1, and df2 with explode and map , Notice explode is after pandas 0.25
#import ast
#df1.Text=df1.Text.apply(ast.literal_eval)
#If the list is string type , we need bring the format list back with fast
s=df1.Text.explode().map(dict(zip(df2.Word,df2.Value))).mean(level=0)
0 7.750
1 5.145
Name: Text, dtype: float64
Update
df1.Text.explode().to_frame('Word').reset_index().merge(df2,how='left').groupby('index').mean()
Value
index
0 7.750
1 5.145

Related

Flightradar24 pandas groupby and vectorize. A no looping solution

I am looking to perform a fast operation on flightradar data to see if the speed in distance matches the speed reported. I have multiple flights and was told not to run double loops on pandas dataframes. Here is a sample dataframe:
import pandas as pd
from datetime import datetime
from shapely.geometry import Point
from geopy.distance import distance
dates = ['2020-12-26 15:13:01', '2020-12-26 15:13:07','2020-12-26 15:13:19','2020-12-26 15:13:32','2020-12-26 15:13:38']
datetimes = [datetime.fromisoformat(date) for date in dates]
data = {'UTC': datetimes,
'Callsign': ["1", "1","2","2","2"],
'Position':[Point(30.542175,-91.13999200000001), Point(30.546204,-91.14020499999999),Point(30.551443,-91.14417299999999),Point(30.553909,-91.15136699999999),Point(30.554489,-91.155075)]
}
df = pd.DataFrame(data)
What I want to do is add a new column called "dist". This column will be 0 if it is the first element of a new callsign but if not it will be the distance between a point and the previous point.
The resulting df should look like this:
df1 = df
dist = [0,0.27783309075379214,0,0.46131362750613436,0.22464461718704595]
df1['dist'] = dist
What I have tried is to first assign a group index:
df['group_index'] = df.groupby('Callsign').cumcount()
Then groupby
Then try and apply the function:
df['dist'] = df.groupby('Callsign').apply(lambda g: 0 if g.group_index == 0 else distance((g.Position.x , g.Position.y),
(g.Position.shift().x , g.Position.shift().y)).miles)
I was hoping this would give me the 0 for the first index of each group and then run the distance function on all others and return a value in miles. However it does not work.
The code errors out for at least one reason which is because the .x and .y attributes of the shapely object are being called on the series rather than the object.
Any ideas on how to fix this would be much appreciated.
Sort df by callsign then timestamp
Compute distances between adjacent rows using a temporary column of shifted points
For the first row of each new callsign, set distance to 0
Drop temporary column
df = df.sort_values(by=['Callsign', 'UTC'])
df['Position_prev'] = df['Position'].shift().bfill()
def get_dist(row):
return distance((row['Position'].x, row['Position'].y),
(row['Position_prev'].x, row['Position_prev'].y)).miles
df['dist'] = df.apply(get_distances, axis=1)
# Flag row if callsign is different from previous row callsign
new_callsign_rows = df['Callsign'] != df['Callsign'].shift()
# Zero out the first distance of each callsign group
df.loc[new_callsign_rows, 'dist'] = 0.0
# Drop shifted column
df = df.drop(columns='Position_prev')
print(df)
UTC Callsign Position dist
0 2020-12-26 15:13:01 1 POINT (30.542175 -91.13999200000001) 0.000000
1 2020-12-26 15:13:07 1 POINT (30.546204 -91.14020499999999) 0.277833
2 2020-12-26 15:13:19 2 POINT (30.551443 -91.14417299999999) 0.000000
3 2020-12-26 15:13:32 2 POINT (30.553909 -91.15136699999999) 0.461314
4 2020-12-26 15:13:38 2 POINT (30.554489 -91.155075) 0.224645

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

Replacing Specific Values in a Pandas Column [duplicate]

I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.
If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.
You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1
w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.
Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)
This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0
This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)
You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.
Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object
Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.
w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'
There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.
w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True
dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.
I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.
To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?
If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0

Correct way of iterating over pandas dataframe by date

I want to iterate over a dataframe's major axis date by date.
Example:
tdf = df.ix[date]
The issue I am having is that the type returned by df.ix changes, leaving me with 3 possible situations
If the date does not exist in tdf an error is thrown: KeyError: 1394755200000000000
If there is only one item in tdf: print type(tdf) returns
<class 'pandas.core.series.Series'>
If there is more than one item in tdf: print type(tdf) returns
<class 'pandas.core.frame.DataFrame'>
To avoid the first case I can simply wrap this in a try catch block or thanks to jxstanford, I can avoid the try catch block by using if date in df.index:
I run into the issue afterwards with an inconsistent API with a pandas series and a pandas data frame. I could solve this by checking for types but it seems I shouldn't have to do that. I would ideally like to keep the types the same. Is there a better way of doing this?
I'm running pandas 0.13.1 and I am currently loading my data from a CSV using
Here's a full example demonstrating the problem.
from pandas import DataFrame
import datetime
path_to_csv = '/home/n/Documents/port/test.csv'
df = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')
start_dt = df.index.min()
end_dt = df.index.max()
dt_step = datetime.timedelta(days=1)
df.sort_index(inplace=True)
cur_dt = start_dt
while cur_dt != end_dt:
if cur_dt in df.index:
print type(df.ix[cur_dt])
#run some other steps using cur_dt
cur_dt += dt_step
An example CSV that demonstrates the problem is as follows:
value1,value2,value3,Date,type
1,2,4,03/13/14,a
2,3,3,03/21/14,b
3,4,2,03/21/14,a
4,5,1,03/27/14,b
The above code prints out
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
Is it possible to get the value of value1 from tdf in a consistent manner? or am I stuck making an if statement for and separately handle each case?
if type(df.ix[cur_dt]) == DataFrame:
....
if type(df.ix[cur_dt]) == Series:
....
Not sure what your trying to do with the dataframe, but this might be better than a try/except:
tdf = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')
while cur_dt != end_dt:
if cur_dt in df.index:
# do your thing
cur_dt += dt_step
This toy code will return DataFrames consistently.
def framer(rows):
if ndim(rows) == 1:
return rows.to_frame().T
else:
return rows
for cur_date in df.index:
print type(framer(df.ix[cur_date]))
And this will give you the missing days:
df.resample(rule='D')
Have a look at the resample method docstring. It has its own options to fill up the missing data. And if you decide to make your multiple dates into a single one, the method you're looking at is groupby (if you want to combine values across rows) and drop_duplicates (if you want to ignore them). There is no need to reinvent the wheel.
You can use the apply method of the DataFrame, using axis = 1 to work on each row of the DataFrame to build a Series with the same Index.
e.g.
def calculate_value(row):
if row.date == pd.datetime(2014,3,21):
return 0
elif row.type == 'a':
return row.value1 + row.value2 + row.value3
else:
return row.value1 * row.value2 * row.value3
df['date'] = df.index
df['NewValue'] = df.apply(calculate_value, axis=1)
modifies your example input as follows
value1 value2 value3 type NewValue date
Date
2014-03-13 1 2 4 a 7 2014-03-13
2014-03-21 2 3 3 b 0 2014-03-21
2014-03-21 3 4 2 a 0 2014-03-21
2014-03-27 4 5 1 b 20 2014-03-27
[4 rows x 6 columns]