TypeError: unhashable type: 'list'(WILLHE11) - typeerror

user is Angelica and distance is 0.91
userRatings = {'Blues Traveler': 3.5, 'Broken Bells': 2.0, 'Norah Jones': 4.5, 'Phoenix': 5.0, 'Slightly Stoopid': 1.5, 'The Strokes': 2.5, 'Vampire Weekend': 2.0} and userX Ratings = {'Blues Traveler': 3.0, 'Norah Jones': 5.0, 'Phoenix': 4.0, 'Slightly Stoopid': 2.5, 'The Strokes': 3.0}
itemsToRecommendFromCurrentUser = ['Vampire Weekend', 'Broken Bells']
user is Jordyn and distance is 0.87
userRatings = {'Broken Bells': 4.5, 'Deadmau5': 4.0, 'Norah Jones': 5.0, 'Phoenix': 5.0, 'Slightly Stoopid': 4.5, 'The Strokes': 4.0, 'Vampire Weekend': 4.0} and userX Ratings = {'Blues Traveler': 3.0, 'Norah Jones': 5.0, 'Phoenix': 4.0, 'Slightly Stoopid': 2.5, 'The Strokes': 3.0}
itemsToRecommendFromCurrentUser = ['Deadmau5', 'Vampire Weekend', 'Broken Bells']
user is Chan and distance is 0.64
userRatings = {'Blues Traveler': 5.0, 'Broken Bells': 1.0, 'Deadmau5': 1.0, 'Norah Jones': 3.0, 'Phoenix': 5, 'Slightly Stoopid': 1.0} and userX Ratings = {'Blues Traveler': 3.0, 'Norah Jones': 5.0, 'Phoenix': 4.0, 'Slightly Stoopid': 2.5, 'The Strokes': 3.0}
itemsToRecommendFromCurrentUser = ['Deadmau5', 'Broken Bells']
Potential items (before removing duplicates) for recommendation are [['Vampire Weekend', 'Broken Bells'], ['Deadmau5', 'Vampire Weekend', 'Broken Bells'], ['Deadmau5', 'Broken Bells']]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-85-10dc11a19cfb> in <module>
19
20 print(f"Potential items (before removing duplicates) for recommendation are {itemsToRecommend}")
---> 21 itemsToRecommend = set (itemsToRecommend)
22 # <<<<< (5.1) YOUR CODE ENDS HERE >>>>>
23
TypeError Traceback (most recent call last)
in
19
20 print(f"Potential items (before removing duplicates) for recommendation are {itemsToRecommend}")
---> 21 itemsToRecommend = set (itemsToRecommend)
22 # <<<<< (5.1) YOUR CODE ENDS HERE >>>>>
23
TypeError: unhashable type: 'list'

Related

Reindex pandas DataFrame to match index with another DataFrame

I have two pandas DataFrames with different (float) indices.
I want to update the second dataframe to match the first dataframe's index, updating its values to be interpolated using the index.
This is the code I have:
from pandas import DataFrame
df1 = DataFrame([
{'time': 0.2, 'v': 1},
{'time': 0.4, 'v': 2},
{'time': 0.6, 'v': 3},
{'time': 0.8, 'v': 4},
{'time': 1.0, 'v': 5},
{'time': 1.2, 'v': 6},
{'time': 1.4, 'v': 7},
{'time': 1.6, 'v': 8},
{'time': 1.8, 'v': 9},
{'time': 2.0, 'v': 10}
]).set_index('time')
df2 = DataFrame([
{'time': 0.25, 'v': 1},
{'time': 0.5, 'v': 2},
{'time': 0.75, 'v': 3},
{'time': 1.0, 'v': 4},
{'time': 1.25, 'v': 5},
{'time': 1.5, 'v': 6},
{'time': 1.75, 'v': 7},
{'time': 2.0, 'v': 8},
{'time': 2.25, 'v': 9}
]).set_index('time')
df2 = df2.reindex(df1.index.union(df2.index)).interpolate(method='index').reindex(df1.index)
print(df2)
Output:
v
time
0.2 NaN
0.4 1.6
0.6 2.4
0.8 3.2
1.0 4.0
1.2 4.8
1.4 5.6
1.6 6.4
1.8 7.2
2.0 8.0
That's correct and as I need - however it seems a more complicated statement than it needs to be.
If there a more concise way to do the same, requiring fewer intermediate steps?
Also, is there a way to both interpolate and extrapolate? For example, in the example data above, the linearly extrapolated value for index 0.2 could be 0.8 instead of NaN. I know I could curve_fit, but again I feel that's more complicated that it may need to be?
One idea with numpy.interp, if values in both indices are increased and processing only one column v:
df1['v1'] = np.interp(df1.index, df2.index, df2['v'])
print(df1)
v v1
time
0.2 1 1.0
0.4 2 1.6
0.6 3 2.4
0.8 4 3.2
1.0 5 4.0
1.2 6 4.8
1.4 7 5.6
1.6 8 6.4
1.8 9 7.2
2.0 10 8.0

Handling queries in pandas when a CSV input contains multiple duplicate columns?

I have a fairly simple CSV that looks like this:
When I use pandas to read the CSV, columns that have the same name automatically gets renamed with a ".n" notation, as follows:
>>> import pandas as pd
>>> food = pd.read_csv("food.csv")
>>> food
Order Number Item Description Item Cost Item Description.1 Item Cost.1 Item Description.2 Item Cost.2
0 110 Chow Mein 5.00 NaN NaN NaN NaN
1 111 Cake 1.50 Chocolate 13.10 Noodle 3.75
2 112 Chocolate 11.00 Chips 5.75 NaN NaN
3 113 Sandwich 6.25 Milk 2.00 Ice 0.50
4 114 Chocolate 13.10 Water 0.25 NaN NaN
5 115 Tea 1.00 Milkshake 2.80 Chocolate 13.10
6 116 Green Tea 1.25 NaN NaN NaN NaN
7 117 Burger 2.00 Fries 3.50 NaN NaN
8 118 Chocolate 5.00 Green Tea 1.50 NaN NaN
9 119 Tonic 3.00 Burger 3.75 Milk 2.00
10 120 Orange 1.50 Milkshake 4.20 NaN NaN
>>>
food.csv:
Order Number,Item Description,Item Cost,Item Description,Item Cost,Item Description,Item Cost
110,Chow Mein,5,,,,
111,Cake,1.5,Chocolate,13.1,Noodle,3.75
112,Chocolate,11,Chips,5.75,,
113,Sandwich,6.25,Milk,2,Ice,0.5
114,Chocolate,13.1,Water,0.25,,
115,Tea,1,Milkshake,2.8,Chocolate,13.1
116,Green Tea,1.25,,,,
117,Burger,2,Fries,3.5,,
118,Chocolate,5,Green Tea,1.5,,
119,Tonic,3,Burger,3.75,Milk,2
120,Orange,1.5,Milkshake,4.2,,
As such, queries that rely on the column names will only work if they match the first column (e.g.):
>>> print(food[(food['Item Description'] == "Chocolate") & (food['Item Cost'] == 13.10)]['Order Number'].to_string(index=False))
114
While I can technically lengthen the masks to include the .1 and .2 columns, this seems relatively inefficient, especially when the number of duplicated columns is large (in this example there are only 3 sets of duplicated columns, but in other datasets, I have a large number which would not work well if I just construct a mask for each column.)
I am not sure if I am approaching this the right way or if I am missing something simple (like in loading the CSV) or if there are some groupbys I can do that can answer the same question (i.e. Find the order numbers when the order contains an item that has chocolate listed that costs $13.10).
Would the problem be different if it's something like: average all the costs of chocolates paid for all the orders?
Thanks in advance.
Here's a bit of a simpler approach with pandas' wide_to_long function
(i will use the df provided by #mitoRibo in another answer)
documentation: https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Order Number': ['Order_01', 'Order_02', 'Order_03', 'Order_04', 'Order_05', 'Order_06', 'Order_07', 'Order_08', 'Order_09', 'Order_10'],
'Item Description': ['Burger', 'Cake', 'Cake', 'Tonic', 'Green Tea', 'Sandwich', 'Orange', 'Burger', 'Cake', 'Chow Mein'],
'Item Cost': [7, 10, 4, 1, 10, 7, 9, 9, 6, 3],
'Item Description.1': ['Tonic', 'Burger', 'Green Tea', 'Sandwich', 'Orange', None, 'Chocolate', None, 'Chocolate', 'Tea'],
'Item Cost.1': [4.0, 1.0, 7.0, 7.0, 8.0, np.nan, 6.0, np.nan, 8.0, 3.0],
'Item Description.2': [None, 'Chow Mein', 'Chow Mein', 'Chocolate', 'Tea', None, 'Burger', None, 'Tea', 'Green Tea'],
'Item Cost.2': [np.nan, 8.0, 1.0, 9.0, 9.0, np.nan, 2.0, np.nan, 1.0, 9.0],
'Item Description.3': [None, 'Sandwich', 'Orange', 'Cake', 'Tonic', None, None, None, 'Sandwich', 'Burger'],
'Item Cost.3': [np.nan, 5.0, 9.0, 2.0, 7.0, np.nan, np.nan, np.nan, 8.0, 4.0],
'Item Description.4': [None, 'Green Tea', 'Burger', 'Green Tea', 'Cake', None, None, None, None, 'Orange'],
'Item Cost.4': [np.nan, 4.0, 4.0, 3.0, 10.0, np.nan, np.nan, np.nan, np.nan, 1.0],
'Item Description.5': [None, None, 'Tea', 'Burger', 'Chocolate', None, None, None, None, 'Sandwich'],
'Item Cost.5': [np.nan, np.nan, 8.0, 5.0, 1.0, np.nan, np.nan, np.nan, np.nan, 4.0],
'Item Description.6': [None, None, 'Tonic', 'Tea', 'Burger', None, None, None, None, 'Chocolate'],
'Item Cost.6': [np.nan, np.nan, 8.0, 2.0, 8.0, np.nan, np.nan, np.nan, np.nan, 9.0],
})
df.rename(columns={'Item Description': 'Item Description.0', 'Item Cost': 'Item Cost.0'}, inplace=True)
long = pd.wide_to_long(df, stubnames=['Item Description', 'Item Cost'], i="Order Number", j="num_after_col_name", sep='.')
It's often easier to operate on a table in "long" form instead of "wide" form that you currently have.
There's example code below to convert from an example wide_df:
To a long df version:
In the long_df version each row is a unique Order/Item and now we don't have to store any null values. Pandas also makes it easy to perform grouping operations on tables in long form. Here's what the agg table looks like from the code below
You can also easily make your query of finding orders where a chocolate cost $13.10 by long_df[long_df['Description'].eq('Chocolate') & long_df['Cost'].eq(13.10)]['Order Number'].unique()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Order Number': ['Order_01', 'Order_02', 'Order_03', 'Order_04', 'Order_05', 'Order_06', 'Order_07', 'Order_08', 'Order_09', 'Order_10'],
'Item Description': ['Burger', 'Cake', 'Cake', 'Tonic', 'Green Tea', 'Sandwich', 'Orange', 'Burger', 'Cake', 'Chow Mein'],
'Item Cost': [7, 10, 4, 1, 10, 7, 9, 9, 6, 3],
'Item Description.1': ['Tonic', 'Burger', 'Green Tea', 'Sandwich', 'Orange', None, 'Chocolate', None, 'Chocolate', 'Tea'],
'Item Cost.1': [4.0, 1.0, 7.0, 7.0, 8.0, np.nan, 6.0, np.nan, 8.0, 3.0],
'Item Description.2': [None, 'Chow Mein', 'Chow Mein', 'Chocolate', 'Tea', None, 'Burger', None, 'Tea', 'Green Tea'],
'Item Cost.2': [np.nan, 8.0, 1.0, 9.0, 9.0, np.nan, 2.0, np.nan, 1.0, 9.0],
'Item Description.3': [None, 'Sandwich', 'Orange', 'Cake', 'Tonic', None, None, None, 'Sandwich', 'Burger'],
'Item Cost.3': [np.nan, 5.0, 9.0, 2.0, 7.0, np.nan, np.nan, np.nan, 8.0, 4.0],
'Item Description.4': [None, 'Green Tea', 'Burger', 'Green Tea', 'Cake', None, None, None, None, 'Orange'],
'Item Cost.4': [np.nan, 4.0, 4.0, 3.0, 10.0, np.nan, np.nan, np.nan, np.nan, 1.0],
'Item Description.5': [None, None, 'Tea', 'Burger', 'Chocolate', None, None, None, None, 'Sandwich'],
'Item Cost.5': [np.nan, np.nan, 8.0, 5.0, 1.0, np.nan, np.nan, np.nan, np.nan, 4.0],
'Item Description.6': [None, None, 'Tonic', 'Tea', 'Burger', None, None, None, None, 'Chocolate'],
'Item Cost.6': [np.nan, np.nan, 8.0, 2.0, 8.0, np.nan, np.nan, np.nan, np.nan, 9.0],
})
# Convert table to long form
desc_cols = [c for c in df.columns if 'Desc' in c]
cost_cols = [c for c in df.columns if 'Cost' in c]
desc_df = df.melt(id_vars='Order Number', value_vars=desc_cols, value_name='Description')
cost_df = df.melt(id_vars='Order Number', value_vars=cost_cols, value_name='Cost')
long_df = pd.concat((desc_df[['Order Number','Description']], cost_df[['Cost']]), axis=1).dropna()
long_df.insert(1,'Item Number',long_df.groupby('Order Number').cumcount().add(1))
long_df = long_df.sort_values(['Order Number','Item Number'])
# Calculate group info
group_info = long_df.groupby('Order Number').agg(
ordered_chocolate = ('Description', lambda d: d.eq('Chocolate').any()),
total_cost = ('Cost','sum'),
)

First 'Group by' then plot/save as png from pandas

first I need to filter data then plot each group separately and save files to directory
for id in df["set"].unique():
df2= df.loc[df["set"] == id]
outpath = "path/of/your/folder/"
sns.set_style("whitegrid", {'grid.linestyle': '-'})
plt.figure(figsize=(12,8))
ax1=sns.scatterplot(data=df2, x="x", y="y", hue="result",markers=['x'],s=1000)
ax1.get_legend().remove()
ax1.set_yticks((0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5), minor=False)
ax1.set_xticks([0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.6], minor=False)
fig = ax1.get_figure()
fig.savefig(path.join(outpath,"id.png",dpi=300 )
This worked for me but it is very slow
groups = df.groupby("set")
for name, group in groups:
sns.set_style("whitegrid", {'grid.linestyle': '-'})
plt.figure(figsize=(12,8))
ax1=sns.scatterplot(data=group, x="x", y="y", hue="result",markers=['x'],s=1000)
ax1.get_legend().remove()
ax1.set_yticks((0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5), minor=False)
ax1.set_xticks([0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.6], minor=False)
fig = ax1.get_figure()
fig.savefig("directory/{0}.png".format(name), dpi=300)

fast_executemany=True throwing DBAPIError: Function sequence error in sqlalchemy version 1.3.5

Since SQLAlchemy 1.3.0, released 2019-03-04,
sqlalchemy now supports
engine = create_engine(sqlalchemy_url, fast_executemany=True)
for the mssql+pyodbc dialect. I.e.,
it is no longer necessary to define a function and use
#event.listens_for(engine, 'before_cursor_execute').
However when i am trying to write a simple test dataframe to mssql it returns error:
DBAPIError: (pyodbc.Error) ('HY010', '[HY010] [Microsoft][ODBC Driver 17 for SQL Server]Function sequence error (0) (SQLParamData)')
[SQL: INSERT INTO fast_executemany_test ([Date], [A], [B], [C], [D]) VALUES (?, ?, ?, ?, ?)][parameters: ((datetime.datetime(2018, 1, 3, 0, 0), 2.0, 1.0, 1.0, 'Joe'), (datetime.datetime(2018, 1, 4, 0, 0), 2.0, 1.0, 2.0, 'Joe'), (datetime.datetime(2018, 1, 5, 0, 0), 2.0, 3.0, 1.0, 'Pete'), (datetime.datetime(2018, 1, 6, 0, 0), 2.0, 1.0, 5.0, 'Mary'))]
(Background on this error at: http://sqlalche.me/e/dbapi)
I have gone through the documentation but could not find what am I doing wrong.
import sqlalchemy
import pandas as pd
DataFrame contains datetime, float, float, float, string.
test_columns = ['Date', 'A', 'B', 'C', 'D']
test_data = [
[datetime(2018, 1, 3), 2.0, 1.0, 1.0, 'Joe'],
[datetime(2018, 1, 4), 2.0, 1.0, 2.0, 'Joe'],
[datetime(2018, 1, 5), 2.0, 3.0, 1.0, 'Pete'],
[datetime(2018, 1, 6), 2.0, 1.0, 5.0, 'Mary'],
]
I am establishing connection as:
sqlUrl='mssql+pyodbc://ID:PASSWORD' + 'SERVER_ADDRESS' + '/' + 'DBName' + '?driver=ODBC+Driver+17+for+SQL+Server'
sqlcon = sqlalchemy.create_engine(sqlUrl,fast_executemany=True)
if sqlcon:
test_data.to_sql('FastTable_test', sqlcon, if_exists='replace',index=False)
print('Successfully written!')
It creates the table but due to error does not write any data into it.

Retain values in a Pandas dataframe

Consider the following Pandas Dataframe:
_df = pd.DataFrame([
[4.0, "Diastolic Blood Pressure", 1.0, "2017-01-15", 68],
[4.0, "Diastolic Blood Pressure", 5.0, "2017-04-15", 60],
[4.0, "Diastolic Blood Pressure", 8.0, "2017-06-18", 68],
[4.0, "Heart Rate", 1.0, "2017-01-15", 85],
[4.0, "Heart Rate", 5.0, "2017-04-15", 72],
[4.0, "Heart Rate", 8.0, "2017-06-18", 81],
[6.0, "Diastolic Blood Pressure", 1.0, "2017-01-18", 114],
[6.0, "Diastolic Blood Pressure", 6.0, "2017-02-18", 104],
[6.0, "Diastolic Blood Pressure", 9.0, "2017-03-18", 124]
], columns = ['ID', 'VSname', 'Visit', 'VSdate', 'VSres'])
I'd like to create the 'Flag' variable in this df: for each ID and VSName, show the difference from baseline (visit 1) at each visit.
I tried different approaches and I'm stuck.
I come from a background of SAS programming, and that'd be very easy in SAS to retain values from a row to another, and then substract. I'm sure my mind is poluted by SAS (and the title is clearly wrong), but this has to be doable with Pandas, one way or another. Any idea?
Thanks a lot for your help.
Kind regards,
Nicolas
Assuming the DataFrame is ordered by id and visit group (i.e. the 5, 8 and directly after the 1), you can use cumcount:
c = (df.visit == 1).cumcount()
You can subtract VSRes from the first VSRes entry of each group:
df.VSRes - df.groupby(c).VSRes.transform("first")
I tried the answers kindly given, none worked, got errors I coudln't fix. Not sure why... I managed to produced something close using the following:
baseline = df[df["Visit"] == 1.0]
baseline = baseline.rename(columns={'VSres': 'baseline'})
df = pd.merge(df, baseline, on = ["ID", "VSname"], how='left')
df["chg"] = df["VSres"] - df["baseline"]
That's not very beautiful, I know...