First 'Group by' then plot/save as png from pandas - pandas

first I need to filter data then plot each group separately and save files to directory
for id in df["set"].unique():
df2= df.loc[df["set"] == id]
outpath = "path/of/your/folder/"
sns.set_style("whitegrid", {'grid.linestyle': '-'})
plt.figure(figsize=(12,8))
ax1=sns.scatterplot(data=df2, x="x", y="y", hue="result",markers=['x'],s=1000)
ax1.get_legend().remove()
ax1.set_yticks((0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5), minor=False)
ax1.set_xticks([0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.6], minor=False)
fig = ax1.get_figure()
fig.savefig(path.join(outpath,"id.png",dpi=300 )

This worked for me but it is very slow
groups = df.groupby("set")
for name, group in groups:
sns.set_style("whitegrid", {'grid.linestyle': '-'})
plt.figure(figsize=(12,8))
ax1=sns.scatterplot(data=group, x="x", y="y", hue="result",markers=['x'],s=1000)
ax1.get_legend().remove()
ax1.set_yticks((0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5), minor=False)
ax1.set_xticks([0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.6], minor=False)
fig = ax1.get_figure()
fig.savefig("directory/{0}.png".format(name), dpi=300)

Related

TypeError: unhashable type: 'list'(WILLHE11)

user is Angelica and distance is 0.91
userRatings = {'Blues Traveler': 3.5, 'Broken Bells': 2.0, 'Norah Jones': 4.5, 'Phoenix': 5.0, 'Slightly Stoopid': 1.5, 'The Strokes': 2.5, 'Vampire Weekend': 2.0} and userX Ratings = {'Blues Traveler': 3.0, 'Norah Jones': 5.0, 'Phoenix': 4.0, 'Slightly Stoopid': 2.5, 'The Strokes': 3.0}
itemsToRecommendFromCurrentUser = ['Vampire Weekend', 'Broken Bells']
user is Jordyn and distance is 0.87
userRatings = {'Broken Bells': 4.5, 'Deadmau5': 4.0, 'Norah Jones': 5.0, 'Phoenix': 5.0, 'Slightly Stoopid': 4.5, 'The Strokes': 4.0, 'Vampire Weekend': 4.0} and userX Ratings = {'Blues Traveler': 3.0, 'Norah Jones': 5.0, 'Phoenix': 4.0, 'Slightly Stoopid': 2.5, 'The Strokes': 3.0}
itemsToRecommendFromCurrentUser = ['Deadmau5', 'Vampire Weekend', 'Broken Bells']
user is Chan and distance is 0.64
userRatings = {'Blues Traveler': 5.0, 'Broken Bells': 1.0, 'Deadmau5': 1.0, 'Norah Jones': 3.0, 'Phoenix': 5, 'Slightly Stoopid': 1.0} and userX Ratings = {'Blues Traveler': 3.0, 'Norah Jones': 5.0, 'Phoenix': 4.0, 'Slightly Stoopid': 2.5, 'The Strokes': 3.0}
itemsToRecommendFromCurrentUser = ['Deadmau5', 'Broken Bells']
Potential items (before removing duplicates) for recommendation are [['Vampire Weekend', 'Broken Bells'], ['Deadmau5', 'Vampire Weekend', 'Broken Bells'], ['Deadmau5', 'Broken Bells']]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-85-10dc11a19cfb> in <module>
19
20 print(f"Potential items (before removing duplicates) for recommendation are {itemsToRecommend}")
---> 21 itemsToRecommend = set (itemsToRecommend)
22 # <<<<< (5.1) YOUR CODE ENDS HERE >>>>>
23
TypeError Traceback (most recent call last)
in
19
20 print(f"Potential items (before removing duplicates) for recommendation are {itemsToRecommend}")
---> 21 itemsToRecommend = set (itemsToRecommend)
22 # <<<<< (5.1) YOUR CODE ENDS HERE >>>>>
23
TypeError: unhashable type: 'list'

Reindex pandas DataFrame to match index with another DataFrame

I have two pandas DataFrames with different (float) indices.
I want to update the second dataframe to match the first dataframe's index, updating its values to be interpolated using the index.
This is the code I have:
from pandas import DataFrame
df1 = DataFrame([
{'time': 0.2, 'v': 1},
{'time': 0.4, 'v': 2},
{'time': 0.6, 'v': 3},
{'time': 0.8, 'v': 4},
{'time': 1.0, 'v': 5},
{'time': 1.2, 'v': 6},
{'time': 1.4, 'v': 7},
{'time': 1.6, 'v': 8},
{'time': 1.8, 'v': 9},
{'time': 2.0, 'v': 10}
]).set_index('time')
df2 = DataFrame([
{'time': 0.25, 'v': 1},
{'time': 0.5, 'v': 2},
{'time': 0.75, 'v': 3},
{'time': 1.0, 'v': 4},
{'time': 1.25, 'v': 5},
{'time': 1.5, 'v': 6},
{'time': 1.75, 'v': 7},
{'time': 2.0, 'v': 8},
{'time': 2.25, 'v': 9}
]).set_index('time')
df2 = df2.reindex(df1.index.union(df2.index)).interpolate(method='index').reindex(df1.index)
print(df2)
Output:
v
time
0.2 NaN
0.4 1.6
0.6 2.4
0.8 3.2
1.0 4.0
1.2 4.8
1.4 5.6
1.6 6.4
1.8 7.2
2.0 8.0
That's correct and as I need - however it seems a more complicated statement than it needs to be.
If there a more concise way to do the same, requiring fewer intermediate steps?
Also, is there a way to both interpolate and extrapolate? For example, in the example data above, the linearly extrapolated value for index 0.2 could be 0.8 instead of NaN. I know I could curve_fit, but again I feel that's more complicated that it may need to be?
One idea with numpy.interp, if values in both indices are increased and processing only one column v:
df1['v1'] = np.interp(df1.index, df2.index, df2['v'])
print(df1)
v v1
time
0.2 1 1.0
0.4 2 1.6
0.6 3 2.4
0.8 4 3.2
1.0 5 4.0
1.2 6 4.8
1.4 7 5.6
1.6 8 6.4
1.8 9 7.2
2.0 10 8.0

Handling queries in pandas when a CSV input contains multiple duplicate columns?

I have a fairly simple CSV that looks like this:
When I use pandas to read the CSV, columns that have the same name automatically gets renamed with a ".n" notation, as follows:
>>> import pandas as pd
>>> food = pd.read_csv("food.csv")
>>> food
Order Number Item Description Item Cost Item Description.1 Item Cost.1 Item Description.2 Item Cost.2
0 110 Chow Mein 5.00 NaN NaN NaN NaN
1 111 Cake 1.50 Chocolate 13.10 Noodle 3.75
2 112 Chocolate 11.00 Chips 5.75 NaN NaN
3 113 Sandwich 6.25 Milk 2.00 Ice 0.50
4 114 Chocolate 13.10 Water 0.25 NaN NaN
5 115 Tea 1.00 Milkshake 2.80 Chocolate 13.10
6 116 Green Tea 1.25 NaN NaN NaN NaN
7 117 Burger 2.00 Fries 3.50 NaN NaN
8 118 Chocolate 5.00 Green Tea 1.50 NaN NaN
9 119 Tonic 3.00 Burger 3.75 Milk 2.00
10 120 Orange 1.50 Milkshake 4.20 NaN NaN
>>>
food.csv:
Order Number,Item Description,Item Cost,Item Description,Item Cost,Item Description,Item Cost
110,Chow Mein,5,,,,
111,Cake,1.5,Chocolate,13.1,Noodle,3.75
112,Chocolate,11,Chips,5.75,,
113,Sandwich,6.25,Milk,2,Ice,0.5
114,Chocolate,13.1,Water,0.25,,
115,Tea,1,Milkshake,2.8,Chocolate,13.1
116,Green Tea,1.25,,,,
117,Burger,2,Fries,3.5,,
118,Chocolate,5,Green Tea,1.5,,
119,Tonic,3,Burger,3.75,Milk,2
120,Orange,1.5,Milkshake,4.2,,
As such, queries that rely on the column names will only work if they match the first column (e.g.):
>>> print(food[(food['Item Description'] == "Chocolate") & (food['Item Cost'] == 13.10)]['Order Number'].to_string(index=False))
114
While I can technically lengthen the masks to include the .1 and .2 columns, this seems relatively inefficient, especially when the number of duplicated columns is large (in this example there are only 3 sets of duplicated columns, but in other datasets, I have a large number which would not work well if I just construct a mask for each column.)
I am not sure if I am approaching this the right way or if I am missing something simple (like in loading the CSV) or if there are some groupbys I can do that can answer the same question (i.e. Find the order numbers when the order contains an item that has chocolate listed that costs $13.10).
Would the problem be different if it's something like: average all the costs of chocolates paid for all the orders?
Thanks in advance.
Here's a bit of a simpler approach with pandas' wide_to_long function
(i will use the df provided by #mitoRibo in another answer)
documentation: https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Order Number': ['Order_01', 'Order_02', 'Order_03', 'Order_04', 'Order_05', 'Order_06', 'Order_07', 'Order_08', 'Order_09', 'Order_10'],
'Item Description': ['Burger', 'Cake', 'Cake', 'Tonic', 'Green Tea', 'Sandwich', 'Orange', 'Burger', 'Cake', 'Chow Mein'],
'Item Cost': [7, 10, 4, 1, 10, 7, 9, 9, 6, 3],
'Item Description.1': ['Tonic', 'Burger', 'Green Tea', 'Sandwich', 'Orange', None, 'Chocolate', None, 'Chocolate', 'Tea'],
'Item Cost.1': [4.0, 1.0, 7.0, 7.0, 8.0, np.nan, 6.0, np.nan, 8.0, 3.0],
'Item Description.2': [None, 'Chow Mein', 'Chow Mein', 'Chocolate', 'Tea', None, 'Burger', None, 'Tea', 'Green Tea'],
'Item Cost.2': [np.nan, 8.0, 1.0, 9.0, 9.0, np.nan, 2.0, np.nan, 1.0, 9.0],
'Item Description.3': [None, 'Sandwich', 'Orange', 'Cake', 'Tonic', None, None, None, 'Sandwich', 'Burger'],
'Item Cost.3': [np.nan, 5.0, 9.0, 2.0, 7.0, np.nan, np.nan, np.nan, 8.0, 4.0],
'Item Description.4': [None, 'Green Tea', 'Burger', 'Green Tea', 'Cake', None, None, None, None, 'Orange'],
'Item Cost.4': [np.nan, 4.0, 4.0, 3.0, 10.0, np.nan, np.nan, np.nan, np.nan, 1.0],
'Item Description.5': [None, None, 'Tea', 'Burger', 'Chocolate', None, None, None, None, 'Sandwich'],
'Item Cost.5': [np.nan, np.nan, 8.0, 5.0, 1.0, np.nan, np.nan, np.nan, np.nan, 4.0],
'Item Description.6': [None, None, 'Tonic', 'Tea', 'Burger', None, None, None, None, 'Chocolate'],
'Item Cost.6': [np.nan, np.nan, 8.0, 2.0, 8.0, np.nan, np.nan, np.nan, np.nan, 9.0],
})
df.rename(columns={'Item Description': 'Item Description.0', 'Item Cost': 'Item Cost.0'}, inplace=True)
long = pd.wide_to_long(df, stubnames=['Item Description', 'Item Cost'], i="Order Number", j="num_after_col_name", sep='.')
It's often easier to operate on a table in "long" form instead of "wide" form that you currently have.
There's example code below to convert from an example wide_df:
To a long df version:
In the long_df version each row is a unique Order/Item and now we don't have to store any null values. Pandas also makes it easy to perform grouping operations on tables in long form. Here's what the agg table looks like from the code below
You can also easily make your query of finding orders where a chocolate cost $13.10 by long_df[long_df['Description'].eq('Chocolate') & long_df['Cost'].eq(13.10)]['Order Number'].unique()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Order Number': ['Order_01', 'Order_02', 'Order_03', 'Order_04', 'Order_05', 'Order_06', 'Order_07', 'Order_08', 'Order_09', 'Order_10'],
'Item Description': ['Burger', 'Cake', 'Cake', 'Tonic', 'Green Tea', 'Sandwich', 'Orange', 'Burger', 'Cake', 'Chow Mein'],
'Item Cost': [7, 10, 4, 1, 10, 7, 9, 9, 6, 3],
'Item Description.1': ['Tonic', 'Burger', 'Green Tea', 'Sandwich', 'Orange', None, 'Chocolate', None, 'Chocolate', 'Tea'],
'Item Cost.1': [4.0, 1.0, 7.0, 7.0, 8.0, np.nan, 6.0, np.nan, 8.0, 3.0],
'Item Description.2': [None, 'Chow Mein', 'Chow Mein', 'Chocolate', 'Tea', None, 'Burger', None, 'Tea', 'Green Tea'],
'Item Cost.2': [np.nan, 8.0, 1.0, 9.0, 9.0, np.nan, 2.0, np.nan, 1.0, 9.0],
'Item Description.3': [None, 'Sandwich', 'Orange', 'Cake', 'Tonic', None, None, None, 'Sandwich', 'Burger'],
'Item Cost.3': [np.nan, 5.0, 9.0, 2.0, 7.0, np.nan, np.nan, np.nan, 8.0, 4.0],
'Item Description.4': [None, 'Green Tea', 'Burger', 'Green Tea', 'Cake', None, None, None, None, 'Orange'],
'Item Cost.4': [np.nan, 4.0, 4.0, 3.0, 10.0, np.nan, np.nan, np.nan, np.nan, 1.0],
'Item Description.5': [None, None, 'Tea', 'Burger', 'Chocolate', None, None, None, None, 'Sandwich'],
'Item Cost.5': [np.nan, np.nan, 8.0, 5.0, 1.0, np.nan, np.nan, np.nan, np.nan, 4.0],
'Item Description.6': [None, None, 'Tonic', 'Tea', 'Burger', None, None, None, None, 'Chocolate'],
'Item Cost.6': [np.nan, np.nan, 8.0, 2.0, 8.0, np.nan, np.nan, np.nan, np.nan, 9.0],
})
# Convert table to long form
desc_cols = [c for c in df.columns if 'Desc' in c]
cost_cols = [c for c in df.columns if 'Cost' in c]
desc_df = df.melt(id_vars='Order Number', value_vars=desc_cols, value_name='Description')
cost_df = df.melt(id_vars='Order Number', value_vars=cost_cols, value_name='Cost')
long_df = pd.concat((desc_df[['Order Number','Description']], cost_df[['Cost']]), axis=1).dropna()
long_df.insert(1,'Item Number',long_df.groupby('Order Number').cumcount().add(1))
long_df = long_df.sort_values(['Order Number','Item Number'])
# Calculate group info
group_info = long_df.groupby('Order Number').agg(
ordered_chocolate = ('Description', lambda d: d.eq('Chocolate').any()),
total_cost = ('Cost','sum'),
)

Ggpredict over fitting causing trend line to begin in an odd place

I am not entirely sure what I am doing wrong/what to look up. My objective is to use ggpredict and ggplot to display the relationship between time and the proportion of years burnt. I'm guessing it is something to do with the time variable being log transformed?
library(lme4);library(ggplot2);library(ggeffects);library(dplyr)
data = read.csv('FtmpAllyrs10kmC.csv')
This is what the data looks like:
structure(list(Observ = c(5208, 2828, 1664, 578, 18, 1644, 4741,
751, 689, 3813, 1464, 438, 1553, 4752, 4960, 376, 2482, 1811,
5682, 5441, 4505, 2281, 2103, 2993, 562, 4297, 3592, 5148, 3793,
1621, 1912, 1627, 1737, 4976, 2173, 5132, 5758, 2756, 1789, 5666,
2628, 2593, 794, 5779, 5158, 3123, 4986, 676, 4200, 2442, 2751,
4330, 1802, 2020, 2500, 1056, 959, 3290, 4303, 247, 5586, 922,
1049, 2432, 2076, 2560, 1369, 3636, 3722, 4137, 1561, 4915, 2515,
3034, 5547, 1491, 1247, 4116, 455, 4687, 1697, 5329, 21, 5724,
3701, 5697, 2938, 1721, 61, 998, 4304, 5798, 651, 910, 2689,
3986, 2908, 5753, 2574, 2345, 1940, 4317, 4588, 2179, 665, 4133,
749, 3977, 3134, 4190, 3985, 4937, 2473, 3238, 4987, 3915, 4261,
3521, 2736, 3665, 1797, 5692, 5578, 4087, 2011, 903, 889, 1523,
3396, 2291, 5269, 3644, 3403, 4814, 4618, 16, 77, 5385, 2842,
5816, 2015, 1443, 3183, 3331, 4977, 5380, 989, 4918, 740, 4637,
887, 1557, 4295, 4673, 1918, 5662, 4167, 1384, 3441, 614, 2360,
780, 661, 1267, 2018, 1906, 3402, 677, 5218, 2830, 4979, 3984,
4924, 1125, 2640, 986, 1885, 2573, 5300, 2398, 4832, 4816, 3738,
3276, 3830, 2425, 2054, 4273, 5607, 1678, 378, 1158, 510, 2210,
2399, 1952, 2909, 4945, 2659, 2642), yrblock15 = c(2015, 2010,
2007, 2005, 2004, 2007, 2014, 2005, 2005, 2012, 2007, 2004, 2007,
2014, 2015, 2004, 2009, 2008, 2016, 2016, 2014, 2009, 2008, 2010,
2005, 2013, 2011, 2015, 2012, 2007, 2008, 2007, 2007, 2015, 2008,
2015, 2016, 2010, 2007, 2016, 2009, 2009, 2005, 2016, 2015, 2010,
2015, 2005, 2013, 2009, 2010, 2013, 2008, 2008, 2009, 2006, 2006,
2011, 2013, 2004, 2016, 2006, 2006, 2009, 2008, 2009, 2007, 2012,
2012, 2013, 2007, 2014, 2009, 2010, 2016, 2007, 2006, 2013, 2005,
2014, 2007, 2015, 2004, 2016, 2012, 2016, 2010, 2007, 2004, 2006,
2013, 2016, 2005, 2006, 2009, 2012, 2010, 2016, 2009, 2009, 2008,
2013, 2014, 2008, 2005, 2013, 2005, 2012, 2010, 2013, 2012, 2014,
2009, 2011, 2015, 2012, 2013, 2011, 2010, 2012, 2007, 2016, 2016,
2013, 2008, 2006, 2005, 2007, 2011, 2009, 2015, 2012, 2011, 2014,
2014, 2004, 2004, 2015, 2010, 2016, 2008, 2007, 2011, 2011, 2015,
2015, 2006, 2014, 2005, 2014, 2005, 2007, 2013, 2014, 2008, 2016,
2013, 2007, 2011, 2005, 2009, 2005, 2005, 2006, 2008, 2008, 2011,
2005, 2015, 2010, 2015, 2012, 2014, 2006, 2009, 2006, 2008, 2009,
2015, 2009, 2014, 2014, 2012, 2011, 2012, 2009, 2008, 2013, 2016,
2007, 2004, 2006, 2005, 2008, 2009, 2008, 2010, 2014, 2009, 2009
), circleID = c(258, 128, 314, 128, 18, 294, 241, 301, 239, 213,
114, 438, 203, 252, 10, 376, 232, 11, 282, 41, 5, 31, 303, 293,
112, 247, 442, 198, 193, 271, 112, 277, 387, 26, 373, 182, 358,
56, 439, 266, 378, 343, 344, 379, 208, 423, 36, 226, 150, 192,
51, 280, 2, 220, 250, 156, 59, 140, 253, 247, 186, 22, 149, 182,
276, 310, 19, 36, 122, 87, 211, 415, 265, 334, 147, 141, 347,
66, 5, 187, 347, 379, 21, 324, 101, 297, 238, 371, 61, 98, 254,
398, 201, 10, 439, 386, 208, 353, 324, 95, 140, 267, 88, 379,
215, 83, 299, 377, 434, 140, 385, 437, 223, 88, 37, 315, 211,
371, 36, 65, 447, 292, 178, 37, 211, 3, 439, 173, 246, 41, 319,
44, 253, 314, 118, 16, 77, 435, 142, 416, 215, 93, 33, 181, 27,
430, 89, 418, 290, 137, 437, 207, 245, 173, 118, 262, 117, 34,
291, 164, 110, 330, 211, 367, 218, 106, 252, 227, 268, 130, 29,
384, 424, 225, 390, 86, 85, 323, 350, 148, 332, 316, 138, 126,
230, 175, 254, 223, 207, 328, 378, 258, 60, 410, 149, 152, 209,
445, 409, 392), rain15 = c(347.83, 394.12, 382.2, 382.41, 395.7,
386.08, 383.79, 352.65, 354.31, 366.48, 416.79, 335.17, 409.24,
373, 390.76, 341.35, 387.25, 452.18, 329.14, 365.74, 432.58,
443.36, 375.57, 359.75, 379.14, 386.41, 361.47, 366.1, 382.57,
383.32, 409.56, 390.92, 380.38, 394.94, 366.72, 347.44, 336.88,
410.94, 370.83, 335.88, 368.53, 370.42, 344.56, 323.41, 348.34,
351.07, 382.75, 362.64, 402.7, 396.11, 418.01, 389.14, 462.76,
391.05, 369.47, 399.78, 419.32, 392.97, 389.15, 345.37, 336.22,
405.73, 378.45, 394.7, 388.29, 379.56, 437.29, 415.95, 388.91,
402.43, 397.09, 368.84, 378.54, 361.92, 355.22, 416.46, 361.24,
417.12, 420.92, 386.48, 375.04, 335.03, 385.23, 342.51, 401.27,
341.21, 362.81, 372.85, 396.48, 390.72, 385.06, 343.64, 365.25,
440.76, 364.68, 354.45, 368.7, 324.44, 366.4, 408.43, 405.71,
390.8, 401.09, 364.07, 360.68, 399.39, 348.38, 344.2, 345.23,
401.29, 356.48, 364.21, 376.12, 403.37, 384.1, 355.71, 389.53,
363.28, 417.76, 403.16, 362.28, 333.91, 337.46, 419.51, 389.22,
448.08, 338.46, 397.52, 372.25, 424.25, 349.25, 408.19, 376.68,
375.87, 403.78, 398.73, 386.92, 340.39, 391.58, 335.03, 390.25,
422.05, 423.79, 386.49, 392.97, 334.07, 403.85, 369.54, 348.84,
392.33, 336.68, 399.56, 386.84, 395.97, 409.93, 337.08, 410.27,
450.48, 364.93, 369.08, 413.31, 341.93, 360.06, 362.28, 395.8,
423.56, 376.67, 366.19, 358.88, 390.74, 390.84, 362.84, 370.21,
360.84, 371.9, 410.36, 421.59, 367.48, 355.62, 389.61, 370.81,
374.37, 382.61, 401.78, 373.7, 382.72, 387.56, 388.53, 329.06,
383.78, 336.97, 376.68, 398.57, 370.46, 388.88, 421.66, 369.29,
371.58, 369.01, 369.22), YearsBurnt = c(6, 6, 3.5, 5, 3, 2, 3.5,
2.5, 2, 1.5, 10.5, 3.5, 2.5, 3.5, 4.5, 3, 2, 2.5, 1.5, 3.5, 3.5,
4, 4, 3, 3.5, 2.5, 6, 4.5, 4, 2.5, 3.5, 2, 7, 3, 2.5, 3.5, 13,
3, 3.5, 3.5, 4.5, 3, 1.5, 2, 4, 2, 4.5, 4, 3.5, 2.5, 2, 2, 3,
1, 5, 2.5, 4, 12.5, 2.5, 1.5, 3.5, 1.5, 2.5, 4, 4.5, 10, 3, 3.5,
4.5, 10.5, 1, 4.5, 2, 13.5, 8.5, 10, 1, 4, 3, 3.5, 1.5, 3, 2.5,
2.5, 2.5, 4.5, 4, 1.5, 3, 3.5, 4.5, 1.5, 3, 2.5, 3.5, 8.5, 4,
7, 2.5, 5, 11, 3.5, 11.5, 3, 1.5, 3, 0.5, 4.5, 3.5, 13.5, 7.5,
3.5, 2, 12, 4, 5, 2, 1.5, 3.5, 4.5, 2, 3.5, 3, 4, 1.5, 2, 2.5,
6, 2, 5, 3.5, 4.5, 2, 3.5, 5, 4.5, 3, 4, 14, 3, 1.5, 3.5, 5.5,
3, 4, 3, 7, 4.5, 2.5, 3, 3, 3.5, 3, 9, 5, 6.5, 5, 4, 4, 3.5,
3, 8.5, 1, 4.5, 1.5, 5.5, 3, 2, 2.5, 2.5, 3, 8.5, 2.5, 1, 3.5,
5.5, 5, 1.5, 2, 4.5, 5, 4, 1.5, 3.5, 4.5, 6, 4.5, 3.5, 3, 6.5,
3, 6.5, 3.5, 4.5, 2.5, 2.5, 4, 4, 4, 4.5), YearsNotBurnt = c(9,
9, 11.5, 10, 12, 13, 11.5, 12.5, 13, 13.5, 4.5, 11.5, 12.5, 11.5,
10.5, 12, 13, 12.5, 13.5, 11.5, 11.5, 11, 11, 12, 11.5, 12.5,
9, 10.5, 11, 12.5, 11.5, 13, 8, 12, 12.5, 11.5, 2, 12, 11.5,
11.5, 10.5, 12, 13.5, 13, 11, 13, 10.5, 11, 11.5, 12.5, 13, 13,
12, 14, 10, 12.5, 11, 2.5, 12.5, 13.5, 11.5, 13.5, 12.5, 11,
10.5, 5, 12, 11.5, 10.5, 4.5, 14, 10.5, 13, 1.5, 6.5, 5, 14,
11, 12, 11.5, 13.5, 12, 12.5, 12.5, 12.5, 10.5, 11, 13.5, 12,
11.5, 10.5, 13.5, 12, 12.5, 11.5, 6.5, 11, 8, 12.5, 10, 4, 11.5,
3.5, 12, 13.5, 12, 14.5, 10.5, 11.5, 1.5, 7.5, 11.5, 13, 3, 11,
10, 13, 13.5, 11.5, 10.5, 13, 11.5, 12, 11, 13.5, 13, 12.5, 9,
13, 10, 11.5, 10.5, 13, 11.5, 10, 10.5, 12, 11, 1, 12, 13.5,
11.5, 9.5, 12, 11, 12, 8, 10.5, 12.5, 12, 12, 11.5, 12, 6, 10,
8.5, 10, 11, 11, 11.5, 12, 6.5, 14, 10.5, 13.5, 9.5, 12, 13,
12.5, 12.5, 12, 6.5, 12.5, 14, 11.5, 9.5, 10, 13.5, 13, 10.5,
10, 11, 13.5, 11.5, 10.5, 9, 10.5, 11.5, 12, 8.5, 12, 8.5, 11.5,
10.5, 12.5, 12.5, 11, 11, 11, 10.5), time = c(1.96, 4.94, 3.46,
4.94, 2.73, 6.22, 4.5, 2.67, 4.66, 3.83, 0.38, 2.6, 3.97, 4.18,
3.77, 3.44, 2.9, 3.93, 2.16, 3.51, 2.91, 3.19, 2.73, 6.36, 1.74,
4.39, 4.1, 2.26, 2.36, 5.32, 1.74, 3.66, 1.26, 5.61, 9.04, 4.61,
0.46, 3.98, 2.63, 5.5, 2.56, 5.92, 6.39, 2.26, 3.27, 7.95, 2.93,
4.93, 2.97, 2.43, 5.91, 3.07, 4.27, 3.21, 4.12, 4.72, 1.93, 0.69,
3.51, 4.39, 4.02, 3.18, 2.61, 4.61, 3.67, 0.54, 2.33, 2.93, 2.12,
1.06, 3.95, 2.31, 5.44, 0.17, 1.42, 0.55, 8.35, 2.53, 2.91, 3.26,
8.35, 2.26, 2.23, 7.18, 6.59, 6.36, 4.38, 7.67, 1.93, 3.34, 2.91,
8.54, 5.75, 3.77, 2.63, 0.97, 3.27, 1.58, 7.18, 2.08, 0.69, 5.43,
0.85, 2.26, 3.69, 3.18, 6.18, 2.93, 2.68, 0.69, 0.92, 2.34, 3.26,
0.85, 2.91, 4.3, 3.95, 7.67, 2.93, 2.1, 6.54, 6.31, 3.87, 2.91,
3.95, 3.35, 2.63, 1.49, 4.32, 3.51, 7.06, 2.67, 3.51, 3.46, 1.56,
4.33, 5.64, 2.73, 0.57, 2.87, 3.69, 2.56, 2.33, 4.27, 4.73, 4.02,
0.82, 4.11, 4.88, 2.29, 2.34, 3.72, 4.21, 1.49, 1.56, 3.03, 1.24,
2.65, 5.71, 1.67, 2.71, 1.49, 3.95, 4.51, 3.36, 5.21, 4.18, 4.54,
5.36, 4.25, 3.71, 0.95, 8.92, 3.12, 2.73, 1.36, 1.85, 7.24, 8.11,
2.2, 0.95, 5.16, 1.3, 6.54, 3.01, 1.97, 2.91, 3.26, 3.72, 1.79,
2.56, 1.96, 1.89, 1.89, 2.61, 5.25, 3.25, 5.26, 1.74, 3.73),
claylake = c(0, 0, 0, 0, 0, 17.53, 0.1, 0.59, 0, 9.13, 36.93,
12.75, 0, 0, 0, 0, 0, 0, 0, 0.09, 0.01, 0, 0, 9.43, 74.71,
26.42, 0.23, 0, 0, 35.27, 74.71, 0, 0, 0, 0, 0, 0, 0, 20.81,
9.46, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1.14, 0, 26.42, 3.62, 0, 0, 0, 0.21, 0, 0, 0, 0.03, 10.43,
0.99, 3.6, 5.32, 0, 0.36, 0, 0, 0.25, 0.01, 0.22, 0, 0, 6.45,
0, 0, 0, 0, 0, 1.71, 0, 0, 0, 0, 0, 20.81, 0, 0, 0.18, 0,
0, 1.14, 0.03, 1.2, 0, 8.97, 0, 0, 0, 0, 1.14, 0, 1.56, 0.22,
1.2, 0, 0, 0.99, 0, 0, 0, 0, 4.14, 0, 0, 0.99, 0, 20.81,
0, 33.61, 0.09, 14.94, 0, 0, 0, 0, 0.41, 0, 2.7, 0, 0.61,
8.97, 0, 0, 0, 0, 1.7, 2.67, 7.71, 0.2, 8.63, 1.56, 0, 0.49,
0, 0, 0, 0, 0, 11.9, 33.08, 0, 0, 0.99, 2.13, 0, 0, 0, 0,
0.03, 0, 0, 0, 0, 0, 0, 2.86, 1.65, 0, 0, 0, 0, 0, 60.14,
0, 0, 0, 0, 0.22, 0, 0, 0, 0, 0.3, 0, 0, 0, 0, 0, 0, 5.57
), spinsandplain = c(81.94, 34.29, 89.55, 34.29, 80.86, 75.92,
81.55, 43.53, 97.3, 87.84, 60.62, 80.81, 11.73, 5.11, 98.67,
79.52, 60.73, 91.65, 2.82, 97.31, 73.65, 72.78, 96.51, 74.02,
25.09, 50.74, 96.62, 88.77, 98.8, 54.04, 25.09, 95.1, 69.85,
99.4, 78.79, 78.77, 48.16, 80.68, 75.79, 66.33, 68.3, 79.11,
91.89, 82.49, 98.33, 90.82, 91.24, 65.01, 69.24, 99.94, 99.75,
18.57, 90.39, 95.56, 71.07, 67.85, 92.37, 85.85, 17.89, 50.74,
79.65, 68.82, 74.05, 78.77, 87.67, 41.11, 91.74, 91.24, 44.8,
86.24, 97.7, 94.17, 85.59, 33.53, 85.23, 94.55, 78.52, 95.49,
73.65, 95.04, 78.52, 82.49, 77.26, 83.4, 98.29, 85.24, 98.78,
87.09, 81.36, 96.62, 3.4, 94.65, 28.6, 98.67, 75.79, 73.34,
98.33, 74.88, 83.4, 88.24, 85.85, 52.44, 95.84, 82.49, 62.11,
98.74, 70.32, 86.18, 95.67, 85.85, 11.42, 85.96, 75.53, 95.84,
95.46, 93.68, 97.7, 87.09, 91.24, 80.03, 87.77, 68.71, 17.51,
95.46, 97.7, 50.7, 75.79, 70.43, 61.06, 97.31, 74.63, 99,
17.89, 89.55, 99.25, 98.08, 97.61, 93.36, 99.03, 38.1, 62.11,
96.9, 88.87, 40.48, 90.21, 73.79, 95.2, 66.53, 96.67, 82.89,
85.96, 97.08, 75.74, 70.43, 99.25, 96.4, 98.88, 98.13, 85.32,
54.19, 99.2, 81.42, 97.7, 82.25, 97.42, 98.1, 5.11, 12.06,
66.14, 52.39, 52.72, 12.32, 87.32, 98.95, 71.55, 90.58, 97.9,
80.62, 93.32, 76, 86.48, 86.42, 39.54, 68.65, 6.05, 86.02,
3.4, 75.53, 97.08, 32.47, 68.3, 81.94, 89.64, 57.4, 74.05,
0.47, 96.76, 86.7, 78.46, 84.81)), row.names = c(5208L, 2828L,
1664L, 578L, 18L, 1644L, 4741L, 751L, 689L, 3813L, 1464L, 438L,
1553L, 4752L, 4960L, 376L, 2482L, 1811L, 5682L, 5441L, 4505L,
2281L, 2103L, 2993L, 562L, 4297L, 3592L, 5148L, 3793L, 1621L,
1912L, 1627L, 1737L, 4976L, 2173L, 5132L, 5758L, 2756L, 1789L,
5666L, 2628L, 2593L, 794L, 5779L, 5158L, 3123L, 4986L, 676L,
4200L, 2442L, 2751L, 4330L, 1802L, 2020L, 2500L, 1056L, 959L,
3290L, 4303L, 247L, 5586L, 922L, 1049L, 2432L, 2076L, 2560L,
1369L, 3636L, 3722L, 4137L, 1561L, 4915L, 2515L, 3034L, 5547L,
1491L, 1247L, 4116L, 455L, 4687L, 1697L, 5329L, 21L, 5724L, 3701L,
5697L, 2938L, 1721L, 61L, 998L, 4304L, 5798L, 651L, 910L, 2689L,
3986L, 2908L, 5753L, 2574L, 2345L, 1940L, 4317L, 4588L, 2179L,
665L, 4133L, 749L, 3977L, 3134L, 4190L, 3985L, 4937L, 2473L,
3238L, 4987L, 3915L, 4261L, 3521L, 2736L, 3665L, 1797L, 5692L,
5578L, 4087L, 2011L, 903L, 889L, 1523L, 3396L, 2291L, 5269L,
3644L, 3403L, 4814L, 4618L, 16L, 77L, 5385L, 2842L, 5816L, 2015L,
1443L, 3183L, 3331L, 4977L, 5380L, 989L, 4918L, 740L, 4637L,
887L, 1557L, 4295L, 4673L, 1918L, 5662L, 4167L, 1384L, 3441L,
614L, 2360L, 780L, 661L, 1267L, 2018L, 1906L, 3402L, 677L, 5218L,
2830L, 4979L, 3984L, 4924L, 1125L, 2640L, 986L, 1885L, 2573L,
5300L, 2398L, 4832L, 4816L, 3738L, 3276L, 3830L, 2425L, 2054L,
4273L, 5607L, 1678L, 378L, 1158L, 510L, 2210L, 2399L, 1952L,
2909L, 4945L, 2659L, 2642L), class = "data.frame")
I create a new variable as the proportion of years burnt is out of 15 years (i.e., binomial)
data$fireprop = cbind(data$YearsBurnt,data$YearsNotBurnt)
Model:
mfireprop = glmer(fireprop~log(time)+spinsandplain+rain15+claylake+rain15*log(time)+(1|circleID),na.action=na.fail, family=binomial, data=data)
Trend line code:
d = ggpredict(mfireprop, terms = "time[exp]")
d = rename(d, "time" = x, "fireprop" = predicted)
ggplot(d, aes(time, fireprop)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) +
geom_line(size = 2, colour = "black") +
theme_bw()
And the trend line comes out looking like this:
Why is the x axis not stopping at 10 hours where the data stops? Why is it going to 20,000? And why does the y axis only go to 0.4? When some of the proportions are 1?
When I limit the x and y axis it ends up looking like this:
But when I look at the raw data over the top of that, it seems like the trend line is starting off in a really odd place.
I am unsure of what I am doing wrong?
Okay, so I've figured out the main problem here. In the documentation of the ggpredict() function there is an argument called back.transform that defaults to TRUE. This means that log-transformed data will be automatically transformed back to the original response scale. This is why if you examine the ggpredict object d, you will see that the time variable actually does go to over 8000 in that object. So because you did not flag back.transform=FALSE, but also specified time[exp], what happened was the function automatically exponentiated your values, and then you did it again.
If we look at the logged values:
summary(log(data$time))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.7720 0.8154 1.1802 1.0904 1.4793 2.2017
Then we exponentiate the max value, we get the previous max:
exp(2.2017) # Exponentiated to get back to years
[1] 9.040369
summary(df$time) # The original variable
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.170 2.260 3.255 3.519 4.390 9.040
If we exponentiate it again, we end up with the max time being over 8000.
exp(9.040369)
[1] 8436.89
So, to get the plot you want, you just need to leave out the [exp] after calling time in ggpredict():
d = ggpredict(mfireprop, terms = "time")
d = rename(d, "time" = x, "fireprop" = predicted)
ggplot(d, aes(time, fireprop)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) +
geom_line(size = 2, colour = "black") +
theme_bw()
The time is being cut off because at time 0 there is no variation. YearsNotBurnt is always 0. Therefore, if you look at the object d from ggpredict, you will see NaN in all the columns for time 0. If you simplify the model to the following:
mfireprop2 = glmer(fireprop~
log(time) +
(1|circleID),
na.action=na.fail,
family=binomial,
data=df)
You will be able to get the plot, but because there is very little variation, the confidence interval will span from one to zero. I believe this is an issue related to separation, basically it means that binomial models can't be fit in frequentist models if there is no variation, or if something perfectly predicts the outcome.
The only I think I wanted to mention is that you had a question in the comments about "non-integer counts in a binomial glm!". This is because it expects the dependent variable to be a proportion of trials, which should not have decimals. You have points in your data that seem to be half-year intervals. I'm not familiar with your data enough to say for sure what a better alternative would be, but creating a proportion and giving the number of observations in the weights= argument might be an option.

fast_executemany=True throwing DBAPIError: Function sequence error in sqlalchemy version 1.3.5

Since SQLAlchemy 1.3.0, released 2019-03-04,
sqlalchemy now supports
engine = create_engine(sqlalchemy_url, fast_executemany=True)
for the mssql+pyodbc dialect. I.e.,
it is no longer necessary to define a function and use
#event.listens_for(engine, 'before_cursor_execute').
However when i am trying to write a simple test dataframe to mssql it returns error:
DBAPIError: (pyodbc.Error) ('HY010', '[HY010] [Microsoft][ODBC Driver 17 for SQL Server]Function sequence error (0) (SQLParamData)')
[SQL: INSERT INTO fast_executemany_test ([Date], [A], [B], [C], [D]) VALUES (?, ?, ?, ?, ?)][parameters: ((datetime.datetime(2018, 1, 3, 0, 0), 2.0, 1.0, 1.0, 'Joe'), (datetime.datetime(2018, 1, 4, 0, 0), 2.0, 1.0, 2.0, 'Joe'), (datetime.datetime(2018, 1, 5, 0, 0), 2.0, 3.0, 1.0, 'Pete'), (datetime.datetime(2018, 1, 6, 0, 0), 2.0, 1.0, 5.0, 'Mary'))]
(Background on this error at: http://sqlalche.me/e/dbapi)
I have gone through the documentation but could not find what am I doing wrong.
import sqlalchemy
import pandas as pd
DataFrame contains datetime, float, float, float, string.
test_columns = ['Date', 'A', 'B', 'C', 'D']
test_data = [
[datetime(2018, 1, 3), 2.0, 1.0, 1.0, 'Joe'],
[datetime(2018, 1, 4), 2.0, 1.0, 2.0, 'Joe'],
[datetime(2018, 1, 5), 2.0, 3.0, 1.0, 'Pete'],
[datetime(2018, 1, 6), 2.0, 1.0, 5.0, 'Mary'],
]
I am establishing connection as:
sqlUrl='mssql+pyodbc://ID:PASSWORD' + 'SERVER_ADDRESS' + '/' + 'DBName' + '?driver=ODBC+Driver+17+for+SQL+Server'
sqlcon = sqlalchemy.create_engine(sqlUrl,fast_executemany=True)
if sqlcon:
test_data.to_sql('FastTable_test', sqlcon, if_exists='replace',index=False)
print('Successfully written!')
It creates the table but due to error does not write any data into it.