How to use LAG function and get the numbers in percentages in SQL - sql

I have a table with columns below
Employee(linked_lylty_card_nbr, prod_nbr, tot_amt_incld_gst, start_txn_date, main_total_size, tota_size_uom)
Below is the respective table
linked_lylty_card_nbr, prod_nbr, tot_amt_incld_gst, start_txn_date, main_total_size, tota_size_uom
1100000000006296409 83563-EA 3.1600 2021-11-10 500.0000 ML
1100000000006296409 83563-EA 2.6800 2021-11-20 500.0000 ML
1100000000001959800 83563-EA 2.6900 2021-12-21 500.0000 ML
1100000000006296409 83563-EA 3.1600 2021-12-30 500.0000 ML
1100000000001959800 83563-EA 5.3700 2022-01-14 500.0000 ML
1100000000006296409 83563-EA 2.6800 2022-01-16 500.0000 ML
1100000000001959800 83563-EA 2.4900 2022-01-19 500.0000 ML
1100000000006296409 83563-EA 3.4600 2022-02-26 500.0000 ML
1100000000006296409 607577-EA 3.9800 2022-05-26 500.0000 ML
1100000000006296409 607577-EA 3.9800 2022-06-11 500.0000 ML
1100000000001959800 83563-EA 3.9800 2022-06-14 500.0000 ML
1100000000001959800 83563-EA 3.9800 2022-06-24 500.0000 ML
1100000000006296409 607577-EA 4.4600 2022-07-30 500.0000 ML
1100000000001959800 83563-EA 4.0100 2022-08-02 500.0000 ML
1100000000001959800 83563-EA 4.0100 2022-09-01 500.0000 ML
1100000000006296409 607577-EA 3.9800 2022-09-08 500.0000 ML
I'm trying to get the change in the volume per each visit in % i.e. percentage, for example if 1100000000006296409 is linked_lylty_card_nbr then for start_txn_date 2021-11-10 main_total_size is 500, then for the same customer for 2021-11-20 the main_total_size is 500, there's no difference and no of days it's taking for the linked_lylty_card_nbr to return and get the product in % i.e. percentage. Below is the SQL query i've written
SELECT
linked_lylty_card_nbr,
prod_nbr,
start_txn_date,
main_total_size,
total_size_uom,
(
main_total_size - LAG(main_total_size, 1) OVER (
PARTITION BY linked_lylty_card_nbr
ORDER BY
start_txn_date
)) / main_total_size AS change_in_volume_per_visit,
(
start_txn_date - LAG(start_txn_date, 1) OVER (
PARTITION BY linked_lylty_card_nbr
ORDER BY
start_txn_date
)) / main_total_size AS change_in_days_per_visit
FROM
Employee
ORDER BY
linked_lylty_card_nbr,
start_txn_date
The output is below
linked_lylty_card_nbr prod_nbr start_txn_date main_total_size tota_size_uomm change_in_volume_per_visit change_in_days_per_visit
1100000000001959800 83563-EA 2021-12-21 500.0 ML
1100000000001959800 83563-EA 2022-01-14 1000.0 ML 0.5 0.024
1100000000001959800 83563-EA 2022-01-19 500.0 ML -1.0 0.01
1100000000001959800 83563-EA 2022-06-14 500.0 ML 0.0 0.292
1100000000001959800 83563-EA 2022-06-24 500.0 ML 0.0 0.02
1100000000001959800 83563-EA 2022-08-02 500.0 ML 0.0 0.078
1100000000001959800 83563-EA 2022-09-01 500.0 ML 0.0 0.06
1100000000006296409 83563-EA 2021-11-10 500.0 ML
1100000000006296409 83563-EA 2021-11-20 500.0 ML 0.0 0.02
1100000000006296409 83563-EA 2021-12-30 500.0 ML 0.0 0.08
1100000000006296409 83563-EA 2022-01-16 500.0 ML 0.0 0.034
1100000000006296409 83563-EA 2022-02-26 500.0 ML 0.0 0.082
1100000000006296409 607577-EA 2022-05-26 500.0 ML 0.0 0.178
1100000000006296409 607577-EA 2022-06-11 500.0 ML 0.0 0.032
1100000000006296409 607577-EA 2022-07-30 500.0 ML 0.0 0.098
1100000000006296409 607577-EA 2022-09-08 500.0 ML 0.0 0.08
From the above output, the change_in_volume_per_visit column, 2nd row has value 0.5. But it must be 1 if 500 is jumping to 1000 as in main_total_size 1st row has 500 and 2nd row has 1000 and Can anyone please confirm whether change_in_days_per_visit values are correct or not?

Related

make a mean of several year dataframes, hour by hour

I have several dataframes of some value taken very hour, on several year, like this :
df1
Out[6]:
time P G(i) H_sun T2m WS10m Int
0 2005-01-01 00:10:00 0.0 0.0 0.0 0.68 2.11 0.0
1 2005-01-01 01:10:00 0.0 0.0 0.0 0.38 2.11 0.0
2 2005-01-01 02:10:00 0.0 0.0 0.0 0.08 2.11 0.0
3 2005-01-01 03:10:00 0.0 0.0 0.0 -0.22 2.11 0.0
4 2005-01-01 04:10:00 0.0 0.0 0.0 0.06 2.21 0.0
... ... ... ... ... ... ...
8755 2005-12-31 19:10:00 0.0 0.0 0.0 1.75 1.71 0.0
8756 2005-12-31 20:10:00 0.0 0.0 0.0 1.49 1.71 0.0
8757 2005-12-31 21:10:00 0.0 0.0 0.0 1.23 1.70 0.0
8758 2005-12-31 22:10:00 0.0 0.0 0.0 0.95 1.65 0.0
8759 2005-12-31 23:10:00 0.0 0.0 0.0 0.67 1.60 0.0
[8760 rows x 7 columns]
df2
Out[7]:
time P G(i) H_sun T2m WS10m Int
8760 2006-01-01 00:10:00 0.0 0.0 0.0 0.39 1.56 0.0
8761 2006-01-01 01:10:00 0.0 0.0 0.0 0.26 1.52 0.0
8762 2006-01-01 02:10:00 0.0 0.0 0.0 0.13 1.49 0.0
8763 2006-01-01 03:10:00 0.0 0.0 0.0 0.01 1.45 0.0
8764 2006-01-01 04:10:00 0.0 0.0 0.0 -0.45 1.65 0.0
... ... ... ... ... ... ...
17515 2006-12-31 19:10:00 0.0 0.0 0.0 4.24 1.32 0.0
17516 2006-12-31 20:10:00 0.0 0.0 0.0 4.00 1.32 0.0
17517 2006-12-31 21:10:00 0.0 0.0 0.0 3.75 1.32 0.0
17518 2006-12-31 22:10:00 0.0 0.0 0.0 4.34 1.54 0.0
17519 2006-12-31 23:10:00 0.0 0.0 0.0 4.92 1.76 0.0
[8760 rows x 7 columns]
and this for 10 years.
I'm trying to make a mean of the value for the "20XX-01-01 00:10:00" of each year to obtain something like "mean all the value of the 01 January at 00:10". Ideally with a time column merge to obtain just "01-01 00:10:00".
Is it possible ?
For now I just know the df.mean() function to take all the value of a column to have just one result, and that's not what I want.
Join all DataFrames together in concat:
df = pd.concat([df1, df2, df3, ..., df10])
And then aggregate mean with same year - e.g. 2005
df['time'] = pd.to_datetime(df['time'])
#for remove 29 Feb
#df = df[((df['time'].dt.month != 2) | (df['time'].dt.day != 29))]
df1 = df.groupby(pd.to_datetime(df['time'].dt.strftime('2005-%m-%d %H:%M:%S'))).mean()

Write & Apply Python Function with Grouped Pandas Data

I have data that is grouped by a column 'plant_name' and I need to write & apply a function to test for a trend on one of the columns, i.e., named "10%" or '90%' for example.
My data looks like this -
plant_name year count mean std min 10% 50% 90% max
0 ARIZONA I 2005 8760.0 8.25 2.21 1.08 5.55 8.19 11.09 15.71
1 ARIZONA I 2006 8760.0 7.87 2.33 0.15 4.84 7.82 10.74 16.75
2 ARIZONA I 2007 8760.0 8.31 2.25 0.03 5.52 8.27 11.23 16.64
3 ARIZONA I 2008 8784.0 7.67 2.46 0.21 4.22 7.72 10.78 15.73
4 ARIZONA I 2009 8760.0 6.92 2.33 0.23 3.79 6.95 9.96 14.64
5 ARIZONA I 2010 8760.0 8.07 2.21 0.68 5.51 7.85 11.14 17.31
6 ARIZONA I 2011 8760.0 7.54 2.38 0.33 4.44 7.45 10.54 17.77
7 ARIZONA I 2012 8784.0 8.61 1.92 0.33 6.37 8.48 11.07 15.84
8 ARIZONA I 2015 8760.0 8.21 2.13 0.60 5.58 8.24 10.88 16.74
9 ARIZONA I 2016 8784.0 8.39 2.27 0.46 5.55 8.32 11.34 16.09
10 ARIZONA I 2017 8760.0 8.32 2.11 0.85 5.70 8.25 11.12 17.96
11 ARIZONA I 2018 8760.0 7.94 2.28 0.07 5.17 7.72 11.04 16.31
12 ARIZONA I 2019 8760.0 7.71 2.49 0.38 4.28 7.75 10.87 15.79
13 ARIZONA I 2020 8784.0 7.57 2.43 0.50 4.36 7.47 10.78 15.69
14 CAETITE I 2005 8760.0 8.11 3.15 0.45 3.76 8.38 12.08 18.89
15 CAETITE I 2006 8760.0 7.70 3.21 0.05 3.50 7.66 12.05 19.08
16 CAETITE I 2007 8760.0 8.64 3.18 0.01 4.05 8.83 12.63 18.57
17 CAETITE I 2008 8784.0 7.87 3.09 0.28 3.75 7.80 11.92 18.54
18 CAETITE I 2009 8760.0 7.31 3.02 0.17 3.46 7.21 11.40 19.46
19 CAETITE I 2010 8760.0 8.00 3.24 0.34 3.63 8.03 12.29 17.27
I'm using this function from here -
import pymannkendall as mk
and you apply the function like this:
mk.original_test(dataframe)
I need the final dataframe to look like this which is the result of the series columns returned by the function (mk.original_test):
trend, h, p, z, Tau, s, var_s, slope, intercept = mk.original_test(data)
plant_name trend h p z Tau s var_s slope intercept
0 ARIZONA I no trend False 0.416 0.812 xxx x x x x
1 CAETITE I increasing True 0.002 3.6 xxx x x x x
I just am not sure how to use groupby to group by plant_name column and then apply the mk function by plant_name to either of the columns in the data shown. Thank you,
For a given column, you can run the test in a GroupBy.apply() and return the result as a Series indexed by result._fields:
def mktest(x):
result = mk.original_test(x)
return pd.Series(result, index=result._fields)
column = '10%'
df.groupby('plant_name', as_index=False)[column].apply(mktest)
plant_name
trend
h
p
z
Tau
s
var_s
slope
intercept
ARIZONA I
no trend
False
0.956276
-0.054827
-0.021978
-2.0
332.666667
-0.003333
5.361667
CAETITE I
no trend
False
0.452370
-0.751469
-0.333333
-5.0
28.333333
-0.026000
3.755000

Merge old and new table and fill values by date

I have df1:
Date
Symbol
Time
Quantity
Price
2020-09-04
AAPL
09:54:48
11.0
115.97
2020-09-16
AAPL
09:30:02
-11.0
115.33
2020-02-24
AMBA
09:30:02
22.0
64.24
2020-02-25
AMBA
14:01:28
-22.0
62.64
2020-07-14
AMGN
09:30:01
5.0
243.90
...
...
...
...
...
2020-12-08
YUMC
09:30:00
-22.0
56.89
2020-11-18
Z
14:20:01
12.0
100.68
2020-11-20
Z
09:30:01
-12.0
109.25
2020-09-04
ZS
09:45:24
9.0
135.94
2020-09-14
ZS
09:38:23
-9.0
126.41
and df2:
Date
USD
2
2020-02-01
22.702
3
2020-03-01
22.753
4
2020-06-01
22.601
5
2020-07-01
22.626
6
2020-08-01
22.739
..
...
...
248
2020-12-23
21.681
249
2020-12-28
21.482
250
2020-12-29
21.462
251
2020-12-30
21.372
252
2020-12-31
21.387
I want to add a new column "USD" from df2 by date in df1.
Trying
new_df = (dane5.reset_index()
.merge(kurz2,how='outer')
.fillna(0)
.set_index('Date'))
new_df.sort_index(inplace=True)
new_df= new_df[new_df['Symbol'] != 0]
print(new_df.head(50))
But I return zero value some rows:
Date
Symbol
Time
Quantity
Price
USD
2020-01-02
GL
10:31:14
13.0
104.550000
0.000
2020-01-02
ATEC
13:35:04
211.0
6.860000
0.000
2020-01-03
IOVA
14:02:32
56.0
25.790000
0.000
2020-01-03
TGNA
09:30:00
90.0
16.080000
0.000
2020-01-03
SCS
09:30:01
-70.0
20.100000
0.000
2020-01-03
SKX
09:30:09
34.0
41.940000
0.000
2020-01-06
IOVA
09:45:19
-56.0
24.490000
24.163
2020-01-06
GL
09:30:02
-13.0
103.430000
24.163
2020-01-06
SKX
15:55:15
-34.0
43.900000
24.163
2020-01-07
TGNA
15:55:16
-90.0
16.945000
23.810
2020-01-07
MRTX
09:46:18
-13.0
101.290000
23.810
2020-01-07
MRTX
09:34:10
13.0
109.430000
23.810
2020-01-08
ITCI
09:30:01
49.0
27.640000
0.000
Could you some help me please?
Sorry my bad English language.

Group by Index of Row in Pandas

I want to group and sum every 7 rows together (Hence to get a total of each week). There are currently two columns. One for date and the other for a float.
1/22/2020 NaN
1/23/2020 0.0
1/24/2020 1.0
1/25/2020 0.0
1/26/2020 3.0
1/27/2020 0.0
1/28/2020 0.0
1/29/2020 0.0
1/30/2020 0.0
1/31/2020 2.0
2/1/2020 1.0
2/2/2020 0.0
2/3/2020 3.0
2/4/2020 0.0
2/5/2020 0.0
2/6/2020 0.0
2/7/2020 0.0
2/8/2020 0.0
2/9/2020 0.0
2/10/2020 0.0
2/11/2020 1.0
2/12/2020 0.0
2/13/2020 1.0
2/14/2020 0.0
2/15/2020 0.0
2/16/2020 0.0
2/17/2020 0.0
2/18/2020 0.0
2/19/2020 0.0
2/20/2020 0.0
... ...
2/28/2020 0.0
2/29/2020 8.0
3/1/2020 6.0
3/2/2020 23.0
3/3/2020 20.0
3/4/2020 31.0
3/5/2020 68.0
3/6/2020 45.0
3/7/2020 119.0
3/8/2020 114.0
3/9/2020 64.0
3/10/2020 194.0
3/11/2020 397.0
3/12/2020 452.0
3/13/2020 590.0
3/14/2020 710.0
3/15/2020 61.0
3/16/2020 1389.0
3/17/2020 1789.0
3/18/2020 906.0
3/19/2020 3068.0
3/20/2020 4009.0
3/21/2020 4017.0
3/23/2020 25568.0
3/24/2020 10074.0
3/25/2020 12043.0
3/26/2020 18058.0
3/27/2020 17822.0
3/28/2020 19825.0
3/29/2020 19408.0
Assuming your date column is called dt and your value column is val:
import numpy as np
# in case if it's not already date time format:
df["dt"]=pd.to_datetime(df["dt"])
# your data looks sorted, but in case if it's not - that's the prerequisite here:
df=df.sort_values("dt")
df=df.groupby(np.arange(len(df))//7).agg({"dt": (min, max), "val": sum})
The aggregation for dt is done only so you can explicitly indicate aggregated interval - it might be enough to just take min for instance, or ignore it at all...
Set the date column as the index and use resample
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.resample('1W').sum()

Pandas - ufunc 'subtract' did not contain a loop with signature matching type

I have this piece of code:
self.value=0.8
for col in df.ix[:,'value1':'value3']:
df = df.iloc[abs(df[col] - self.value).argsort()]
which works perfectly as part of main() function. at return, it prints:
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
6 Radiohead Codex 0.00 1.00 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
however, when I import this function as part of a module, and even though I'm passing and printing the same 0.8 self.value, I get the following error:
df = df.iloc[(df[col] - self.flavor).argsort()]
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 721, in wrapper
result = wrap_results(safe_na_op(lvalues, rvalues))
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 682, in safe_na_op
return na_op(lvalues, rvalues)
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 668, in na_op
result[mask] = op(x[mask], y)
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
why is it so? what is going on?
pd.DataFrame.ix is has been deprecated. You should stop using it.
Your use of 'value1':'value3' is dangerous as it can include columns you didn't expect if your columns aren't positioned in the order you thought.
df = pd.DataFrame(
[['a', 'b', 1, 2, 3]],
columns='artist track v1 v2 v3'.split()
)
list(df.loc[:, 'v1':'v3'])
['v1', 'v2', 'v3']
But rearrange the columns and
list(df.loc[:, ['v1', 'v2', 'artist', 'v3', 'b']].loc[:, 'v1':'v3'])
['v1', 'v2', 'artist', 'v3']
You got 'artist' in the the list. And column 'artist' is of type string and can't be subtracted from or by an integer or float.
df['artist'] - df['v1']
> TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21')
Setup
Shuffle df
df = df.sample(frac=1)
df
artist track pos neg neu
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
Solution
Use np.lexsort
value = 0.8
v = df[['pos', 'neg', 'neu']].values
df.iloc[np.lexsort(np.abs(v - value).T)]
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0