Output last row of each year - pandas

My dataframe look like this:
Close Volume Dividends
Date
2014-08-07 14.21 4848000 0.00
2014-08-08 13.95 5334000 0.00
2014-08-11 14.07 4057000 0.00
2014-08-12 14.13 2611000 0.00
2014-08-13 14.15 3743000 0.28
... ... ... ...
2020-08-03 19.45 7352600 0.00
2020-08-04 19.69 4250500 0.00
2020-08-05 19.83 3414080 0.00
2020-08-06 20.40 6128100 0.00
2020-08-07 20.60 8295000 0.00
I like to output the closing price for the last day of each year. I tried the following:
df = df.groupby(df.index.year)['Close'].tail(1)
Date
2014-12-31 16.39
2015-12-31 13.67
2016-12-30 14.78
2017-12-29 21.83
2018-12-31 21.64
2019-12-31 25.00
2020-08-07 20.60
I want the output to be:
Date
2014 16.39
2015 13.67
2016 14.78
2017 21.83
...
Any help would be very much appreciated. Many Thanks!

Try with last
df = df.groupby(df.index.year)['Close'].last()

Related

Filter a DataFrame based on groupby method and a column

I have to following DF
symbol cml_units number_of_shares price time gain_loss cml_cost cash_flow avg_price
1 BP.L 2 2 504.8275 2022-10-04 14:14:11 0.00 1009.65 -1009.65 504.83
3 BP.L 0 -2 504.2625 2022-10-04 14:43:18 -1.13 -0.01 1008.52 0.00
4 AAPL 0 -3 142.4500 2022-10-04 15:28:33 0.00 284.93 427.35 0.00
5 AAPL 6 3 146.4000 2022-10-06 10:13:53 0.00 1151.51 -439.20 191.92
8 AAPL 47 47 171.5200 2022-08-18 13:45:02 0.00 8061.44 -8061.44 171.52
15 AAPL 0 -47 149.8400 2022-09-25 19:18:42 -1018.96 0.00 7042.48 0.00
20 AAPL 10 7 140.0900 2022-10-09 13:53:05 0.00 1692.94 -980.63 169.29
22 AAPL 3 3 142.4500 2022-10-04 09:06:15 0.00 712.31 -427.35 142.46
23 AAPL 0 3 138.3400 2022-10-13 09:38:23 -24.18 0.00 415.02 0.00
29 AAPL 0 7 138.3400 2022-10-13 09:38:26 -12.25 0.00 968.38 0.00
31 AAPL 5 5 138.3400 2022-10-13 09:46:32 0.00 691.70 -691.70 138.34
38 AAPL 0 5 150.3200 2022-11-01 18:42:08 59.90 0.00 751.60 0.00
44 AAPL 1 1 150.2700 2022-11-01 18:42:47 0.00 150.27 -150.27 150.27
55 AAPL 0 1 149.7000 2022-11-14 12:41:36 -0.57 0.00 149.70 0.00
66 BP.L 2 2 562.4942 2022-10-14 12:42:48 0.00 1124.98 -1124.99 562.49
68 AAPL 2 2 149.7000 2022-11-14 14:39:57 0.00 299.40 -299.40 149.70
70 AAPL 0 -2 148.2800 2022-11-15 09:07:41 -2.84 0.00 296.56 0.00
73 BP.L 1 -1 562.1850 2022-11-15 09:12:41 -0.31 562.49 562.18 562.49
74 AAPL 3 3 148.2800 2022-11-15 13:14:36 0.00 444.84 -444.84 148.28
I need to filter out all the rows that are previous to the last time cml_units was 0 for each symbol.
For example on the above DF the result should be:
symbol cml_units number_of_shares price time gain_loss cml_cost cash_flow avg_price
66 BP.L 2 2 562.4942 2022-10-14 12:42:48 0.00 1124.98 -1124.99 562.49
73 BP.L 1 -1 562.1850 2022-11-15 09:12:41 -0.31 562.49 562.18 562.49
74 AAPL 3 3 148.2800 2022-11-15 13:14:36 0.00 444.84 -444.84 148.28
This is because BP.L on 2022-10-14 12:42:48 was the first purchase after cml_units were 0 on the 2022-10-04 14:43:18, and AAPL on the 2022-11-15 13:14:36 was the first purchase after cml_units were 0 on the 2022-11-15 09:07:41.
This DF can be in any shape so I am trying to find an inclusive wholesome way to achieve it, even if the DF have other stocks.
First you should sort your df by time. Then you can group and concat based on condition:
df = df.sort_values('time')
df_out = pd.DataFrame()
for sym, sub_df in df.groupby('symbol'):
zero_dates = sub_df[(sub_df['cml_units'] == 0)]['time']
if not zero_dates.empty:
last_zero_date = zero_dates.values[-1]
else:
last_zero_date = pd.to_datetime(0)
df_out = pd.concat([df_out, sub_df[sub_df['time'] > last_zero_date]])
print(df_out)
Edit: adding handling of cases where cml_units is always >0
Output:
symbol cml_units number_of_shares price time gain_loss cml_cost cash_flow avg_price
id
74 AAPL 3 3 148.2800 2022-11-15 13:14:36 0.00 444.84 -444.84 148.28
66 BP.L 2 2 562.4942 2022-10-14 12:42:48 0.00 1124.98 -1124.99 562.49
73 BP.L 1 -1 562.1850 2022-11-15 09:12:41 -0.31 562.49 562.18 562.49

Awk: Formatting the order of table data in awk script

I have the following output file. Please note that this data is dynamic, so there could be more or less years and many more categories A,B,C,D...:
2015 2016 2017
EX
FE
B 0.00 -2.00 -1.00
D 0.00 -1.00 0.00
sumFE 0.00 -3.00 -1.00
VE
B 0.00 -3.00 0.00
C -4.00 0.00 0.00
D 0.00 -5.00 0.00
sumVE -4.00 -8.00 0.00
sumE -4.00 -11.00 -1.00
IN
FI
A 8.00 0.00 0.00
C 0.00 0.00 8.00
sumFI 8.00 0.00 8.00
VI
A 0.00 0.00 5.00
B 4.00 0.00 0.00
sumVI 4.00 0.00 5.00
sumI 12.00 0.00 13.00
net 8.00 -11.00 12.00
I am trying to format it like this.
2015 2016 2017
IN
VI
A 0.00 0.00 5.00
B 4.00 0.00 0.00
sumVI 4.00 0.00 5.00
FI
A 8.00 0.00 0.00
C 0.00 0.00 8.00
sumFI 8.00 0.00 8.00
sumI 12.00 0.00 13.00
EX
VE
B 0.00 -3.00 0.00
C -4.00 0.00 0.00
D 0.00 -5.00 0.00
sumVE -4.00 -8.00 0.00
FE
B 0.00 -2.00 -1.00
D 0.00 -1.00 0.00
sumFE 0.00 -3.00 -1.00
sumE -4.00 -11.00 -1.00
net 8.00 -11.00 12.00
I have tried the following script as a start:
#!/usr/bin/env bash
awk '
BEGIN{FS="\t"}
3>NR {print "D" $0}
$1 ~ /^I$/,$1 ~ /^sumI$/ {
print
}
$1 ~ /^E$/,$1 ~ /^sumE$/{
print
}
$1 ~ /net/ {print ORS $0}
' "${#:--}"
The script would go a long way in replacing all I data for E data however the execution order is not preserved and the I block is printed out last. Can someone please help with this.
This will probably be easier to modify the originating code to use GNU awk's predefined array scanning orders. The key objective is to switch the scanning order (PROCINFO["sorted_in"]) just prior to the associated for (index in array) loop.
Adding four lines of code (see # comments) to what I'm guessing is the originating code:
...
END {
for (year = minYear; year <= maxYear; year++) {
printf "%s%s", OFS, year
}
print ORS
PROCINFO["sorted_in"]="#ind_str_desc" # sort cat == { I | E } in descending order
for (cat in ctiys2amounts) {
printf "%s\n\n",(cat=="I") ? "IN" : "EX" # print { IN | EX }
delete catSum
PROCINFO["sorted_in"]="#ind_str_desc" # sort type == { VI | FI } || { VE | FE } in descending order
for (type in ctiys2amounts[cat]) {
print type
delete typeSum
PROCINFO["sorted_in"]="#ind_str_asc" # sort item == { A | B | C | D } in ascending order
for (item in ctiys2amounts[cat][type]) {
printf "%s", item
for (year = minYear; year <= maxYear; year++) {
amount = ctiys2amounts[cat][type][item][year]
printf "%s%0.2f", OFS, amount
typeSum[year] += amount
}
print ""
}
....
This generates:
2015 2016 2017
IN
VI
A 0.00 0.00 5.00
B 4.00 0.00 0.00
sumVI 4.00 0.00 5.00
FI
A 8.00 0.00 0.00
C 0.00 0.00 8.00
sumFI 8.00 0.00 8.00
sumI 12.00 0.00 13.00
EX
VE
B 0.00 -3.00 0.00
C -4.00 0.00 0.00
D 0.00 -5.00 0.00
sumVE -4.00 -8.00 0.00
FE
B 0.00 -2.00 -1.00
D 0.00 -1.00 0.00
sumFE 0.00 -3.00 -1.00
sumE -4.00 -11.00 -1.00
net 8.00 -11.00 12.00

I want to create a dynamic pivot table which i can publish as view in sql server. please see more details below [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
There are 2 tables table one has acctrefno and date_purchased
Table no 2 has the date paid and payment amount
Here is sample data for table 1
acctrefno FirstPayDate
5 2009-11-05
22 2012-04-15
28 2017-08-15
29 2018-09-15
Here is the sample data for table 2
acctrefno FirstPayDate date_paid payment_amount
5 2009-11-05 2009-11-13 77.86
5 2009-11-05 2009-12-07 77.86
5 2009-11-05 2010-01-05 77.86
5 2009-11-05 2010-02-05 77.86
5 2009-11-05 2010-03-05 77.86
5 2009-11-05 2010-04-05 77.86
5 2009-11-05 2010-05-05 77.86
5 2009-11-05 2010-06-07 77.86
5 2009-11-05 2010-07-06 77.86
5 2009-11-05 2010-08-05 77.86
5 2009-11-05 2010-09-07 77.86
22 2012-04-15 2012-05-31 173.48
22 2012-04-15 2012-06-11 168.48
22 2012-04-15 2012-06-25 173.48
22 2012-04-15 2012-07-02 168.48
22 2012-04-15 2012-08-13 125.00
22 2012-04-15 2012-08-31 48.48
22 2012-04-15 2012-09-17 125.00
22 2012-04-15 2012-10-10 48.48
22 2012-04-15 2012-10-22 125.00
22 2012-04-15 2012-11-05 48.48
22 2012-04-15 2012-11-13 125.00
28 2017-08-15 2017-08-14 136.00
28 2017-08-15 2017-09-11 170.00
28 2017-08-15 2017-10-17 136.00
28 2017-08-15 2017-11-15 136.00
28 2017-08-15 2017-12-13 170.00
28 2017-08-15 2018-04-16 142.78
28 2017-08-15 2018-05-04 135.98
28 2017-08-15 2018-05-21 102.60
28 2017-08-15 2018-11-20 4.00
28 2017-08-15 2018-11-20 132.00
28 2017-08-15 2018-12-19 8.00
28 2017-08-15 2018-12-19 135.98
28 2017-08-15 2018-12-19 26.02
28 2017-08-15 2019-01-17 4.00
28 2017-08-15 2019-01-17 109.96
28 2017-08-15 2019-01-17 22.04
28 2017-08-15 2019-02-14 4.00
29 2018-09-15 2018-09-17 155.48
I am looking to get an output something like this
loan_number Month -4 Month -3 Month -2 Month -1 Month 0 Month 1 Month 2 Month 3 Month 4 Month 5 Month 6 Month 7 Month 8 Month 9 Month 10 Month 11 Month 12 Month 13
203026 0.00 0.00 0.00 0.00 77.86 77.86 77.86 77.86 77.86 77.86 77.86 77.86 77.86 77.86 77.86 77.86 77.86 77.86
259796 0.00 0.00 0.00 0.00 0.00 173.48 341.96 168.48 173.48 125.00 173.48 173.48 173.48 216.96 168.48 125.00 221.96 125.00
428086 0.00 0.00 0.00 0.00 136.00 170.00 136.00 136.00 170.00 0.00 0.00 0.00 142.78 238.58 0.00 0.00 0.00 0.00
550343 0.00 0.00 0.00 0.00 155.48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
here Month 0 means if the month of the date paid is equal to the month of first pay date, month 1 is one month after the first pay date, and month -1 is one month before the first pay date.
The reason why I want to create a dynamic query is there are more than 100 thousand accounts all started at different times within the last 6 years and have different payment dates. There is already a working solution for this query but the problem is it is not dynamic for the month's column. you have to specify month columns manually.
Assuming you don't need dynamic SQL
The only trick was to convert the string into dates, and then it was a small matter to calculate the datediff(MONTH,...)
Example dbFiddle
Select ID
,[Month 1] = IsNull([1],0)
,[Month 2] = IsNull([2],0)
,[Month 3] = IsNull([3],0)
,[Month 4] = IsNull([4],0)
,[Month 5] = IsNull([5],0)
From (
Select ID
,Item = DateDiff(MONTH,try_convert(date,replace([Date Purchase],' ',' 1, ')),try_convert(date,replace([Payment Date],',',' ')))
,Value =[Payment Amount]
From YourTable
) src
Pivot ( sum(Value) for Item in ([1],[2],[3],[4],[5]) ) pvt
Returns
ID Month 0 Month 1 Month 2 Month 3 Month 4 Month 5
1550 0.00 120.00 120.00 0.00 0.00 0.00
1551 0.00 130.00 135.00 0.00 90.00 0.00
1552 0.00 0.00 102.00 0.00 900.00 0.00

Deleting entire columns of from a text file using CUT command or AWK program

I have a text file in the form below. Could someone help me as to how I could delete columns 2, 3, 4, 5, 6 and 7? I want to keep only 1,8 and 9.
37.55 6.00 24.98 0.00 -2.80 -3.90 26.675 './gold_soln_CB_FragLib_Controls_m1_9.mol2' 'ethyl'
38.45 1.39 27.36 0.00 -0.56 -2.48 22.724 './gold_soln_CB_FragLib_Controls_m2_6.mol2' 'pyridin-2-yl(pyridin-3-yl)methanone'
38.47 0.00 28.44 0.00 -0.64 -2.42 20.387 './gold_soln_CB_FragLib_Controls_m3_3.mol2' 'pyridin-2-yl(pyridin-4-yl)methanone'
42.49 0.07 30.87 0.00 -0.03 -3.24 22.903 './gold_soln_CB_FragLib_Controls_m4_5.mol2' '(3-chlorophenyl)(pyridin-3-yl)methanone'
38.20 1.47 27.53 0.00 -1.13 -3.28 22.858 './gold_soln_CB_FragLib_Controls_m5_2.mol2' 'dipyridin-4-ylmethanone'
41.87 0.57 30.53 0.00 -0.67 -3.16 22.829 './gold_soln_CB_FragLib_Controls_m6_9.mol2' '(3-chlorophenyl)(pyridin-4-yl)methanone'
38.18 1.49 27.09 0.00 -0.56 -1.63 7.782 './gold_soln_CB_FragLib_Controls_m7_1.mol2' '3-hydrazino-6-phenylpyridazine'
39.45 1.50 27.71 0.00 -0.15 -4.17 17.130 './gold_soln_CB_FragLib_Controls_m8_6.mol2' '3-hydrazino-6-phenylpyridazine'
41.54 4.10 27.71 0.00 -0.65 -4.44 9.702 './gold_soln_CB_FragLib_Controls_m9_4.mol2' '3-hydrazino-6-phenylpyridazine'
41.05 1.08 29.30 0.00 -0.31 -2.44 28.590 './gold_soln_CB_FragLib_Controls_m10_3.mol2' '3-hydrazino-6-(4-methylphenyl)pyridazine'
Try:
awk '{print $1"\t"$8"\t"$9}' yourfile.tsv > only189.tsv

SQL Calculating Turn-Around-Time with Overlapping Concideration

I have a Table (parts) where I store when an item was requested and when it was issued. With this, I can easily compute each items turn-around-time ("TAT"). What I'd like to do is have another column ("Computed") where any overlapping request-to-issue dates are properly computed.
RecID Requested Issued TAT Computed
MD0001 11/28/2012 12/04/2012 6.00 0.00
MD0002 11/28/2012 11/28/2012 0.00 0.00
MD0003 11/28/2012 12/04/2012 6.00 0.00
MD0004 11/28/2012 11/28/2012 0.00 0.00
MD0005 11/28/2012 12/10/2012 12.00 0.00
MD0006 11/28/2012 01/21/2013 54.00 54.00
MD0007 11/28/2012 11/28/2012 0.00 0.00
MD0008 11/28/2012 12/04/2012 6.00 0.00
MD0009 01/29/2013 01/30/2013 1.00 1.00
MD0010 01/29/2013 01/30/2013 1.00 0.00
MD0011 02/05/2013 02/06/2013 1.00 1.00
MD0012 02/07/2013 03/04/2013 25.00 25.00
MD0013 03/07/2013 03/14/2013 7.00 7.00
MD0014 03/07/2013 03/08/2013 1.00 0.00
MD0015 03/13/2013 03/25/2013 12.00 11.00
MD0016 03/20/2013 03/21/2013 1.00 0.00
Totals 133.00 99.00 <- waiting for parts TAT summary
In the above, I manually filled in the ("Computed") column so that there is an example of what I'm trying to accomplish.
NOTE: Notice how MD0013 affects the computed time for MD0015 as MD0013 was "computed" first. This could have been where MD0015 was computed first, then MD0013 would be affected accordingly - the net result is there is -1 day.