Add column for percentages - pandas

I have a df who looks like this:
Total Initial Follow Sched Supp Any
0 5525 3663 968 296 65 533
I transpose the df 'cause I have to add a column with the percentages based on column 'Total'
Now my df looks like this:
0
Total 5525
Initial 3663
Follow 968
Sched 296
Supp 65
Any 533
So, How can I add this percentage column?
The expected output looks like this
0 Percentage
Total 5525 100
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6
Do you know how can I add this new column?
I'm working in jupyterlab with pandas and numpy

Multiple column 0 by scalar from Total with Series.div, then multiple by 100 by Series.mul and last round by Series.round:
df['Percentage'] = df[0].div(df.loc['Total', 0]).mul(100).round(1)
print (df)
0 Percentage
Total 5525 100.0
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6

Consider below df:
In [1328]: df
Out[1328]:
b
a
Total 5525
Initial 3663
Follow 968
Sched 296
Supp 65
Any 533
In [1327]: df['Perc'] = round(df.b.div(df.loc['Total', 'b']) * 100, 1)
In [1330]: df
Out[1330]:
b Perc
a
Total 5525 100.0
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6

Related

How to arrange df in ascending order and reset the index numbering

My is about stock data.
Open Price High Price Low Price Close Price WAP No.of Shares No. of Trades Total Turnover (Rs.) Deliverable Quantity % Deli. Qty to Traded Qty Spread High-Low Spread Close-Open Pert Rank Year
Date
2022-12-30 419.75 421.55 415.55 418.95 417.841704 1573 183 657265.0 954 60.65 6.00 -0.80 0.131558 2022
2022-12-29 412.15 418.40 411.85 415.90 413.236152 1029 117 425220.0 766 74.44 6.55 3.75 0.086360 2022
2022-12-28 411.90 422.05 411.30 415.35 417.917534 2401 217 1003420.0 949 39.53 10.75 3.45 0.128329 2022
2022-12-27 409.60 414.70 407.60 412.70 411.436312 1052 136 432831.0 687 65.30 7.10 3.10 0.066182 2022
2022-12-26 392.00 409.55 389.60 406.35 400.942300 2461 244 986719.0 1550 62.98 19.95 14.35 0.240920 2022
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-01-05 338.75 358.70 338.75 355.65 351.255581 31802 896 11170630.0 15781 49.62 19.95 16.90 0.949153
The date column is in descending order that has to be converted to ascending order.
at the same time index i.e has to be converted to ascending order. i.e. 1,2,3,4. It should not be in descending order.
I tried with sort_values. it returns nonetype object.
I am expecting a dataframe. I also tried groupby. Is there any other way.
Sorting dates values with sort_values works for me
df = pd.DataFrame({'Dates': ['2022-12-30', '2022-12-29','2022-12-28'],'Prices':[100,101,99]})
df
Out[142]:
Dates Prices
0 2022-12-30 100
1 2022-12-29 101
2 2022-12-28 99
df.sort_values('Dates',ascending=True,inplace=True)
df
Out[144]:
Dates Prices
2 2022-12-28 99
1 2022-12-29 101
0 2022-12-30 100
You need to sort_values and reset_index:
>>> import random
>>> df = pd.DataFrame(
{
"Dates": pd.Series(pd.date_range("2022-12-24", "2022-12-29")),
"Prices": pd.Series(np.random.randint(0,100,size=(6,)))
})
>>> df
Dates Prices
0 2022-12-24 31
1 2022-12-25 2
2 2022-12-26 27
3 2022-12-27 90
4 2022-12-28 87
5 2022-12-29 49
>>> df.sort_values(by="Dates", ascending=True).reset_index(drop=True, inplace=True)

Creating a Lookup Matrix in Microsoft Access

I have the matrix below in Excel and want to import it into Access (2016) to then use in queries. The aim is to be able to lookup values based on the row and column. Eg lookup criteria of 10 and 117 should return 98.1.
Is this possible? I'm an Access novice and don't know where to start.
.
10
9
8
7
6
5
4
3
2
1
0
120
100.0
96.8
92.6
86.7
78.8
68.2
54.4
37.5
21.3
8.3
0.0
119
99.4
96.2
92.0
86.2
78.5
67.9
54.3
37.5
21.3
8.3
0.0
118
98.7
95.6
91.5
85.8
78.1
67.7
54.1
37.4
21.2
8.3
0.0
117
98.1
95.1
90.9
85.3
77.8
67.4
54.0
37.4
21.2
8.3
0.0
116
97.4
94.5
90.3
84.8
77.4
67.1
53.8
37.4
21.1
8.3
0.0
115
96.8
93.9
89.8
84.4
77.1
66.9
53.7
37.3
21.1
8.3
0.0
Consider creating a table with 3 columns to store this data:
Value1 - numeric
Value2 - numeric
LookupValue - currency
You can then use DLookup to get the value required:
?DLookup("LookupValue","LookupData","Value1=117 AND Value2=10")
If you have the values stored in variables, then you need to concatenate them in:
lngValue1=117
lngValue2=10
Debug.Print DLookup("LookupValue","LookupData","Value1=" & lngValue1 & " AND Value2=" & lngValue2)

how to select a value based on multiple criteria

I'm trying to select some values based on some proprietary data, and I just changed the variables to reference house prices.
I am trying to get the total offers for houses where they were sold at the bid or at the ask price, with offers under 15 and offers * sale price less than 5,000,000.
I then want to get the total number of offers for each neighborhood on each day, but instead I'm getting the total offers across each neighborhood (n1 + n2 + n3 + n4 + n5) across all dates and the total offers in the dataset across all dates.
My current query is this:
SELECT DISTINCT(neighborhood),
DATE(date_of_sale),
(SELECT SUM(offers)
FROM `big_query.a_table_name.houseprices`
WHERE ((offers * accepted_sale_price < 5000000)
AND (offers < 15)
AND (house_bid = sale_price OR
house_ask = sale_price))) as bid_ask_off,
(SELECT SUM(offers)
FROM `big_query.a_table_name.houseprices`) as
total_offers,
FROM `big_query.a_table_name.houseprices`
GROUP BY neighborhood, DATE(date_of_sale) LIMIT 100
Which I am expecting a result like, with date being repeated throughout as d1, d2, d3, etc.:
but am instead receiving
I'm aware that there are some inherent problems with what I'm trying to select / group, but I'm not sure what to google or what tutorials to look at in order to perform this operation.
It's querying quite a bit of data, and I want to keep costs down, as I've already racked up a smallish bill on queries.
Any help or advice would be greatly appreciated, and I hope I've provided enough information.
Here is a sample dataframe.
neighborhood date_of_sale offers accepted_sale_price house_bid house_ask
bronx 4/1/2022 3 323 320 323
manhattan 4/1/2022 4 244 230 244
manhattan 4/1/2022 8 856 856 900
queens 4/1/2022 15 110 110 135
brooklyn 4/2/2022 12 115 100 115
manhattan 4/2/2022 9 255 255 275
bronx 4/2/2022 6 330 300 330
queens 4/2/2022 10 405 395 405
brooklyn 4/2/2022 4 254 254 265
staten_island 4/3/2022 2 442 430 442
staten_island 4/3/2022 13 195 195 225
bronx 4/3/2022 4 650 650 690
manhattan 4/3/2022 2 286 266 286
manhattan 4/3/2022 6 356 356 400
staten_island 4/4/2022 4 361 361 401
staten_island 4/4/2022 5 348 348 399
bronx 4/4/2022 8 397 340 397
manhattan 4/4/2022 9 333 333 394
manhattan 4/4/2022 11 392 325 392
I think that this is what you need.
As we group by neighbourhood we do not need DISTINCT.
We take sum(offers) for total_offers directly from the table and bids from a sub-query which we join to so that it is grouped by neighbourhood.
SELECT
h.neighborhood,
DATE(h.date_of_sale) AS date_,
s.bids AS bid_ask_off,
SUM(h.offers) AS total_offers,
FROM
`big_query.a_table_name.houseprices` h
LEFT JOIN
(SELECT
neighborhood,
SUM(offers) AS bids
FROM
`big_query.a_table_name.houseprices`
WHERE offers * accepted_sale_price < 5000000
AND offers < 15
AND (house_bid = sale_price OR
house_ask = sale_price)
GROUP BY neighborhood) s
ON h.neighborhood = s.neighborhood
GROUP BY
h.neighborhood,
DATE(date_of_sale),
s.bids
LIMIT 100;
Or the following which modifies more the initial query but may be more like what you need.
SELECT
h.neighborhood,
DATE(h.date_of_sale) AS date_,
s.bids AS bid_ask_off,
SUM(h.offers) AS total_offers,
FROM
`big_query.a_table_name.houseprices` h
LEFT JOIN
(SELECT
date_of_sale dos,
neighborhood,
SUM(offers) AS bids
FROM
`big_query.a_table_name.houseprices`
WHERE offers * accepted_sale_price < 5000000
AND offers < 15
AND (house_bid = sale_price OR
house_ask = sale_price)
GROUP BY
neighborhood,
date_of_sale) s
ON h.neighborhood = s.neighborhood
AND h.date_of_sale = s.dos
GROUP BY
h.neighborhood,
DATE(date_of_sale),
s.bids
LIMIT 100;

Changing value of column in pandas chaining

I have a dataset like this:
year artist track time date.entered wk1 wk2
2000 Pac Baby 4:22 2000-02-26 87 82
2000 Geher The 3:15 2000-09-02 91 87
2000 three_DoorsDown Kryptonite 3:53 2000-04-08 81 70
2000 ATeens Dancing_Queen 3:44 2000-07-08 97 97
2000 Aaliyah I_Dont_Wanna 4:15 2000-01-29 84 62
2000 Aaliyah Try_Again 4:03 2000-03-18 59 53
2000 Yolanda Open_My_Heart 5:30 2000-08-26 76 76
My desired output is like this:
year artist track time date week rank
0 2000 Pac Baby 4:22 2000-02-26 1 87
1 2000 Pac Baby 4:22 2000-03-04 2 82
6 2000 ATeens Dancing_Queen 3:44 2000-07-08 1 97
7 2000 ATeens Dancing_Queen 3:44 2000-07-15 2 97
8 2000 Aaliyah I_Dont_Wanna 4:15 2000-01-29 1 84
Basically, I am tidying up the given billboard data.
Without pandas chaining I could do this easily like this:
df = pd.read_clipboard()
df1 = (pd.wide_to_long(df, 'wk', i=df.columns.values[:5], j='week')
.reset_index()
.rename(columns={'date.entered': 'date', 'wk': 'rank'}))
df1['date'] = pd.to_datetime(df1['date']) + pd.to_timedelta((df1['week'] - 1) * 7, 'd')
df1 = df1.sort_values(by=['track', 'date'])
print(df1.head())
Question
Is there a way I can chain the df1['date'] = pd.to_datetime(...) part? So that the whole operation can fit into single chain?
Use assign:
df1 = (pd.wide_to_long(df, 'wk', i=df.columns.values[:5], j='week')
.reset_index()
.rename(columns={'date.entered': 'date', 'wk': 'rank'})
.assign(date = lambda x: pd.to_datetime(x['date']) +
pd.to_timedelta((x['week'] - 1) * 7, 'd'))
.sort_values(by=['track', 'date'])
)

Resample and interpolate pandas df

I have a df that looks like the following:
TotalSpend Date
100 2001-04-26
230 2001-05-12
340 2001-06-16
610 2001-07-31
770 2001-08-31
I'm trying interpolate the data so I can see how much was spent during each month like so:
TotalSpend Date MonthlySpend
110 2001-04-30
310 2001-05-31 200
400 2001-06-30 90
610 2001-07-31 210
770 2001-08-31 160
I set the date column as the index and have tried to upsample the data (below) so that I have every day of the year and can then interpolate the missing values and select the month ends however this is proving troublesome.
resample = realdf.resample('d').mean()
Any help would be much appreciated.