Creating multiindex pivot table Pandas - pandas

I have created a Pandas pivot table that has the data I need based on a df that looks like this:
' Short Date PercentileF R_E Status
0 Denz 2022-09-19 18:57:08 55 Hispanic Active
1 Denz 2022-09-19 18:44:10 14 African_American Active
2 Denz 2022-09-19 18:39:32 33 African_American Active
3 Denz 2022-09-19 19:02:51 52 Hispanic Active
4 Denz 2022-09-19 18:46:12 13 Hispanic Active
5 Denz 2022-09-19 18:56:08 72 Mult Active
6 Denz 2022-09-19 18:52:14 Hispanic Active
7 Denz 2022-09-19 19:03:28 43 Hispanic Active
8 Denz 2022-09-19 18:55:40 90 Filipino Active
9 Denz 2022-09-19 19:07:03 Hispanic Active
10 Dezn 2022-09-19 18:51:54 75 African_American Active
11 Cart 2022-09-19 19:06:50 99 Filipino Active
12 Cart 2022-09-19 18:58:05 52 African_American Active
13 Cart 2022-09-19 18:41:16 5 African_American Active
14 Cart 2022-09-19 18:52:29 46 Hispanic Active
15 Cart 2022-09-19 18:46:43 18 Mult Active
16 Cart 2022-09-19 19:05:42 68 Filipino Active
17 Cart 2022-09-19 18:42:15 18 Hispanic Active
18 Cart 2022-09-19 18:46:32 26 Mult Active
19 Cart 2022-09-19 18:33:54 1 Hispanic Active'
`
This is my code currently:
vs['PercentileF']=pd.to_numeric(vs['PercentileF'])
vs=vs.dropna()
vs['PercentileF'] = vs['PercentileF'].astype(int)
vs['Typ_High'] = vs['PercentileF'].apply(lambda x:1 if x >=35 else 0)
vs['Low'] = vs['PercentileF'].apply(lambda x:1 if x <35 else 0)
vs['LowHigh']=vs['Low']+vs['Typ_High']
tframe = pd.pivot_table(vs, values=['Typ_High','LowHigh'], index=['Short', 'Race_Ethn'], aggfunc=np.sum)
tframe['Percentage']= tframe['Typ_High']/tframe['LowHigh'] * 100
tframe['Percentage'] = tframe['Percentage'].map('{:,.2f}'.format)
tframe['Percentage'] = np.where((tframe['LowHigh']) <= 10,'-1',tframe['Percentage'])
tframe.reset_index(inplace=True)
tableframe2 = tframe.pivot(index='Short',
columns= 'Race_Ethn',
values='Percentage')`
These are the results from my whole data set, not just the example above:
Race_Ethn African_American American_Indian Asian Filipino Hispanic Mult Pac_Islander
Short
Cart 65 -1 67.78 63.64 59.02 -1 -1
Cord 52.17 -1 58.33 -1
Denz 58.1 -1 69.51 60 59.01 57.89 -1 `
I would like suggestions to make the code cleaner - ie. it shouldn't take so many lines to get what I need but it took me a while to get to this point. Any suggestions to shorten the code would be great. Thank you.

Related

How to I count a range in sql?

I have a data that looks like this:
$ Time : int 0 1 5 8 10 11 15 17 18 20 ...
$ NumOfFlights: int 1 6 144 91 504 15 1256 1 1 578 ...
Time col is just 24hr time. From 0 up all the way until 2400
What I hope to get is:
hour | number of flight
-------------------------------------
1st | 240
2nd | 223
... | ...
24th | 122
Where 1st hour is from midnight to 1am, and 2nd is 1am to 2am, and so on until finally 24th which is from 11pm to midnight. And number of flights is just the total of the NumOfFlights within the range.
I've tried:
dbGetQuery(conn,"
SELECT
flights.CRSDepTime AS Time,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/60
")
But I realise it can't be done this way. The results that I get will have 40 values for time.
> head
Time NumOnTimeFlights
1 50 6055
2 105 2383
3 133 674
4 200 446
5 245 266
6 310 34
> tail
Time NumOnTimeFlights
35 2045 48136
36 2120 103229
37 2215 15737
38 2245 36416
39 2300 15322
40 2355 8018
If your CRSDepTime column is an integer encoded time like HHmm then CRSDepTime/100 will extract the hour.
SELECT
CRSDepTime/100 AS hh,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/100

How to group merge columns based on one row identifier with pandas?

I have a dataset, in which it has a lot of entries for a single location. I am trying to find a way to sum up all of those entries without affecting any of the other columns. So, just in case I'm not explaining it well enough, I want to use a dataset like this:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
Bedford 10 12 14 17 27
Bedford 11 40 34 9 1
Bedford 7 1 2 3 3
Leeds 1 1 2 0 0
Leeds 20 13 6 1 1
Bath 101 20 33 41 3
Bath 11 2 3 1 0
And turn it into something like this:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
Bedford 28 53 50 29 31
Leeds 21 33 39 1 1
Bath 111 22 36 42 3
Now, I have read up that a groupby should work in a way, but from my understanding a group by will change it into 2 columns and I don't particularly want to make hundreds of 2 columns and then merge it all. Surely there's a much simpler way to do this?
IIUC, groupby+sum will work for you:
df.groupby('Locations',as_index=False,sort=False).sum()
Output:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
0 Bedford 28 53 50 29 31
1 Leeds 21 14 8 1 1
2 Bath 112 22 36 42 3
Pivot table should work for you.
new_df = pd.pivot_table(df, values=['Cyclists', 'maleRunners', 'femalRunners',
'maleCyclists','femaleCyclists'],index='Locations', aggfunc=np.sum)

SQL query between and equals

there are three tables, first table name is baseline which contains all beneficiaries information and one column in the name of PPI Score and the second table in the name of PPI_SCORE_TOOKUP which contains six columns as below the third table in the name of endline which contains beneficiaries end line assessment data and also one column in the name of PPI_Score, what i want is, to join some how these tables however there is no foreign key of the baseline and endline table in the PPI_SCORE_TOOKUP table there is only PPI_Score in the tables PPI_SCORE_TOOKUP, endline and endline tables, and i want to query to show some baseline data along PPI result if the values of the ppi in the basline table is between or equals to PPI_SCORE_START and PPI_SCORE_END and also it should show endline data of the same member along with the PPI Score with its six column if ppi score in the endline table is between and equals to PPI_SCORE_START and PPI_SCORE_END all in one row.
Note: i did not try any query yet since i did not have any idea how to do this, but i expect the expected result in the bottom of this question.
Tables are as follows
baseline table
ID NAME LAST_NAME DISTRICT PPI_SCORE
1 A A A 10
2 B B B 23
3 C C C 90
4 D D D 47
endline table
baseline_ID Enterprise Market PPI_SCORE
3 Bee Keeping Yes
2 Poultry No 74
1 Agriculture Yes 80
PPI_SCORE_TOOKUP table
ppi_start ppi_end national national_150 national_200 usaid
0 4 100 100 100 100
10 14 66.1 89.5 96.5 39.2
5 9 68.8 90.2 96.7 44.4
15 19 59.5 89.1 97.2 35.2
20 24 51.3 85.5 96.4 28.8
25 29 43.5 81.1 93.2 20
30 34 31.9 74.5 90.4 13.6
35 39 24.6 66.9 87.3 7.9
40 44 15.2 58 82.8 4.5
45 49 11.4 47.9 73.4 4.2
50 54 6 37.2 68.4 2.6
55 59 2.7 26.1 61.3 0.5
60 64 0.9 21 50.4 0.5
65 69 0 14.3 37.1 0
70 74 3 14.3 29.2 0
75 79 0 1.4 5.1 0
80 84 0 0 9.5 0
85 89 0 0 15.2 0
90 94 0 0 0 0
95 100 0 0 0 0
Expected Result
Your query can be made in the following way:
SELECT *
FROM baseline b
LEFT JOIN endline e ON b.id = e.baseline_ID
LEFT JOIN PPI_SCORE_TOOKUP ppi ON b.PPI_SCORE BETWEEN ppi.ppi_start AND ppi.ppi_end
LEFT JOIN PPI_SCORE_TOOKUP ppi2 ON e.PPI_SCORE BETWEEN ppi2.ppi_start AND ppi2.ppi_end
This matches your id's from the baseline table with the baseline_ID's from the endline table, keeping possible null values from baseline. It then matches the PPI_SCORE from baseline with ppi_start and ppi_end from PPI_SCORE_TOOKUP. Then we join the PPI_SCORE from endline with and ppi_end.
By replacing * with whatever fields you want to have.
See fiddle for a working example

Getting average of product sales each day and calculate number of days that have positive sales

I have this table TARGETSALE that have the following columns
SELECT DATE, WEEK, BRANCH, PROD, TARGETREACH
FROM TARGETSALE
WHERE BRANCH = 1
AND WEEK BETWEEN 52 AND 53;
DATE WEEK BRANCH PROD TARGETREACH
-------------------------------------------------------------------
01/09/2014 52 1 1 50
02/09/2014 52 1 1 -10
03/09/2014 52 1 1 50
04/09/2014 52 1 1 50
05/09/2014 52 1 1 40
06/09/2014 52 1 1 -10
07/09/2014 53 1 1 -5
08/09/2014 53 1 1 0
09/09/2014 53 1 1 10
10/09/2014 53 1 1 20
11/09/2014 53 1 1 30
12/09/2014 53 1 1 40
13/09/2014 53 1 1 0
01/09/2014 52 1 2 20
02/09/2014 52 1 2 0
03/09/2014 52 1 2 0
04/09/2014 52 1 2 10
05/09/2014 52 1 2 20
06/09/2014 52 1 2 10
07/09/2014 53 1 2 -10
08/09/2014 53 1 2 10
09/09/2014 53 1 2 -10
10/09/2014 53 1 2 20
11/09/2014 53 1 2 20
12/09/2014 53 1 2 40
13/09/2014 53 1 2 0
01/09/2014 52 1 3 30
02/09/2014 52 1 3 30
03/09/2014 52 1 3 5
04/09/2014 52 1 3 0
05/09/2014 52 1 3 10
06/09/2014 52 1 3 -10
07/09/2014 53 1 3 -10
08/09/2014 53 1 3 -10
09/09/2014 53 1 3 20
10/09/2014 53 1 3 10
11/09/2014 53 1 3 40
12/09/2014 53 1 3 10
13/09/2014 53 1 3 10
"targetsales" shows how much over the target the sales is, where negative means how far below the target the sales was. How can I do the following:
1. I need to get the average for all the product for each day. Something like this:
DATE BRANCH AVERAGE_SALES_OF_ALL_PRODUCT
01/09/2014 1 33.33
02/09/2014 1 -1.67
...and so on
And then I need to have another query that shows how many days within those two weeks that there's positive average sales. Something like this:
BRANCH 2WEEKS_SINCE DAYS_WITH_POSITIVE_AVERAGE_SALES
1 53 9
Above just an example not a real result.
Sorry, hope this not too confusing. Thank you so much.
In Oracle, the date type might still have a time component. If you do not know if this is there, then use trunc() to remove it:
select trunc(date), branch, avg(targetreach)
from targetsale
group by truncdate, branch
order by 1, 2;
For the second query, you want to use case:
select branch, count(distinct case when targetreach > 0 then date end) as DaysWithPositiveSales
from targetsales
group by branch;
If you know there is one row per date per branch -- and the time component of the date is empty -- then the distinct is not necessary.
1)
SELECT TRUNC(DATE, 'DD'), BRANCH, SUM(TARGETREACH)
FROM TARGETSALE WHERE BRANCH = 1 AND WEEK BETWEEN 52 AND 53
GROUP BY TRUNC(DATE, 'DD'), BRANCH;
2)
SELECT BRANCH, SUM(DECODE(ABS(TARGETREACH), 1, 1, 0)
FROM TARGETSALE WHERE BRANCH = 1 AND WEEK BETWEEN 52 AND 53
GROUP BY BRANCH;

How to verify whether records exist for the last x days (calendar days) in SQL not using the between key word

Want verify whether my table is having the records for the last 6 consecutive days in SQL
SNO FLIGHT_DATE LANDINGS
45 9/1/2013 1
31 10/1/2013 1
32 11/1/2013 1
30 11/24/2013 1
27 11/25/2013 1
28 11/26/2013 1
29 11/26/2013 1
33 11/26/2013 1
26 11/30/2013 1
25 12/1/2013 1
34 12/1/2013 1
24 12/2/2013 1
35 12/3/2013 1
36 12/3/2013 1
44 12/4/2013 1
46 12/6/2013 1
47 12/6/2013 1
Is this what you want?
SELECT
*
FROM
Table1
WHERE
FLIGHT_DATE > dateadd(day,-6,datediff(day,0,getdate()))
AND
FLIGHT_DATE < GETDATE();
SQL FIDDLE