How to find sum of count, grouped by the id? - pandas

I would like to get the output for the following problem.
I have the following datatype:
id start end count Time Train
001 Paris London 01 05:00 Yes
001 Paris London 01 05:00 Yes
002 Prague Vienna 15 15:00 No
003 Frankfurt London 01 17:00 Yes
015 Paris London 08 21:00 No
019 Barcelona Vienna 15 15:00 No
003 Frankfurt London 01 07:00 Yes
002 Prague Vienna 15 05:00 No
I would like to find the sum of count, grouped by the id. Also ignore the rows that has the same id, start and end . Also I have a data of 4 gb and I would like to find the start and end city of top 5 count. Thank you.
I could like to get output that gives data something similar to this,
Prague -> Vienna Count : 15
Barcelona -> Vienna count : 15
Paris --> london Count : 09
Frankfurt -> London Count: 02
.....

You can use drop_duplicates + groupby with aggregating sum:
df['count'] = df['count'].astype(int)
df = df.drop_duplicates(['id','start','end'])
print (df)
id start end count Time Train
0 001 Paris London 1 05:00 Yes
2 002 Prague Vienna 15 15:00 No
3 003 Frankfurt London 1 07:00 Yes
4 015 Paris London 8 21:00 No
5 019 Barcelona Vienna 15 15:00 No
df1 = df.groupby('id', as_index=False)['count'].sum()
print (df1)
id count
0 001 1
1 002 15
2 003 1
3 015 8
4 019 15
df11 = df.groupby(['id', 'start', 'end'], as_index=False)['count'].sum()
print (df11)
id start end count
0 001 Paris London 1
1 002 Prague Vienna 15
2 003 Frankfurt London 1
3 015 Paris London 8
4 019 Barcelona Vienna 15
df12 = df.groupby(['start', 'end'], as_index=False)['count'].sum()
print (df12)
start end count
0 Barcelona Vienna 15
1 Frankfurt London 1
2 Paris London 9
3 Prague Vienna 15
For top values use nlargest:
df2 = df.nlargest(5, 'count')[['start','end']]
print (df2)
start end
2 Prague Vienna
5 Barcelona Vienna
4 Paris London
0 Paris London
3 Frankfurt London

SELECT T.* FROM
(
SELECT *,COUNT(id) AS count FROM TABLE1 GROUP BY id,start,end
) T
GROUP BY id ORDER BY count DESC LIMIT 0,5

Related

Pandas iterate over rows and conditional count

I am trying to iterate over rows in a pandas Dataframe with a conditional count in a new column called Stage. For each name the stage should start at 1, and if the name is the same between rows then after a "Healthy" status a new stage should start. A "Healthy" event will be in the same stage as the preceding "Sick" events, if they exist. I've done the code in excel before but not sure how to do it in python.
What I have now is:
Date
Name
Status
2020-01-02
Mary
Healthy
2020-01-05
Mary
Sick
2020-01-15
Mary
Sick
2020-01-20
Mary
Healthy
2020-02-03
Mary
Healthy
2020-02-06
Mary
Sick
2020-02-10
Mary
Sick
2020-02-15
Mary
Healthy
2020-01-02
Bob
Healthy
2020-01-05
Bob
Healthy
2020-01-15
Bob
Healthy
2020-01-20
Bob
Sick
2020-02-03
Bob
Sick
2020-02-06
Bob
Sick
2020-02-10
Bob
Sick
2020-02-15
Bob
Healthy
What I would like to have:
Date
Name
Status
Stage
2020-01-02
Mary
Healthy
1
2020-01-05
Mary
Sick
2
2020-01-15
Mary
Sick
2
2020-01-20
Mary
Healthy
2
2020-02-03
Mary
Healthy
3
2020-02-06
Mary
Sick
4
2020-02-10
Mary
Sick
4
2020-02-15
Mary
Healthy
4
2020-01-02
Bob
Healthy
1
2020-01-05
Bob
Healthy
2
2020-01-15
Bob
Healthy
3
2020-01-20
Bob
Sick
4
2020-02-03
Bob
Sick
4
2020-02-06
Bob
Sick
4
2020-02-10
Bob
Sick
4
2020-02-15
Bob
Healthy
4
You don't need an explicit loop. You need the following:
group by the name column
apply to each group:
shift the Status column to look at the previous value
take cumulative sum of the following series:
if the previous value is null and current value is Healthy, we're at the first row so call it one
if the previous row is Healthy, call it one
otherwise, call it zero
from io import StringIO
import numpy
import pandas
df = pandas.read_csv(StringIO("""\
|Date|Name|Stage|
|2020-01-02|Mary|Healthy|
|2020-01-05|Mary|Sick|
|2020-01-15|Mary|Sick|
|2020-01-20|Mary|Healthy|
|2020-02-03|Mary|Healthy|
|2020-02-06|Mary|Sick|
|2020-02-10|Mary|Sick |
|2020-02-15|Mary|Healthy|
|2020-01-02|Bob|Healthy|
|2020-01-05|Bob|Healthy|
|2020-01-15|Bob|Healthy|
|2020-01-20|Bob|Sick|
|2020-02-03|Bob|Sick|
|2020-02-06|Bob|Sick|
|2020-02-10|Bob|Sick |
|2020-02-15|Bob|Healthy|
"""), sep='|').loc[:, ['Date', 'Name', 'Stage']]
output = (
df.assign(Status=lambda df: df.groupby('Name')['Stage'].apply(lambda g:
numpy.bitwise_or( # returns 1 if either two conditions are met
g.shift().eq('Healthy'), # general case
g.shift().isnull() & g.eq("Healthy") # handles first row of a group
).cumsum()
))
)
print(output.to_string())
And I get:
Date Name Stage Status
0 2020-01-02 Mary Healthy 1
1 2020-01-05 Mary Sick 2
2 2020-01-15 Mary Sick 2
3 2020-01-20 Mary Healthy 2
4 2020-02-03 Mary Healthy 3
5 2020-02-06 Mary Sick 4
6 2020-02-10 Mary Sick 4
7 2020-02-15 Mary Healthy 4
8 2020-01-02 Bob Healthy 1
9 2020-01-05 Bob Healthy 2
10 2020-01-15 Bob Healthy 3
11 2020-01-20 Bob Sick 4
12 2020-02-03 Bob Sick 4
13 2020-02-06 Bob Sick 4
14 2020-02-10 Bob Sick 4
15 2020-02-15 Bob Healthy 4

Groupby plot in year wise for different categories of other column pandas and seaborn or matplotlib

I have a data frame as shown below.
Place Bldng_Id Num_Bed_Rooms Contract_date Rental_value
Bangalore 1 4 2016-02-16 100
Bangalore 1 4 2016-05-16 150
Bangalore 1 4 2017-01-18 450
Bangalore 1 4 2017-02-26 550
Bangalore 5 4 2015-02-26 120
Bangalore 5 4 2016-05-18 180
Bangalore 2 3 2015-03-06 150
Bangalore 2 3 2016-05-14 150
Bangalore 2 3 2017-07-26 220
Bangalore 2 3 2017-09-19 200
Chennai 3 4 2016-02-16 100
Chennai 3 4 2016-05-16 150
Chennai 3 4 2017-01-18 450
Chennai 3 4 2017-02-26 550
Chennai 4 3 2015-03-06 150
Chennai 4 3 2016-05-14 150
Chennai 4 3 2017-07-26 220
Chennai 4 3 2017-09-19 200
Chennai 6 3 2018-07-26 250
Chennai 6 3 2019-09-19 280
From the above I would like to prepare the below dataframe.
Expected output:
Place Num_Bed_Rooms Year Avg_Rental_value
Bangalore 3 2015 150
Bangalore 3 2016 150
Bangalore 3 2017 210
Bangalore 4 2015 120
Bangalore 4 2016 143.3
Bangalore 4 2017 500
Chennai 3 2015 150
Chennai 3 2016 150
Chennai 3 2017 210
Chennai 3 2018 250
Chennai 3 2019 280
Chennai 4 2016 150
Chennai 4 2017 210
I tried following code to achieve this.
df.groupby(['Place', 'Year', 'Num_Bed_Rooms']).Rental_value.mean()
But above does not work properly.
From the above expected output I would like to write a time series code to forecast the next year rental_value for each case separatly.
If necessary first convert values to datetimes:
df['Contract_date'] = pd.to_datetime(df['Contract_date'])
Then create new column and pass to groupby:
df['Year'] = df['Contract_date'].dt.year
df1 = df.groupby(['Place', 'Num_Bed_Rooms','Year'], as_index=False).Rental_value.mean()
Or pass Series:
y = df['Contract_date'].dt.year.rename('Year')
df1 = df.groupby(['Place', 'Num_Bed_Rooms', y], as_index=False).Rental_value.mean()
print (df1)
Place Num_Bed_Rooms Year Rental_value
0 Bangalore 3 2015 150.000000
1 Bangalore 3 2016 150.000000
2 Bangalore 3 2017 210.000000
3 Bangalore 4 2015 120.000000
4 Bangalore 4 2016 143.333333
5 Bangalore 4 2017 500.000000
6 Chennai 3 2015 150.000000
7 Chennai 3 2016 150.000000
8 Chennai 3 2017 210.000000
9 Chennai 3 2018 250.000000
10 Chennai 3 2019 280.000000
11 Chennai 4 2016 125.000000
12 Chennai 4 2017 500.000000

Groupby year and calculate the average and count the size in pandas

I have a dataframe as shown below
Contract_ID Place Contract_Date Price
1 Bangalore 2018-10-25 100
2 Bangalore 2018-08-25 200
3 Bangalore 2019-10-25 300
4 Bangalore 2019-11-25 200
5 Bangalore 2019-10-25 400
6 Chennai 2018-10-25 100
7 Chennai 2018-10-25 200
8 Chennai 2018-10-25 100
9 Chennai 2018-10-25 300
10 Chennai 2019-10-25 400
11 Chennai 2019-10-25 600
From the above I would like to generate below table using pandas.
Expected Output:
Place Year Number_of_Contracts Average_Price
Bangalore 2018 2 150
Bangalore 2019 3 300
Chennai 2018 4 175
Chennai 2019 2 500
Use GroupBy.agg with years created by Series.dt.year and tuples for new columns names:
df['Contract_Date'] = pd.to_datetime(df['Contract_Date'])
df1 = (df.groupby(['Place', df['Contract_Date'].dt.year.rename('Year')])['Price']
.agg([('Number_of_Contracts','size'),('Average_Price','mean')])
.reset_index())
print (df1)
Place Year Number_of_Contracts Average_Price
0 Bangalore 2018 2 150
1 Bangalore 2019 3 300
2 Chennai 2018 4 175
3 Chennai 2019 2 500
Solution named aggregation, but is necessary pandas 0.25+:
df['Contract_Date'] = pd.to_datetime(df['Contract_Date'])
df1 = (df.groupby(['Place', df['Contract_Date'].dt.year.rename('Year')])
.agg(Number_of_Contracts=('Contract_ID','size'),
Average_Price=('Price','mean'))
.reset_index())
print (df1)
Place Year Number_of_Contracts Average_Price
0 Bangalore 2018 2 150
1 Bangalore 2019 3 300
2 Chennai 2018 4 175
3 Chennai 2019 2 500

SQL server select from 3 tables

I have three tables in my database Books, Borrowers and Movement:
Books
BookID Title Author Category Published
----------- ------------------------------ ------------------------- --------------- ----------
101 Ulysses James Joyce Fiction 1922-06-16
102 Huckleberry Finn Mark Twain Fiction 1884-03-24
103 The Great Gatsby F. Scott Fitzgerald Fiction 1925-06-17
104 1984 George Orwell Fiction 1949-04-19
105 War and Peace Leo Tolstoy Fiction 1869-08-01
106 Gullivers Travels Jonathan Swift Fiction 1726-07-01
107 Moby Dick Herman Melville Fiction 1851-08-01
108 Pride and Prejudice Jane Austen Fiction 1813-08-13
110 The Second World War Winston Churchill NonFiction 1953-09-01
111 Relativity Albert Einstein NonFiction 1917-01-09
112 The Right Stuff Tom Wolfe NonFiction 1979-09-07
121 Hitchhikers Guide to Galaxy Douglas Adams Humour 1975-10-27
122 Dad Is Fat Jim Gaffigan Humour 2013-03-01
131 Kick-Ass 2 Mark Millar Comic 2012-03-03
133 Beautiful Creatures: The Manga Kami Garcia Comic 2014-07-01
Borrowers
BorrowerID Name Birthday
----------- ------------------------- ----------
2 Bugs Bunny 1938-09-08
3 Homer Simpson 1992-09-09
5 Mickey Mouse 1928-02-08
7 Fred Flintstone 1960-06-09
11 Charlie Brown 1965-06-05
13 Popeye 1933-03-03
17 Donald Duck 1937-07-27
19 Mr. Magoo 1949-09-14
23 George Jetson 1948-04-08
29 SpongeBob SquarePants 1984-08-04
31 Stewie Griffin 1971-11-17
Movement
MoveID BookID BorrowerID DateOut DateIn ReturnCondition
----------- ----------- ----------- ---------- ---------- ---------------
1 131 31 2012-06-01 2013-05-24 good
2 101 23 2012-02-10 2012-03-24 good
3 102 29 2012-02-01 2012-04-01 good
4 105 7 2012-03-23 2012-05-11 good
5 103 7 2012-03-22 2012-04-22 good
6 108 7 2012-01-23 2012-02-12 good
7 112 19 2012-01-12 2012-02-10 good
8 122 11 2012-04-14 2013-05-01 poor
9 106 17 2013-01-24 2013-02-01 good
10 104 2 2013-02-24 2013-03-10 bitten
11 121 3 2013-03-01 2013-04-01 good
12 131 19 2013-04-11 2013-05-23 good
13 111 5 2013-05-22 2013-06-22 poor
14 131 2 2013-06-12 2013-07-23 bitten
15 122 23 2013-07-10 2013-08-12 good
16 107 29 2014-01-01 2014-02-14 good
17 110 7 2014-01-11 2014-02-01 good
18 105 2 2014-02-22 2014-03-02 bitten
What is a query I can use to find out which book was borrowed by the oldest borrower?
I am new to SQL and am using Microsoft SQL Server 2014
Here are two different solutions:
First using two sub querys and one equi-join:
select Title
from Books b , Movement m
where b.BookID = m.BookID and m.BorrowerID = (select BorrowerID
from Borrowers
where Birthday = (select MIN(Birthday)
from Borrowers))
Using two equi-joins and one sub query:
select Title
from Books b, Borrowers r, Movement m
where b.BookID = m.BookID
and m.BorrowerID = r.BorrowerID
and Birthday = (select MIN(Birthday) from Borrowers)
Both above queries give the following answer:
Title
------------------------------
Relativity

SQL: Grouping by 2 columns

I have a table points:
event_time | name | points |
------------------------------------
2014-07-16 11:40 Bob 10
2014-07-16 10:00 Jim 20
2014-07-16 09:20 Jim 30
2014-07-15 11:20 Bob 5
2014-07-15 10:20 Anna 10
2014-07-15 09:40 Bob 30
2014-07-15 09:00 Anna 10
Is it possible to make a query that results with:
event_date | name | total_points |
------------------------------------
2014-07-16 Bob 10
2014-07-16 Jim 50
2014-07-15 Bob 35
2014-07-15 Anna 20
Where total_points is a sum of all points for the given name during the day?
select date(event_time) as event_date,
name,
sum(points) as total_points
from points
group by date(event_time), name