Pandas removing rows with incomplete time series in panel data - pandas

I have a dataframe along the lines of the below:
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
4 Germany Italy 2000
5 Germany Italy 2002
6 Mexico Canada 2000
7 Mexico Canada 2001
8 Mexico Canada 2002
9 US France 2000
10 US France 2001
11 Greece Italy 2000
12 Greece Italy 2001
I want to keep only the rows in which there are observations for the entire time series (2000-2002). So, the end result would be:
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
4 Mexico Canada 2000
5 Mexico Canada 2001
6 Mexico Canada 2002

One idea is reshape by crosstab and test if rows has not 0 values by DataFrame.ne with DataFrame.all, convert index to DataFrame by MultiIndex.to_frame and last get filtered rows in DataFrame.merge:
df1 = pd.crosstab([df['Country1'], df['Country2']], df['Year'])
df = df.merge(df1.index[df1.ne(0).all(axis=1)].to_frame(index=False))
print (df)
Country1 Country2 Year
0 Italy Greece 2000
1 Italy Greece 2001
2 Italy Greece 2002
3 Mexico Canada 2000
4 Mexico Canada 2001
5 Mexico Canada 2002
Or if need test some specific range is possible compare sets in GroupBy.transform:
r = set(range(2000, 2003))
df = df[df.groupby(['Country1', 'Country2'])['Year'].transform(lambda x: set(x) == r)]
print (df)
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
6 Mexico Canada 2000
7 Mexico Canada 2001
8 Mexico Canada 2002

One option is to pivot the data, drop null rows and reshape back; this only works if the combination of Country* and Year is unique (in the sample data it is ):
(df.assign(dummy = 1)
.pivot(('Country1', 'Country2'), 'Year')
.dropna()
.stack()
.drop(columns='dummy')
.reset_index()
)
Country1 Country2 Year
0 Italy Greece 2000
1 Italy Greece 2001
2 Italy Greece 2002
3 Mexico Canada 2000
4 Mexico Canada 2001
5 Mexico Canada 2002

Related

sql command to find out how many players score how much

I have a table like these
country
gender
player
score
year
ID
Germany
male
Michael
14
1990
1
Austria
male
Simon
13
1990
2
Germany
female
Mila
16
1990
3
Austria
female
Simona
15
1990
4
This is a table in the database. It shows 70 countries around the world with player names and gender. It shows which player score how many goals in which year. The years goes from 1990 to 2015. So the table is large. Now I would like to know how many goals all female player and how many male player from Germany have scored from 2010 to 2015. So I would like to know the total score of german male player and the total score of german female player every year from 2010 to 2015 with a Sqlite
I expecting these output
country
gender
score
year
Germany
male
114
2010
Germany
female
113
2010
Germany
male
110
2011
Germany
female
111
2011
Germany
male
119
2012
Germany
female
114
2012
Germany
male
119
2013
Germany
female
114
2013
Germany
male
129
2014
Germany
female
103
2014
Germany
male
109
2015
Germany
female
104
2015
SELECT
country,
gender,
year,
SUM(score) AS score
FROM
<table_name>
WHERE
country ='Germany'
AND year between 2010 and 2015
GROUP BY
1, 2, 3
filtering on country and the years you are interested in
then summing up total score using group by

How can I have the same index for two different columns where the columns do not have unqiue values?

df1:
name_d
name_o
year
Turkiye
Italy
1990
Turkiye
Italy
1991
Turkiye
Italy
1993
Spain
Italy
1990
Spain
Italy
1991
Spain
Japan
1990
df2:
country_name
year
v2x_regime
Spain
1990
0
Turkiye
1990
0
Italy
1990
1
Turkiye
1991
1
Spain
1991
1
Italy
1991
1
Expected result:
name_o
v2x_regime_name_o
name_d
v2x_regime_name_d
year
Italy
1
Turkiye
1
1990
Basically I would like to know the regime type of the each country for each year. Since this is a bilateral data, there are two columns that include country name. For example, for each year I would like to have the index for name_o column and name_d column.
Does this work for you?:
df = df1.merge(df2, left_on=['name_o','year'], right_on=['country_name','year'], how='left')
df = df.merge(df2, left_on=['name_d,'year'], right_on=['country_name','year'], how='left')

Iterating over two dfs and add values

I have a dataframe along the lines of the below (df1):
Country Crisis
1 Italy 2008
2 Germany 2008, 2009
3 Mexico
4 US 2007
5 Greece 2010, 2007
I have another dataframe (df2) in panel data format:
Country Year
1 Italy 2007
2 Italy 2008
3 Italy 2009
4 Italy 2010
5 Germany 2007
6 Germany 2008
7 Germany 2009
8 Germany 2010
9 Mexico 2007
10 Mexico 2008
11 Mexico 2009
12 Mexico 2010
13 US 2007
14 US 2008
15 US 2009
16 US 2010
17 Greece 2007
18 Greece 2008
19 Greece 2009
20 Greece 2010
I wish to add a column to df2, "crisis", in which 1 will indicate a crisis, like so:
Country Year crisis
1 Italy 2007 0
2 Italy 2008 1
3 Italy 2009 0
4 Italy 2010 0
5 Germany 2007 0
6 Germany 2008 1
7 Germany 2009 1
8 Germany 2010 0
9 Mexico 2007 0
10 Mexico 2008 0
11 Mexico 2009 0
12 Mexico 2010 0
13 US 2007 1
14 US 2008 0
15 US 2009 0
16 US 2010 0
17 Greece 2007 1
18 Greece 2008 0
19 Greece 2009 0
20 Greece 2010 1
Any ideas?
Although not pretty, this works:
df2['crisis'] = df2.groupby('Country', sort=False, as_index=False).apply(lambda x: x['Year'].isin(df1[df1['Country'] == x['Country'].iloc[0]]['Crisis'].iloc[0])).droplevel(0)
Output:
>>> df2
Country Year crisis
1 Italy 2007 False
2 Italy 2008 True
3 Italy 2009 False
4 Italy 2010 False
5 Germany 2007 False
6 Germany 2008 True
7 Germany 2009 True
8 Germany 2010 False
9 Mexico 2007 False
10 Mexico 2008 False
11 Mexico 2009 False
12 Mexico 2010 False
13 US 2007 True
14 US 2008 False
15 US 2009 False
16 US 2010 False
17 Greece 2007 True
18 Greece 2008 False
19 Greece 2009 False
20 Greece 2010 True
>>> df2.assign(crisis=df2['crisis'].astype(int))
Country Year crisis
1 Italy 2007 0
2 Italy 2008 1
3 Italy 2009 0
4 Italy 2010 0
5 Germany 2007 0
6 Germany 2008 1
7 Germany 2009 1
8 Germany 2010 0
9 Mexico 2007 0
10 Mexico 2008 0
11 Mexico 2009 0
12 Mexico 2010 0
13 US 2007 1
14 US 2008 0
15 US 2009 0
16 US 2010 0
17 Greece 2007 1
18 Greece 2008 0
19 Greece 2009 0
20 Greece 2010 1

Python Pandas: add missing row of two dataframe and keep the extra columns [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Error: pandas hashtable keyerror
(3 answers)
Closed 2 years ago.
I would like to add the missing row dataframe df1 and keep the extra columns information
In [183]: df1
Out[183]:
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Boston US N.America
3 London UK Europe
4 Beijing China Asia
5 Omaha US N.America
In [183]: df2
Out[183]:
City
0 Chicago
1 San Franciso
2 Sao Paulo
3 Boston
4 London
5 Beijing
6 Tokyo
7 Omaha
The desired result after the merge is
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Sao Paulo nan nan
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo nan nan
7 Omaha US N.America
I am trying with pd.merge(df2, df1, on='City', how='outer') but return keyerror.
Try the code below, using pd.merge, left_join, your desired output:
merged = pd.merge(df2,df1,how='left',on='City')
print(merged)
City Country Region
0 Chicago US N.America
1 San Fransicsco NaN NaN
2 Sao Paolo NaN NaN
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo NaN NaN
7 Omaha US N.America
If you want to use an outer join, you can get this result using the below code:
merged_outer = pd.merge(df2, df1, on='City', how='outer')
print(merged_outer)
City Country Region
0 Chicago US N.America
1 San Fransicsco NaN NaN
2 Sao Paolo NaN NaN
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo NaN NaN
7 Omaha US N.America
8 San Franciso US N.America
DF1 & DF2 respectively:
df1
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Boston US N.America
3 London UK Europe
4 Beijing China Asia
5 Omaha US N.America
df2
City
0 Chicago
1 San Fransicsco
2 Sao Paolo
3 Boston
4 London
5 Beijing
6 Tokyo
7 Omaha

Joining a Table with Itself with multiple WHERE statemetns

Long time reader, first time poster.
I'm trying to consolidate a table I have to the rate of sold goods getting lost in transit. In this table, we have four kinds of products, three countries of origin, three transit countries (where the goods are first shipped to before being passed to customers) and three destination countries. The table is as follows.
Status Product Count Origin Transit Destination
--------------------------------------------------------------------
Delivered Shoes 100 Germany France USA
Delivered Books 50 Germany France USA
Delivered Jackets 75 Germany France USA
Delivered DVDS 30 Germany France USA
Not Delivered Shoes 7 Germany France USA
Not Delivered Books 3 Germany France USA
Not Delivered Jackets 5 Germany France USA
Not Delivered DVDS 1 Germany France USA
Delivered Shoes 300 Poland Netherlands Canada
Delivered Books 80 Poland Netherlands Canada
Delivered Jackets 25 Poland Netherlands Canada
Delivered DVDS 90 Poland Netherlands Canada
Not Delivered Shoes 17 Poland Netherlands Canada
Not Delivered Books 13 Poland Netherlands Canada
Not Delivered Jackets 1 Poland Netherlands Canada
Delivered Shoes 250 Spain Ireland UK
Delivered Books 20 Spain Ireland UK
Delivered Jackets 150 Spain Ireland UK
Delivered DVDS 60 Spain Ireland UK
Not Delivered Shoes 19 Spain Ireland UK
Not Delivered Books 8 Spain Ireland UK
Not Delivered Jackets 8 Spain Ireland UK
Not Delivered DVDS 10 Spain Ireland UK
I would like to create a new table that shows the count of goods delivered and not delivered in one row, like this.
Product Delivered Not_Delivered Origin Transit Destination
Shoes 100 7 Germany France USA
Books 50 3 Germany France USA
Jackets 75 5 Germany France USA
DVDS 30 1 Germany France USA
Shoes 300 17 Poland Netherlands Canada
Books 80 13 Poland Netherlands Canada
Jackets 25 1 Poland Netherlands Canada
DVDS 90 0 Poland Netherlands Canada
Shoes 250 19 Spain Ireland UK
Books 20 8 Spain Ireland UK
Jackets 150 8 Spain Ireland UK
DVDS 60 10 Spain Ireland UK
I've had a look at some other posts and so far I haven't found exactly what I'm looking for. Perhaps the issue here is that there will be multiple WHERE statements in the code to ensure that I don't group all shoes together, ore all country groups.
Is this possible with SQL?
Something like this?
select
product
,sum(case when status = 'Delivered' then count else 0 end) as delivered
,sum(case when status = 'Not Delivered' then count else 0 end) as not_delivered
,origin
,transit
,destination
from table
group by
product
,origin
,transit
,destination
This is rather easy; instead of one line per Product, Origin, Transit, Destination and Status, you want one result line per Product, Origin, Transit and Destination only. So group by these four columns and aggregate conditionally:
select
product, origin, transit, destination,
sum(case when status = 'Delivered' then "count" else 0 end) as delivered,
sum(case when status = 'Not Delivered' then "count" else 0 end) as not_delivered
from mytable
group by product, origin, transit, destination;
BTW: It is not a good idea to use a keyword for a column name. I used double quotes to use your column count, which is standard SQL, but I don't know if it works in Google BigQuery. Maybe it must be "Count" rather than "count" or something entirely else.)
SELECT
product, origin, transit, destination,
SUM([count] * (status = 'Delivered')) AS delivered,
SUM([count] * (status = 'Not Delivered')) AS not_delivered
FROM mytable
GROUP BY 1, 2, 3, 4