Replacing with Nan - pandas

I am trying to replace the placeholder '.' string with NaN in the total revenue column. This is the code used to create the df.
raw_data = {'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Company': ['Microsoft', 'Oracle', "IBM", 'SAP', 'Symantec', 'EMC', 'VMware', 'HP', 'Salesforce.com', 'Intuit'],
'Company_HQ': ['USA', 'USA', 'USA', 'Germany', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA'],
'Software_revenue': ['$62,014', '$29,881', '$29,286', '$18,777', '$6,138', '$5,844', '$5,520', '$5,082', '$4,820', '$4,324'],
'Total_revenue': ['93,456', '38,828', '92,793', '23,289', '6,615', ".", '6,035', '110,577', '5,274', '4,573'],
'Percent_revenue_total': ['66.36%', '76.96%', '31.56%', '80.63%', '92.79%', '23.91%', '91.47%', '4.60%', '91.40%', '94.55%']}
df = pd.DataFrame(raw_data, columns = ['Rank', 'Company', 'Company_HQ', 'Software_revenue', 'Total_revenue', 'Percent_revenue_total'])
df
I have tried using:
import numpy as np
df['Total_revenue'] = df['Total_revenue'].replace('.', np.nan, regex=True)
df
However, this replaces the entire column with Nan instead of just the placeholder '.' value.

You only need to fix the regex=False. Because when you set it to True you are assuming the passed-in is a regular expression, setting it to False will treat the pattern as a literal string (which is what I believe you want):
import pandas as pd
raw_data = {'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Company': ['Microsoft', 'Oracle', "IBM", 'SAP', 'Symantec', 'EMC', 'VMware', 'HP', 'Salesforce.com', 'Intuit'],
'Company_HQ': ['USA', 'USA', 'USA', 'Germany', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA'],
'Software_revenue': ['$62,014', '$29,881', '$29,286', '$18,777', '$6,138', '$5,844', '$5,520', '$5,082', '$4,820', '$4,324'],
'Total_revenue': ['93,456', '38,828', '92,793', '23,289', '6,615', ".", '6,035', '110,577', '5,274', '4,573'],
'Percent_revenue_total': ['66.36%', '76.96%', '31.56%', '80.63%', '92.79%', '23.91%', '91.47%', '4.60%', '91.40%', '94.55%']}
df = pd.DataFrame(raw_data, columns = ['Rank', 'Company', 'Company_HQ', 'Software_revenue', 'Total_revenue', 'Percent_revenue_total'])
import numpy as np
df['Total_revenue'] = df['Total_revenue'].replace('.', np.nan, regex=False)
print(df)
Output:
Rank Company Company_HQ Software_revenue Total_revenue Percent_revenue_total
0 1 Microsoft USA $62,014 93,456 66.36%
1 2 Oracle USA $29,881 38,828 76.96%
2 3 IBM USA $29,286 92,793 31.56%
3 4 SAP Germany $18,777 23,289 80.63%
4 5 Symantec USA $6,138 6,615 92.79%
5 6 EMC USA $5,844 NaN 23.91%
6 7 VMware USA $5,520 6,035 91.47%
7 8 HP USA $5,082 110,577 4.60%
8 9 Salesforce.com USA $4,820 5,274 91.40%
9 10 Intuit USA $4,324 4,573 94.55%

. is special character in regex reprensent any character. You need escape it to make regex consider it as regular char
df['Total_revenue'].replace('\.', np.nan, regex=True)
Out[52]:
0 93,456
1 38,828
2 92,793
3 23,289
4 6,615
5 NaN
6 6,035
7 110,577
8 5,274
9 4,573
Name: Total_revenue, dtype: object
In your case, you should use mask
df['Total_revenue'].mask(df['Total_revenue'].eq('.'))
Out[58]:
0 93,456
1 38,828
2 92,793
3 23,289
4 6,615
5 NaN
6 6,035
7 110,577
8 5,274
9 4,573
Name: Total_revenue, dtype: object

I went one step further here and changed the column type to numeric, so you can also use if for calculations.
df.Total_revenue = pd.to_numeric(df.Total_revenue.str.replace(',',''),errors='coerce').astype('float')
df.Total_revenue
0 93456.0
1 38828.0
2 92793.0
3 23289.0
4 6615.0
5 NaN
6 6035.0
7 110577.0
8 5274.0
9 4573.0
Name: Total_revenue, dtype: float64

In my opinion "replace" is not required as user wanted to change "." Whole to nan. Inistead this will also work. It finds rows with "." And assign nan to it
df.loc[df['Total_revenue']==".", 'Total_revenue'] = np.nan

you can try below to apply your requirement to DataFrame
df.replace('.', np.nan)
or of you want to make if for specific column then use df['Total_revenue'] instead of df
where below is the output:
Rank Company Company_HQ Software_revenue Total_revenue Percent_revenue_total
0 1 Microsoft USA $62,014 93,456 66.36%
1 2 Oracle USA $29,881 38,828 76.96%
2 3 IBM USA $29,286 92,793 31.56%
3 4 SAP Germany $18,777 23,289 80.63%
4 5 Symantec USA $6,138 6,615 92.79%
5 6 EMC USA $5,844 NaN 23.91%
6 7 VMware USA $5,520 6,035 91.47%
7 8 HP USA $5,082 110,577 4.60%
8 9 Salesforce.com USA $4,820 5,274 91.40%
9 10 Intuit USA $4,324 4,573 94.55%

Related

Pandas Decile Rank

I just used the pandas qcut function to create a decile ranking, but how do I look at the bounds of each ranking. Basically, how do I know what numbers fall in the range of the ranking of 1 or 2 or 3 etc?
I hope the following python code with 2 short examples can help you. For the second example I used the isin method.
import numpy as np
import pandas as pd
df = {'Name' : ['Mike', 'Anton', 'Simon', 'Amy',
'Claudia', 'Peter', 'David', 'Tom'],
'Score' : [42, 63, 75, 97, 61, 30, 80, 13]}
df = pd.DataFrame(df, columns = ['Name', 'Score'])
df['decile_rank'] = pd.qcut(df['Score'], 10,
labels = False)
print(df)
Output:
Name Score decile_rank
0 Mike 42 2
1 Anton 63 5
2 Simon 75 7
3 Amy 97 9
4 Claudia 61 4
5 Peter 30 1
6 David 80 8
7 Tom 13 0
rank_1 = df[df['decile_rank']==1]
print(rank_1)
Output:
Name Score decile_rank
5 Peter 30 1
rank_1_and_2 = df[df['decile_rank'].isin([1,2])]
print(rank_1_and_2)
Output:
Name Score decile_rank
0 Mike 42 2
5 Peter 30 1

Row filtering with respect to intersection of 2 columns

I have a data frame named data_2010 with 3 columns CountryName, IndicatorName and Value.
For eg.
data_2010
CountryName IndicatorName Value
4839015 Arab World Access to electricity (% of population) 8.434222e+01
4839016 Arab World Access to electricity, rural (% of rural popul... 7.196990e+01
4839017 Arab World Access to electricity, urban (% of urban popul... 9.382846e+01
4839018 Arab World Access to non-solid fuel (% of population) 8.600367e+01
4839019 Arab World Access to non-solid fuel, rural (% of rural po... 7.455260e+01
... ... ... ...
5026216 Zimbabwe Urban population (% of total) 3.319600e+01
5026217 Zimbabwe Urban population growth (annual %) 1.279630e+00
5026218 Zimbabwe Use of IMF credit (DOD, current US$) 5.287290e+08
5026219 Zimbabwe Vitamin A supplementation coverage rate (% of ... 4.930002e+01
5026220 Zimbabwe Womens share of population ages 15+ living wi... 5.898546e+01
The problem is there are 247 Unique countries and 1299 Unique IndicatorNames and every country doesn't have the data for the all the indicators. I want the set of countries and Indicator names such that every country has the data of the same indicator names and vice versa
(Edit)
df:
df = pd.DataFrame({'CountryName': ['USA', 'USA','USA','UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
Expected output for df:
CountryName IndicatorName value
USA elec 1
USA fuel 3
UAE elec 4
UAE fuel 5
Zimbabve elec 8
Zimbabve fuel 9
Solution not working for this case:
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe', 'Spain'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission','population'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
})
Output got:
CountryName IndicatorName value
0 Saudi fuel 6
1 Saudi population 7
2 UAE elec 4
3 UAE fuel 5
4 USA elec 1
5 USA fuel 3
6 Zimbabwe elec 8
7 Zimbabwe fuel 9
Output expected:
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9
Though Saudi has 2 indicators but they're not common to the rest.
For eg if Saudi had 3 indicators like ['elec', 'fuel', credit] then Saudi would be added to the final df with elec and fuel.
You can groupby IndicatorName, get the number of unique countries that have the indicator name, then filter your df to keep only the rows that have that indicator for > 1 country.
Nit: your CountryName column is missing a comma between 'USA' 'UAE', fixed below.
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df_indicators = df.groupby('IndicatorName', as_index=False)['CountryName'].nunique()
df_indicators = df_indicators.rename(columns={'CountryName': 'CountryCount'})
df_indicators = df_indicators[df_indicators['CountryCount'] > 1]
# merge on only the indicator column, how='inner' - which is the default so no need to specify
# to keep only those indicators that have a value for > 1 country
df2use = df.merge(df_indicators[['IndicatorName']], on=['IndicatorName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
5 Saudi fuel 6
1 UAE elec 4
4 UAE fuel 5
0 USA elec 1
3 USA fuel 3
2 Zimbabwe elec 8
6 Zimbabwe fuel 9
Looks like you also want to exclude Saudi because it although it has fuel, it has only 1 common IndicatorName. If so, you can use a similar process for countries rather than indicators, starting with only the countries and indicators that survived the first round of filtering, so after the code above use:
df_countries = df2use.groupby('CountryName', as_index=False)['IndicatorName'].nunique()
df_countries = df_countries.rename(columns={'IndicatorName': 'IndicatorCount'})
df_countries = df_countries[df_countries['IndicatorCount'] > 1]
df2use = df2use.merge(df_countries[['CountryName']], on=['CountryName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9

How t add data in separate columns data in Pandas Dataframe?

question:
goldmedal = pd.DataFrame({'Country': ['india', 'japan', 'korea'],
'Medals': [5, 3, 4]}
)
silvermedal = pd.DataFrame({'Country': ['india', 'china', 'korea'],
'Medals': [9, 0, 6]}
)
bronzemedal = pd.DataFrame({'Country': ['japan', 'india', 'vietnam'],
'Medals': [4, 2, 2]}
)
I need to find the cumulative medals earned by the mentioned countries.
I tried this
add function: goldmedal.add(silvermedal, fill_value=0) O/P
Country Medals
0 indiaindia 14
1 japanchina 3
2 koreakorea 10
merge function: pd.merge (goldmedal,silvermedal, how='inner', on='Country')
O/P
Country Medalsx Medalsy
0 india 5 9
1 korea 4 6
How do I get the following output?
Country Medals
0 india 16
1 china 0
2 korea 10
3 vietnam 2
4 japan 7
pd.concat([goldmedal, silvermedal, bronzemedal]).groupby('Country').sum().reset_index()

How do I aggregate a pandas Dataframe while retaining all original data?

My goal is to aggregate a pandas DataFrame, grouping rows by an identity field. Notably, rather than just gathering summary statistics of the group, I want to retain all the information in the DataFrame in addition to summary statistics like mean, std, etc. I have performed this transformation via a lot of iteration, but I am looking for a cleaner/more pythonic approach. Notably, there may be more or less than 2 replicates per group, but all groups will always have the same number of replicates.
Example: I would llke to translate the below format
df = pd.DataFrame([
["group1", 4, 10],
["group1", 8, 20],
["group2", 6, 30],
["group2", 12, 40],
["group3", 1, 50],
["group3", 3, 60]],
columns=['group','timeA', 'timeB'])
print(df)
group timeA timeB
0 group1 4 10
1 group1 8 20
2 group2 6 30
3 group2 12 40
4 group3 1 50
5 group3 3 60
into a df of the following format:
target = pd.DataFrame([
["group1", 4, 8, 6, 10, 20, 15],
["group2", 6, 12, 9, 30, 45, 35],
["group3", 1, 3, 2, 50, 60, 55]
], columns = ["group", "timeA.1", "timeA.2", "timeA.mean", "timeB.1", "timeB.2", "timeB.mean"])
print(target)
group timeA.1 timeA.2 timeA.mean timeB.1 timeB.2 timeB.mean
0 group1 4 8 6 10 20 15
1 group2 6 12 9 30 45 35
2 group3 1 3 2 50 60 55
Finally, it doesn't really matter what the column names are, these ones are just to make the example more clear. Thanks!
EDIT: As suggested by a user in the comments, I tried the solution from the linked Q/A without success:
df.insert(0, 'count', df.groupby('group').cumcount())
df.pivot(*df)
TypeError: pivot() takes from 1 to 4 positional arguments but 5 were given
Try with pivot_table:
out = (df.assign(col=df.groupby('group').cumcount()+1)
.pivot_table(index='group', columns='col',
margins='mean', margins_name='mean')
.drop('mean')
)
out.columns = [f'{x}.{y}' for x,y in out.columns]
Output:
timeA.1 timeA.2 timeA.mean timeB.1 timeB.2 timeB.mean
group
group1 4.0 8.0 6.0 10 20 15
group2 6.0 12.0 9.0 30 40 35
group3 1.0 3.0 2.0 50 60 55

Finding duplicate entries

I am working with the 515k Hotel Reviews dataset from Kaggle. There are 1492 unique hotel names and 1493 unique addresses. So at first it would appear that one (or possibly more) hotel has more than one address. But, if I do a groupby.count on the data, I get 1494 whether I groupby HotelName followed by Address or if I reverse the order.
In order to make this reproducible, hopefully this simplification will suffice:
data = {
'HotelName': ['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'B', 'C', 'C'],
'Address': [1, 2, 3, 4, 1, 2, 3, 4, 2, 2, 2, 3, 5]
}
df = pd.DataFrame(data, columns = ['HotelName', 'Address'])
df['HotelName'].unique().shape[0] # Returns 4
df['Address'].unique().shape[0] # Returns 5
df.groupby(['Address', 'HotelName']).count().shape[0] # Returns 6
df.groupby(['Address', 'HotelName']).count().shape[0] # Returns 6
I would like to find the hotel names that have different addresses. So in my example, I would like to find the A and C along with their addresses (1,2 and 3,5 respectively). That code should be enough for me to also find the addresses that have duplicate hotel names.
Use the nunique groupby aggregator:
>>> n_uniq = df.groupby('HotelName')['Address'].nunique()
>>> n_uniq
HotelName
A 2
B 1
C 2
D 1
Name: Address, dtype: int64
If you want to look at the distinct hotels with more than one address in the original dataframe,
>>> hotels_with_mult_addr = n_uniq.index[n_uniq > 1]
>>> df[df['HotelName'].isin(hotels_with_mult_addr)].drop_duplicates()
HotelName Address
0 A 1
2 C 3
8 A 2
12 C 5
If I understand your correctly, we can check which hotel has more then 1 unique adress with groupby.transform(nunqiue):
m = df.groupby('HotelName')['Address'].transform('nunique').ne(1)
print(df.loc[m])
HotelName Address
0 A 1
2 C 3
4 A 1
6 C 3
8 A 2
11 C 3
12 C 5
If you want to get a more concise view on which the duplicates are, use groupby.agg(set):
df.loc[m].groupby('HotelName')['Address'].agg(set).reset_index(name='addresses')
HotelName addresses
0 A {1, 2}
1 C {3, 5}
Step by step:
transform(nunique) gives us the amount of unique adresses next to each row
df.groupby('HotelName')['Address'].transform('nunique')
0 2
1 1
2 2
3 1
4 2
5 1
6 2
7 1
8 2
9 1
10 1
11 2
12 2
Name: Address, dtype: int64
Then we check which rows are not equal (ne) to 1 and filter those:
df.groupby('HotelName')['Address'].transform('nunique').ne(1)
0 True
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
9 False
10 False
11 True
12 True
Name: Address, dtype: bool
Groupby didn't do what you were expected. After you did the groupby here is what you got
HotelName Address
0 A 1
4 A 1
HotelName Address
8 A 2
HotelName Address
1 B 2
5 B 2
9 B 2
10 B 2
HotelName Address
2 C 3
6 C 3
11 C 3
HotelName Address
3 D 4
7 D 4
HotelName Address
12 C 5
There are indeed 6 combinations!
If you want to know the duplication in each group, you should check the row index.
Here is the long way to do it, where in newdf['count'] == 1 is unique
df = pd.DataFrame(data, columns = ['HotelName', 'Address'])
df = df.sort_values(by = ['HotelName','Address']).reset_index(drop = True)
count = df.groupby(['HotelName','Address'])['Address'].count().reset_index(drop = True)
df['rownum'] = df.groupby(['HotelName','Address']).cumcount()+1
dfnew = df[df['rownum']==1].reset_index(drop = True).drop(columns = 'rownum')
dfnew['count'] = count
dfnew