Related
I am attempting to create a new df that shows all columns and their unique values. I have this following code but I think I am referencing the column of the df in the loop wrong.
#Create empty df
df_unique = pd.DataFrame()
#Loop to take unique values from each column and append to df
for col in df:
list = df(col).unique().tolist()
df_unique.loc[len(df_unique)] = list
To visualize what I am hoping to achieve, I've included a before and after example below.
Before
ID Name Zip Type
01 Bennett 10115 House
02 Sally 10119 Apt
03 Ben 11001 House
04 Bennett 10119 House
After
Column List_of_unique
ID 01, 02, 03, 04
Name Bennett, Sally, Ben
Zip 10115, 10119, 11001
Type House, Apt
You can use:
>>> df.apply(np.unique)
ID [1, 2, 3, 4]
Name [Ben, Bennett, Sally]
Zip [10115, 10119, 11001]
Type [Apt, House]
dtype: object
# OR
>>> (df.apply(lambda x: ', '.join(x.unique().astype(str)))
.rename_axis('Column').rename('List_of_unique').reset_index())
Column List_of_unique
0 ID 1, 2, 3, 4
1 Name Bennett, Sally, Ben
2 Zip 10115, 10119, 11001
3 Type House, Apt
not used pandas explode before. I got the gist of the pd.explode but for value lists where selective cols have nested lists I heard that pd.Series.explode is useful. However, i keep getting : "KeyError: "None of ['city'] are in the columns". Yet 'city' is defined in the keys:
keys = ["city", "temp"]
values = [["chicago","london","berlin"], [[32,30,28],[39,40,25],[33,34,35]]]
df = pd.DataFrame({"keys":keys,"values":values})
df2 = df.set_index(['city']).apply(pd.Series.explode).reset_index()
desired output is:
city / temp
chicago / 32
chicago / 30
chicago / 28
etc.
I would appreciate an expert weighing in as to why this throws an error, and a fix, thank you.
The problem comes from how you define df:
df = pd.DataFrame({"keys":keys,"values":values})
This actually gives you the following dataframe:
keys values
0 city [chicago, london, berlin]
1 temp [[32, 30, 28], [39, 40, 25], [33, 34, 35]]
You probably meant:
df = pd.DataFrame(dict(zip(keys, values)))
Which gives you:
city temp
0 chicago [32, 30, 28]
1 london [39, 40, 25]
2 berlin [33, 34, 35]
You can then use explode:
print(df.explode('temp'))
Output:
city temp
0 chicago 32
0 chicago 30
0 chicago 28
1 london 39
1 london 40
1 london 25
2 berlin 33
2 berlin 34
2 berlin 35
I have a data frame named data_2010 with 3 columns CountryName, IndicatorName and Value.
For eg.
data_2010
CountryName IndicatorName Value
4839015 Arab World Access to electricity (% of population) 8.434222e+01
4839016 Arab World Access to electricity, rural (% of rural popul... 7.196990e+01
4839017 Arab World Access to electricity, urban (% of urban popul... 9.382846e+01
4839018 Arab World Access to non-solid fuel (% of population) 8.600367e+01
4839019 Arab World Access to non-solid fuel, rural (% of rural po... 7.455260e+01
... ... ... ...
5026216 Zimbabwe Urban population (% of total) 3.319600e+01
5026217 Zimbabwe Urban population growth (annual %) 1.279630e+00
5026218 Zimbabwe Use of IMF credit (DOD, current US$) 5.287290e+08
5026219 Zimbabwe Vitamin A supplementation coverage rate (% of ... 4.930002e+01
5026220 Zimbabwe Womens share of population ages 15+ living wi... 5.898546e+01
The problem is there are 247 Unique countries and 1299 Unique IndicatorNames and every country doesn't have the data for the all the indicators. I want the set of countries and Indicator names such that every country has the data of the same indicator names and vice versa
(Edit)
df:
df = pd.DataFrame({'CountryName': ['USA', 'USA','USA','UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
Expected output for df:
CountryName IndicatorName value
USA elec 1
USA fuel 3
UAE elec 4
UAE fuel 5
Zimbabve elec 8
Zimbabve fuel 9
Solution not working for this case:
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe', 'Spain'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission','population'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
})
Output got:
CountryName IndicatorName value
0 Saudi fuel 6
1 Saudi population 7
2 UAE elec 4
3 UAE fuel 5
4 USA elec 1
5 USA fuel 3
6 Zimbabwe elec 8
7 Zimbabwe fuel 9
Output expected:
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9
Though Saudi has 2 indicators but they're not common to the rest.
For eg if Saudi had 3 indicators like ['elec', 'fuel', credit] then Saudi would be added to the final df with elec and fuel.
You can groupby IndicatorName, get the number of unique countries that have the indicator name, then filter your df to keep only the rows that have that indicator for > 1 country.
Nit: your CountryName column is missing a comma between 'USA' 'UAE', fixed below.
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df_indicators = df.groupby('IndicatorName', as_index=False)['CountryName'].nunique()
df_indicators = df_indicators.rename(columns={'CountryName': 'CountryCount'})
df_indicators = df_indicators[df_indicators['CountryCount'] > 1]
# merge on only the indicator column, how='inner' - which is the default so no need to specify
# to keep only those indicators that have a value for > 1 country
df2use = df.merge(df_indicators[['IndicatorName']], on=['IndicatorName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
5 Saudi fuel 6
1 UAE elec 4
4 UAE fuel 5
0 USA elec 1
3 USA fuel 3
2 Zimbabwe elec 8
6 Zimbabwe fuel 9
Looks like you also want to exclude Saudi because it although it has fuel, it has only 1 common IndicatorName. If so, you can use a similar process for countries rather than indicators, starting with only the countries and indicators that survived the first round of filtering, so after the code above use:
df_countries = df2use.groupby('CountryName', as_index=False)['IndicatorName'].nunique()
df_countries = df_countries.rename(columns={'IndicatorName': 'IndicatorCount'})
df_countries = df_countries[df_countries['IndicatorCount'] > 1]
df2use = df2use.merge(df_countries[['CountryName']], on=['CountryName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9
I am trying to replace the placeholder '.' string with NaN in the total revenue column. This is the code used to create the df.
raw_data = {'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Company': ['Microsoft', 'Oracle', "IBM", 'SAP', 'Symantec', 'EMC', 'VMware', 'HP', 'Salesforce.com', 'Intuit'],
'Company_HQ': ['USA', 'USA', 'USA', 'Germany', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA'],
'Software_revenue': ['$62,014', '$29,881', '$29,286', '$18,777', '$6,138', '$5,844', '$5,520', '$5,082', '$4,820', '$4,324'],
'Total_revenue': ['93,456', '38,828', '92,793', '23,289', '6,615', ".", '6,035', '110,577', '5,274', '4,573'],
'Percent_revenue_total': ['66.36%', '76.96%', '31.56%', '80.63%', '92.79%', '23.91%', '91.47%', '4.60%', '91.40%', '94.55%']}
df = pd.DataFrame(raw_data, columns = ['Rank', 'Company', 'Company_HQ', 'Software_revenue', 'Total_revenue', 'Percent_revenue_total'])
df
I have tried using:
import numpy as np
df['Total_revenue'] = df['Total_revenue'].replace('.', np.nan, regex=True)
df
However, this replaces the entire column with Nan instead of just the placeholder '.' value.
You only need to fix the regex=False. Because when you set it to True you are assuming the passed-in is a regular expression, setting it to False will treat the pattern as a literal string (which is what I believe you want):
import pandas as pd
raw_data = {'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Company': ['Microsoft', 'Oracle', "IBM", 'SAP', 'Symantec', 'EMC', 'VMware', 'HP', 'Salesforce.com', 'Intuit'],
'Company_HQ': ['USA', 'USA', 'USA', 'Germany', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA'],
'Software_revenue': ['$62,014', '$29,881', '$29,286', '$18,777', '$6,138', '$5,844', '$5,520', '$5,082', '$4,820', '$4,324'],
'Total_revenue': ['93,456', '38,828', '92,793', '23,289', '6,615', ".", '6,035', '110,577', '5,274', '4,573'],
'Percent_revenue_total': ['66.36%', '76.96%', '31.56%', '80.63%', '92.79%', '23.91%', '91.47%', '4.60%', '91.40%', '94.55%']}
df = pd.DataFrame(raw_data, columns = ['Rank', 'Company', 'Company_HQ', 'Software_revenue', 'Total_revenue', 'Percent_revenue_total'])
import numpy as np
df['Total_revenue'] = df['Total_revenue'].replace('.', np.nan, regex=False)
print(df)
Output:
Rank Company Company_HQ Software_revenue Total_revenue Percent_revenue_total
0 1 Microsoft USA $62,014 93,456 66.36%
1 2 Oracle USA $29,881 38,828 76.96%
2 3 IBM USA $29,286 92,793 31.56%
3 4 SAP Germany $18,777 23,289 80.63%
4 5 Symantec USA $6,138 6,615 92.79%
5 6 EMC USA $5,844 NaN 23.91%
6 7 VMware USA $5,520 6,035 91.47%
7 8 HP USA $5,082 110,577 4.60%
8 9 Salesforce.com USA $4,820 5,274 91.40%
9 10 Intuit USA $4,324 4,573 94.55%
. is special character in regex reprensent any character. You need escape it to make regex consider it as regular char
df['Total_revenue'].replace('\.', np.nan, regex=True)
Out[52]:
0 93,456
1 38,828
2 92,793
3 23,289
4 6,615
5 NaN
6 6,035
7 110,577
8 5,274
9 4,573
Name: Total_revenue, dtype: object
In your case, you should use mask
df['Total_revenue'].mask(df['Total_revenue'].eq('.'))
Out[58]:
0 93,456
1 38,828
2 92,793
3 23,289
4 6,615
5 NaN
6 6,035
7 110,577
8 5,274
9 4,573
Name: Total_revenue, dtype: object
I went one step further here and changed the column type to numeric, so you can also use if for calculations.
df.Total_revenue = pd.to_numeric(df.Total_revenue.str.replace(',',''),errors='coerce').astype('float')
df.Total_revenue
0 93456.0
1 38828.0
2 92793.0
3 23289.0
4 6615.0
5 NaN
6 6035.0
7 110577.0
8 5274.0
9 4573.0
Name: Total_revenue, dtype: float64
In my opinion "replace" is not required as user wanted to change "." Whole to nan. Inistead this will also work. It finds rows with "." And assign nan to it
df.loc[df['Total_revenue']==".", 'Total_revenue'] = np.nan
you can try below to apply your requirement to DataFrame
df.replace('.', np.nan)
or of you want to make if for specific column then use df['Total_revenue'] instead of df
where below is the output:
Rank Company Company_HQ Software_revenue Total_revenue Percent_revenue_total
0 1 Microsoft USA $62,014 93,456 66.36%
1 2 Oracle USA $29,881 38,828 76.96%
2 3 IBM USA $29,286 92,793 31.56%
3 4 SAP Germany $18,777 23,289 80.63%
4 5 Symantec USA $6,138 6,615 92.79%
5 6 EMC USA $5,844 NaN 23.91%
6 7 VMware USA $5,520 6,035 91.47%
7 8 HP USA $5,082 110,577 4.60%
8 9 Salesforce.com USA $4,820 5,274 91.40%
9 10 Intuit USA $4,324 4,573 94.55%
I am new to Python and I have a dataframe that needs a bit of a complicated reshaping. It is best describing with an example using dummy data:
I have this:
and I need this:
The original dataframe is:
testdata = [('State', ['CA', 'FL', 'ON']),
('Country', ['US', 'US', 'CAN']),
('a1', [0.059485629, 0.968962817, 0.645435903]),
('b2', [0.336665658, 0.404398227, 0.333113735]),
('Test', ['Test1', 'Test2', 'Test3']),
('d', [20, 18, 24]),
('e', [21, 16, 25]),
]
df = pd.DataFrame.from_items(testdata)
The dataframe I am after is:
testdata2 = [('State', ['CA', 'CA', 'FL', 'FL', 'ON', 'ON']),
('Country', ['US', 'US', 'US', 'US', 'CAN', 'CAN']),
('Test', ['Test1', 'Test1', 'Test2', 'Test2', 'Test3', 'Test3']),
('Measurements', ['a1', 'b2', 'a1', 'b2', 'a1', 'b2']),
('Values', [0.059485629, 0.336665658, 0.968962817, 0.404398227, 0.645435903, 0.333113735]),
('Steps', [20, 21, 18, 16, 24, 25]),
]
dfn = pd.DataFrame.from_items(testdata2)
It looks like the solution likely requires use of melt, stack and multiindex but I am not sure how to bring all those together.
Any suggested solutions will be greatly appreciated.
Thank you.
Let's try:
df1 = df.melt(id_vars=['State','Country','Test'],value_vars=['a1','b2'],value_name='Values',var_name='Measuremensts')
df2 = df.melt(id_vars=['State','Country','Test'],value_vars=['d','e'],value_name='Steps').drop('variable',axis=1)
df1.merge(df2, on=['State','Country','Test'], right_index=True, left_index=True)
Output:
State Country Test Measuremensts Values Steps
0 CA US Test1 a1 0.059486 20
1 FL US Test2 a1 0.968963 18
2 ON CAN Test3 a1 0.645436 24
3 CA US Test1 b2 0.336666 21
4 FL US Test2 b2 0.404398 16
5 ON CAN Test3 b2 0.333114 25
Or use #JohnGalt solution:
pd.concat([pd.melt(df, id_vars=['State', 'Country', 'Test'], value_vars=x) for x in [['d', 'e'], ['a1', 'b2']]], axis=1)
There is a way to do this using pd.wide_to_long but you must rename your columns so that the Measurements column contains the correct values
df1 = df.rename(columns={'a1':'Values_a1', 'b2':'Values_b2', 'd':'Steps_a1', 'e':'Steps_b2'})
pd.wide_to_long(df1,
stubnames=['Values', 'Steps'],
i=['State', 'Country', 'Test'],
j='Measurements',
sep='_',
suffix='.').reset_index()
State Country Test Measurements Values Steps
0 CA US Test1 a1 0.059486 20
1 CA US Test1 b2 0.336666 21
2 FL US Test2 a1 0.968963 18
3 FL US Test2 b2 0.404398 16
4 ON CAN Test3 a1 0.645436 24
5 ON CAN Test3 b2 0.333114 25