How to combine parts of different columns in a DataFrame? - pandas

Say I have a DataFrame:
foo bar
0 1998 abc
1 1999 xyz
2 2000 123
What is the best way to combine the first two (or any n) characters of foo with the last two of bar to make a third column 19bc, 19yz, 2023 ? Normally to combine columns I'd simply do
df['foobar'] = df['foo'] + df['bar']
but I believe I can't do slicing on these objects.

If you convert your columns as string with astype, you can use str accessor and slice values:
df['foobar'] = df['foo'].astype(str).str[:2] + df['bar'].str[-2:]
print(df)
# Ouput
foo bar foobar
0 1998 abc 19bc
1 1999 xyz 19yz
2 2000 123 2023
Using astype(str) is useless if your column has already object dtype like bar column (?).
You can read: Working with text data

Example:
import pandas as pd
df = pd.DataFrame({"foo": [1998, 1999, 2000], "bar": ["abc", "xyz", "123"]})
df["foobar"] = df.apply(lambda x: str(x["foo"])[0:2] + x["bar"][1:], axis=1)
print(df)
# foo bar foobar
# 0 1998 abc 19bc
# 1 1999 xyz 19yz
# 2 2000 123 2023

Related

Transform a dataframe with comma thousand separator to space separator Pandas

I have an issue with format in Pandas. So, I have a Column in A DataFrame witch countains numbers with a comma seperator like (200,000). So I would like to transfom this in (200 000).
The easy way it's to use the replace function but I want also to transform the type into Integer. It's not working because there are spaces between.
In the end, I just want to do a ranking with descending sorted values ​​like this:
Id
Villas
Price_nospace
3
Peace
35000000
3
Peace
35000000
2
Rosa
27000000
1
Beach
25000000
0
Palm
22000000
As you can see, It's not easy to read the price without separator. So I would like to make the price more readable. But when I have space separator I can't convert to Int.
And If I don't convert to integer, I can use sort_values function. So I am stuck.
Thank you for your help.
Modified the sample input a bit to sort(descending) the values in output.
Below solution will sort(descending) the dataframe by Price_nospace and replace the comma with space. But the Price_nospace will be of object type in output.
Sample Input
Id Villas Price_nospace
0 3 Peace 220,000
1 3 Peace 350,000
2 2 Rosa 270,000
3 1 Beach 250,000
4 0 Palm 230,000
Code
df['Price_new'] = df['Price_nospace'].str.replace(',','',regex=True).astype(int)
df = df.sort_values(by='Price_new', ascending=False)
df['Price_nospace'] = df['Price_nospace'].str.replace(',',' ',regex=True)
df = df.drop(columns='Price_new').reset_index(drop=True) ## reset_index, if required
df
Output
Id Villas Price_nospace
0 3 Peace 350 000
1 2 Rosa 270 000
2 1 Beach 250 000
3 0 Palm 230 000
4 3 Peace 220 000
Explanation
Introduced a new column Price_new to convert Price_nospace values to int and sort the values.
Once df is sorted, just replaced comma with space for Price_nospace and deleted temp column Price_new.
Another option is to alter how the data is displayed but not affect the underlying type.
Use pd.options.display.float_format after converting the str prices to float prices:
import pandas as pd
def my_float_format(x):
'''
Number formatting with custom thousands separator
'''
return f'{x:,.0f}'.replace(',', ' ')
# set display float_format
pd.options.display.float_format = my_float_format
df = pd.DataFrame({
'Id': [3, 3, 2, 1, 0],
'Villas': ['Peace', 'Peace', 'Rosa', 'Beach', 'Palm'],
'Price_nospace': ['35,000,000', '35,000,000', '27,000,000',
'25,000,000', '22,000,000']
})
# Convert str prices to float
df['Price_nospace'] = (
df['Price_nospace'].str.replace(',', '', regex=True).astype(float)
)
Output:
print(df)
Id Villas Price_nospace
0 3 Peace 35 000 000
1 3 Peace 35 000 000
2 2 Rosa 27 000 000
3 1 Beach 25 000 000
4 0 Palm 22 000 000
print(df.dtypes)
Id int64
Villas object
Price_nospace float64
dtype: object
Since the type is float64 any numeric operations will function as normal.
The same my_float_format function can be used on export as well:
df.to_csv(float_format=my_float_format)
,Id,Villas,Price_nospace
0,3,Peace,35 000 000
1,3,Peace,35 000 000
2,2,Rosa,27 000 000
3,1,Beach,25 000 000
4,0,Palm,22 000 000

Pandas Create Variability DF with Multiple Row Averages in Different DF

I've been trying to create a column of variability given the mean of the column data values for 'A' and 'B' below. I don't understand how to create the average for each row or element-wise in the panda column by the single data value with the long-term average(s). For example, imagine if have data that looks like this in pandas df1:
Year Name Data
1999 A 2
2000 A 4
1999 B 6
2000 B 8
And, i have a DF with the long-term mean called "LTmean", which in this case is = 3 and 7.
mean_df =
Name Data mean
0 A 3
1 B 7
So, the result would look like this for a new df: dfnew['var'] = (df1.['Data']/mean_df(???) -1:
Year Name Var
1999 A -0.3
2000 A 0.3
1999 B -0.14
2000 B 0.14
Thank you for any suggestions on this! Would a loop be the best idea to loop through each column by the "Name' in each DF somehow?
df['Var'] = df1['Data']/LTmean - 1

Parse Excel file with multi-index in Pandas

What would be the best way to parse the above Excel file in a Pandas dataframe? The idea would be to be able to update data easily, adding columns, dropping lines. For example, for every origin, I would like to keep only output3. Then for every column (2000, ....,2013) divide it by 2 given a condition (say value > 6000) .
Below is what I tried: first to parse and drop the unnecessary lines, but it's not satisfactory, as I had to rename columns manually. So this doesn't look very optimal as a solution. Any better idea?
df = pd.read_excel("myExcel.xlsx", skiprows=2, sheet_name='1')
cols1 = list(df.columns)
cols1 = [str(x)[:4] for x in cols1]
cols2 = list(df.iloc[0,:])
cols2 = [str(x) for x in cols2]
cols = [x + "_" + y for x,y in zip(cols1,cols2)]
df.columns = cols
df = df.drop(["Unna_nan"], axis =1).rename(columns ={'Time_Origine':'Country','Unna_Output' : 'Series','Unna_Ccy' : 'Unit','2000_nan' : '2000','2001_nan': '2001','2002_nan':'2002','2003_nan' : '2003','2004_nan': '2004','2005_nan' : '2005','2006_nan' : '2006','2007_nan' : '2007','2008_nan' : '2008','2009_nan' : '2009','2010_nan' : '2010','2011_nan': '2011','2012_nan' : '2012','2013_nan':'2013','2014_nan':'2014','2015_nan':'2015','2016_nan':'2016','2017_nan':'2017'})
df.drop(0,inplace=True)
df.drop(df.tail(1).index, inplace=True)
idx = ['Country', 'Series', 'Unit']
df = df.set_index(idx)
df = df.query('Series == "Output3"')
Without having an excel like this I think something like this might work.
In order to only get the rows from output3, you can use the following:
df = pd.read_excel("myExcel.xlsx", skiprows=2, sheet_name='1')
df = df.loc[df['Output'] == 'output3']
Now dividing every cell by 2 if the cell value is greater than 6000 by using pandas apply:
def foo(bar):
if bar > 6000:
return bar / 2
return bar
for col in df.columns:
try:
int(col) # to check if this column is a year
df[col] = df[col].apply(foo)
except ValueError:
pass
#read first 2 rows to MultiIndex nad remove last one
df = pd.read_excel("Excel1.xlsx", skiprows=2, header=[0,1], skipfooter=1)
print (df)
#create helper DataFrame
cols = df.columns.to_frame().reset_index(drop=True)
cols.columns=['a','b']
cols['a'] = pd.to_numeric(cols['a'], errors='ignore')
cols['b'] = cols['b'].replace('Unit.1','tmp', regex=False)
#create new column by condition
cols['c'] = np.where(cols['b'].str.startswith('Unnamed'), cols['a'], cols['b'])
print (cols)
a b c
0 Time Country Country
1 Time Series Series
2 Time Unit Unit
3 Time tmp tmp
4 2000 Unnamed: 4_level_1 2000
5 2001 Unnamed: 5_level_1 2001
6 2002 Unnamed: 6_level_1 2002
7 2003 Unnamed: 7_level_1 2003
8 2004 Unnamed: 8_level_1 2004
9 2005 Unnamed: 9_level_1 2005
10 2006 Unnamed: 10_level_1 2006
11 2007 Unnamed: 11_level_1 2007
12 2008 Unnamed: 12_level_1 2008
13 2009 Unnamed: 13_level_1 2009
14 2010 Unnamed: 14_level_1 2010
15 2011 Unnamed: 15_level_1 2011
16 2012 Unnamed: 16_level_1 2012
17 2013 Unnamed: 17_level_1 2013
18 2014 Unnamed: 18_level_1 2014
19 2015 Unnamed: 19_level_1 2015
20 2016 Unnamed: 20_level_1 2016
21 2017 Unnamed: 21_level_1 2017
#overwrite columns by column c
df.columns = cols['c'].tolist()
#forward filling missing values
df['Country'] = df['Country'].ffill()
df = df.drop('tmp', axis=1).set_index(['Country','Series','Unit'])
print (df)

Issue looping through dataframes in Pandas

I have a dict 'd' set up which is a list of dataframes E.g.:
d["DataFrame1"]
Will return that dataframe with all its columns:
ID Name
0 123 John
1 548 Eric
2 184 Sam
3 175 Andy
Each dataframe has a column in it called 'Names'. I want to extract this column from each dataframe in the dict and to create a new dataframe consisting of these columns.
df_All_Names = pd.DataFrame()
for df in d:
df_All_Names[df] = df['Names']
Returns the error:
TypeError: string indices must be integers
Unsure where I'm going wrong here.
For example you have df as follow
df=pd.DataFrame({'Name':['X', 'Y']})
df1=pd.DataFrame({'Name':['X1', 'Y1']})
And we create a dict
d=dict()
d['df']=df
d['df1']=df1
Then presetting a empty data frame:
yourdf=pd.DataFrame()
Using items with for loop
for key,val in d.items():
yourdf[key]=val['Name']
yield :
yourdf
Out[98]:
df df1
0 X X1
1 Y Y1
Your can use reduce and concatenate all of the columns named ['Name'] in your dictionary of dataframes
Sample Data
from functools import reduce
d = {'df1':pd.DataFrame({'ID':[0,1,2],'Name':['John','Sam','Andy']}),'df2':pd.DataFrame({'ID':[3,4,5],'Name':['Jen','Cara','Jess']})}
You can stack the data side by side using axis=1
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=1),d.values())
Name Name
0 John Jen
1 Sam Cara
2 Andy Jess
Or on top of one an other usingaxis=0
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=0),d.values())
0 John
1 Sam
2 Andy
0 Jen
1 Cara
2 Jess

Using Pandas groupby to calculate many slopes

Some illustrative data in a DataFrame (MultiIndex) format:
|entity| year |value|
+------+------+-----+
| a | 1999 | 2 |
| | 2004 | 5 |
| b | 2003 | 3 |
| | 2007 | 2 |
| | 2014 | 7 |
I would like to calculate the slope using scipy.stats.linregress for each entity a and b in the above example. I tried using groupby on the first column, following the split-apply-combine advice, but it seems problematic since it's expecting one Series of values (a and b), whereas I need to operate on the two columns on the right.
This is easily done in R via plyr, not sure how to approach it in pandas.
A function can be applied to a groupby with the apply function. The passed function in this case linregress. Please see below:
In [4]: x = pd.DataFrame({'entity':['a','a','b','b','b'],
'year':[1999,2004,2003,2007,2014],
'value':[2,5,3,2,7]})
In [5]: x
Out[5]:
entity value year
0 a 2 1999
1 a 5 2004
2 b 3 2003
3 b 2 2007
4 b 7 2014
In [6]: from scipy.stats import linregress
In [7]: x.groupby('entity').apply(lambda v: linregress(v.year, v.value)[0])
Out[7]:
entity
a 0.600000
b 0.403226
You can do this via the iterator ability of the group by object. It seems easier to do it by dropping the current index and then specifying the group by 'entity'.
A list comprehension is then an easy way to quickly work through all the groups in the iterator. Or use a dict comprehension to get the labels in the same place (you can then stick the dict into a pd.DataFrame easily).
import pandas as pd
import scipy.stats
#This is your data
test = pd.DataFrame({'entity':['a','a','b','b','b'],'year':[1999,2004,2003,2007,2014],'value':[2,5,3,2,7]}).set_index(['entity','year'])
#This creates the groups
groupby = test.reset_index().groupby(['entity'])
#Process groups by list comprehension
slopes = [scipy.stats.linregress(group.year, group.value)[0] for name, group in groupby]
#Process groups by dict comprehension
slopes = {name:[scipy.stats.linregress(group.year, group.value)[0]] for name, group in groupby}