pandas groupby and agg - pandas

Here are my data:
df1 = pd.read_csv('gp_data.csv',sep= ";")
df2 = pd.read_csv('ms_data.csv',sep= ";")
#2.7.2
student_data = pd.concat([df1, df2])
#2.7.3
df3 = pd.read_csv('gp_grades.csv',sep= ";")
df4 = pd.read_csv('ms_grades.csv',sep= ";")
#2.7.4
student_grades = pd.concat([df3, df4])
#2.7.5
student = pd.merge(left=student_grades, right=student_data, left_on='student_id', right_on='student_id')
Q:Based on the address and guardian, do the following:
Define a function and use agg to apply the function to the traveltime column.
Find out the sum of the travel time for each group.
Do not use Pandas provide function for the answer, but you may use it to check.
Here are my codes below:
#1
def travel(series):
return sum(series == True)
print(student.groupby(['address','guardian']).agg({'traveltime' : travel}))
#2
print(student.agg({'traveltime':'sum'}))
In fact I don't know what kind of function it may need and not so sure whether the second question is satisfied. (print(student.agg({'traveltime':'sum'})))

Related

Dataframe loc with multiple string value conditions

Hi, given this dataframe is it possible to fetch the Number value associated with certain conditions using df.loc? This is what i came up with so far.
if df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]:
I want the output to be 1. Is this the correct way to do it?
You're in the right way, but you have to pass ".values[0]" in the end of the .loc statement to extract the only value that you got in the pandas Series.
df = pd.DataFrame({
'Tags': ['Brunei', 'China'],
'Type': ['Host', 'Address'],
'Number': [1, 1192]
}
)
display(df)
series = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]
print(type(series))
value = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"].values[0]
print(type(value))

How to merge 4 dataframes into one

I have created a functions that returns a dataframe.Now, i want merge all dataframe into one. First, i called all the function and used reduce and merge function.It did not work as expected.The error i am getting is "cannot combine function.It should be dataframe or series.I checked the type of my df,it is dataframe not functions. I don't know where the error is coming from.
def func1():
return df1
def func2():
return df2
def func3():
return df3
def func4():
return df4
def alldfs():
df_1 = func1()
df_2 = func2()
df_3 = func3()
df_4 = func4()
result = reduce(lambda df_1,d_2,df_3,df_4: pd.merge(df_1,df_2,df_3,df_4,on ="EMP_ID"),[df1,df2,df3,df4)
print(result)
You could try something like this ( assuming that EMP_ID is common across all dataframes and you want the intersection of all dataframes ) -
result = pd.merge(df1, df2, on='EMP_ID').merge(df3, on='EMP_ID').merge(df4, on='EMP_ID')

df.groupby('columns').apply(''.join()), join all the cells to a string

df.groupby('columns').apply(''.join()), join all the cells to a string.
This is for a junior dataprocessor. In the past, I've tried many ways.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index(‘key’)
#df2 is expected result
data2 = {'a':['12j5g9d'],'b':['3d6d'],'c':['4d7t']}
df2 = pd.DataFrame(data2)
df2 = df2.set_index(‘key’)
Here's a simple solution, where we first translate the integers to strings and then concatenate profit and income, then finally we concatenate all strings under the same key:
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df['profit_income'] = df['profit'].apply(str) + df['income']
res = df.groupby('key')['profit_income'].agg(''.join)
print(res)
output:
key
a 12j5g9d
b 3d6d
c 4d7t
Name: profit_income, dtype: object
This question can be solved couple different ways:
First add an extra column by concatenating the profit and income columns.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index('key')
df['profinc']=df['profit'].astype(str)+df['income']
1) Using sum
df2=df.groupby('key').profinc.sum()
2) Using apply and join
df2=df.groupby('key').profinc.apply(''.join)
Results from both of the above would be the same:
key
a 12j5g9d
b 3d6d
c 4d7t

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Pandas rename columns with wildcard

My df looks like this:
Datum Zeit Temperatur[°C] Luftdruck Windgeschwindigkeit[m/s] Windrichtung[Grad] Relative Luftfeuchtigkeit[%] Globalstrahlung[W/m²]
Now i want to rename the columns like this:#
wetterdaten.rename(columns={'Temperatur%': 'Temperatur', 'Luftdruck[hPa]': 'Luftdruck'}, inplace=True)
Where % is a wildcard.
But of course it will not work like this.
The beginning of the column name is always the same in the log data,
but the ending is temporally changing.
You can filter the columns and fetch the name:
wetterdaten.rename(columns={wetterdaten.filter(regex='Temperatur.*').columns[0]: 'Temperatur',
wetterdaten.filter(regex='Luftdruck.*').columns[0]: 'Luftdruck'},
inplace=True)
You can use replace by dict, for wildcard use .* and for start of string ^:
d = {'^Temperatur.*': 'Temperatur', 'Luftdruck[hPa]': 'Luftdruck'}
df.columns = df.columns.to_series().replace(d, regex=True)
Sample:
cols = ['Datum', 'Zeit', 'Temperatur[°C]', 'Luftdruck' , 'Windgeschwindigkeit[m/s]',
'Windrichtung[Grad]', 'Relative Luftfeuchtigkeit[%]', ' Globalstrahlung[W/m²]']
df = pd.DataFrame(columns=cols)
print (df)
Empty DataFrame
Columns: [Datum, Zeit, Temperatur[°C], Luftdruck, Windgeschwindigkeit[m/s],
Windrichtung[Grad], Relative Luftfeuchtigkeit[%], Globalstrahlung[W/m²]]
Index: []
d = {'^Temperatur.*': 'Temperatur', 'Luftdruck.*': 'Luftdruck'}
df.columns = df.columns.to_series().replace(d, regex=True)
print (df)
Empty DataFrame
Columns: [Datum, Zeit, Temperatur, Luftdruck, Windgeschwindigkeit[m/s],
Windrichtung[Grad], Relative Luftfeuchtigkeit[%], Globalstrahlung[W/m²]]
Index: []
You may prepare a function for renaming you columns:
rename_columns(old_name):
if old_name == 'Temperatur':
new_name = old_name + whichever_you_wants # may be another function call
elif old_name == 'Luftdruck':
new_name = 'Luftdruck[hPa]'
else:
new_name = old_name
return new_name
and then use the .rename() method with that function as a parameter:
wetterdaten.rename(columns=rename_columns, inplace=True)
Also, you can try the below code; Which replaces #Item Code into Item Name without condition.
Code:
pd.rename(columns = {'#Item Code':'Item Name'}, inplace = True)