Pandas -change id to string using map and lambda - pandas

with this dictionary:
teams_dict = {'Tottenham':262, 'Liverpool': 263, 'Leeds': 264}
And having a column 'team_id' in df1 and a column 'team_name' in another df2, how do I change id to name in df1, using map() and lambda in pandas?
I've tried:
df1['team_name'] = df2['team_name'].map(lambda x: teams_dict[x])
But it does not work.

Try using apply instead of map:
df1['team_name'] = df2['team_name'].apply(lambda x: teams_dict[x])
It's worth mentioning that this is index-agnostic and only works if df1 and df2 are the same size, and are sorted the same way.

Related

Convert column names to string in pandas dataframe

Can somebody explain how this works?
df.columns = list(map(str, df.columns))
Your code is not the best way to convert column names to string, use instead:
df.columns = df.columns.astype(str)
Your code:
df.columns = list(map(str, df.columns))
is equivalent to:
df.columns = [str(col) for col in df.columns]
map: for each item in df.columns (iterable), apply the function str on it but the map function returns an iterator, so you need to explicitly execute list to generate the list.

groupby with transform minmax

for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)

Transform list of pyspark rows into pandas data frame through a dictionary

I am trying to convert a list of PySpark sorted rows to a Pandas data frame using dictionary comprehension but only works when explicitly stating the key and value of the desired dictionary.
row_list = sorted(data, key=lambda row: row['date'])
future_df = {'key': int(key),
'date': map(lambda row: row["date"], row_list),
'col1': map(lambda row: row["col1"], row_list),
'col2': map(lambda row: row["col2"], row_list)}
And then converting it to Pandas with:
pd.DataFrame(future_df)
This operation is to be found inside the class ForecastByKey invoked by:
rdd = df.select('*')
.rdd \
.map(lambda row: ((row['key']), row)) \
.groupByKey() \
.map(lambda args: spark_ops.run(args[0], args[1]))
Up to this point, everything works fine; meaning explicitly indicating the columns inside the dictionary future_df.
The problem arises when trying to convert the whole set of columns (700+) with something like:
future_df = {'key': int(key),
'date': map(lambda row: row["date"], row_list)}
for col_ in columns:
future_df[col_] = map(lambda row: row[col_], row_list)
pd.DataFrame(future_df)
Where columns contains the name of each coumn passed to the ForecastByKey class.
The result of this operation is a data frame with empty or close-to-zero columns.
I am using Python 3.6.10 and PySpark 2.4.5
How is this iteration to be done in order to get a data frame with the right information?
After some research, I realized this can be solved with:
row_list = sorted(data, key=lambda row: row['date'])
def f(x):
return map(lambda row: row[x], row_list)
pre_df = {col_: col_ for col_ in self.sdf_cols}
future_df = toolz.valmap(f, pre_df)
future_df['key'] = int(key)

How to merge 4 dataframes into one

I have created a functions that returns a dataframe.Now, i want merge all dataframe into one. First, i called all the function and used reduce and merge function.It did not work as expected.The error i am getting is "cannot combine function.It should be dataframe or series.I checked the type of my df,it is dataframe not functions. I don't know where the error is coming from.
def func1():
return df1
def func2():
return df2
def func3():
return df3
def func4():
return df4
def alldfs():
df_1 = func1()
df_2 = func2()
df_3 = func3()
df_4 = func4()
result = reduce(lambda df_1,d_2,df_3,df_4: pd.merge(df_1,df_2,df_3,df_4,on ="EMP_ID"),[df1,df2,df3,df4)
print(result)
You could try something like this ( assuming that EMP_ID is common across all dataframes and you want the intersection of all dataframes ) -
result = pd.merge(df1, df2, on='EMP_ID').merge(df3, on='EMP_ID').merge(df4, on='EMP_ID')

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)