How can I combine same-named columns into one in a pandas dataframe so all the columns are unique? - pandas

I have a dataframe that looks like this:
In [268]: dft.head()
Out[268]:
ticker BYND UBER UBER UBER ... ZM ZM BYND ZM
0 analyst worlds uber revenue ... company owning pet things
1 moskow apac note uber ... try things humanization users
2 growth anheuserbusch growth target ... postipo unicorn products revenue
3 stock kong analysts raised ... software revenue things million
4 target uberbeating stock rising ... earnings million pets direct
[5 rows x 500 columns]
In [269]: dft.columns.unique()
Out[269]: Index(['BYND', 'UBER', 'LYFT', 'SPY', 'WORK', 'CRWD', 'ZM'], dtype='object', name='ticker')
How do I combine the the columns so there is only a single unique column name for each ticker?

Maybe you should try making a copy of the column you wish to join then extend the first column with the copy you have.

Code :
First convert the all columns name into one case either in lower or upper case so that there is no miss-match in header case.
def merge_(df):
'''Return the data-frame with columns with the same lowercase'''
# Get the list of unique columns in lowercase
columns = set(map(str.lower,df.columns))
df1 = pd.DataFrame(data=np.zeros((len(df),len(columns))),columns=columns)
# Merging the matching columns
for col in df.cloumns:
df1[col.lower()] += df[col] # words are in str format so '+' will concatenate
return df1

Related

Selecting Rows Based On Specific Condition In Python Pandas Dataframe

So I am new to using Python Pandas dataframes.
I have a dataframe with one column representing customer ids and the other holding flavors and satisfaction scores that looks something like this.
Although each customer should have 6 rows dedicated to them, Customer 1 only has 5. How do I create a new dataframe that will only print out customers who have 6 rows?
I tried doing: df['Customer No'].value_counts() == 6 but it is not working.
Here is one way to do it
if you post data as a code (preferably) or text, i would be able to share the result
# create a temporary column 'c' by grouping on Customer No
# and assigning count to it using transform
# finally, using loc to select rows that has a count eq 6
(df.loc[df.assign(
c=df.groupby(['Customer No'])['Customer No']
.transform('count'))['c'].eq(6]
)

Joining two data frames on column name and comparing result side by side

I have two data frames which look like df1 and df2 below and I want to create df3 as shown.
I could do this using a left join to have all the rows in one dataframe and then did a numpy.where to see if they are matching or not.
I could get what I want but I feel there should be an elegant way of doing this which will eliminate renaming columns, reshuffling columns in dataframe and then using np.where.
Is there a better way to do this?
code to reproduce dataframes:
import pandas as pd
df1=pd.DataFrame({'product':['apples','bananas','oranges','pineapples'],'price':[1,2,3,7],'quantity':[5,7,11,4]})
df2=pd.DataFrame({'product':['apples','bananas','oranges'],'price':[2,2,4],'quantity':[5,7,13]})
df3=pd.DataFrame({'product':['apples','bananas','oranges'],'price_df1':[1,2,3],'price_df2':[2,2,4],'price_match':['No','Yes','No'],'quantity':[5,7,11],'quantity_df2':[5,7,13],'quantity_match':['Yes','Yes','No']})
An elegant way to do your task is to:
generate "partial" DataFrames from each source column,
and then concatenate them.
The first step is to define a function to join 2 source columns and append "match" column:
def myJoin(s1, s2):
rv = s1.to_frame().join(s2.to_frame(), how='inner',
lsuffix='_df1', rsuffix='_df2')
rv[s1.name + '_match'] = np.where(rv.iloc[:,0] == rv.iloc[:,1], 'Yes', 'No')
return rv
Then, from df1 and df2, generate 2 auxiliary DataFrames setting product as the index:
wrk1 = df1.set_index('product')
wrk2 = df2.set_index('product')
And the final step is:
result = pd.concat([ myJoin(wrk1[col], wrk2[col]) for col in wrk1.columns ], axis=1)\
.reset_index()
Details:
for col in wrk1.columns - generates names of columns to join.
myJoin(wrk1[col], wrk2[col]) - generates the partial result for this column from
both source DataFrames.
[…] - a list comprehension, collecting the above partial results in a list.
pd.concat(…) - concatenates these partial results into the final result.
reset_index() - converts the index (product names) into a regular column.
For your source data, the result is:
product price_df1 price_df2 price_match quantity_df1 quantity_df2 quantity_match
0 apples 1 2 No 5 5 Yes
1 bananas 2 2 Yes 7 7 Yes
2 oranges 3 4 No 11 13 No

how to create n DataFrame from a DataFrame of n columns?

Suppose you have a dataframe of n columns and want to create n dataframe. Each new DataFrame will contain all the values ​​of a column and will be called as the column.
Example:
df=pd.DataFrame(columns=['cities','games','jobs'])
df['cities']='Londres Paris'.split()
df['games']='Fornite mw2'.split()
df['jobs']='engineers programmers'.split()
df
Output:
cities games jobs
0 Londres Fornite engineers
1 Paris mw2 programmers
An efficient and extrapolable way for dataframes with a large number of columns is sought whose name is unknown.
Therefore you must deduct the name of each new dataframe from the names of each column.
Required Departures:
cities
Out:
cities
0 Londres
1 Paris
games
Out:
games
0 Fornite
1 mw2
jobs
Output:
jobs
0 engineers
1 programmers
I want to create new DataFrame whose names or reference are the str contained in df.columns
Easiest way is to create a dictionary, where the keys are the dataframes/column names and the values are actual dataframes:
dfs = {f'{col}':df[col].to_frame() for col in df.columns}
Now we can access each dataframe:
jobs
0 engineers
1 programmers
dfs['games']
games
0 Fornite
1 mw2
df['jobs']
jobs
0 engineers
1 programmers

Merge two data frames based on common column values in Pandas

How to get merged data frame from two data frames having common column value such that only those rows make merged data frame having common value in a particular column.
I have 5000 rows of df1 as format : -
director_name actor_1_name actor_2_name actor_3_name movie_title
0 James Cameron CCH Pounder Joel David Moore Wes Studi Avatar
1 Gore Verbinski Johnny Depp Orlando Bloom Jack Davenport Pirates
of the Caribbean: At World's End
2 Sam Mendes Christoph Waltz Rory Kinnear Stephanie Sigman Spectre
and 10000 rows of df2 as
movieId genres movie_title
1 Adventure|Animation|Children|Comedy|Fantasy Toy Story
2 Adventure|Children|Fantasy Jumanji
3 Comedy|Romance Grumpier Old Men
4 Comedy|Drama|Romance Waiting to Exhale
A common column 'movie_title' have common values and based on them, I want to get all rows where 'movie_title' is same. Other rows to be deleted.
Any help/suggestion would be appreciated.
Note: I already tried
pd.merge(dfinal, df1, on='movie_title')
and output comes like one row
director_name actor_1_name actor_2_name actor_3_name movie_title movieId title genres
and on how ="outer"/"left", "right", I tried all and didn't get any row after dropping NaN although many common coloumn do exist.
You can use pd.merge:
import pandas as pd
pd.merge(df1, df2, on="movie_title")
Only rows are kept for which common keys are found in both data frames. In case you want to keep all rows from the left data frame and only add values from df2 where a matching key is available, you can use how="left":
pd.merge(df1, df2, on="movie_title", how="left")
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge operation.
If you want to merge two DataFrames and you want a merged DataFrame in which only common values from both data frames will appear then do inner merge.
import pandas as pd
merged_Frame = pd.merge(df1, df2, on = id, how='inner')

Pandas: how do I group a Data Frame by a set of ordinal values?

I'm starting to learn about Python Pandas and want to generate a graph with the sum of arbitrary groupings of an ordinal value. It can be better explained with a simple example.
Suppose I have the following table of food consumption data:
And I have two groups of foods defined as two lists:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
Now I want to plot a graph with the evolution of consumption of junk and healthy food. I believe I must then process my data to get a DataFrame like:
Suppose the first table is already in a Dataframe called food, how do I transform it to get the second one?
I also welcome suggestions to reword my question to make it clearer, or for different approaches to generate the plot.
First create dictinary with lists and then swap keys with values.
Then groupby by mapped column food by dict and year, aggregate sum and last reshape by unstack:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
d1 = {'healthy':healthy, 'junk':junk}
##http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
print (d)
{'brocolli': 'healthy', 'cheetos': 'junk', 'apple': 'healthy', 'coke': 'junk'}
df1 = df.groupby([df.food.map(d), 'year'])['amount'].sum().unstack(0)
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
Another solution with pivot_table:
df1 = df.pivot_table(index='year', columns=df.food.map(d), values='amount', aggfunc='sum')
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24