how to create n DataFrame from a DataFrame of n columns? - pandas

Suppose you have a dataframe of n columns and want to create n dataframe. Each new DataFrame will contain all the values ​​of a column and will be called as the column.
Example:
df=pd.DataFrame(columns=['cities','games','jobs'])
df['cities']='Londres Paris'.split()
df['games']='Fornite mw2'.split()
df['jobs']='engineers programmers'.split()
df
Output:
cities games jobs
0 Londres Fornite engineers
1 Paris mw2 programmers
An efficient and extrapolable way for dataframes with a large number of columns is sought whose name is unknown.
Therefore you must deduct the name of each new dataframe from the names of each column.
Required Departures:
cities
Out:
cities
0 Londres
1 Paris
games
Out:
games
0 Fornite
1 mw2
jobs
Output:
jobs
0 engineers
1 programmers
I want to create new DataFrame whose names or reference are the str contained in df.columns

Easiest way is to create a dictionary, where the keys are the dataframes/column names and the values are actual dataframes:
dfs = {f'{col}':df[col].to_frame() for col in df.columns}
Now we can access each dataframe:
jobs
0 engineers
1 programmers
dfs['games']
games
0 Fornite
1 mw2
df['jobs']
jobs
0 engineers
1 programmers

Related

Most computationally efficient pandas method to group rows based on 1 column value, and other column values to list

I want to combine rows in a pandas dataframe where a particular column contains the same value, and all the other columns, values are put into a single list.
Here's a method I came up with. First, create the sample df
df = pd.DataFrame({'Animal': ['Falcon', 'Parrot',
'Parrot', 'Falcon'],
'Max Speed': ['380.', '370.', '24.', '26.']})
df
output
Animal Max Speed
0 Falcon 380.
1 Parrot 370.
2 Parrot 24.
3 Falcon 26.
Here is the method
test = df.groupby(['Animal']).agg(lambda x: tuple(x)).applymap(list).reset_index()
test.head()
output
Animal Max Speed
0 Falcon [380., 26.]
1 Parrot [370., 24.]
Is there a more computationally efficient method for getting the same output?

Iterate two dataframes, compare and change a value in pandas or pyspark

I am trying to do an exercise in pandas.
I have two dataframes. I need to compare few columns between both dataframes and change the value of one column in the first dataframe if the comparison is successful.
Dataframe 1:
Article Country Colour Buy
Pants Germany Red 0
Pull Poland Blue 0
Initially all my articles have the flag 'Buy' set to zero.
I have dataframe 2 that looks as:
Article Origin Colour
Pull Poland Blue
Dress Italy Red
I want to check if the article, country/origin and colour columns match (so check whether I can find the each article from dataframe 1 in dataframe two) and, if so, I want to put the flag 'Buy' to 1.
I trying to iterate through both dataframe with pyspark but pyspark daatframes are not iterable.
I thought about doing it in pandas but apaprently is a bad practise to change values during iteration.
Which code in pyspark or pandas would work to do what I need to do?
Thanks!
merge with an indicator then map the values. Make sure to drop_duplicates on the merge keys in the right frame so the merge result is always the same length as the original, and rename so we don't repeat the same information after the merge. No need to have a pre-defined column of 0s.
df1 = df1.drop(columns='Buy')
df1 = df1.merge(df2.drop_duplicates().rename(columns={'Origin': 'Country'}),
indicator='Buy', how='left')
df1['Buy'] = df1['Buy'].map({'left_only': 0, 'both': 1}).astype(int)
Article Country Colour Buy
0 Pants Germany Red 0
1 Pull Poland Blue 1

How can I combine same-named columns into one in a pandas dataframe so all the columns are unique?

I have a dataframe that looks like this:
In [268]: dft.head()
Out[268]:
ticker BYND UBER UBER UBER ... ZM ZM BYND ZM
0 analyst worlds uber revenue ... company owning pet things
1 moskow apac note uber ... try things humanization users
2 growth anheuserbusch growth target ... postipo unicorn products revenue
3 stock kong analysts raised ... software revenue things million
4 target uberbeating stock rising ... earnings million pets direct
[5 rows x 500 columns]
In [269]: dft.columns.unique()
Out[269]: Index(['BYND', 'UBER', 'LYFT', 'SPY', 'WORK', 'CRWD', 'ZM'], dtype='object', name='ticker')
How do I combine the the columns so there is only a single unique column name for each ticker?
Maybe you should try making a copy of the column you wish to join then extend the first column with the copy you have.
Code :
First convert the all columns name into one case either in lower or upper case so that there is no miss-match in header case.
def merge_(df):
'''Return the data-frame with columns with the same lowercase'''
# Get the list of unique columns in lowercase
columns = set(map(str.lower,df.columns))
df1 = pd.DataFrame(data=np.zeros((len(df),len(columns))),columns=columns)
# Merging the matching columns
for col in df.cloumns:
df1[col.lower()] += df[col] # words are in str format so '+' will concatenate
return df1

Adding lists stored in dataframe

I have two dataframes as:
df1.ix[1:3]
DateTime
2018-01-02 [-0.0031537018416199097, 0.006451397621428631,...
2018-01-03 [-0.0028882814454597745, -0.005829869983964528...
df2.ix[1:3]
DateTime
2018-01-02 [-0.03285881500135208, -0.027806145786217932, ...
2018-01-03 [-0.0001314381449719178, -0.006278235444742629...
len(df1.ix['2018-01-02'][0])
500
len(df2.ix['2018-01-02'][0])
500
When I do df1 + df2 I get:
len((df1 + df2).ix['2018-01-02'][0])
1000
So, the lists instead of being summation is being concatenated.
How do I add element wise the lists in the dataframes df1 and df2.
When an operation is applied between two dataframes, it gets broadcasted at element level. Element in your case is a list and when '+' operator is applied between two lists, it concatenates them. That's why resulting dataframe contains concatenated lists.
There can be multiple approaches for actually summing up elements of lists instead of concatenating.
One approach can be converting list elements into columns and then adding dataframes and then merging columns back to a single list.(which has been suggested in first answer but in a wrong way)
Step 1: Converting list elements to columns
df1=df1.apply(lambda row:pd.Series(row[0]), axis=1)
df2=df2.apply(lambda row:pd.Series(row[0]), axis=1)
We need to pass row[0] instead of row to get rid of column index associated with series.
Step 2: Add dataframes
df=df1+df2 #this dataframe will have 500 columns
Step 3: Merge columns back to lists
df=df.apply(lambda row:pd.Series({0:list(row)}),axis=1)
This is an interesting part. Why are we returning a series here? Why only returning list(row) doesn't work and keep retaining 500 columns?
Reason is - if length of list returned is same as length of columns in the beginning, then this list gets fit in columns and to us it seems nothing happened. Whereas if length of the list is not equal to number of columns, then it is returned as single list.
Let's look at an example.
Suppose I've a dataframe, having columns 0 ,1 and 2.
df=pd.DataFrame({0:[1,2,3],1:[4,5,6],2:[7,8,9]})
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
Number of columns in original dataframe are 3. If I try to return a list with two columns, it works and a series is returned,
df1=df.apply(lambda row:[row[0],row[1]],axis=1)
0 [1, 4]
1 [2, 5]
2 [3, 6]
dtype: object
Instead if try to return list of three numbers, it would get fit in columns.
df1=df.apply(list,axis=1)
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
So if we want to return list of same size as number of columns, we'll have to return it in form of Series where one row's value has been given as list.
Another approach can be, introduce one column of a dataframe into other and then add columns using apply function.
df1[1]=df2[0]
df=df1.apply(lambda r: list(np.array(r[0])+np.array(r[1])),axis=1)
We can take advantage of numpy arrays here. '+' operator on numpy arrays sums up corresponding values and gives a single numpy array.
Cast them to series so that they become columns, then add your dfs:
df1 = df1.apply(pd.Series, axis=1)
df2 = df2.apply(pd.Series, axis=1)
df1 + df2

Pandas: how do I group a Data Frame by a set of ordinal values?

I'm starting to learn about Python Pandas and want to generate a graph with the sum of arbitrary groupings of an ordinal value. It can be better explained with a simple example.
Suppose I have the following table of food consumption data:
And I have two groups of foods defined as two lists:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
Now I want to plot a graph with the evolution of consumption of junk and healthy food. I believe I must then process my data to get a DataFrame like:
Suppose the first table is already in a Dataframe called food, how do I transform it to get the second one?
I also welcome suggestions to reword my question to make it clearer, or for different approaches to generate the plot.
First create dictinary with lists and then swap keys with values.
Then groupby by mapped column food by dict and year, aggregate sum and last reshape by unstack:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
d1 = {'healthy':healthy, 'junk':junk}
##http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
print (d)
{'brocolli': 'healthy', 'cheetos': 'junk', 'apple': 'healthy', 'coke': 'junk'}
df1 = df.groupby([df.food.map(d), 'year'])['amount'].sum().unstack(0)
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
Another solution with pivot_table:
df1 = df.pivot_table(index='year', columns=df.food.map(d), values='amount', aggfunc='sum')
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24