I have a table like:
col1 col2
0 1 a
1 2 b
2 2 c
3 3 c
4 4 d
I'd like rows to be grouped together if they have a matching value in col1 or col2. That is, I'd like something like this:
> (
df
.groupby(set('col1', 'col2')) # Made-up syntax
.ngroup())
0 0
1 1
2 1
3 1
4 2
Is there a way to do this with pandas?
This is not easy to achieve simply with pandas. Indeed, two far away groups can become connected when two items are connected in the second group.
You can approach this using graph theory. Find the connected components using edges formed by the two (or more) groups. A python library for this is networkx:
import networkx as nx
g1 = df.groupby('col1').ngroup()
g2 = 'a'+df.groupby('col2').ngroup().astype(str)
# make graph and get connected components to form a mapping dictionary
G = nx.from_edgelist(zip(g1, g2))
d = {k:v for v,s in enumerate(nx.connected_components(G)) for k in s}
# find common group
group = g1.map(d)
df.groupby(group).ngroup()
output:
0 0
1 1
2 1
3 1
4 2
dtype: int64
graph:
I am dealing with several data frames (DataFrames = [DataFrame_a,b,c...z]) with long description as their headers, for examples, a = pd.DataFrame(data = [[1,2,7],["A","B","C"],[5,6,0]], columns = ['SuperSuperlong name columnA', 'SuperSuperlong name columnB','SuperSuperlong name columnC'])
SuperSuperlong_name_columnA SuperSuperlong_name_columnB SuperSuperlong_name_columnC
0 1 2 7
1 ABC BCD CDE
2 5 6 0
I'd like it to be transformed to
ABC BCD CDE
0 SuperSuperlong_name_columnA SuperSuperlong_name_columnB SuperSuperlong_name_columnC
1 1 2 7
2 5 6 0
What's the easiest way?
I also like to apply the method to all data frame I have. How should I do it?
Hope this helps.
# Pass column name as new value in DataFrame and reset index
df.loc['new'] = df.columns
df.reset_index(inplace=True, drop=True)
# Pass the row you want as the column name
df.columns = df.iloc[1]
Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...
I have two pandas dataframes
df1 = A B C
1 2 3
2 3 4
3 4 5
df2 = X Y Z
1 2 3
2 3 4
3 4 5
I need to map based on data If data is same then map column namesenter code here
Output = col1 col2
A X
B Y
C Z
I cannot find any built-in function to support this, hence simply loop over all columns:
pairs = []
for col1 in df1.columns:
for col2 in df2.columns:
if df1[col1].equals(df2[col2]):
pairs.append((col1, col2))
output = pandas.DataFrame(pairs, columns=['col1', 'col2'])
I have 2 files and i want to generate data using different columns of diff
files. I want to do something like this:-
Here is my problem with example:-
I have 2 files abc.txt(col1,col2) and xyz.txt(col3,col4) Number of records in both the files differ say abc.txt has 1000 records and xyz.txt has 100 records.
I want to store output in a file such that , i get col1,col2 from abc.txt and col3 from xyz.txt (as we have less records in xyz then abc i want my col3 values to get repeated either randomly or in same sequence as in input file anything is ok)
Input
abc.txt xyz.txt
col1 col2 col3 col4
1 A 4 X
2 B 5 Y
3 C 6 Z
4 D
5 D
6 F
7 A
A = LOAD '/user/abc.txt' Using PigStorage('|');
B = LOAD '/user/xyz.txt' Using PigStorage('|');
C = FOREACH A GENERATE A.$0,A.$1,B.$0;
Output
col1 col2 col3
1 A 4
2 B 5
3 C 6
4 D 5
5 D 4
6 F 4
7 A 6
Is it possible to do this using PIG?
GENERATE is not operator in Pig. So you cannot use it to generate data. Pig provides FOREACH for iterating over a relation. It works for one relation only. To me it looks like you can generate the data as you have specified in question until you want to perform some sort of JOIN on data.