I have two data frames, and I want to create new columns in frame 1 using properties from frame 2
frame 1
Name
alice
bob
carol
frame 2
Name Type Value
alice lower 1
alice upper 2
bob equal 42
carol lower 0
desired result
frame 1
Name Lower Upper
alice 1 2
bob 42 42
carol 0 NA
Hence, the common column of both frames is Name. You can use Name to look up bounds (of a variable), which are specified in the second frame. Frame 1 lists each name exactly once. Frame 2 might have one or two entries per frame, which might either specify a lower or an upper bound (or both at a time if the type is equal). We do not need to have both bounds for each variable, one of the bounds can stay empty. I would like to have a frame that lists the range of each variable. I see how I can do that with for-loops over the columns, but that does not seem to be in the pandas spirit. Do you have any suggestions for a compact solution? :-)
Thanks in advance
You're not looking for a merge, but rather a pivot.
(df2[df2['Name'].isin(df1['Name'])]
.pivot('Name', 'Type', 'Value')
.reset_index()
)
But this doesn't handle the special 'equal' case.
For this, you can use a little trick. Replace 'equal' by a list with the other two values and explode to create the two rows.
(df2[df2['Name'].isin(df1['Name'])]
.assign(Type=lambda d: d['Type'].map(lambda x: {'equal': ['lower', 'upper']}.get(x,x)))
.explode('Type')
.pivot('Name', 'Type', 'Value')
.reset_index()
.convert_dtypes()
)
Output:
Name lower upper
0 alice 1 2
1 bob 42 42
2 carol 0 <NA>
Related
I wonder if you can help me to find a solution for the following problem. Given a data frame df1 like this
d1={'L':['aaa','bbb','ccc','aaa','bbb','ddd'],
'w':[1,5,9,13,17,21],
'x':[2,6,10,14,18,22],
'y':[3,7,11,15,19,23],
'z':[4,8,12,16,20,24]}
df1=pd.DataFrame(d1)
and two dictionaries to define grouping over columns and rows
dctRowGroups={'aaa':'A','bbb':'B','ccc':'A','ddd':'B'}
dctColGroups={'w':'ALPHA','x':'BETA','y':'ALPHA','z':'BETA'}
I wanted to aggregate over columns as a first step. Applying
g2=df1.groupby(dctColGroups,axis=1)
g2.sum()
results in
but I wanted to keep the 'L' column for the next step row-wise aggregation, i.e. the result should be a dataframe df2 more like this:
What do I need to code to make this happen?
As a next step, I want to aggregate df2 over the rows using the dctRowGroups dictionary
g3=df2.groupby(dctRowGroups,axis=0)
g3.sum()
to get a final result like this:
In what way can I do all these steps in as few lines of code as possible?
Appreciate your advice on this.
Thanks a lot
Willfried.
You can do:
Firstly create df2 and insert 'L' column by using insert() method:
df2=df1.groupby(dctColGroups,axis=1).sum()
df2.insert(0,'L',df1['L']) #use this only when the order matters
#OR(use anyone of the method either insert or assign)
df2=df2.assign(L=df1['L']) #otherwise use this
Finally use assign() ,map() and groupby() method:
result=df2.assign(L=df2['L'].map(dctRowGroups)).groupby('L').sum()
Outputs:
df2:
L ALPHA BETA
0 aaa 4 6
1 bbb 12 14
2 ccc 20 22
3 aaa 28 30
4 bbb 36 38
5 ddd 44 46
result:
ALPHA BETA
L
A 52 58
B 92 98
I have a list and a data frame. I want to find the number of each word in the list (some words in the list are pair) for each "emotions" in the data frame.
Here is my list:
[(frozenset({'know'}), 16528),
(frozenset({'im'}), 39047),
(frozenset({'feeling'}), 99455),
(frozenset({'like'}), 49332),
(frozenset({'feel', 'im'}), 16602),
(frozenset({'feeling', 'im'}), 23488),
(frozenset({'feel'}), 202985),
(frozenset({'feel', 'like'}), 42162),
(frozenset({'time'}), 17203),
(frozenset({'really'}), 17247)]
and this is my data frame:
Unnamed: 0 id text emotions
0 0 27383 [feel, awful, job, get, position, succeed, hap... sadness
1 1 110083 [im, alone, feel, awful] sadness
2 2 140764 [ive, probably, mentioned, really, feel, proud... joy
3 3 100071 [feeling, little, low, day, back] sadness
4 4 2837 [beleive, much, sensitive, people, feeling, te... love
Here is the expected output:
6 columns for six existed emotions and the last column is for totall count.
I need to compare the last row of a group to the row above it, see if changes occur in a few columns, and populate a new column with 1 if a change occurs. The data presentation below will explain better.
Also need to account for having a group with only 1 row.
what we have:
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom BBall Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
what we want:
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBALL Toto Yes 0 0 0 0
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
Only care about changes from row to row within group. Thank you for any help in advance.
The rows are already ordered so we only need to last two of the group. If it is easier to compare sequential rows in a group then that is just as good for my purposes.
I did know this would be arrays and I struggle with these because never use them for my typical sas modeling. Wanted to keep things short and sweet.
Use the data step and lag statements. Ensure your data is sorted by group first, and that the rows within groups are sorted in the correct order. Using arrays will make your code much smaller.
The logic below will compare each row with the previous row. A flag of 1 will be set only if:
It's not the first row of the group
The current value differs from the previous value.
The syntax var = (test logic); is a shortcut to automatically generate dummy flags.
data want;
set have;
by group;
array var[*] name sport dogname eligibility;
array lagvar[*] $ lag_name lag_sport lag_dogname lag_eligibility;
array changeflag[*] N_change S_change D_change E_change;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.group);
end;
drop lag: i;
run;
It is not uncommon for procedural programmers to find this kind of dilemma in SQL, which is predominately a set language where rows have no position. If you write a procedure that reads the select data (sorted in the desired order), it can have variables to control creating the desired additional columns in the output, similar to the lag function above.
Or you can put it into a spreadsheet, which is happier detecting the changes in formula filled columns =if(a2<>a1,1,0). Just make sure nobody re-sorts the spreadsheet data into a new order!
I have:
pd.DataFrame({'col':['one','fish','two','fish','left','foot','right','foot']})
col
0 one
1 fish
2 two
3 fish
4 left
5 foot
6 right
7 foot
I want to concatenate every n rows (here every 4) and form a new dataframe:
pd.DataFrame({'col':['one fish two fish','left foot right foot']})
col
0 one fish two fish
1 left foot right foot
I am using Python and pandas
If there is default RangeIndex then use integer division with aggregate join:
print (df.groupby(df.index // 4).agg(' '.join))
#for not RangeIndex create helper array
#print (df.groupby(np.arange(len(df)) // 4).agg(' '.join))
col
0 one fish two fish
1 left foot right foot
If want specify column col:
print (df.groupby(df.index // 4)['col'].agg(' '.join).to_frame())
Try groupby:
df['col'].groupby(np.repeat(np.arange(len(df)), 4)[:len(df)]).agg(' '.join)
Output:
0 one fish two fish
1 left foot right foot
Name: col, dtype: object
I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation