Python3 pandas: data frame grouped by a columns(such as name), then extract a number of rows for each group - pandas

There is data frame called df as following:
name id age text
a 1 1 very good, and I like him
b 2 2 I play basketball with his brother
c 3 3 I hope to get a offer
d 4 4 everything goes well, I think
a 1 1 I will visit china
b 2 2 no one can understand me, I will solve it
c 3 3 I like followers
d 4 4 maybe I will be good
a 1 1 I should work hard to finish my research
b 2 2 water is the source of earth, I agree it
c 3 3 I hope you can keep in touch with me
d 4 4 My baby is very cute, I like him
The data frame is grouped by name, then I want to extract a number of rows by row index(for example: 2) for the new dataframe: df_new.
name id age text
a 1 1 very good, and I like him
a 1 1 I will visit china
b 2 2 I play basketball with his brother
b 2 2 no one can understand me, I will solve it
c 3 3 I hope to get a offer
c 3 3 I like followers
d 4 4 everything goes well, I think
d 4 4 maybe I will be good
df_new = (df.groupby('screen_name'))[0:2]
But there is error:
hash(key)
TypeError: unhashable type: 'slice'

Try using head() instead.
import pandas as pd
from io import StringIO
buff = StringIO('''
name,id,age,text
a,1,1,"very good, and I like him"
b,2,2,I play basketball with his brother
c,3,3,I hope to get a offer
d,4,4,"everything goes well, I think"
a,1,1,I will visit china
b,2,2,"no one can understand me, I will solve it"
c,3,3,I like followers
d,4,4,maybe I will be good
a,1,1,I should work hard to finish my research
b,2,2,"water is the source of earth, I agree it"
c,3,3,I hope you can keep in touch with me
d,4,4,"My baby is very cute, I like him"
''')
df = pd.read_csv(buff)
using head() instead of [:2] then sorting by name
df_new = df.groupby('name').head(2).sort_values('name')
print(df_new)
name id age text
0 a 1 1 very good, and I like him
4 a 1 1 I will visit china
1 b 2 2 I play basketball with his brother
5 b 2 2 no one can understand me, I will solve it
2 c 3 3 I hope to get a offer
6 c 3 3 I like followers
3 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good

Another solution with iloc:
df_new = df.groupby('name').apply(lambda x: x.iloc[:2]).reset_index(drop=True)
print(df_new)
name id age text
0 a 1 1 very good, and I like him
1 a 1 1 I will visit china
2 b 2 2 I play basketball with his brother
3 b 2 2 no one can understand me, I will solve it
4 c 3 3 I hope to get a offer
5 c 3 3 I like followers
6 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good

Related

Choosing companies from a dataframe with monthl returns based on company list of other dataframe

I'm currently writing my master thesis and I would like to calculate the portfolio returns for a list of companies. Therefore, I want to choose the companies from the dataframe with their monthly returns. I want to select the companies, which are in another dataframe.
The dataframe with the returns looks like that:
The second dataframe with the company names according to which I want to choose, looks like that:
I tried something like this:
mrt.loc[mrt['Company Name'] == SL1['Company Name']], which just gives me the Error 'Company Name'. I checked and the spelling should be correct.
I tried as well this:
mrt.loc[mrt == SL0['Company Name']]
This gives me a list of companies but I need as well the monthlyreturns from the dataframe mrt.
So as a recap I want the rows from mrt according to the company names in the dataframe SL0. And I need to do this afterwards with other dataframes like SL0 but with different length.
Could someone help? Thank you very much and have a nice day.
isin() would resolve this issue.
df
###
Company Name value1
0 Apple 1
1 Google 2
2 Microsoft 3
3 Facebook 4
4 Tesla 5
5 Amazon 6
6 Alphabet 7
7 Oracle 8
8 IBM 9
9 Facebook 10
df2
###
value2 value3 Company Name
0 51 11 Apple
1 52 22 Google
2 35 33 Microsoft
3 54 44 Facebook
Selecting
df[df['Company Name'].isin(df2['Company Name'])]
###
Company Name value1
0 Apple 1
1 Google 2
2 Microsoft 3
3 Facebook 4
9 Facebook 10

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Calculate intersecting sums from flat DataFrame for a heatmap

I'm trying to wrangle some data to show how many items a range of people have in common. The goal is to show this data in a heatmap format via Seaborn to understand these overlaps visually.
Here's some sample data:
demo_df = pd.DataFrame([
("Get Back", 1,0,2),
("Help", 5, 2, 0),
("Let It Be", 0,2,2)
],columns=["Song","John", "Paul", "Ringo"])
demo_df.set_index("Song")
John Paul Ringo
Song
Get Back 1 0 2
Help 5 2 0
Let It Be 0 2 2
I don't need a breakdown by song, just the total of shared items. The resulting data would show a sum of how many items they share like this:
Name
John
Paul
Ringo
John
-
7
3
Paul
7
-
4
Ringo
3
4
-
So far I've tried a few options with groupby and unstack but haven't been able to work out how to cross match the names into both column and header rows.
We may do dot then fill diag
out = df.T.dot(df.ne(0)) + df.T.ne(0).dot(df)
np.fill_diagonal(out.values, 0)
out
Out[176]:
John Paul Ringo
John 0 7 3
Paul 7 0 4
Ringo 3 4 0

Count instances of number in Row

I have a sheet formated somewhat like this
Thing 5 6 7 Person 1 Person 2 Person 3
Thing 1 1 2 7 7 6
Thing 2 5 5
Thing 3 7 6 6
Thing 4 6 6 5
I am trying to find a query formula that I can place in the columns labeled 5,6,7 that will count the number of people who have that amount of Thing 1. For example, I filled out the Thing 1 row, showing that 1 person has 6 of Thing 1 and 2 people have 7 of Thing 1.
You can use this function: "COUNTIF".
The formula to write in the cells will look like this:
=COUNTIF(E2:G2;"=5")
For more information regarding this function, check the documentation: https://support.google.com/docs/answer/3093480?hl=en

How to match already-calculated means to the original data set?

I am now learning R. I feel that there is a very easy succinct answer to my problem, but I am having trouble solving it myself.
I have a large data set. One column contains various 'categories'. I aggregated these categories to get the mean for each one. So, right now, my aggregated table looks like this:
Category __ Average
A ________ a
B ________ b
C ________ c
etc...
I want now to take these average and combine it as another column onto my original data.
So, I want it to look something like this:
Categories _____ Averages
B _____________ b
A______________a
B______________b
C______________c
B______________b
C______________c
In other words, I want to match each category with its corresponding mean. I have tried variations of merge(), match(), and different apply functions. The fact that my aggregated table is so much smaller than my original data is causing some problems.
Is there a specific function I can use for this simple problem? Thanks in advance.
In base R:
data <- data.frame(Category=c(rep("A",3), rep("B",4), rep("C",2)), Value=1:9)
> data
Category Value
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 B 7
8 C 8
9 C 9
> avg <- lapply(split(data$Value, data$Category), mean)
$A
[1] 2
$B
[1] 5.5
$C
[1] 8.5
> data$Averages <- avg[data$Category]
> data
Category Value Averages
1 A 1 2
2 A 2 2
3 A 3 2
4 B 4 5.5
5 B 5 5.5
6 B 6 5.5
7 B 7 5.5
8 C 8 8.5
9 C 9 8.5
You can use plyr, data.table, etc. more efficiently for larger datasets.