How to sum values of two columns by an ID column, keeping some columns with repeated values and excluding others? - dataframe

I need to organize a large df adding values of a column by a column ID (the ID is not sequencial), keeping some columns of the df that have repeated values by ID and excluding column that have different values by ID. Below I inserted a reproducible example and the output I need. I think there is a simple way to do that, but I am not soo familiar with R.
df=read.table(textConnection("
ID spp effort generalist specialist
1 a 10 1 0
1 b 10 1 0
1 c 10 0 1
1 d 10 0 1
2 a 16 1 0
2 b 16 1 0
2 e 16 0 1
"), header = TRUE)
The output I need:
ID effort generalist specialist
1 10 2 2
2 16 2 1

Related

How to indicate count of values in categorical column in Pandas, Python?

I have the following Pandas DataFrame:
ID CAT
1 A
1 B
1 A
2 A
2 B
2 A
1 B
1 A
I'd like to have a table that indicates the number of occurance per CAT values for each ID in different columns like this:
ID CAT_A_NUM CAT_B_NUM
1 3 2
2 2 1
I tried in many ways, like this one with pivot table, but unsuccessfully:
df.pivot_table(values='CAT', index='ID', columns='CAT', aggfunc='count')
you can use crosstab():
df=pd.DataFrame(data={'ID':[1,1,1,2,2,2,1,1],'CAT':['A','B','A','A','B','A','B','A']})
final = pd.crosstab(df['ID'], df['CAT'])
final.columns=['CAT_A_NUM','CAT_B_NUM']
final
ID CAT_A_NUM CAT_B_NUM
1 3 2
2 2 1
Probably you can use groupby + unstack
df.groupby(["ID","CAT"]).size().unstack()
which gives
CAT A B
ID
1 3 2
2 2 1

best way to generate rows based on other rows in pandas at a big file

I have a csv with around 8 million of rows, something like that:
a b c
0 2 3
and I wanted to generate from it new rows based on the second and the third value so I will get:
a b c
0 2 3
0 3 3
0 4 3
0 5 3
which is basically just itereating through every row(in this example one row), and then creating a new row with a value of b+i, where i is between 0 to the value of c including c itself.
c column is irelevant after the rows have been generated, problem is that it has million of rows, and doing that might generate many rows, so how can I do it efficenly? (loops are too slow for that amount of data).
thanks
You can reindex on the repeated index:
out = df.loc[df.index.repeat(df['c']+1)]
out['b'] += out.groupby(level=0).cumcount()
print(out)
Output (reset index if you want):
a b c
0 0 2 3
0 0 3 3
0 0 4 3
0 0 5 3
Note since you blow your data up by the c column and you already have 8 million rows, your new dataframe can be too big on its own.

How to keep only the last index in groups of rows where a condition is met in pandas?

I have the following dataframe:
d = {'value': [1,1,1,1,1,1,1,1,1,1], 'flag_1': [0,1,0,1,1,1,0,1,1,1],'flag_2':[1,0,1,1,1,1,1,0,1,1],'index':[1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(data=d)
I need to perform the following filter on it:
If flag 1 and flag 2 are equal keep the row with the maximum index from the consecutive indices. Below for rows 4,5,6 and rows 9,10 flag 1 and flag 2 are equal. From the group of consecutive indices 4,5,6 therefore I wish to keep only row 6 and drop rows 4 and 5. For the next group of rows 9 and 10 I wish to keep only row 10. The rows where flag 1 and 2 are not equal should all be retained. I want my final output to look as shown below:
I am really not sure how to achieve what is required so I would be grateful for any advice on how to do it.
IIUC, you can compare consecutive rows with shift. This solution requires a sorted index.
In [5]: df[~df[['flag_1', 'flag_2']].eq(df[['flag_1', 'flag_2']].shift(-1)).all(axis=1)]
Out[5]:
value flag_1 flag_2 index
0 1 0 1 1
1 1 1 0 2
2 1 0 1 3
5 1 1 1 6
6 1 0 1 7
7 1 1 0 8
9 1 1 1 10

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

How to calculate the number of pairs in an Excel spreadsheet?

I have two columns of integers between 1 and 16 in an excel file. I'd like to count the number of pairs of integers in these columns. There are 256 cases and I'd like to have a column which tells me how many pairs exist for each case. For instance, I have a table like below:
1 2
1 1
1 3
1 4
1 1
1 8
1 1
16 16
1 2
...
And I'd like to calculate a column like this:
3 (number of 1 1s)
2 (number of 1 2s)
1 (number of 1 3s)
1 (number of 1 4s)
0 (number of 1 5s)
0 (number of 1 6s)
0 (number of 1 7s)
1 (number of 1 8s)
...
1 (number of 16 16s)
I'd appreciate if someone can help me with the calculation.
First you need to create two columns with all possible combinations:
1 1
1 2
1 3
...
2 1
2 2
...
16 16
Let's assume these are in columns C,D and your data are in columns A, B, in rows 1 to 1000. Then you can use an array formula:
=SUM(IF(($A$1:$A$1000=C1)*($B$1:$B$1000=D1);1;0))
You must press Shift+Ctrl+Enter when entering array formula.