I have the following dataframe with one column representing IDs (one same ID can appear several time in the column) and another one representing an occurence of a category for this ID. Each category can have several occurences per ID.
id category
1234 happy
4567 sad
8910 medium
...............
1234 happy
4567 medium
I would like to pivot this table to get the following
id happy sad medium
1234 2 0 0
4567 0 1 1
8910 0 0 1
I've tried the following
df.pivot_table(index= "id", columns = "category", aggfunc= 'count', fill_value = 0)
But it's only returning me the IDs as indexes.
Could anyone help?
You can use pd.crosstab:
print (pd.crosstab(df["id"], df["category"]))
If you want to stick with pivot_table, you need to add an extra column as value:
print (df.assign(value=0).pivot_table(index="id", columns="category", values="value", aggfunc='count', fill_value=0))
category happy medium sad
id
1234 2 0 0
4567 0 1 1
8910 0 1 0
Related
Expanding a bit on this question, I want to capture changes in values specifically when the previous column value is 0 or when the next column value is 0.
Given the following dataframe, tracking value changes from one column to the next using diff and aggregating these fluctuations in a new set of values is possible.
Item Jan_20 Apr_20 Aug_20 Oct_20
Apple 3 4 4 4
Orange 5 5 1 2
Grapes 0 0 4 4
Berry 5 3 0 0
Banana 0 2 0 0
However, if I were to only capture such differences when the values being changed from one column to the next is either specifically from 0 or to 0 and tracking that as either new fruit or lost fruit, respectively, how would I do that?
Desired outcome:
Type Jan_20 Apr_20 Aug_20 Oct_20
New Fruits 0 2 4 0
Lost Fruits 0 0 5 0
Put another way, in the example, since Grapes go from a value of 0 in Apr_20 to 4 in Aug_20, I want 4 to be captured and stored in New Fruits. Similarly, since Banana and Berry both go from a value higher than zero in Apr_20 to 0 in Aug_20, I want to aggregate those values in Lost Fruits.
How could this be achieved?
This can be achieved using masks to hide the non relevant data, combined with diff and sum:
d = df.set_index('Item')
# mask to select values equal to zero
m = d.eq(0)
# difference from previous date
d = d.diff(axis=1)
out = pd.DataFrame({'New' : d.where(m.shift(axis=1)).sum(),
'Lost': -d.where(m).sum()}
).T
Output:
Jan_20 Apr_20 Aug_20 Oct_20
New 0 2 4 0
Lost 0 0 5 0
I want to calculate the frequency by each row data. For instance,
column_nameA
column_nameB
column_nameC
title
content
AAA company
AAA
Ben Simons
AAA company has new product lanuch.
AAA company has released new product. AAA claims that the product X has significant changed than before. Ben Simons, who is AAA company CEO, also mentioned.......
BBB company
BBB
Alex Wong
AAA company has new product lanuch.
AAA company has released new product. BBB claims that the product X has significant changed than before, and BBB company has invested around 1 millions…....
In here, the result I expected is
When AAA company happens in the title and counts 1, if AAA company appears twice in the title then it should count as 2.
Also, the similar idea in the content, if AAA company appears once then count number shows 1, if AAA company appears twice in the title then it should count as 2.
However, if AAA company appears in the second row which the row only needs to consider BBB company or BBB instead AAA company or AAA.
So, the result would be like:
nameA_appear_in_title
nameB_appear_in_title
nameC_appear_in_title
nameA_appear_in_content
nameB_appear_in_content
nameC_appear_in_content
1
1
0
2
1
1
0
0
0
1
1
0
All the data has stored into the dataframe, and hope this can manipulate by using panda.
One more thing would be highlighted, the title or content cannot be tokenized to count the frequency.
Use itertools.product for all combinations of lists of columns names and create new columns with count, last remove original columns names if necessary:
cols = df.columns
L1 = ['column_nameA', 'column_nameB', 'column_nameC']
L2 = ['title', 'content']
from itertools import product
for a, b in product(L2, L1):
df[f'{b}_{a}'] = df.apply(lambda x: x[a].count(x[b]), axis=1)
df = df.drop(cols, axis=1)
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 1 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 3 1
1 1 2 0
Last if necessary subtract column_nameA from column_nameB use:
cola = df.columns.str.startswith('column_nameA')
colb = df.columns.str.startswith('column_nameB')
df.loc[:, colb] = df.loc[:, colb] - df.loc[:, cola].to_numpy()
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 0 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 1 1
1 1 1 0
I need to organize a large df adding values of a column by a column ID (the ID is not sequencial), keeping some columns of the df that have repeated values by ID and excluding column that have different values by ID. Below I inserted a reproducible example and the output I need. I think there is a simple way to do that, but I am not soo familiar with R.
df=read.table(textConnection("
ID spp effort generalist specialist
1 a 10 1 0
1 b 10 1 0
1 c 10 0 1
1 d 10 0 1
2 a 16 1 0
2 b 16 1 0
2 e 16 0 1
"), header = TRUE)
The output I need:
ID effort generalist specialist
1 10 2 2
2 16 2 1
Say I have a pandas dataframe df1 as follows:
OpDay Rid Tid Sid Dist
0 18Sep 1 1 1 10
1 18Sep 1 1 1 15
2 18Sep 1 1 1 20
3 18Sep 1 5 4 5
4 18Sep 1 5 4 50
and df2 like:
S_Day R_ID T_ID S_ID ABC XYZ
0 18Sep 1 1 1 100 60
1 18Sep 1 5 4 125 100
Number of rows in df2 is equal to total number of unique combinations of OpDay+Rid+Tid+Sid in df1.
Now, I want the values of columns ABC and XYZ from df2 corresponding to this each unique combination. But I don't want to store these values in df1. Just need these values for some computation purpose and then I want to store the result in df2 only by creating a new column.
To summarize, lets say ,I want to do some computation using df1.Dist[3] for which I need values from columns df2.ABC and df2.XYZ also, so first find the row index in df2 where,
S_Day = OpDay[3],
R_ID = Rid[3],
T_ID = Tid[3] and
S_ID = Sid[3]
(In this case its row#1),
so use df2.ABC[1] and df2.XYZ[1] and store results in df2.RESULT[1].
So now df2 will look something like:
S_Day R_ID T_ID S_ID ABC XYZ RESULT
0 18Sep 1 1 1 100 60 Nan
1 18Sep 1 5 4 125 100 some computed value
Basically I guess I need a lookup kind of a function but don't know how to proceed further.
Please help as I am new to the world of python and programming. Many thanks in advance.
You can use .loc and Boolean indices to do what you want. Let's say that you're after the ith row of df1:
i = 3
Next, you can use Boolean indexing to find the corresponding rows in df2:
bool_index = (df1.loc[i, 'OpDay'] == df2.loc[:, 'S_Day']) & (df1.loc[i, 'Rid'] == df2.loc[:, 'R_ID']) & (df1.loc[i, 'Tid'] == df2.loc[:, 'T_ID']) & (df1.loc[i, 'Sid'] == df2.loc[:, 'S_ID'])
You might want to include a check to verify that you found one and only one combination:
sum(bool_index) == 1
And finally, you can use the boolean index to call the right values from df2:
ABC_for_computation = df2.loc[bool_index, 'ABC']
XYZ_for_computation = df2.loc[bool_index, 'XYZ']
Note that I'm not too sure about the speed of this operation if you have large datasets. In my experience, if speed is affected you should switch to numpy arrays instead of dataframes, particularly when writing data into your dataframe.
I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1