Pivot table for non numerical data frame pandas - pandas

I am trying to create a chart from a large data set the structure is as below :
sample data frame :
df = pd.DataFrame({'climate':['hot','hot','hot','cold','cold'],0:['none','apple','apple','orange','grape'],1:['orange','none','grape','apple','banana'],2:['grape','kiwi','tomato','none','tomato']})
need to plot how many from each fruit exist in different climate ,I need two chart both hot and cold separately.
Pivot table and aggregation are not possible because no numerical values .
what method do you recommend?

First do melt then pd.crosstab
s=df.melt('climate')
s=pd.crosstab(s.variable,s.value)
value apple banana grape kiwi none orange tomato
variable
0 2 0 1 0 1 1 0
1 1 1 1 0 1 1 0
2 0 0 1 1 1 0 2

Related

incompatible index of inserted column with frame index with group by and count

I have data that looks like this:
CHROM POS REF ALT ... is_sever_int is_sever_str is_sever_f encoding_str
0 chr1 14907 A G ... 1 1 one one
1 chr1 14930 A G ... 1 1 one one
These are the columns that I'm interested to perform calculations on (example) :
is_severe snp _id encoding
1 1 one
1 1 two
0 1 one
1 2 two
0 2 two
0 2 one
what I want to do is to count for each snp_id and severe_id how many ones and twos are in the encoding column :
snp_id is_svere encoding_one encoding_two
1 1 1 1
1 0 1 0
2 1 0 1
2 0 1 1
I tried this :
df.groupby(["snp_id","is_sever_f","encoding_str"])["encoding_str"].count()
but it gave the error :
incompatible index of inserted column with frame index
then i tried this:
df["count"]=df.groupby(["snp_id","is_sever_f","encoding_str"],as_index=False)["encoding_str"].count()
and it returned:
Expected a 1D array, got an array with shape (2532831, 3)
how can i fix this? thank you:)
Let's try groupby with whole columns and get size of each group then unstack the encoding index.
out = (df.groupby(['is_severe', 'snp_id', 'encoding']).size()
.unstack(fill_value=0)
.add_prefix('encoding_')
.reset_index())
print(out)
encoding is_severe snp_id encoding_one encoding_two
0 0 1 1 0
1 0 2 1 1
2 1 1 1 1
3 1 2 0 1
Try as follows:
Use pd.get_dummies to convert categorical data in column encoding into indicator variables.
Chain df.groupby and get sum to turn double rows per group into one row (i.e. [0,1] and [1,0] will become [1,1] where df.snp_id == 2 and df.is_severe == 0).
res = pd.get_dummies(data=df, columns=['encoding'])\
.groupby(['snp_id','is_severe'], as_index=False, sort=False).sum()
print(res)
snp_id is_severe encoding_one encoding_two
0 1 1 1 1
1 1 0 1 0
2 2 1 0 1
3 2 0 1 1
If your actual df has more columns, limit the assigment to the data parameter inside get_dummies. I.e. use:
res = pd.get_dummies(data=df[['is_severe', 'snp_id', 'encoding']],
columns=['encoding']).groupby(['snp_id','is_severe'],
as_index=False, sort=False)\
.sum()

Pandas: Calculate value changes with diff based on condition

Expanding a bit on this question, I want to capture changes in values specifically when the previous column value is 0 or when the next column value is 0.
Given the following dataframe, tracking value changes from one column to the next using diff and aggregating these fluctuations in a new set of values is possible.
Item Jan_20 Apr_20 Aug_20 Oct_20
Apple 3 4 4 4
Orange 5 5 1 2
Grapes 0 0 4 4
Berry 5 3 0 0
Banana 0 2 0 0
However, if I were to only capture such differences when the values being changed from one column to the next is either specifically from 0 or to 0 and tracking that as either new fruit or lost fruit, respectively, how would I do that?
Desired outcome:
Type Jan_20 Apr_20 Aug_20 Oct_20
New Fruits 0 2 4 0
Lost Fruits 0 0 5 0
Put another way, in the example, since Grapes go from a value of 0 in Apr_20 to 4 in Aug_20, I want 4 to be captured and stored in New Fruits. Similarly, since Banana and Berry both go from a value higher than zero in Apr_20 to 0 in Aug_20, I want to aggregate those values in Lost Fruits.
How could this be achieved?
This can be achieved using masks to hide the non relevant data, combined with diff and sum:
d = df.set_index('Item')
# mask to select values equal to zero
m = d.eq(0)
# difference from previous date
d = d.diff(axis=1)
out = pd.DataFrame({'New' : d.where(m.shift(axis=1)).sum(),
'Lost': -d.where(m).sum()}
).T
Output:
Jan_20 Apr_20 Aug_20 Oct_20
New 0 2 4 0
Lost 0 0 5 0

How to split pandas dataframe into multiple dataframes (holding together rows) based upon a column's value

My problem is similar to split a dataframe into chunks of N rows problem, expect that the number of rows in each chunk will be different. I have a datafame as such:
A
B
C
1
2
0
1
2
1
1
2
2
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
0
A and B are just whatever don't pay attention. Column C though starts at 0 and increments with each row until it suddenly resets to 0. So in the dataframe included the first 3 rows are a new dataframe, then the next 5 are a second new dataframe, and this continues as my dataframe adds more and more rows.
To finish off the question,
df = [x for _, x in df.groupby(df['C'].eq(0).cumsum())]
allows me to group all the subgroups and then with this groupby I can select each subgroups as a separate dataframe.

Dataframe apply set is not removing duplicate values

My dataset can sometimes include duplicates in one concatenated column like this:
Total
0 Thriller,Satire,Thriller
1 Horror,Thriller,Horror
2 Mystery,Horror,Mystery
3 Adventure,Horror,Horror
When doind this
df['Total'].str.split(",").apply(set)
I get
Total
0 {Thriller,Satire}
1 {Horror,Thriller}
2 {Mystery,Horror,Crime}
3 {Adventure,Horror}
And after encoding it with
df['Total'].str.get_dummies(sep=",")
I get a header looking like this
{'Horror {'Mystery {'Thriller ... Horror Thriller'}
Instead of
Horror Mystery Thriller
How do I get rid of the curly brackets when using Pandas dataframe?
Method Series.str.get_dummies working nice also with duplicates.
So omit code for unique values:
df['Total'] = df['Total'].str.split(",").apply(set)
And use only:
df1 = df['Total'].str.get_dummies(sep=",")
print (df1)
Adventure Horror Mystery Satire Thriller
0 0 0 0 1 1
1 0 1 0 0 1
2 0 1 1 0 0
3 1 1 0 0 0
BUt if need remopve duplicates add Series.str.join:
df1 = df['Total'].str.split(",").apply(set).str.join(',').str.get_dummies(sep=",")

Assigning one column to another column between pandas DataFrames (like vector to vector assignment)

I have a super strange problem which I spent the last hour trying to solve, but with no success. It is even more strange since I can't replicate it on a small scale.
I have a large DataFrame (150,000 entries). I took out a subset of it and did some manipulation. the subset was saved as a different variable, x.
x is smaller than the df, but its index is in the same range as the df. I'm now trying to assign x back to the DataFrame replacing values in the same column:
rep_Callers['true_vpID'] = x.true_vpID
This inserts all the different values in x to the right place in df, but instead of keeping the df.true_vpID values that are not in x, it is filling them with NaNs. So I tried a different approach:
df.ix[x.index,'true_vpID'] = x.true_vpID
But instead of filling x values in the right place in df, the df.true_vpID gets filled with the first value of x and only it! I changed the first value of x several times to make sure this is indeed what is happening, and it is. I tried to replicate it on a small scale but it didn't work:
df = DataFrame({'a':ones(5),'b':range(5)})
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
z =Series([random() for i in range(5)],index = range(5))
0 0.812561
1 0.862109
2 0.031268
3 0.575634
4 0.760752
df.ix[z.index[[1,3]],'b'] = z[[1,3]]
a b
0 1 0.000000
1 1 0.812561
2 1 2.000000
3 1 0.575634
4 1 4.000000
5 1 5.000000
I really tried it all, need some new suggestions...
Try using df.update(updated_df_or_series)
Also using a simple example, you can modify a DataFrame by doing an index query and modifying the resulting object.
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
df_2 = df_1.ix[3:5]
df_2.b = df_2.b + 2
df_2
a b
3 1 5
4 1 6
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 5
4 1 6