Turn MultiIndex Series into pivot table design by unique value counts - pandas

Sample Data:
Date,code
06/01/2021,405
06/01/2021,405
06/01/2021,400
06/02/2021,200
06/02/2021,300
06/03/2021,500
06/02/2021,500
06/03/2021,300
06/05/2021,500
06/04/2021,500
06/03/2021,400
06/02/2021,400
06/04/2021,400
06/03/2021,400
06/01/2021,400
06/04/2021,200
06/05/2021,200
06/02/2021,200
06/06/2021,300
06/04/2021,300
06/06/2021,300
06/05/2021,400
06/03/2021,400
06/04/2021,400
06/04/2021,500
06/01/2021,200
06/02/2021,300
import pandas as pd
df = pd.read_csv(testfile.csv)
code_total = df.groupby(by="Date",)['code'].value_counts()
print(code_total)
Date code
06/01/2021 400 2
405 2
200 1
06/02/2021 200 2
300 2
400 1
500 1
06/03/2021 400 3
300 1
500 1
06/04/2021 400 2
500 2
200 1
300 1
06/05/2021 200 1
400 1
500 1
06/06/2021 300 2
dates = set([x[0] for x in code_total.index])
codes = set([x[1] for x in code_total.index])
test = pd.DataFrame(code_total,columns=sorted(codes),index=sorted(dates))
print(test)
Is there a way to transpose the second index into a column and retain the value for the counts? Ultimately I'm trying to plot the count of unique error codes on a line graph. I've been searching up many different ways but am always missing something. any help would be appreciated.

Use Series.unstack:
df = df.groupby(by="Date",)['code'].value_counts().unstack(fill_value=0)

Related

To iterate all the dataframe in pandas

I'm trying to do something similar to this...
My intention is to create a boucle for in pandas that can iterate all the dataframe filtering all the rows that are highest than four. If the condition is satisfied, it will give me a new column with the column name and the ID. Something like this (the column output):
enter image description here
I'm trying with this code but it doesn't work...
list = []
for col in df.columns:
for row in df[col]:
if row>4:
list.append(df(row).index, col)
Could somebody help me? I will thanks you so much...
Here is a proposition with pandas.DataFrame.loc and pandas.Series.ge :
collected_vals = []​
for col in df.filter(like="X").columns:
collected_vals.append(df.loc[df[col].ge(4), "ID"].astype(str).radd(f"{col}, "))
#if list​ is needed
from itertools import chain​
l = list(chain(*[ser.tolist() for ser in collected_vals]))
#if Series is needed
ser = pd.concat(collected_vals, ignore_index=True)
#if DataFrame is needed
out_df = pd.concat(collected_vals, ignore_index=True).to_frame("OUTPUT")
# Output
print(out_df)
OUTPUT
0 X40, 1100
1 X40, 1200
2 X50, 700
3 X50, 800
4 X50, 900
Input used :
print(df)
X40 X50 ID
0 1 5 700
1 2 6 800
2 1 8 900
3 3 2 1000
4 4 3 1100
5 6 1 1200

How to calculate leftovers of each balance top-up using first in first out technique?

Imagine we have user balances. There's a table with top-up and withdrawals. Let's call it balance_updates.
transaction_id
user_id
current_balance
amount
created_at
1
1
100
100
...
2
1
0
-100
3
2
400
400
4
2
300
-100
5
2
200
-200
6
2
300
100
7
2
50
-50
What I want to get off this is a list of top-ups and their leftovers using the first in first out technique for each user.
So the result could be this
top_up
user_id
leftover
1
1
0
3
2
50
6
2
100
Honestly, I struggle to turn it to SQL. Tho I know how to do it on paper. Got any ideas?

Pandas - Pivot/stack/unstack/melt

I have a dataframe that looks like this:
name
value 1
value 2
A
100
101
A
100
102
A
100
103
B
200
201
B
200
202
B
200
203
C
300
301
C
300
302
C
300
303
And I'm trying to get to this:
name
value 1
value 2
value 3
value 4
value 5
value 6
A
100
101
100
102
100
103
B
200
201
200
202
200
203
C
300
301
300
302
300
303
Here is what i have tried so far;
dataframe.stack()
dataframe.unstack()
dataframe.melt(id_vars=['name'])
I need to transpose the data by ensuring that;
The first row remains as it is but every subsequent value associated with the same name should be transposed to a coulmn.
Whereas the second value B (for. ex) should transpose it's associated value as a new value under the column A values, it should not form a separate altogether.
Try:
def fn(x):
vals = x.values.ravel()
return pd.DataFrame(
[vals],
columns=[f"value {i}" for i in range(1, vals.shape[0] + 1)],
)
out = (
df.set_index("name")
.groupby(level=0)
.apply(fn)
.reset_index()
.drop(columns="level_1")
)
print(out.to_markdown())
Prints:
name
value 1
value 2
value 3
value 4
value 5
value 6
0
A
100
101
100
102
100
103
1
B
200
201
200
202
200
203
2
C
300
301
300
302
300
303
Flatten values for each name
(
df.set_index('name')
.groupby(level=0)
.apply(lambda x: pd.Series(x.values.flat))
.rename(columns=lambda x: f'value {x + 1}')
.reset_index()
)
One option using melt, groupby`, and pivot_wider (from pyjanitor):
# pip install pyjanitor
import pandas as pd
import janitor
(df
.melt('name', ignore_index = False)
.sort_index()
.drop(columns='variable')
.assign(header = lambda df: df.groupby('name').cumcount() + 1)
.pivot_wider('name', 'header', names_sep = ' ')
)
name value 1 value 2 value 3 value 4 value 5 value 6
0 A 100 101 100 102 100 103
1 B 200 201 200 202 200 203
2 C 300 301 300 302 300 303

Replacing -999 with a number but I want all replaced number to be different

I have a Pandas DataFrame named df and in df['salary'] column, there are 400 values represented by same number -999. I want to replace that -999 value with any number in between 200 and 500. I want to replace all 400 values with a different number from 200 to 500. So far I have written this code:
df['salary'] = df['salary'].replace(-999, random.randint(200, 500))
but this code is replacing all -999 with the same value. I want all replaced values to be different from each other. How can do this.
You can use Series.mask with np.random.randint:
df = pd.DataFrame({"salary":[0,1,2,3,4,5,-999,-999,-999,1,3,5,-999]})
df['salary'] = df["salary"].mask(df["salary"].eq(-999), np.random.randint(200, 500, size=len(df)))
print (df)
salary
0 0
1 1
2 2
3 3
4 4
5 5
6 413
7 497
8 234
9 1
10 3
11 5
12 341
If you want non-repeating numbers instead:
s = pd.Series(range(200, 500)).sample(frac=1).reset_index(drop=True)
df['salary'] = df["salary"].mask(df["salary"].eq(-999), s)

creating new column within multiindex

I have a simple dataframe:
A B
1 2 1 2
Foo 100 200 300 400
Bar 100 200 300 400
I want to add a new column which is (B,2) - (A,2)
What I've tried is:
df["Chg","Period"] = [df.loc[:, [("B",2)]] - df.loc[:, [("A", 2)]]]
But I'm told that:
Length of values does not match length of index
I'm a bit confused - I thought that by having two column headers for my new column it would work, but I'm now struggling. Any help would be most appreciated
Thanks
Use tuples for select MultiIndex and also for new MultiIndex column:
df[("Chg","Period")] = df[("B",2)] - df[("A", 2)]
print (df)
A B Chg
1 2 1 2 Period
Foo 100 200 300 400 200
Bar 100 200 300 400 200
If want working by multiple columns together, e.g. subtract B with A to new MultiIndex levels is possible use DataFrame.xs, then create MultiIndex by MultiIndex.from_product and add to original by DataFrame.join:
df1 = df.xs('B', axis=1, level=0) - df.xs('A', axis=1, level=0)
df1.columns = pd.MultiIndex.from_product([['Diff'], df1.columns])
print (df1)
Diff
1 2
Foo 200 200
Bar 200 200
df = df.join(df1)
print (df)
A B Diff
1 2 1 2 1 2
Foo 100 200 300 400 200 200
Bar 100 200 300 400 200 200