Pandas: Combine two dataframe columns in a sorted column - pandas

Suppose that I have this dataframe:
import pandas as pd
def creatingDataFrame():
raw_data = {'Region1': ['A', 'A', 'C', 'B' , 'A', 'B'],
'Region2': ['B', 'C', 'A', 'A' , 'B', 'A'],
'var-1': [20, 30, 40 , 50, 10, 20],
'var-2': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['Region1', 'Region2','var-1', 'var-2'])
return df
I want to generate this column:
df['segment']=['A-B','A-C','A-C','A-B','A-B','A-B']
Note that it is using columns 'Region1' and 'Region2' but in a sorted order. I have no clue how to do that using pandas. The only solution that I have in mind is to use a list as intermediary step:
Regions=df[['Region1','Region2']].values.tolist()
segments=[]
for i in range(np.shape(Regions)[0]):
auxRegions=sorted(Regions[i][:])
segments.append(auxRegions[0]+'-'+auxRegions[1])
df['segments']=segments
To get:
>>> df['segments']
0 A-B
1 A-C
2 A-C
3 A-B
4 A-B
5 A-B

You need:
df['segments'] = ['-'.join(sorted(tup)) for tup in zip(df['Region1'], df['Region2'])]
Output:
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B

np.sort
v = np.sort(df.iloc[:, :2], axis=1).T
df['segments'] = [f'{i}-{j}' for i, j in zip(v[0], v[1])] # '{}-{}'.format(i, j)
df
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B
DataFrame.agg + str.join
df['segments'] = pd.DataFrame(
np.sort(df.iloc[:, :2], axis=1)).agg('-'.join, axis=1)
df
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B
(One above's faster.)

Related

Pandas variable rounding of column

>>> print(df)
item value1
0 a 1.121
1 a 1.510
2 a 0.110
3 b 3.322
4 b 4.811
5 c 5.841
This is my dummy pandas df.
Below is how I truncate/round my column value1.
decimals = 2
df['value1'] = df['value1'].apply(lambda x: round(x, decimals))
>>> print(df)
item value1
0 a 1.12
1 a 1.51
2 a 0.11
3 b 3.32
4 b 4.81
5 c 5.84
This truncate all the two column to two decimal point after decimal. Is it possible to have variable rounding w dictionary. So in below we see 'a' = two places post decimal, 'b': 3 post decimal....default(value not convered....default to 2). My expected df below. Not sure if this is possible. (More of thought experimentation)
dec_dict = {'a' : 2, 'b': 3, 'l':3, 'default': 2}
>>> print(df)
item value1
0 a 1.12
1 a 1.51
2 a 0.11
3 b 3.322
4 b 4.811
5 c 5.84
Given the fact that trailing zeros are not significant, the best approach should be:
dec_dict = {'a' : 2, 'b': 3, 'l':3, 'default': 2}
df['value1'] = (df.groupby('item')['value1']
.apply(lambda g: g.round(dec_dict.get(g.name, dec_dict['default']))
)
output:
item value1
0 a 1.120
1 a 1.510
2 a 0.110
3 b 3.322
4 b 4.811
5 c 5.840
df1.assign(value1=df1.assign(col1=df1.item.map(dec_dict).fillna(dec_dict['default']).astype(int))\
.apply(lambda ss:str(round(ss.value1, ss.col1)),axis=1))
item value1
0 a 1.12
1 a 1.51
2 a 0.11
3 b 3.322
4 b 4.811
5 c 5.84
You can set the index then round it with dict by column only, before that we need to update you dict with those missing value
update_dict = {**dec_dict,**dict.fromkeys(df.item[~df.item.isin(dec_dict.keys())],2)}
update_dict
{'a': 2, 'b': 3, 'l': 3, 'default': 2, 'c': 2}
out = df.set_index('item').T.round(update_dict).astype(object).T.reset_index()
out
item value1
0 a 1.12
1 a 1.51
2 a 0.11
3 b 3.322
4 b 4.811
5 c 5.84

Iterate over rows and subtract values in pandas df

I have the following table:
ID
Qty_1
Qty_2
A
1
10
A
2
0
A
3
0
B
3
29
B
2
0
B
1
0
I want to iterate based on the ID, and subtract Qty_2 - Qty_1 and update the next row with that result.
The result would be:
ID
Qty_1
Qty_2
A
1
10
A
2
8
A
3
5
B
3
29
B
2
27
B
1
26
Ideally, I would also like to start by subtracting the first row end a new ID appears and only after that start the loop:
ID
Qty_1
Qty_2
A
1
9
A
2
7
A
3
4
B
3
26
B
2
24
B
1
23
Each of the solutions is ok! Thank you!
First compute the difference between 'Qty_1' and 'Qty_2' row by row, then group by 'ID' and compute cumulative sum:
df['Qty_2'] = df.assign(Qty_2=df['Qty_2'].sub(df['Qty_1'])) \
.groupby('ID')['Qty_2'].cumsum()
print(df)
# Output:
ID Qty_1 Qty_2
0 A 1 9
1 A 2 7
2 A 3 4
3 B 3 26
4 B 2 24
5 B 1 23
Setup:
data = {'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'Qty_1': [1, 2, 3, 3, 2, 1],
'Qty_2': [10, 0, 0, 29, 0, 0]}
df = pd.DataFrame(data)

Unstack a single column dataframe

I have a dataframe that looks like this:
statistics
0 2013-08
1 4
2 8
3 2013-09
4 7
5 13
6 2013-10
7 2
8 10
And I need it to look like this:
statistics X Y
0 2013-08 4 8
1 2013-09 7 13
2 2013-10 2 10
it would be useful to find a way that doesnt depend on the number of rows as I want to use it in a loop and the number of original rows might be changing. However, the output should always have these 3 columns
What you are doing is not an unstack operation, you are trying to do a reshape.
You can do this by using the reshape method of numpy. The variable n_cols is the number of columns you are looking for.
Here you have an example:
df = pd.DataFrame(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'], columns=['col'])
df
col
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
10 K
11 L
n_cols = 3
pd.DataFrame(df.values.reshape(int(len(df)/n_cols), n_cols))
0 1 2
0 A B C
1 D E F
2 G H I
3 J K L
import pandas as pd
data = pd.read_csv('data6.csv')
x=[]
y=[]
statistics= []
for i in range(0,len(data)):
if i%3==0:
statistics.append(data['statistics'][i])
elif i%3==1:
x.append(data['statistics'][i])
elif i%3 == 2:
y.append(data['statistics'][i])
data1 = pd.DataFrame({'statistics':statistics,'x':x,'y':y})
data1

How to get column name based on multiple columns in pandas?

Goal:
Create columns
fst_imp: return column name in which value is index of the min value of each row.
snd_imp: value is column name in which value is index of the second small value of each row.
trd_imp: value is column name in which value is index of the third small value of each row.
Example result:
A B C fst_imp snd_imp trd_imp
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B
Here is one potential solution using numpy.argsort, the pandas.DataFrame constructor and DataFrame.join:
# Setup
import numpy as np
df = pd.DataFrame({'A': {0: 1, 1: 6, 2: 7}, 'B': {0: 2, 1: 5, 2: 9}, 'C': {0: 3, 1: 4, 2: 8}})
df.join(pd.DataFrame([df.columns.values[x] for x in np.argsort(df.values)],
columns=['fst_imp', 'snd_imp', 'trd_imp']))
[out]
A B C fst_imp snd_imp trd_imp
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B
Or a bit more scalable...
df.join(pd.DataFrame([df.columns.values[x] for x in np.argsort(df.values)]))
[out]
A B C 0 1 2
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B

Separate aggregated data in different rows [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.