Paired difference of columns in dataframe to generate dataframe with 1.3 million columns - pandas

I have a dataframe with 1600 columns.
The dataframe df looks like where the column names are 1, 3 , 2:
Row Labels 1 3 2
41730Type1 9 6 5
41730Type2 14 12 20
41731Type1 2 15 5
41731Type2 3 20 12
41732Type1 8 10 5
41732Type2 8 18 16
I need to create the following dataframe df2 pythonically:
Row Labels (1, 2) (1, 3) (2, 3)
41730Type1 -4 -3 1
41730Type2 6 -2 -8
41731Type1 3 13 10
41731Type2 9 17 8
41732Type1 -3 2 5
41732Type2 8 10 2
where e.g. column (1, 2) is created by df[2] - df[1]
The column names for df2 are created by pairing the column headers of df1 such that second element of each name is greater than first e.g. (1, 2), (1, 3), (2, 3)
The second challenge is can pandas dataframe support 1.3 million columns?

We can do combinations for the column , then create the dict and concat it back
import itertools
l=itertools.combinations(df.columns,2)
d={'{0[0]}|{0[1]}'.format(x) : df[x[0]]-df[x[1]] for x in [*l] }
newdf=pd.concat(d,axis=1)
1|3 1|2 3|2
RowLabels
41730Type1 3 4 1
41730Type2 2 -6 -8
41731Type1 -13 -3 10
41731Type2 -17 -9 8
41732Type1 -2 3 5
41732Type2 -10 -8 2

itertools combinations seems the obvious choice, same as #YOBEN_S, a different route to the solution, using numpy arrays and dictionary
from itertools import combinations
new_data = combinations(df.to_numpy().T,2)
new_cols = combinations(df.columns, 2)
result = {key : np.subtract(arr1,arr2)
if key[0] > key[1]
else np.subtract(arr2,arr1)
for (arr1, arr2), key
in zip(new_data,new_cols)}
outcome = pd.DataFrame.from_dict(result,orient='index').sort_index().T
outcome
(1, 2) (1, 3) (3, 2)
0 -4 -3 1
1 6 -2 -8
2 3 13 10
3 9 17 8
4 -3 2 5
5 8 10 2

Related

Find common values within groupby in pandas Dataframe based on two columns

I have following dataframe:
period symptoms recovery
1 4 2
1 5 2
1 6 2
2 3 1
2 5 2
2 8 4
2 12 6
3 4 2
3 5 2
3 6 3
3 8 5
4 5 2
4 8 4
4 12 6
I'm trying to find the common values of df['period'] groups (1, 2, 3, 4) based on value
of two columns 'symptoms' and 'recovery'
Result should be :
symptoms recovery period
5 2 [1, 2, 3, 4]
8 4 [2, 4]
where each same two columns values has the periods occurrence in a list or column.
I'm I approaching the problem in the wrong way ? Appreciate your help.
I tried to turn each period into dict and loop through to find values but didn't work for me. Also tried to use grouby().apply() but I'm not getting a meaningful data frame.
Tried sorting values based on 3 columns but couldn't get the common ones between each period section.
Last attempt :
df2 = df[['period', 'how_long', 'days_to_ex']].copy()
#s = df.groupby(["period", "symptoms", "recovery"]).size()
s = df.groupby(["symptoms", "recovery"]).size()
You were almost there:
from io import StringIO
import pandas as pd
# setup sample data
data = StringIO("""
period;symptoms;recovery
1;4;2
1;5;2
1;6;2
2;3;1
2;5;2
2;8;4
2;12;6
3;4;2
3;5;2
3;6;3
3;8;5
4;5;2
4;8;4
4;12;6
""")
df = pd.read_csv(data, sep=";")
# collect unique periods
df.groupby(['symptoms','recovery'])[['period']].agg(list).reset_index()
This gives
symptoms recovery period
0 3 1 [2]
1 4 2 [1, 3]
2 5 2 [1, 2, 3, 4]
3 6 2 [1]
4 6 3 [3]
5 8 4 [2, 4]
6 8 5 [3]
7 12 6 [2, 4]

Pick row with key GROUP_FILENAME and add a new column with column name

I have a data frame which looks like this
GROUP_FIELD_NAME:BKR_ID
GROUP_FIELD_VALUE:T80
GROUP_FIELD_NAME:GROUP_OFFSET
GROUP_FIELD_VALUE:0
GROUP_FIELD_NAME:GROUP_LENGTH
GROUP_FIELD_VALUE:0
GROUP_FIELD_NAME:FIRM_ID
GROUP_FIELD_VALUE:KIZEM
GROUP_FILENAME:000000018.pdf
GROUP_FIELD_NAME:BKR_ID
GROUP_FIELD_VALUE:T80
GROUP_FIELD_VALUE:P
GROUP_FIELD_NAME:FI_ID
GROUP_FIELD_VALUE:
GROUP_FIELD_NAME:RUN_DTE
GROUP_FIELD_VALUE:20220208
GROUP_FIELD_NAME:FIRM_ID
GROUP_FIELD_VALUE:KIZEM
GROUP_FILENAME:000000019.pdf
It has three keys Group field ,group field value and group file name,i want to create a dataframe like this
I am expecting a data frame with three column group_field_name,group_field_value and group_file name.
You can use:
(df['col'].str.extract('GROUP_FILENAME:(.*)|([^:]+):(.*)')
.set_axis(['GROUP_FILENAME', 'var', 'val'], axis=1)
.assign(GROUP_FILENAME=lambda d: d['GROUP_FILENAME'].bfill(),
n=lambda d: d.groupby(['GROUP_FILENAME', 'var']).cumcount()
)
.dropna(subset=['var'])
.pivot(index=['GROUP_FILENAME', 'n'], columns='var', values='val')
.droplevel(1).rename_axis(columns=None)
.reset_index('GROUP_FILENAME')
)
Output:
GROUP_FILENAME GROUP_FIELD_NAME GROUP_FIELD_VALUE
0 000000018.pdf BKR_ID T80
1 000000018.pdf GROUP_OFFSET 0
2 000000018.pdf GROUP_LENGTH 0
3 000000018.pdf FIRM_ID KIZEM
4 000000019.pdf BKR_ID T80
5 000000019.pdf FI_ID P
6 000000019.pdf RUN_DTE
7 000000019.pdf FIRM_ID 20220208
8 000000019.pdf NaN KIZEM
Used input:
col
0 GROUP_FIELD_NAME:BKR_ID
1 GROUP_FIELD_VALUE:T80
2 GROUP_FIELD_NAME:GROUP_OFFSET
3 GROUP_FIELD_VALUE:0
4 GROUP_FIELD_NAME:GROUP_LENGTH
5 GROUP_FIELD_VALUE:0
6 GROUP_FIELD_NAME:FIRM_ID
7 GROUP_FIELD_VALUE:KIZEM
8 GROUP_FILENAME:000000018.pdf
9 GROUP_FIELD_NAME:BKR_ID
10 GROUP_FIELD_VALUE:T80
11 GROUP_FIELD_VALUE:P
12 GROUP_FIELD_NAME:FI_ID
13 GROUP_FIELD_VALUE:
14 GROUP_FIELD_NAME:RUN_DTE
15 GROUP_FIELD_VALUE:20220208
16 GROUP_FIELD_NAME:FIRM_ID
17 GROUP_FIELD_VALUE:KIZEM
18 GROUP_FILENAME:000000019.pdf

renaming multiple cells below a specific cell with pandas

I am trying to merge two Excel tables, but the rows don't line up because in one column information is split over several rows whereas in the other table it is contained in a single cell.
Is there a way with pandas to rename the cells in Table A so that they line up with the rows in Table B?
df_jobs = pd.read_excel(r"jobs.xlsx", usecols="Jobs")
df_positions = pd.read_excel(r"orders.xlsx", usecols="Orders")
Sample files:
https://drive.google.com/file/d/1PEG3nZc0183Gh-8A2xbIs9kEZIWLzLSa/view?usp=sharing
https://drive.google.com/file/d/1HfQ4q7pjba0TKNJAHBqcGeoqdY3Yr3DB/view?usp=sharing
I suppose your input data looks like:
>>> df1
A i j
0 O-20-003049 NaN NaN
1 1 0.643284 0.834937
2 2 0.056463 0.394168
3 3 0.773379 0.057465
4 4 0.081585 0.178991
5 5 0.667667 0.004370
6 6 0.672313 0.587615
7 O-20-003104 NaN NaN
8 1 0.916426 0.739700
9 O-20-003117 NaN NaN
10 1 0.800776 0.614192
11 2 0.925186 0.980913
12 3 0.503419 0.775606
>>> df2
A x y
0 O-20-003049.01 0.593312 0.666600
1 O-20-003049.02 0.554129 0.435650
2 O-20-003049.03 0.900707 0.623963
3 O-20-003049.04 0.023075 0.445153
4 O-20-003049.05 0.307908 0.503038
5 O-20-003049.06 0.844624 0.710027
6 O-20-003104.01 0.026914 0.091458
7 O-20-003117.01 0.275906 0.398993
8 O-20-003117.02 0.101117 0.691897
9 O-20-003117.03 0.739183 0.213401
We start by renaming the rows in column A:
# create a boolean mask
mask = df1["A"].str.startswith("O-")
# rename all rows
df1["A"] = df1.loc[mask, "A"].reindex(df1.index).ffill() \
+ "." + df1["A"].str.pad(2, fillchar="0")
# remove unwanted rows (where mask==True)
df1 = df1[~mask].reset_index(drop=True)
>>> df1
A i j
1 O-20-003049.01 0.000908 0.078590
2 O-20-003049.02 0.896207 0.406293
3 O-20-003049.03 0.120693 0.722355
4 O-20-003049.04 0.412412 0.447349
5 O-20-003049.05 0.369486 0.872241
6 O-20-003049.06 0.614941 0.907893
8 O-20-003104.01 0.519443 0.800131
10 O-20-003117.01 0.583067 0.760002
11 O-20-003117.02 0.133029 0.389461
12 O-20-003117.03 0.969289 0.397733
Now, we are able to merge data on column A:
>>> pd.merge(df1, df2, on="A")
A i j x y
0 O-20-003049.01 0.643284 0.834937 0.593312 0.666600
1 O-20-003049.02 0.056463 0.394168 0.554129 0.435650
2 O-20-003049.03 0.773379 0.057465 0.900707 0.623963
3 O-20-003049.04 0.081585 0.178991 0.023075 0.445153
4 O-20-003049.05 0.667667 0.004370 0.307908 0.503038
5 O-20-003049.06 0.672313 0.587615 0.844624 0.710027
6 O-20-003104.01 0.916426 0.739700 0.026914 0.091458
7 O-20-003117.01 0.800776 0.614192 0.275906 0.398993
8 O-20-003117.02 0.925186 0.980913 0.101117 0.691897
9 O-20-003117.03 0.503419 0.775606 0.739183 0.213401

'Series' objects are mutable, thus they cannot be hashed trying to sum columns and datatype is float

I am tryning to sum all values in a range of columns from the third to last of several thousand columns using:
day3prep['D3counts'] = day3prep.sum(day3prep.iloc[:, 2:].sum(axis=1))
dataframe is formated as:
ID G1 Z1 Z2 ...ZN
0 50 13 12 ...62
1 51 62 23 ...19
dataframe with summed column:
ID G1 Z1 Z2 ...ZN D3counts
0 50 13 12 ...62 sum(Z1:ZN in row 0)
1 51 62 23 ...19 sum(Z1:ZN in row 1)
I've changed the NaNs to 0's. The datatype is float but I am getting the error:
'Series' objects are mutable, thus they cannot be hashed
You only need this part:
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
With some random numbers:
import pandas as pd
import random
random.seed(42)
day3prep = pd.DataFrame({'ID': random.sample(range(10), 5), 'G1': random.sample(range(10), 5),
'Z1': random.sample(range(10), 5), 'Z2': random.sample(range(10), 5), 'Z3': random.sample(range(10), 5)})
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
Output:
> day3prep
ID G1 Z1 Z2 Z3 D3counts
0 1 2 0 8 8 16
1 0 1 9 0 6 15
2 4 8 1 3 3 7
3 9 4 7 5 7 19
4 6 3 6 6 4 16

How to set a pandas dataframe equal to a row?

I know how to set the pandas data frame equal to a column.
i.e.:
df = df['col1']
what is the equivalent for a row? let's say taking the index? and would I eliminate one or more of them?
Many thanks.
If you want to take a copy of a row then you can either use loc for label indexing or iloc for integer based indexing:
In [104]:
df = pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})
df
Out[104]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
In [106]:
row = df.iloc[3]
row
Out[106]:
a 0.531293
b -0.386598
Name: 3, dtype: float64
If you want to remove that row then you can use drop:
In [107]:
df.drop(3)
Out[107]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
You can also use a slice or pass a list of labels:
In [109]:
rows = df.loc[[3,5]]
row_slice = df.loc[3:5]
print(rows)
print(row_slice)
a b
3 0.531293 -0.386598
5 0.491417 -0.498816
a b
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
Similarly you can pass a list to drop:
In [110]:
df.drop([3,5])
Out[110]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
If you wanted to drop a slice then you can slice your index and pass this to drop:
In [112]:
df.drop(df.index[3:5])
Out[112]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186