Merge Dataframes and apply different math function by column - pandas

I have 3 DataFrames like below.
A =
ops lat
0 9,453 13,536
1 8,666 14,768
2 8,377 15,278
3 8,236 15,536
4 8,167 15,668
5 8,099 15,799
6 8,066 15,867
7 8,029 15,936
8 7,997 16,004
9 7,969 16,058
10 7,962 16,073
B =
ops lat
0 9,865 12,967
1 8,908 14,366
2 8,546 14,976
3 8,368 15,294
4 8,289 15,439
5 8,217 15,571
6 8,171 15,662
7 8,130 15,741
8 8,093 15,809
9 8,072 15,855
10 8,058 15,882
C =
ops lat
0 9,594 13,332
1 8,718 14,670
2 8,396 15,242
3 8,229 15,553
4 8,137 15,725
5 8,062 15,875
6 8,008 15,982
7 7,963 16,070
8 7,919 16,159
9 7,892 16,218
10 7,874 16,255
How do I merge them into a single dataframe where ops column is a sum and lat column will be average of these three dataframes.
pd.concat() - seems to append the dataframes.

There are likely many ways, but to keep it on the same line of thinking as you had with pd.concat, the below will work.
First, concat your dataframes together and then we will calculate .sum() and .mean() on our newly created dataframe and construct our final table with those two fields.
Dummy Data and Example Below:
import pandas as pd
data = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[303,2084,494,2028,4049,112]}
df1 = pd.DataFrame(data)
data2 = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[8,234,75,123,689,1256]}
df2 = pd.DataFrame(data2)
joined = pd.concat([df1,df2])
final = pd.DataFrame({'Sum_Col': [joined["Value"].sum()],
'Mean_Col': [joined["Value2"].mean()]})
display(final)

Related

python - List of Lists into pandas dataframe including name of columns

I would like to transfer a list of lists into a dataframe with columns based on the lists in the list.
This is still easy.
list = [[....],[....],[...]]
df = pd.DataFrame(list)
df = df.transpose()
The problem is: I would like to give the columns a column-name based on entries I have in another list:
list_two = [A,B,C,...]
This is my issue Im still struggling with.
Is there any approach to solve this problem?
Thanks a lot in advance for your help.
Best regards
Sascha
Use zip with dict for dictionary of lists and pass to DataFrame:
L= [[1,2,3,5],[4,8,9,8],[1,2,5,3]]
list_two = list('ABC')
df = pd.DataFrame(dict(zip(list_two, L)))
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3
Or if pass index parameter after transpose get columns names by this list:
df = pd.DataFrame(L, index=list_two).T
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3

renaming multiple cells below a specific cell with pandas

I am trying to merge two Excel tables, but the rows don't line up because in one column information is split over several rows whereas in the other table it is contained in a single cell.
Is there a way with pandas to rename the cells in Table A so that they line up with the rows in Table B?
df_jobs = pd.read_excel(r"jobs.xlsx", usecols="Jobs")
df_positions = pd.read_excel(r"orders.xlsx", usecols="Orders")
Sample files:
https://drive.google.com/file/d/1PEG3nZc0183Gh-8A2xbIs9kEZIWLzLSa/view?usp=sharing
https://drive.google.com/file/d/1HfQ4q7pjba0TKNJAHBqcGeoqdY3Yr3DB/view?usp=sharing
I suppose your input data looks like:
>>> df1
A i j
0 O-20-003049 NaN NaN
1 1 0.643284 0.834937
2 2 0.056463 0.394168
3 3 0.773379 0.057465
4 4 0.081585 0.178991
5 5 0.667667 0.004370
6 6 0.672313 0.587615
7 O-20-003104 NaN NaN
8 1 0.916426 0.739700
9 O-20-003117 NaN NaN
10 1 0.800776 0.614192
11 2 0.925186 0.980913
12 3 0.503419 0.775606
>>> df2
A x y
0 O-20-003049.01 0.593312 0.666600
1 O-20-003049.02 0.554129 0.435650
2 O-20-003049.03 0.900707 0.623963
3 O-20-003049.04 0.023075 0.445153
4 O-20-003049.05 0.307908 0.503038
5 O-20-003049.06 0.844624 0.710027
6 O-20-003104.01 0.026914 0.091458
7 O-20-003117.01 0.275906 0.398993
8 O-20-003117.02 0.101117 0.691897
9 O-20-003117.03 0.739183 0.213401
We start by renaming the rows in column A:
# create a boolean mask
mask = df1["A"].str.startswith("O-")
# rename all rows
df1["A"] = df1.loc[mask, "A"].reindex(df1.index).ffill() \
+ "." + df1["A"].str.pad(2, fillchar="0")
# remove unwanted rows (where mask==True)
df1 = df1[~mask].reset_index(drop=True)
>>> df1
A i j
1 O-20-003049.01 0.000908 0.078590
2 O-20-003049.02 0.896207 0.406293
3 O-20-003049.03 0.120693 0.722355
4 O-20-003049.04 0.412412 0.447349
5 O-20-003049.05 0.369486 0.872241
6 O-20-003049.06 0.614941 0.907893
8 O-20-003104.01 0.519443 0.800131
10 O-20-003117.01 0.583067 0.760002
11 O-20-003117.02 0.133029 0.389461
12 O-20-003117.03 0.969289 0.397733
Now, we are able to merge data on column A:
>>> pd.merge(df1, df2, on="A")
A i j x y
0 O-20-003049.01 0.643284 0.834937 0.593312 0.666600
1 O-20-003049.02 0.056463 0.394168 0.554129 0.435650
2 O-20-003049.03 0.773379 0.057465 0.900707 0.623963
3 O-20-003049.04 0.081585 0.178991 0.023075 0.445153
4 O-20-003049.05 0.667667 0.004370 0.307908 0.503038
5 O-20-003049.06 0.672313 0.587615 0.844624 0.710027
6 O-20-003104.01 0.916426 0.739700 0.026914 0.091458
7 O-20-003117.01 0.800776 0.614192 0.275906 0.398993
8 O-20-003117.02 0.925186 0.980913 0.101117 0.691897
9 O-20-003117.03 0.503419 0.775606 0.739183 0.213401

Obtain a barplot from boolean values?

I'm facing a bit of a problem. This is my dataframe:
Students Subject Mark
1 M F 7 4 3 7
2 I 5 6
3 M F I S 2 3 0
4 M 2 2
5 F M I 5 1
6 I M F 6 2 3
7 I M 7
I want to plot a barplot with four "bars", for students respecting the next four conditions:
Have 3 ore more letters in the column "Subject"
Have at least one 3 in the colum "Marks"
Have both things
Have neither things
At first I was stuck, but I was suggested to proceed this way:
df["Subject"].str.count("\w+") >= 3
df["Mark"].str.count("3") >= 1
(df["Subject"].str.count("\w+") >= 3) & (df["Mark"].str.count("3") >= 1)
What I obtain are three boolean columns, but I don't know how to go from here to plot the barplot.
I was thinking about counting the values in each column, but I don't seem to find a way to do so, since it looks like I can't apply value_counts() to the boolean columns.
If you have any idea, please help!
I think you need create DataFrame with all 4 masks, then count Trues by sum and last plot:
m1 = df["Subject"].str.count("\w+") >= 3
m2 = df["Mark"].str.count("3") >= 1
df1 = pd.concat([m1, m2, m1 & m2, ~m1 & ~m2], axis=1, keys=('a','b','c','d'))
out = df1.sum()
If need seaborn solution:
import seaborn as sns
ax = sns.barplot(x="index", y="val", data=out.reset_index(name='val'))
For pandas (matplotlib) solution:
out.plot.bar()

Multi-indexed series into DataFrame and reformat

I have a correlation matrix of stock returns in a Pandas DataFrame and I want to extract the top/bottom 10 correlated pairs from the matrix.
Sample DataFrame:
import pandas as pd
import numpy as np
data = np.random.randint(5,30,size=500)
df = pd.DataFrame(data.reshape((50,10)))
corr = df.corr()
This is my function to get the top/bottom 10 correlated pairs by 1) first returning a multi-indexed series (high) for highest correlated pairs, and then 2) unstacking back into a DataFrame (high_df):
def get_rankings(corr_matrix):
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
ranked_corr = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
high = ranked_corr[:10]
high_df = high.unstack().fillna("")
return high_df
get_rankings(corr)
My current DF output looks something like this:
6 4 5 7 8 3 9
3 0.359 0.198
1 0.275
4 0.257
2 0.176 0.154
0 0.153 0.164
5 0.156
But I want it to look this in either 2/3 columns:
ID1 ID2 Corr
0 9 0.304471
2 8 0.271009
2 3 0.147702
7 9 0.146176
0 7 0.144549
7 8 0.111888
4 6 0.098619
1 7 0.092338
1 4 0.09091
3 6 0.079688
It needs to be in a DataFrame so I can pass it to a grid widget, which only accepts DataFrames. Can anyone help me rehash the shape of the unstacked DF?

How to set a pandas dataframe equal to a row?

I know how to set the pandas data frame equal to a column.
i.e.:
df = df['col1']
what is the equivalent for a row? let's say taking the index? and would I eliminate one or more of them?
Many thanks.
If you want to take a copy of a row then you can either use loc for label indexing or iloc for integer based indexing:
In [104]:
df = pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})
df
Out[104]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
In [106]:
row = df.iloc[3]
row
Out[106]:
a 0.531293
b -0.386598
Name: 3, dtype: float64
If you want to remove that row then you can use drop:
In [107]:
df.drop(3)
Out[107]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
You can also use a slice or pass a list of labels:
In [109]:
rows = df.loc[[3,5]]
row_slice = df.loc[3:5]
print(rows)
print(row_slice)
a b
3 0.531293 -0.386598
5 0.491417 -0.498816
a b
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
Similarly you can pass a list to drop:
In [110]:
df.drop([3,5])
Out[110]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
If you wanted to drop a slice then you can slice your index and pass this to drop:
In [112]:
df.drop(df.index[3:5])
Out[112]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186