Comparing different dataframes in python

Comparing different dataframes in python - dataframe

I have 2 different Dataframe of unequal shape:
pack_df:
pack_id
item_size
temperature
pck1
M
7.3
pck2
S
10.0
pck3
L
5.2
pck4
S
15.3
pck5
M
3.3
pck6
L
9.3
pck7
L
20.3
pck8
M
8.1
pck9
M
21.3
pck10
S
9.7
temperature_range_df:
min_temp
max_temp
S
M
L
-4
10
2
1
1
10.1
20
4
3
2
20.1
30
6
4
2
I need to match if pack_df['temperature'] with in range temperature_range_df['min_temp'], temperature_range_df['max_temp'] when a match is found i need to assign the packet counts ( temperature_range_df['S']/temperature_range_df['M']/temperature_range_df['L'] based on pack_df['item_size']. How can we achieve this without iterating over each rows and comparing them in our dataframe as the dataframe is likely to grow a lot over time.( not using dataframe.iterrows() )
My final Dataframe should look like
final_dataframe:
pack_id
item_size
temperature
pack_count
pck1
M
7.3
1
pck2
S
10.0
2
pck3
L
5.2
1
pck4
S
15.3
4
pck5
M
3.3
1
pck6
L
9.3
1
pck7
L
20.3
2
pck8
M
8.1
1
pck9
M
21.3
4
pck10
S
9.7
2
calculation- pck1 item size should be 'M' and temperature '7.3'. this falls under temp range '-4 to 10' based on item size 'M' we can assign only 1 packet.
Thanks for the help!

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)

Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125

Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)

This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

Matching two tables using pandas

I have these two tables
One table is called LineMax
OrigNode DestNode DivisionNum Prefix FromMP ToMP Suffix
7764 25961 3 AB 18 20.9
7764 50213 3 AB 18 17.3
7765 35444 3 AB 0 1.5
7841 35444 3 AB 6 1.5
15390 25961 3 AB 23.75 20.9
25961 7764 3 AB 20.9 18
25961 15390 3 AB 20.9 23.75
And I have another data set
OPER_MNT_DIV_CD TRK_CLS_NBR LN_PFX SEG_BGN_MP SEG_END_MP LN_SFX
4 1 362.7 362.71
4 1 362.71 362.83
4 1 362.83 362.98
4 1 362.98 363.35
4 1 363.35 363.4
4 1 363.4 363.54
4 1 363.54 363.67
4 1 363.67 363.81
4 1 363.81 363.95
4 1 363.95 364.1
4 1 364.1 364.15
4 1 364.15 364.5
4 1 364.5 364.55
I am trying to match my data. Basically I want to match to my second table it should have the same Prefix,Suffix, and Divison Number.
So basically
Prefix=LN_pfx
Suffix=LN_SFX
DIVISIONNUM=OPER_MNT_DIV_CD
I also want my first to Mp and from Mp to be contained within the SEG BGN MP AND SEG END MP like this $Seg_BGN_MP<=fromMp<ToMp<=Seg_end_mp4$
But I cannot seem to get my code to work. My second data table had some white space so I removed them and I turned Oper MNT DIV CID from a string to an int to make an easier comparison.
I also removed all the white space and turned every string capital in my first table.
But I cannot seem to get the matches I want.
import numpy as np
import pyodbc
import math
x=pyodbc.connect("DSN=DBP1")
table1=pd.read_csv("LineMaxOrder.csv")
s2="select oper_mnt_div_cd, trk_cls_nbr, ln_pfx, seg_bgn_mp, seg_end_mp, ln_sfx, crvtr_mn, crvtr_deg, xstg_elev from dcmctrk.crv_seg where trk_cls_nbr = 1 order by oper_mnt_div_cd, ln_pfx,ln_sfx"
table1=table1.drop(table1.columns[[0]],axis=1)
dChange=pd.read_sql_query(s2,x)
dChange["LN_PX"]=dChange["LN_PFX"].str.strip()
dChange["LN_PFX"]=dChange["LN_PFX"].str.upper()
dChange["LN_SFX"]=dChange["LN_SFX"].str.strip()
dChange["LN_SFX"]=dChange["LN_SFX"].str.upper()
dChange["OPER_MNT_DIV_CD"]=dChange["OPER_MNT_DIV_CD"].astype(int)
dfObj2=table1.select_dtypes(["object"])
table1[dfObj2.columns] = dfObj2.apply(lambda x: x.str.strip())
table1[dfObj2.columns]=dfObj2.apply(lambda x:x.str.upper())
table1=table1.fillna('')
w=[]
for idx,row in table1.iterrows():
a=row[3]
b=row[4]
c=row[7]
agu1=row[5]
agu2=row[6]
big=max(agu1,agu2)
small=min(agu2,agu1)
result=dChange[(dChange["OPER_MNT_DIV_CD"]==a)&(dChange["LN_PFX"]==b)&(dChange["LN_SFX"]==c)]
if result.empty:
continue
else:
result[(result["SEG_BGN_MP"]<=small)&(result["SEG_END_MP"]>=big)]
if result.empty:
continue
else:
print(result)
w.append(result)

use the pandas merge to combine joins as : inner, left, or right and link by matching fields

How to show percentage of individuals on y axis instead of count in histogram by groups?

I have a data frame like this:
> head(a)
FID IID FLASER PLASER DIABDUR HBA1C ESRD pheno
1 fam1000-03 G1000 1 1 38 10.2 1 control
2 fam1001-03 G1001 1 1 15 7.3 1 control
3 fam1003-03 G1003 1 2 17 7.0 1 case
4 fam1005-03 G1005 1 1 36 7.7 1 control
5 fam1009-03 G1009 1 1 23 7.6 1 control
6 fam1052-03 G1052 1 1 32 7.3 1 control
My df has 1698 obs of which 828 who have "case" in pheno column and 836 who have "control" in pheno column.
I make a histogram via:
library(ggplot2)
ggplot(a, aes(x=HBA1C, fill=pheno)) +
geom_histogram(binwidth=.5, position="dodge")
I would like to have the y-axis show the percentage of individuals which
have either "case" or "control" in pheno instead of the count. So percentage would be calculated for each group on y axis ("case" or "control"). I also do have NAs in my plot and it would be good to exclude those from the plot.
I guess I can remove NAs from pheno with this:
ggplot(data=subset(a, !is.na(pheno)), aes(x=HBA1C, fill=pheno)) + geom_histogram(binwidth=.5, position="dodge")

This can be achieved like so:
Note: Concerning the NAs you were right. Simply subset for non-NA values or use dplyr::filter or ...
a <- read.table(text = "id FID IID FLASER PLASER DIABDUR HBA1C ESRD pheno
1 fam1000-03 G1000 1 1 38 10.2 1 control
2 fam1001-03 G1001 1 1 15 7.3 1 control
3 fam1003-03 G1003 1 2 17 7.0 1 case
4 fam1005-03 G1005 1 1 36 7.7 1 control
5 fam1009-03 G1009 1 1 23 7.6 1 control
6 fam1052-03 G1052 1 1 32 7.3 1 control
7 fam1052-03 G1052 1 1 32 7.3 1 NA", header = TRUE)
library(ggplot2)
ggplot(a, aes(x=HBA1C, fill=pheno)) +
geom_histogram(aes(y = ..count.. / tapply(..count.., ..group.., sum)[..group..]),
position='dodge', binwidth=0.5) +
scale_y_continuous(labels = scales::percent)
Created on 2020-05-23 by the reprex package (v0.3.0)

conversion column names into first row

I would like to convert the following dataframe into a json .
df:
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
In order to do so by using the function df.to_json() I would need to have unique names in column and indices.
Therefore what I am looking for is to convert the column names into a row and have default column numbers . In short I would like the following output:
df:
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
Turning the column names into the first row so I can make the conversion correctly .

You could also use vstack in numpy:
>>> df
x y z
0 8 7 6
1 6 5 4
>>> pd.DataFrame(np.vstack([df.columns, df]))
0 1 2
0 x y z
1 8 7 6
2 6 5 4
The columns become the actual first row in this case.

Use assign by list of range and original column names:
print (range(len(df.columns)))
range(0, 6)
#for python2 list can be omit
df.columns = [list(range(len(df.columns))), df.columns]
Or MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([range(len(df.columns)), df.columns])
Also is possible use RangeIndex:
print (pd.RangeIndex(len(df.columns)))
RangeIndex(start=0, stop=6, step=1)
df.columns = pd.MultiIndex.from_arrays([pd.RangeIndex(len(df.columns)), df.columns])
print (df)
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio -- 35.99 12.70 20.63 14.75 23.06
RRM Sales -- 114.57 1.51 5.02 1.00 4594.13
MQR book 1.48 2.64 1.02 2.46 2.73 2.74
TTR cash -- 14.33 7.41 15.35 8.59 513854.86

pandas: sort grouped dataframe by frequency of group members

I am interested in sorting a grouped dataframe by the number of entries for each group. As far as I can see, I can either sort by the group labels or not at all. Say I have 10 entries that belong to three groups. Group A has 6 members, group B has three members, and group C has 1 member. Now when I e.g. do a grouped.describe(), I would like the output to be ordered so that the group with the most entries is shown first.

I would unstack the statistics from describe(), then you can simply use sort(), so:
incsv = StringIO("""Group,Value
B,1
B,2
B,3
C,8
A,5
A,10
A,15
A,25
A,35
A,40""")
df = pd.read_csv(incsv)
groups = df.groupby('Group').describe().unstack()
Value
count mean std min 25% 50% 75% max
Group
A 6 21.666667 14.023789 5 11.25 20 32.5 40
B 3 2.000000 1.000000 1 1.50 2 2.5 3
C 1 8.000000 NaN 8 8.00 8 8.0 8
dfstats.xs('Value', axis=1).sort('count', ascending=True)
count mean std min 25% 50% 75% max
Group
C 1 8.000000 NaN 8 8.00 8 8.0 8
B 3 2.000000 1.000000 1 1.50 2 2.5 3
A 6 21.666667 14.023789 5 11.25 20 32.5 40
I reversed the sort just for illustration because it was already sorted by default, but you can sort anyway you want of course.
Bonus for anyone who can sort by count without dropping or stacking the 'Value' level. :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Comparing different dataframes in python - dataframe

Related

How to add Multilevel Columns and create new column?

Matching two tables using pandas

How to show percentage of individuals on y axis instead of count in histogram by groups?

conversion column names into first row

pandas: sort grouped dataframe by frequency of group members

Categories

Resources