compare and append two dataframe - pandas

I have two dataframe where df1 have header request ID and df2 have header request ID.
I wanted to check df1 header request ID in df2, if there is df2 have same request ID as same as df1 header request ID, then append it.
df1 = [{'GUID': 'login',
'sent': True,
'Header Request': '2671257824',
'count': 314},
{'GUID': 'login',
'sent': True,
'Header Request': '2700603520',
'count': 441}]
df2 = [{'GUID': 'Res',
'sent': False,
'Header Request': '2671257824',
'count': 318},
{'GUID': 'Res',
'sent': False,
'Header Request': '2700603520',
'count': 445}
]
df3 = [{'GUID': 'login',
'sent': True,
'Header Request': '2671257824',
'count': 314},
{'GUID': 'Res',
'sent': False,
'Header Request': '2671257824',
'count': 318},
{'GUID': 'login',
'sent': True,
'Header Request': '2700603520',
'count': 441},
{'GUID': 'Res',
'sent': False,
'Header Request': '2700603520',
'count': 445}
]
df1 =
enter image description here
df2 =
enter image description here
output should like this
df3 =
enter image description here
df3 = pd.merge(df1, df2, how='inner', left_on='Header Request')

You probably want a left join then
df3 = pd.merge(df1, df2, how='left', on='Header Request')

Related

How to combine two Pandas dataframes into a single one across the axis=2 (ie. so that the cell values are tuples)?

I have two (large) dataframes. They have the same index & columns, and I want to combine them so that they have tuple values in each cell.
The example explains it best:
pd.DataFrame({
'A':[True, True, False],
'B':[False, True, False],
})
df2 = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 6, 7],
})
# Desired output:
pd.DataFrame({
'A':[(True, 1), (True, 2), (False, 3)],
'B':[(False, 5), (True, 6), (False, 7)],
})
The DataFrames are large (1m rows+), so looking to do this somewhat efficiently.
I tried np.stack([df1.values, df2.values], axis=2) and that got me the right value array, but I could not convert it into a dataframe.
Any ideas?
I got your desired output with this solution
import pandas as pd
df1 = pd.DataFrame({
'A':[True, True, False],
'B':[False, True, False],
})
df2 = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 6, 7],
})
for df_1k, df_2k in zip(df1.columns, df2.columns):
df1[df_1k] = list(map(tuple, zip(df1[df_1k], df2[df_2k])))
print(df1)

Repeat elements from one array based on another

Given
a = np.array([1,2,3,4,5,6,7,8])
b = np.array(['a','b','c','d','e','f','g','h'])
c = np.array([1,1,1,4,4,4,8,8])
where a & b 'correspond' to each other, how can I use c to slice b to get d which 'corresponds' to c:
d = np.array(['a','a','a','d','d','d','h','h')]
I know how to do this by looping
for n in range(a.shape[0]):
d[n] = b[np.argmax(a==c[n])]
but want to know if I can do this without loops.
Thanks in advance!
With the a that is just position+1 you can simply use
In [33]: b[c - 1]
Out[33]: array(['a', 'a', 'a', 'd', 'd', 'd', 'h', 'h'], dtype='<U1')
I'm tempted to leave it at that, since the a example isn't enough to distinguish it from the argmax approach.
But we can test all a against all c with:
In [36]: a[:,None]==c
Out[36]:
array([[ True, True, True, False, False, False, False, False],
[False, False, False, False, False, False, False, False],
[False, False, False, False, False, False, False, False],
[False, False, False, True, True, True, False, False],
[False, False, False, False, False, False, False, False],
[False, False, False, False, False, False, False, False],
[False, False, False, False, False, False, False, False],
[False, False, False, False, False, False, True, True]])
In [37]: (a[:,None]==c).argmax(axis=0)
Out[37]: array([0, 0, 0, 3, 3, 3, 7, 7])
In [38]: b[_]
Out[38]: array(['a', 'a', 'a', 'd', 'd', 'd', 'h', 'h'], dtype='<U1')

Choropleth Plotly calculation & dropdown

Im trying to make a simple choropleth diagram for a branch network in a country via plotly express. my aim is to create a map that shows total fee amount by city and be able to break it down by Fee names. when i run the code i can see the map and its colorized but i cant see the sum and i couldnt manage to get it, I also wasnt able to break it down by fee types. Any suggestions ?
I’ve searched it via forums but i couldnt find any answers, Im also starter and built my code from exercises that i found on the internet
Thanks in advance
from urllib.request import urlopen
import json
with urlopen("https://raw.githubusercontent.com/Babolius/project/62fef3b31fa9e34afb055e493de107d89a50a889/tr-cities-utf8.json") as response:
id = json.load(response)
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Babolius/project/62fef3b31fa9e34afb055e493de107d89a50a889/komisyon5.csv",encoding ='utf8', dtype={"Fee": int})
import plotly.express as px
fig = px.choropleth_mapbox(df, geojson= id, locations='Id', color= "Fee",
color_continuous_scale="Viridis",
range_color=(0, 5000),
mapbox_style="carto-darkmatter",
zoom=3, center = {"lat": 41.0902, "lon": 28.7129},
opacity=0.5,
)
dropdown_buttons =[{'label': 'A', 'method' : 'restyle', 'args': [{'visible': [True, False, False]}, {'title': 'A'}]},
{'label': 'B', 'method' : 'restyle', 'args': [{'visible': [False, True, False]}, {'title': 'B'}]},
{'label': 'C', 'method' : 'restyle', 'args': [{'visible': [True, False, True]}, {'title': 'C'}]}]
fig.update_layout({'updatemenus':[{'type': 'dropdown', 'showactive': True, 'active': 0, 'buttons': dropdown_buttons}]})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
I found the solution, just posting the answer for people who might have the same problem. the problem was not in the code but in the data, if you are using plotly express your csv has to different category information in each line, so you need to use "a long data"
I ve adjusted the csv, updated the dropdown and it worked just fine,
This is the post that helped me to solve my problem (https://towardsdatascience.com/visualization-with-plotly-express-comprehensive-guide-eb5ee4b50b57)
from urllib.request import urlopen
import json
with urlopen("https://raw.githubusercontent.com/Babolius/project/62fef3b31fa9e34afb055e493de107d89a50a889/tr-cities-utf8.json") as response:
id = json.load(response)
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Babolius/project/main/komisyon5.csv",encoding ='utf8', dtype={"Toplam": int})
df.groupby(['ID']).sum()
import plotly.express as px
fig = px.choropleth_mapbox(df, geojson= id, locations= 'ID', color= "Toplam",
color_continuous_scale="Viridis",
range_color=(0, 1000000),
mapbox_style="carto-darkmatter",
zoom=3, center = {"lat": 41.0902, "lon": 28.7129},
opacity=0.5,
)
dropdown_buttons =[{'label': 'A', 'method' : 'restyle', 'args': [{'z': [df["A"]]}, {'visible': [True, False, False, False, False, False]}, {'title': 'A'}]},
{'label': 'B', 'method' : 'restyle', 'args': [{'z': [df["B"]]}, {'visible': [False, True, False, False, False, False]}, {'title': 'B'}]},
{'label': 'C', 'method' : 'restyle', 'args': [{'z': [df["C"]]}, {'visible': [False, False, True, False, False, False]}, {'title': 'C'}]},
{'label': 'D', 'method' : 'restyle', 'args': [{'z': [df["D"]]}, {'visible': [False, False, False, True, False, False]}, {'title': 'D'}]},
{'label': 'E', 'method' : 'restyle', 'args': [{'z': [df["E"]]}, {'visible': [False, False, False, False, True, False]}, {'title': 'E'}]},
{'label': 'Toplam', 'method' : 'restyle', 'args': [{'z': [df["Toplam"]]}, {'visible': [False, False, False, False, False, True]}, {'title': 'Toplam'}]}]
fig.update_layout({'updatemenus':[{'type': 'dropdown', 'showactive': True, 'active': 0, 'buttons': dropdown_buttons}]})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Set index for aggregated dataframe

I did some calculation to a list of dataframes. I'd like the result dataframe uses rangeindex. However, it uses one of the column name as index, even I set index=None
d1 = {'id': [1, 2, 3, 4, 5], 'is_free': [True, False, False, True, True], 'level': ['Top', 'Mid', 'Top', 'Top', 'Low']}
d2 = {'id': [1, 3, 4, 5, 7], 'is_free': [True, True, False, False, False], 'level': ['Top', 'High', 'Top', 'Top', 'Low']}
d1 = pd.DataFrame(data=d1)
d2 = pd.DataFrame(data=d2)
df_list = [d1, d2]
dfs = []
for i, df in enumerate(df_list):
df = df.groupby('is_free')['id'].count()
dfs.append(df)
df = pd.DataFrame(data=dfs, index=None)
It returns
is_free False True
id 2 3
id 3 2
df.index returns
Index(['id', 'id'], dtype='object')
From your code:
df = pd.DataFrame(data=dfs, index=None).reset_index(drop=True)
However, in general, I would avoid append iteratively. Try concat:
pd.concat({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)},
axis=1).T
Or use pd.DataFrame:
pd.DataFrame({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)}).T
Output:
is_free False True
0 2 3
1 3 2

pandas same attribute comparison

I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.