Pandas Outerjoin New Rows - pandas

I have two dataframes df1 and df2.
df1 = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col3': [100, 120, 130, 200, 190, 210],})
df2 = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col4': [120, 140, 120, 200, 190, 210],})
I do an outerjoin on the two dataframes:
df = pd.merge(df1, df2[['Col1', 'Col4']], on= 'Col1', how='outer')
I get a new dataframe but I don't get the entries for Col2 for df2. I get
df = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses', 'NaN', 'NaN', 'NaN'],
'Col3': [100, 120, 130, 200, 190, 210, 'NaN', 'NaN', 'NaN'],
'Col4': [120, 140, 120, 'NaN', 'NaN', 'NaN', '200', '190', '210']})
But what I want is:
df = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col3': [100, 120, 130, 200, 190, 210, 'NaN', 'NaN', 'NaN'],
'Col4': [120, 140, 120, 'NaN', 'NaN', 'NaN', '200', '190', '210']})
I want to have the entries for Col2 from df2 as new rows in the merged dataframe

Related

Problem with websocket output into dataframe with pandas

I have a websocket connection to binance in my script. The websocket runs forever as usual. I got each pair's output as seperate outputs for my multiple stream connection.
for example here is the sample output:
{'stream': 'reefusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066242, 's': 'REEFUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'REEFUSDT', 'i': '1m', 'f': 95484416, 'L': 95484538, 'o': '0.006620', 'c': '0.006631', 'h': '0.006631', 'l': '0.006619', 'v': '1832391', 'n': 123, 'x': False, 'q': '12138.640083', 'V': '930395', 'Q': '6164.398584', 'B': '0'}}}
{'stream': 'ethusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066253, 's': 'ETHUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'ETHUSDT', 'i': '1m', 'f': 1613620941, 'L': 1613622573, 'o': '2671.86', 'c': '2675.79', 'h': '2675.80', 'l': '2671.81', 'v': '1018.530', 'n': 1633, 'x': False, 'q': '2723078.35891', 'V': '702.710', 'Q': '1878876.68612', 'B': '0'}}}
{'stream': 'ancusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066257, 's': 'ANCUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'ANCUSDT', 'i': '1m', 'f': 10991664, 'L': 10992230, 'o': '2.0750', 'c': '2.0810', 'h': '2.0820', 'l': '2.0740', 'v': '134474.7', 'n': 567, 'x': False, 'q': '279289.07500', 'V': '94837.8', 'Q': '197006.89950', 'B': '0'}}}
is there a way to edit this output like listed below. Main struggle is each one of the outputs are different dataframes. I want to merge them into one single dataframe. Output comes as a single nested dict which has two columns: "stream" and "data". "Data" has 4 columns in it and the last column "k" is another dict of 17 columns. I somehow managed to get only "k" in it:
json_message = json.loads(message)
result = json_message["data"]["k"]
and sample output is:
{'t': 1651837560000, 'T': 1651837619999, 's': 'CTSIUSDT', 'i': '1m', 'f': 27238014, 'L': 27238039, 'o': '0.2612', 'c': '0.2606', 'h': '0.2613', 'l': '0.2605', 'v': '17057', 'n': 26, 'x': False, 'q': '4449.1499', 'V': '3185', 'Q': '831.2502', 'B': '0'}
{'t': 1651837560000, 'T': 1651837619999, 's': 'ETCUSDT', 'i': '1m', 'f': 421543741, 'L': 421543977, 'o': '27.420', 'c': '27.398', 'h': '27.430', 'l': '27.397', 'v': '2988.24', 'n': 237, 'x': False, 'q': '81936.97951', 'V': '1848.40', 'Q': '50688.14941', 'B': '0'}
{'t': 1651837560000, 'T': 1651837619999, 's': 'ETHUSDT', 'i': '1m', 'f': 1613645553, 'L': 1613647188, 'o': '2671.38', 'c': '2669.95', 'h': '2672.38', 'l': '2669.70', 'v': '777.746', 'n': 1636, 'x': False, 'q': '2077574.75281', 'V': '413.365', 'Q': '1104234.98707', 'B': '0'}
I want to merge these outputs into a single dataframe of 6 columns and (almost 144 rows) which is closer to ss provided below. The only difference is my code creates different dataframes for each output.
Create a list of your messages. Your messages list should be like below:
message_list = [message1,message2,message3]
df = pd.DataFrame()
for i in range(len(message_list)):
temp_df = pd.DataFrame(message_list[i], index=[i,])
df = df.append(temp_df, ignore_index = True)
print(df)
t T s i f L o c h l v n x q V Q B
0 1651837560000 1651837619999 CTSIUSDT 1m 27238014 27238039 0.2612 0.2606 0.2613 0.2605 17057 26 False 4449.1499 3185 831.2502 0
1 1651837560000 1651837619999 ETCUSDT 1m 421543741 421543977 27.420 27.398 27.430 27.397 2988.24 237 False 81936.97951 1848.40 50688.14941 0
2 1651837560000 1651837619999 ETHUSDT 1m 1613645553 1613647188 2671.38 2669.95 2672.38 2669.70 777.746 1636 False 2077574.75281 413.365 1104234.98707 0
You can manipulate the dataframe later as needed.

Make a color legend in Plotly graph objects

I have the two following codes with their output. One is done in graph objects and the other using plotly express. As you can see, the one in ‘go’ doesn’t have a legend, and the one in ‘px’ doesn’t have individual column width. So how can I either get a legend for the first one, or fix the width in the other?
import plotly.graph_objects as go
import pandas as pd
df = pd.DataFrame({'PHA': [451, 149, 174, 128, 181, 175, 184, 545, 131, 106, 1780, 131, 344, 624, 236, 224, 178, 277, 141, 171, 164, 410],
'PHA_cum': [451, 600, 774, 902, 1083, 1258, 1442, 1987, 2118, 2224, 4004, 4135, 4479, 5103, 5339, 5563, 5741, 6018, 6159, 6330, 6494, 6904],
'trans_cost_cum': [0.14, 0.36, 0.6, 0.99, 1.4, 2.07, 2.76, 3.56, 4.01, 4.5, 5.05, 5.82, 5.97, 6.13, 6.33, 6.53, 6.65, 6.77, 6.9, 7.03, 7.45, 7.9],
'Province': ['East', 'East', 'East', 'East', 'East', 'Lapland', 'Lapland', 'Lapland', 'Oulu', 'Oulu', 'Oulu', 'Oulu', 'South', 'South', 'South', 'South', 'West', 'West', 'West', 'West', 'West', 'West'],
})
col_list = {'South': 'rgb(222,203,228)',
'West': 'rgb(204,235,197)',
'East': 'rgb(255,255,204)',
'Oulu': 'rgb(179,205,227)',
'Lapland': 'rgb(254,217,166)'}
provs = df['Province'].to_list()
colors = [col_list.get(item, item) for item in provs]
fig = go.Figure(data=[go.Bar(
x=df['PHA_cum']-df['PHA']/2,
y=df['trans_cost_cum'],
width=df['PHA'],
marker_color=colors
)])
fig.show()
import plotly.express as px
import pandas as pd
df = pd.DataFrame({'PHA': [451, 149, 174, 128, 181, 175, 184, 545, 131, 106, 1780, 131, 344, 624, 236, 224, 178, 277, 141, 171, 164, 410],
'PHA_cum': [451, 600, 774, 902, 1083, 1258, 1442, 1987, 2118, 2224, 4004, 4135, 4479, 5103, 5339, 5563, 5741, 6018, 6159, 6330, 6494, 6904],
'trans_cost_cum': [0.14, 0.36, 0.6, 0.99, 1.4, 2.07, 2.76, 3.56, 4.01, 4.5, 5.05, 5.82, 5.97, 6.13, 6.33, 6.53, 6.65, 6.77, 6.9, 7.03, 7.45, 7.9],
'Province': ['East', 'East', 'East', 'East', 'East', 'Lapland', 'Lapland', 'Lapland', 'Oulu', 'Oulu', 'Oulu', 'Oulu', 'South', 'South', 'South', 'South', 'West', 'West', 'West', 'West', 'West', 'West'],
})
fig = px.bar(df,
x=df['PHA_cum']-df['PHA']/2,
y=df['trans_cost_cum'],
color="Province",
color_discrete_sequence=px.colors.qualitative.Pastel1
)
fig.show()
Using graph_objects you'll need to pass in each Province as a trace in order for the legend to populate. See below, the only real change is looping through the data per Province.
df = pd.DataFrame({'PHA': [451, 149, 174, 128, 181, 175, 184, 545, 131, 106, 1780, 131, 344, 624, 236, 224, 178, 277, 141, 171, 164, 410],
'PHA_cum': [451, 600, 774, 902, 1083, 1258, 1442, 1987, 2118, 2224, 4004, 4135, 4479, 5103, 5339, 5563, 5741, 6018, 6159, 6330, 6494, 6904],
'trans_cost_cum': [0.14, 0.36, 0.6, 0.99, 1.4, 2.07, 2.76, 3.56, 4.01, 4.5, 5.05, 5.82, 5.97, 6.13, 6.33, 6.53, 6.65, 6.77, 6.9, 7.03, 7.45, 7.9],
'Province': ['East', 'East', 'East', 'East', 'East', 'Lapland', 'Lapland', 'Lapland', 'Oulu', 'Oulu', 'Oulu', 'Oulu', 'South', 'South', 'South', 'South', 'West', 'West', 'West', 'West', 'West', 'West'],
})
col_list = {'South': 'rgb(222,203,228)',
'West': 'rgb(204,235,197)',
'East': 'rgb(255,255,204)',
'Oulu': 'rgb(179,205,227)',
'Lapland': 'rgb(254,217,166)'}
#provs = df['Province'].to_list()
#colors = [col_list.get(item, item) for item in provs]
fig = go.Figure()
for p in df['Province'].unique():
dat = df[df.Province == p]
fig.add_trace(go.Bar(
name = p,
x=dat['PHA_cum']-dat['PHA']/2,
y=dat['trans_cost_cum'],
width=dat['PHA'],
marker_color= col_list[p]
))
fig.show()

creating nested object with pandas

I have an input DataFrame that looks something like this
input_data = {
'url1': ['https://my-website.com/product1', 'https://my-website.com/product1', 'https://my-website.com/product2', 'https://my-website.com/product2'],
'url2': ['https://not-my-website.com/product1', 'https://not-my-website.com/product1', 'https://not-my-website.com/product2', 'https://not-my-website.com/product2'],
'size': ['S', 'L', 'S', 'L'],
'used_price': [100, 110, 210, 220],
'new_price': [1000, 1100, 2100, 2200],
}
input_df = pd.DataFrame(data=input_data)
And I want to turn it into something that would look like this
output_data = {
'url1': ['https://my-website.com/product1', 'https://my-website.com/product2'],
'url2': ['https://not-my-website.com/product1', 'https://not-my-website.com/product2'],
'target': [
{
'S': {'used_price': 100, 'new_price': 1000},
'L': {'used_price': 120, 'new_price': 1200}
},
{
'S': {'used_price': 200, 'new_price': 2000},
'L': {'used_price': 220, 'new_price': 2200}
}
]
}
output_df = pd.DataFrame(data=output_data)
You can use groupby and apply:
(input_df.groupby(['url1', 'url2'])[['size', 'used_price', 'new_price']]
.apply(lambda d: d.set_index('size').T.to_dict())
.rename('target')
.reset_index()
)
output:
url1 url2 target
0 https://my-website.com/product1 https://not-my-website.com/product1 {'S': {'used_price': 100, 'new_price': 1000}, 'L': {'used_price': 110, 'new_price': 1100}}
1 https://my-website.com/product2 https://not-my-website.com/product2 {'S': {'used_price': 210, 'new_price': 2100}, 'L': {'used_price': 220, 'new_price': 2200}}

How to stack a pd.DataFrame until it becomes pd.Series?

I have the following pd.DataFrame:
df = pd.DataFrame(
data=[['dog', 'kg', 100, 241], ['cat', 'lbs', 300, 1]],
columns=['animal', 'unit', 0, 1],
).set_index(['animal', 'unit'])
df.columns = pd.MultiIndex.from_tuples(list(zip(*[['2019', '2018'], ['Apr', 'Oct']])))
and I would like to convert it a 2D matrix with no multilevel indexes on index or column:
pd.DataFrame(
data=[
['dog', 'kg', 100, '2019', 'Apr'],
['dog', 'kg', 241, '2018', 'Oct'],
['cat', 'lbs', 300, '2019', 'Apr'],
['cat', 'lbs', 1, '2018', 'Oct']
],
columns=['animal', 'unit', 'value', 'year', 'month']
)
To achieve this, I use df.stack().stack() -> this becomes a pd.Series and then I do .reset_index() on these series t convert to DataFrame.
My question is - how do I avoid the second (or multiple more) stack()?
Is there a way to stack a pd.DataFrame until it becomes a pd.Series?

Filter pandas column based on frequency of occurrence

My df:
data = [
{'Part': 'A', 'Value': 10, 'Delivery': 10},
{'Part': 'B', 'Value': 12, 'Delivery': 8.5},
{'Part': 'C', 'Value': 10, 'Delivery': 10.1},
{'Part': 'D', 'Value': 10, 'Delivery': 10.3},
{'Part': 'E', 'Value': 11, 'Delivery': 9.2},
{'Part': 'F', 'Value': 15, 'Delivery': 7.3},
{'Part': 'G', 'Value': 10, 'Delivery': 10.1},
{'Part': 'H', 'Value': 12, 'Delivery': 8.1},
{'Part': 'I', 'Value': 12, 'Delivery': 8.0},
{'Part': 'J', 'Value': 10, 'Delivery': 10.2},
{'Part': 'K', 'Value': 8, 'Delivery': 12.5}
]
df = pd.DataFrame(data)
I wish to filter a dataframe out of given dataframe so that it contain only the most frequent occurring "value".
Expected output:
data = [
{'Part': 'A', 'Value': 10, 'Delivery': 10},
{'Part': 'C', 'Value': 10, 'Delivery': 10.1},
{'Part': 'D', 'Value': 10, 'Delivery': 10.3},
{'Part': 'G', 'Value': 10, 'Delivery': 10.1},
{'Part': 'J', 'Value': 10, 'Delivery': 10.2}
]
df_output = pd.DataFrame(data)
is there any way to do this?
Use boolean indexing with Series.mode and seelct first value by Series.iat:
df1 = df[df['Value'].eq(df['Value'].mode().iat[0])]
Or compare by first index value in Series created by Series.value_counts, because by default values are sorted by counts:
df1 = df[df['Value'].eq(df['Value'].value_counts().index[0])]
print (df1)
Part Value Delivery
0 A 10 10.0
2 C 10 10.1
3 D 10 10.3
6 G 10 10.1
9 J 10 10.2