Problem with websocket output into dataframe with pandas - pandas

I have a websocket connection to binance in my script. The websocket runs forever as usual. I got each pair's output as seperate outputs for my multiple stream connection.
for example here is the sample output:
{'stream': 'reefusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066242, 's': 'REEFUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'REEFUSDT', 'i': '1m', 'f': 95484416, 'L': 95484538, 'o': '0.006620', 'c': '0.006631', 'h': '0.006631', 'l': '0.006619', 'v': '1832391', 'n': 123, 'x': False, 'q': '12138.640083', 'V': '930395', 'Q': '6164.398584', 'B': '0'}}}
{'stream': 'ethusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066253, 's': 'ETHUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'ETHUSDT', 'i': '1m', 'f': 1613620941, 'L': 1613622573, 'o': '2671.86', 'c': '2675.79', 'h': '2675.80', 'l': '2671.81', 'v': '1018.530', 'n': 1633, 'x': False, 'q': '2723078.35891', 'V': '702.710', 'Q': '1878876.68612', 'B': '0'}}}
{'stream': 'ancusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066257, 's': 'ANCUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'ANCUSDT', 'i': '1m', 'f': 10991664, 'L': 10992230, 'o': '2.0750', 'c': '2.0810', 'h': '2.0820', 'l': '2.0740', 'v': '134474.7', 'n': 567, 'x': False, 'q': '279289.07500', 'V': '94837.8', 'Q': '197006.89950', 'B': '0'}}}
is there a way to edit this output like listed below. Main struggle is each one of the outputs are different dataframes. I want to merge them into one single dataframe. Output comes as a single nested dict which has two columns: "stream" and "data". "Data" has 4 columns in it and the last column "k" is another dict of 17 columns. I somehow managed to get only "k" in it:
json_message = json.loads(message)
result = json_message["data"]["k"]
and sample output is:
{'t': 1651837560000, 'T': 1651837619999, 's': 'CTSIUSDT', 'i': '1m', 'f': 27238014, 'L': 27238039, 'o': '0.2612', 'c': '0.2606', 'h': '0.2613', 'l': '0.2605', 'v': '17057', 'n': 26, 'x': False, 'q': '4449.1499', 'V': '3185', 'Q': '831.2502', 'B': '0'}
{'t': 1651837560000, 'T': 1651837619999, 's': 'ETCUSDT', 'i': '1m', 'f': 421543741, 'L': 421543977, 'o': '27.420', 'c': '27.398', 'h': '27.430', 'l': '27.397', 'v': '2988.24', 'n': 237, 'x': False, 'q': '81936.97951', 'V': '1848.40', 'Q': '50688.14941', 'B': '0'}
{'t': 1651837560000, 'T': 1651837619999, 's': 'ETHUSDT', 'i': '1m', 'f': 1613645553, 'L': 1613647188, 'o': '2671.38', 'c': '2669.95', 'h': '2672.38', 'l': '2669.70', 'v': '777.746', 'n': 1636, 'x': False, 'q': '2077574.75281', 'V': '413.365', 'Q': '1104234.98707', 'B': '0'}
I want to merge these outputs into a single dataframe of 6 columns and (almost 144 rows) which is closer to ss provided below. The only difference is my code creates different dataframes for each output.

Create a list of your messages. Your messages list should be like below:
message_list = [message1,message2,message3]
df = pd.DataFrame()
for i in range(len(message_list)):
temp_df = pd.DataFrame(message_list[i], index=[i,])
df = df.append(temp_df, ignore_index = True)
print(df)
t T s i f L o c h l v n x q V Q B
0 1651837560000 1651837619999 CTSIUSDT 1m 27238014 27238039 0.2612 0.2606 0.2613 0.2605 17057 26 False 4449.1499 3185 831.2502 0
1 1651837560000 1651837619999 ETCUSDT 1m 421543741 421543977 27.420 27.398 27.430 27.397 2988.24 237 False 81936.97951 1848.40 50688.14941 0
2 1651837560000 1651837619999 ETHUSDT 1m 1613645553 1613647188 2671.38 2669.95 2672.38 2669.70 777.746 1636 False 2077574.75281 413.365 1104234.98707 0
You can manipulate the dataframe later as needed.

Related

Nested Dict To DataFrame

CAN ANYONE HELP
msg = {'e': 'kline',
'E': 1672157513375,
's': 'BTCUSDT',
'k': {
't': 1672157460000, #REQUIRE, CONVERT MS TO DATETIME,
#RENAME AS TIME, AS INDEX
'T': 1672157519999,
's': 'BTCUSDT',
'i': '1m',
'f': 2388965371,
'L': 2388969270,
'o': '16787.32000000', #REQUIRE RENAME AS OPEN
'c': '16783.23000000', #REQUIRE RENAME AS CLOSE
'h': '16789.41000000', #REQUIRE RENAME AS HIGH
'l': '16782.69000000', #REQUIRE RENAME AS LOW
'v': '149.27507000', #REQUIRE RENAME AS VOLUME
'n': 3900,
'x': False,
'q': '2505669.98288240',
'V': '59.70465000',
'Q': '1002207.92308370',
'B': '0'
}
}
Time = k(t),datetime
Open = k(o),dtype float
High = k(h),dtype float
Low = k(l), dtype float
Close = k(c), dtype float
Volume = k (v),dtype float
index give as,
k(t) convert this millisecond to datetime,
and converted give as index
language python
WHAT I TRYING:
def getdata(msg):
frame = pd.DataFrame(msg)
#DONT UNDERSTOOD
frame = frame.loc[frame['k']['t'],frame['k']['t'],frame['k']['t'],
frame['k']['t'],frame['k']['t'],frame['k']['t']]
#SOME UNDERSTOOD
frame.columns = ["Time","Open","High","Low","Close","Volume"]
frame.set_index("Time",inplace=True)
frame.index = pd.to_datetime(frame.index,unit='ms')
frame = frame.astype(float)
return frame
getdata(msg)
REQUIRE OUTPUT:
Time Open High Low Close Volume
2022-12-27 16:11:00 16787.7 16789.4 16782.6 16783.2 149
<3
Using json_normalize():
df = (pd
.json_normalize(data=msg["k"])[["t", "o", "h", "l", "c", "v"]]
.rename(columns={"t": "Time", "o": "Open", "h": "High", "l": "Low", "c": "Close", "v": "Volume"})
)
df["Time"] = (
pd.to_datetime(df["Time"], unit="ms")
.dt.tz_localize("UTC")
.dt.tz_localize(None)
.dt.floor("S")
)
print(df)
Output:
Time Open High Low Close Volume
0 2022-12-27 16:11:00 16787.32000000 16789.41000000 16782.69000000 16783.23000000 149.27507000

creating nested object with pandas

I have an input DataFrame that looks something like this
input_data = {
'url1': ['https://my-website.com/product1', 'https://my-website.com/product1', 'https://my-website.com/product2', 'https://my-website.com/product2'],
'url2': ['https://not-my-website.com/product1', 'https://not-my-website.com/product1', 'https://not-my-website.com/product2', 'https://not-my-website.com/product2'],
'size': ['S', 'L', 'S', 'L'],
'used_price': [100, 110, 210, 220],
'new_price': [1000, 1100, 2100, 2200],
}
input_df = pd.DataFrame(data=input_data)
And I want to turn it into something that would look like this
output_data = {
'url1': ['https://my-website.com/product1', 'https://my-website.com/product2'],
'url2': ['https://not-my-website.com/product1', 'https://not-my-website.com/product2'],
'target': [
{
'S': {'used_price': 100, 'new_price': 1000},
'L': {'used_price': 120, 'new_price': 1200}
},
{
'S': {'used_price': 200, 'new_price': 2000},
'L': {'used_price': 220, 'new_price': 2200}
}
]
}
output_df = pd.DataFrame(data=output_data)
You can use groupby and apply:
(input_df.groupby(['url1', 'url2'])[['size', 'used_price', 'new_price']]
.apply(lambda d: d.set_index('size').T.to_dict())
.rename('target')
.reset_index()
)
output:
url1 url2 target
0 https://my-website.com/product1 https://not-my-website.com/product1 {'S': {'used_price': 100, 'new_price': 1000}, 'L': {'used_price': 110, 'new_price': 1100}}
1 https://my-website.com/product2 https://not-my-website.com/product2 {'S': {'used_price': 210, 'new_price': 2100}, 'L': {'used_price': 220, 'new_price': 2200}}

Pandas Outerjoin New Rows

I have two dataframes df1 and df2.
df1 = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col3': [100, 120, 130, 200, 190, 210],})
df2 = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col4': [120, 140, 120, 200, 190, 210],})
I do an outerjoin on the two dataframes:
df = pd.merge(df1, df2[['Col1', 'Col4']], on= 'Col1', how='outer')
I get a new dataframe but I don't get the entries for Col2 for df2. I get
df = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses', 'NaN', 'NaN', 'NaN'],
'Col3': [100, 120, 130, 200, 190, 210, 'NaN', 'NaN', 'NaN'],
'Col4': [120, 140, 120, 'NaN', 'NaN', 'NaN', '200', '190', '210']})
But what I want is:
df = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col3': [100, 120, 130, 200, 190, 210, 'NaN', 'NaN', 'NaN'],
'Col4': [120, 140, 120, 'NaN', 'NaN', 'NaN', '200', '190', '210']})
I want to have the entries for Col2 from df2 as new rows in the merged dataframe

Pandas to dict conversion with condition

My dataframe:
data_part = [{'Part': 'A', 'Engine': True, 'TurboCharger': True, 'Restricted': True},
{'Part': 'B', 'Engine': False, 'TurboCharger': True, 'Restricted': False},]
My expect output is this:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'TurboCharger': 1}}
This is what I am doing:
df_part = pd.DataFrame(data_part).set_index('Part').astype(int).to_dict('index')
This is what it gives:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'Engine': 0, 'TurboCharger': 1, 'Restricted': 0}}
Anything that can be done to reach expected output?
We can fix your output
d=pd.DataFrame(data_part).set_index('Part').astype(int).stack().loc[lambda x : x!=0].reset_index('Part').groupby('Part').agg(dict)[0].to_dict()
Out[192]:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'TurboCharger': 1}}
You may call agg before to_dict
df_part = (pd.DataFrame(data_part).set_index('Part')
.agg(lambda x: dict(x[x].astype(int)), axis=1)
.to_dict())
Out[60]:
{'A': {'Engine': 1, 'Restricted': 1, 'TurboCharger': 1},
'B': {'TurboCharger': 1}}
Here's a way to convert the list to a dict without pandas:
from pprint import pprint
data_2 = dict()
for dp in data_part:
ts = [(k, v) for k, v in dp.items()]
key = ts[0][1]
values = {k: int(v) for k, v in ts[1:] if v}
data_2[key] = values
pprint(data_2)
{'A': {'Engine': 1, 'Restricted': 1, 'TurboCharger': 1},
'B': {'TurboCharger': 1}}

pandas same attribute comparison

I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.