How to extract the values from the json data frame with particular key - pandas

json_details
{'dob': '1981-06-30', 'name': 'T ', 'date': None, 'val': {'ENG': None, 'US': None}}
{'dob': '2001-09-27', 'name': 'A NGR', 'date': None}
{'dob': '2000-07-12', 'name': 'T B MV', 'date': None, 'val': {'ENG': None, 'US': None}}
{'dob': '1983-01-01', 'name': 'E K', 'date': None, 'val': {'ENG': None, 'US': 2034-11-18}}
{'dob': '1994-10-25', 'name': 'DF', 'date': None, 'val': {'ENG': '2034-11-18', 'US': None}}
Need to extract 2 keys from the json_details column. Some row have no val key if we apply it will throw key error and stop
df['json_details'][0]['ENG']
df['json_details'][0]['US']
Expected Out
df['json_details']['ENG']
None
No keys
None
None
2034-11-18
df['json_details']['US']
None
No keys
None
2034-11-18
None

solution is
df['new'] = df['json_details'].str['ENG']
df['new'] = df['json_details'].str['US']

Related

Problem with websocket output into dataframe with pandas

I have a websocket connection to binance in my script. The websocket runs forever as usual. I got each pair's output as seperate outputs for my multiple stream connection.
for example here is the sample output:
{'stream': 'reefusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066242, 's': 'REEFUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'REEFUSDT', 'i': '1m', 'f': 95484416, 'L': 95484538, 'o': '0.006620', 'c': '0.006631', 'h': '0.006631', 'l': '0.006619', 'v': '1832391', 'n': 123, 'x': False, 'q': '12138.640083', 'V': '930395', 'Q': '6164.398584', 'B': '0'}}}
{'stream': 'ethusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066253, 's': 'ETHUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'ETHUSDT', 'i': '1m', 'f': 1613620941, 'L': 1613622573, 'o': '2671.86', 'c': '2675.79', 'h': '2675.80', 'l': '2671.81', 'v': '1018.530', 'n': 1633, 'x': False, 'q': '2723078.35891', 'V': '702.710', 'Q': '1878876.68612', 'B': '0'}}}
{'stream': 'ancusdt#kline_1m', 'data': {'e': 'kline', 'E': 1651837066257, 's': 'ANCUSDT', 'k': {'t': 1651837020000, 'T': 1651837079999, 's': 'ANCUSDT', 'i': '1m', 'f': 10991664, 'L': 10992230, 'o': '2.0750', 'c': '2.0810', 'h': '2.0820', 'l': '2.0740', 'v': '134474.7', 'n': 567, 'x': False, 'q': '279289.07500', 'V': '94837.8', 'Q': '197006.89950', 'B': '0'}}}
is there a way to edit this output like listed below. Main struggle is each one of the outputs are different dataframes. I want to merge them into one single dataframe. Output comes as a single nested dict which has two columns: "stream" and "data". "Data" has 4 columns in it and the last column "k" is another dict of 17 columns. I somehow managed to get only "k" in it:
json_message = json.loads(message)
result = json_message["data"]["k"]
and sample output is:
{'t': 1651837560000, 'T': 1651837619999, 's': 'CTSIUSDT', 'i': '1m', 'f': 27238014, 'L': 27238039, 'o': '0.2612', 'c': '0.2606', 'h': '0.2613', 'l': '0.2605', 'v': '17057', 'n': 26, 'x': False, 'q': '4449.1499', 'V': '3185', 'Q': '831.2502', 'B': '0'}
{'t': 1651837560000, 'T': 1651837619999, 's': 'ETCUSDT', 'i': '1m', 'f': 421543741, 'L': 421543977, 'o': '27.420', 'c': '27.398', 'h': '27.430', 'l': '27.397', 'v': '2988.24', 'n': 237, 'x': False, 'q': '81936.97951', 'V': '1848.40', 'Q': '50688.14941', 'B': '0'}
{'t': 1651837560000, 'T': 1651837619999, 's': 'ETHUSDT', 'i': '1m', 'f': 1613645553, 'L': 1613647188, 'o': '2671.38', 'c': '2669.95', 'h': '2672.38', 'l': '2669.70', 'v': '777.746', 'n': 1636, 'x': False, 'q': '2077574.75281', 'V': '413.365', 'Q': '1104234.98707', 'B': '0'}
I want to merge these outputs into a single dataframe of 6 columns and (almost 144 rows) which is closer to ss provided below. The only difference is my code creates different dataframes for each output.
Create a list of your messages. Your messages list should be like below:
message_list = [message1,message2,message3]
df = pd.DataFrame()
for i in range(len(message_list)):
temp_df = pd.DataFrame(message_list[i], index=[i,])
df = df.append(temp_df, ignore_index = True)
print(df)
t T s i f L o c h l v n x q V Q B
0 1651837560000 1651837619999 CTSIUSDT 1m 27238014 27238039 0.2612 0.2606 0.2613 0.2605 17057 26 False 4449.1499 3185 831.2502 0
1 1651837560000 1651837619999 ETCUSDT 1m 421543741 421543977 27.420 27.398 27.430 27.397 2988.24 237 False 81936.97951 1848.40 50688.14941 0
2 1651837560000 1651837619999 ETHUSDT 1m 1613645553 1613647188 2671.38 2669.95 2672.38 2669.70 777.746 1636 False 2077574.75281 413.365 1104234.98707 0
You can manipulate the dataframe later as needed.

Converting dataframe to dictionary in pyspark without using pandas

Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this:
dictionary = df_2.unstack().to_dict(orient='index')
However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this?
EDIT:
I have now tried the following approach:
dictionary_list = map(lambda row: row.asDict(), df_2.collect())
dictionary = {age['age']: age for age in dictionary_list}
(reference) but it is not yielding what it is supposed to.
In pandas, what I was obtaining was the following:
df2 is the dataframe from the previous post. You can do a pivot first, and then convert to dictionary as described in your linked post.
import pyspark.sql.functions as F
df3 = df2.groupBy('age').pivot('siblings').agg(F.first('count'))
list_persons = [row.asDict() for row in df3.collect()]
dict_persons = {person['age']: person for person in list_persons}
{15: {'age': 15, '0': 1.0, '1': None, '3': None}, 10: {'age': 10, '0': None, '1': None, '3': 1.0}, 14: {'age': 14, '0': None, '1': 1.0, '3': None}}
Or another way:
df4 = df3.fillna(float('nan')).groupBy().pivot('age').agg(F.first(F.struct(*df3.columns[1:])))
result_dict = eval(df4.select(F.to_json(F.struct(*df4.columns))).head()[0])
{'10': {'0': 'NaN', '1': 'NaN', '3': 1.0}, '14': {'0': 'NaN', '1': 1.0, '3': 'NaN'}, '15': {'0': 1.0, '1': 'NaN', '3': 'NaN'}}

pandas same attribute comparison

I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.

Jupyter .save_to_html function does not store config

I'm trying to use the .save_to_html() function for a kepler.gl jupyter notebook map.
It all works great inside jupyter, and I can re-load the same map with a defined config.
Where it goes wrong is when I use the save_to_html() function. The map will save to an html, but the configuration reverts to the basic configuration, before I customized it.
Please help! I love kepler, when I solve this little thing, it will be our absolute go-to tool
Thanks
Have tried to change the filters, colours, and point sizes. None of this works.
map_1 = KeplerGl(height=500, data={'data': df},config=config)
map_1
config = map_1.config
config
map_1.save_to_html(data={'data_1': df},
file_name='privateers.html',config=config)
Config
{'version': 'v1',
'config': {'visState': {'filters': [{'dataId': 'data',
'id': 'x8t9c53mf',
'name': 'time_update',
'type': 'timeRange',
'value': [1565687902187.5417, 1565775465282],
'enlarged': True,
'plotType': 'histogram',
'yAxis': None},
{'dataId': 'data',
'id': 'biysqlu36',
'name': 'user_id',
'type': 'multiSelect',
'value': ['HNc0SI3WsQfhOFRF2THnUEfmqJC3'],
'enlarged': False,
'plotType': 'histogram',
'yAxis': None}],
'layers': [{'id': 'ud6168',
'type': 'point',
'config': {'dataId': 'data',
'label': 'Point',
'color': [18, 147, 154],
'columns': {'lat': 'lat', 'lng': 'lng', 'altitude': None},
'isVisible': True,
'visConfig': {'radius': 5,
'fixedRadius': False,
'opacity': 0.8,
'outline': False,
'thickness': 2,
'strokeColor': None,
'colorRange': {'name': 'Uber Viz Qualitative 1.2',
'type': 'qualitative',
'category': 'Uber',
'colors': ['#12939A',
'#DDB27C',
'#88572C',
'#FF991F',
'#F15C17',
'#223F9A'],
'reversed': False},
'strokeColorRange': {'name': 'Global Warming',
'type': 'sequential',
'category': 'Uber',
'colors': ['#5A1846',
'#900C3F',
'#C70039',
'#E3611C',
'#F1920E',
'#FFC300']},
'radiusRange': [0, 50],
'filled': True},
'textLabel': [{'field': None,
'color': [255, 255, 255],
'size': 18,
'offset': [0, 0],
'anchor': 'start',
'alignment': 'center'}]},
'visualChannels': {'colorField': {'name': 'ride_id', 'type': 'string'},
'colorScale': 'ordinal',
'strokeColorField': None,
'strokeColorScale': 'quantile',
'sizeField': None,
'sizeScale': 'linear'}},
{'id': 'an8tbef',
'type': 'point',
'config': {'dataId': 'data',
'label': 'previous',
'color': [221, 178, 124],
'columns': {'lat': 'previous_lat',
'lng': 'previous_lng',
'altitude': None},
'isVisible': False,
'visConfig': {'radius': 10,
'fixedRadius': False,
'opacity': 0.8,
'outline': False,
'thickness': 2,
'strokeColor': None,
'colorRange': {'name': 'Global Warming',
'type': 'sequential',
'category': 'Uber',
'colors': ['#5A1846',
'#900C3F',
'#C70039',
'#E3611C',
'#F1920E',
'#FFC300']},
'strokeColorRange': {'name': 'Global Warming',
'type': 'sequential',
'category': 'Uber',
'colors': ['#5A1846',
'#900C3F',
'#C70039',
'#E3611C',
'#F1920E',
'#FFC300']},
'radiusRange': [0, 50],
'filled': True},
'textLabel': [{'field': None,
'color': [255, 255, 255],
'size': 18,
'offset': [0, 0],
'anchor': 'start',
'alignment': 'center'}]},
'visualChannels': {'colorField': None,
'colorScale': 'quantile',
'strokeColorField': None,
'strokeColorScale': 'quantile',
'sizeField': None,
'sizeScale': 'linear'}},
{'id': 'ilpixu9',
'type': 'arc',
'config': {'dataId': 'data',
'label': ' -> previous arc',
'color': [146, 38, 198],
'columns': {'lat0': 'lat',
'lng0': 'lng',
'lat1': 'previous_lat',
'lng1': 'previous_lng'},
'isVisible': True,
'visConfig': {'opacity': 0.8,
'thickness': 2,
'colorRange': {'name': 'Global Warming',
'type': 'sequential',
'category': 'Uber',
'colors': ['#5A1846',
'#900C3F',
'#C70039',
'#E3611C',
'#F1920E',
'#FFC300']},
'sizeRange': [0, 10],
'targetColor': None},
'textLabel': [{'field': None,
'color': [255, 255, 255],
'size': 18,
'offset': [0, 0],
'anchor': 'start',
'alignment': 'center'}]},
'visualChannels': {'colorField': None,
'colorScale': 'quantile',
'sizeField': None,
'sizeScale': 'linear'}},
{'id': 'inv52pp',
'type': 'line',
'config': {'dataId': 'data',
'label': ' -> previous line',
'color': [136, 87, 44],
'columns': {'lat0': 'lat',
'lng0': 'lng',
'lat1': 'previous_lat',
'lng1': 'previous_lng'},
'isVisible': False,
'visConfig': {'opacity': 0.8,
'thickness': 2,
'colorRange': {'name': 'Global Warming',
'type': 'sequential',
'category': 'Uber',
'colors': ['#5A1846',
'#900C3F',
'#C70039',
'#E3611C',
'#F1920E',
'#FFC300']},
'sizeRange': [0, 10],
'targetColor': None},
'textLabel': [{'field': None,
'color': [255, 255, 255],
'size': 18,
'offset': [0, 0],
'anchor': 'start',
'alignment': 'center'}]},
'visualChannels': {'colorField': None,
'colorScale': 'quantile',
'sizeField': None,
'sizeScale': 'linear'}}],
'interactionConfig': {'tooltip': {'fieldsToShow': {'data': ['time_ride_start',
'user_id',
'ride_id']},
'enabled': True},
'brush': {'size': 0.5, 'enabled': False}},
'layerBlending': 'normal',
'splitMaps': []},
'mapState': {'bearing': 0,
'dragRotate': False,
'latitude': 49.52565611453996,
'longitude': 6.2730441822977845,
'pitch': 0,
'zoom': 9.244725880765998,
'isSplit': False},
'mapStyle': {'styleType': 'dark',
'topLayerGroups': {},
'visibleLayerGroups': {'label': True,
'road': True,
'border': False,
'building': True,
'water': True,
'land': True,
'3d building': False},
'threeDBuildingColor': [9.665468314072013,
17.18305478057247,
31.1442867897876],
'mapStyles': {}}}}
Expected:
Fully configurated map as in Jupyter widget
Actuals
Colors and filters are not configured. Size and position of map is sent along, so if I store it looking at an empty area, when I open the html file it looks at the same field
In the Jupyter user guide for kepler.gl under the save section
# this will save current map
map_1.save_to_html(file_name='first_map.html')
# this will save map with provided data and config
map_1.save_to_html(data={'data_1': df}, config=config, file_name='first_map.html')
# this will save map with the interaction panel disabled
map_1.save_to_html(file_name='first_map.html', read_only=True)
So it looks like its a bug if the configuration parameter doesn't work or you are making the changes to the map configure after you set it equal to config. This would be fixed if you set
map_1.save_to_html(data={'data_1': df},
file_name='privateers.html',config=map_1.config)
I think it is a bug (or feature?) happens when you use the same cell to save the map configuration or still not print the map out yet. Generally, the config only exists after you really print the map out.
the problem, as far as I see it and how I solved at a similar problem is, that you 1) named your 'data' key in instancing the map different than you told it to save in the HTML 2).
map_1 = KeplerGl(height=500, data={'data': df},config=config)
map_1.save_to_html(data={'data_1': df}, file_name='privateers.html',config=config)
Name both keys the same and your HTML file will use the correct configuration.
Had this issue as well. Solved it by converting all pandas column dtypes to those that are json serializable: i.e. converting 'datetime' column from dtype <m8[ns] to object.

populating nested dictionaries with rows from Pandas data frame

I'm trying to populate a dictionary of dictionaries with entries from a Pandas data frame in Python by iterating through the nested dictionary and populating the values of each sub-dictionary with entries from a row of a Pandas data frame.
Although there are as many sub-dictionaries as there are rows in the data frame, all dictionaries get populated with the data from the last row of the data frame, instead of using every row for every dictionary.
Here is a toy reproducible example.
import pandas as pd
# initialize an empty df
data = pd.DataFrame()
# populate data frame with entries
data['name'] = ['Joe Smith', 'Mary James', 'Charles Williams']
data['school'] = ["Jollywood Secondary", "Northgate Sixth From", "Brompton High"]
data['subjects'] = [['Maths', 'Art', 'Biology'], ['English', 'French', 'History'], ['Chemistry', 'Biology', 'English']]
# use dictionary comprehensions to set up main dictionary and sub-dictionary templates
# sub-dictionary
keys = ['name', 'school', 'subjects']
record = {key: None for key in keys}
# main dictionary
keys2 = ['cand1', 'cand2', 'cand3']
candidates = {key: record for key in keys2}
# as a result i get something like this
# {'cand1': {'name': None, 'school': None, 'subjects': None},
# 'cand2': {'name': None, 'school': None, 'subjects': None},
# 'cand3': {'name': None, 'school': None, 'subjects': None}}
# iterate through main dictionary and populate each sub-dict with row of df
for i, d in enumerate(candidates.items()):
d[1]['name'] = data['name'].iloc[i]
d[1]['school'] = data['school'].iloc[i]
d[1]['subjcts'] = data['subjects'].iloc[i]
# what i end up with is the last row entry in each sub-dictionary
#{'cand1': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']},
# 'cand2': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']},
# 'cand3': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']}}
How do I need to modify my code to get each dictionary populated with a different row from my data frame?
I did not work through your code to look for the bug, because the solution is a one-liner with the method to_dict.
Here is a minimal working example with your sample data.
import pandas as pd
# initialize an empty df
data = pd.DataFrame()
# populate data frame with entries
data['name'] = ['Joe Smith', 'Mary James', 'Charles Williams']
data['school'] = ["Jollywood Secondary", "Northgate Sixth From", "Brompton High"]
data['subjects'] = [['Maths', 'Art', 'Biology'], ['English', 'French', 'History'], ['Chemistry', 'Biology', 'English']]
# redefine index to match your keys
data.index = ['cand{}'.format(i) for i in range(1,len(data)+1)]
# convert to dict
data_dict = data.to_dict(orient='index')
print(data_dict)
This will look something like this
{'cand1': {
'name': 'Joe Smith',
'school': 'Jollywood Secondary',
'subjects': ['Maths', 'Art', 'Biology']},
'cand2': {
'name': 'Mary James',
'school': 'Northgate Sixth From',
'subjects': ['English', 'French', 'History']},
'cand3': {
'name': 'Charles Williams',
'school': 'Brompton High',
'subjects': ['Chemistry', 'Biology', 'English']}}
Consider avoiding the roundabout away of building dictionary as Pandas maintains various methods to render nested structures such as to_dict and to_json. Specifically, consider adding a new column, cand and set it as index for to_dict output:
data['cand'] = 'cand' + pd.Series((data.index.astype('int') + 1).astype('str'))
mydict = data.set_index('cand').to_dict(orient='index')
print(mydict)
{'cand1': {'name': 'Joe Smith', 'school': 'Jollywood Secondary',
'subjects': ['Maths', 'Art', 'Biology']},
'cand2': {'name': 'Mary James', 'school': 'Northgate Sixth From',
'subjects': ['English', 'French', 'History']},
'cand3': {'name': 'Charles Williams', 'school': 'Brompton High',
'subjects': ['Chemistry', 'Biology', 'English']}}