Convert a dictionary within a list to rows in pandas - pandas

I currently have a data frame like this:
and I would like to explode the "listing" column into rows. I would like to use the key in the dictionary as column names, so ideally this is how I would like to data frame to look like this:
eventId listingId currentPrice
103337923 1307675567 ...
103337923 1307675567 ...
103337923 1307675567 ...
This is what I get with this: print(listing_df.head(3).to_dict())

Definitely there should be a better way to do this. But this works. :)
df1 = pd.DataFrame(
{"a": [1,2,3,4],
"b": [5,6,7,8],
"c": [[{"x": 17, "y": 18, "z": 19}, {"x": 27, "y": 28, "z": 29}],
[{"x": 37, "y": 38, "z": 39}, {"x": 47, "y": 48, "z": 49}],
[{"x": 57, "y": 58, "z": 59}, {"x": 27, "y": 68, "z": 69}],
[{"x": 77, "y": 78, "z": 79}, {"x": 27, "y": 88, "z": 89}]]})
Now you can create a new DataFrame from the above:
df2 = pd.DataFrame(columns=df1.columns)
df2_index = 0
for row in df1.iterrows():
one_row = row[1]
for list_value in row[1]["c"]:
one_row["c"] = list_value
df2.loc[df2_index] = one_row
df2_index += 1
Output is the way you need:
Now that we have expanded list into separate rows, you can further expand json into columns with:
df2[list(df2["c"].head(1).tolist()[0].keys())] = df2["c"].apply(
lambda x: pd.Series([x[key] for key in x.keys()]))
Hope it helps!

Related

How to assert that sum of two series is equal to sum of another two series

Let's say I have 4 series objects:
ser1=pd.Series(data={'a':1,'b':2,'c':NaN, 'd':5, 'e':50})
ser2=pd.Series(data={'a':4,'b':NaN,'c':NaN, 'd':10, 'e':100})
ser3=pd.Series(data={'a':0,'b':NaN,'c':7,'d':15, 'e':NaN})
ser4=pd.Series(data={'a':5,'b':2,'c':10, 'd':NaN, 'e':NaN})
I would like to assert
assert (ser1 + ser2 == ser3 + ser4) where I treat NaNs as zeros, only not a situation where both ser1 and ser2 are Nans - then I want to ommit this case and treat assert as true. For example when ser1 and ser2 are both NaNs ('c') then assert should return True no matter what are the values of ser3 and ser4. In case only one of ser1 or ser2 is NaN, filling nans with zeros would work.
Here is one way to do it:
def assert_sum_equality(ser1, ser2, ser3, ser4):
"""Helper function.
"""
if ser1.isna().all() and ser2.isna().all():
return True
_ = [ser.fillna(0, inplace=True) for ser in [ser1, ser2, ser3, ser4]]
return all(ser1 + ser2 == ser3 + ser4)
import pandas as pd
# ser1 and ser2 are filled with pd.NA
ser1 = pd.Series({"a": pd.NA, "b": pd.NA, "c": pd.NA, "d": pd.NA, "e": pd.NA})
ser2 = pd.Series({"a": pd.NA, "b": pd.NA, "c": pd.NA, "d": pd.NA, "e": pd.NA})
ser3 = pd.Series({"a": 0, "b": pd.NA, "c": 7, "d": 15, "e": pd.NA})
ser4 = pd.Series({"a": 5, "b": 2, "c": 10, "d": pd.NA, "e": 125})
print(assert_sum_equality(ser1, ser2, ser3, ser4)) # True
# ser1 + ser2 == ser3 + ser4 on all rows
ser1 = pd.Series({"a": 1, "b": 2, "c": 13, "d": 5, "e": 50})
ser2 = pd.Series({"a": 4, "b": pd.NA, "c": 4, "d": 10, "e": 100})
ser3 = pd.Series({"a": 0, "b": pd.NA, "c": 7, "d": 15, "e": pd.NA})
ser4 = pd.Series({"a": 5, "b": 2, "c": 10, "d": pd.NA, "e": 150})
print(assert_sum_equality(ser1, ser2, ser3, ser4)) # True
# ser1 + ser2 != ser3 + ser4 on rows 'c' and 'e'
ser1 = pd.Series({"a": 1, "b": 2, "c": pd.NA, "d": 5, "e": 50})
ser2 = pd.Series({"a": 4, "b": pd.NA, "c": pd.NA, "d": 10, "e": 100})
ser3 = pd.Series({"a": 0, "b": pd.NA, "c": 7, "d": 15, "e": pd.NA})
ser4 = pd.Series({"a": 5, "b": 2, "c": 10, "d": pd.NA, "e": pd.NA})
print(assert_sum_equality(ser1, ser2, ser3, ser4)) # False

pandas compare two data frames and highlight the differences

I'm trying to compare 2 dataframes and highlight the differences in the second one like this:
I have tried using concat and drop duplicates but I am not sure how to check for the specific cells and also how to highlight them at the end
Possible solution is the following:
import pandas as pd
# set test data
data1 = {"A": [10, 11, 23, 44], "B": [22, 23, 56, 55], "C": [31, 21, 34, 66], "D": [25, 45, 21, 45]}
data2 = {"A": [10, 11, 23, 44, 56, 23], "B": [44, 223, 56, 55, 73, 56], "C": [31, 21, 45, 66, 22, 22], "D": [25, 45, 26, 45, 34, 12]}
# create dataframes
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# define function to highlight differences in dataframes
def highlight_diff(data, other, color='yellow'):
attr = 'background-color: {}'.format(color)
return pd.DataFrame(np.where(data.ne(other), attr, ''),
index=data.index, columns=data.columns)
# apply style using function
df2.style.apply(highlight_diff, axis=None, other=df1)
Returns

Add/Delete a property of every object inside JSONB column in PostgreSQL

TL;DR: I need two UPDATE scripts that would turn (1) into (2) and vice-versa (add/delete color property from every object in the JSONB column)
(1)
id | name | squares
1 | s1 | [{"x": 5, "y": 5, "width": 10, "height": 10}, {"x": 0, "y": 0, "width": 20, "height": 20}]
2 | s2 | [{"x": 0, "y": 3, "width": 13, "height": 11}, {"x": 2, "y": 3, "width": 20, "height": 20}]
(2)
id | name | squares
1 | s1 | [{"x": 5, "y": 5, "width": 10, "height": 10, "color": "#FFFFFF"}, {"x": 0, "y": 0, "width": 20, "height": 20, "color": "#FFFFFF"}]
2 | s2 | [{"x": 0, "y": 3, "width": 13, "height": 11, "color": "#FFFFFF"}, {"x": 2, "y": 3, "width": 20, "height": 20, "color": "#FFFFFF"}]
My schema
I have a table called scene with squares column, which has a type of JSONB. Inside this column I store values like this: [{"x": 5, "y": 5, "width": 10, "height": 10}, {"x": 0, "y": 0, "width": 20, "height": 20}].
What I want to do
I want to now add color to my squares, which implies also adding some default color (like "#FFFFFF") to every square in every scene record in the existing production database, so I need a migration.
The problem
I need to write a migration that would add "color": "#FFFFFF" to every square in the production database. With a relational schema that would be as easy as writing ALTER TABLE square ADD color... for the forward migration and ALTER TABLE square DROP COLUMN color... for the rollback migration, but since square is not a separate table, it is an array-like JSONB, I need two UPDATE queries for the scene table.
(1) Adding color:
update scene set squares = (select array_to_json(array_agg(jsonb_insert(v.value, '{color}', '"#FFFFFF"')))
from jsonb_array_elements(squares) v);
select * from scene;
See demo.
(2) Removing color:
update scene set squares = (select array_to_json(array_agg(v.value - 'color'))
from jsonb_array_elements(squares) v);
select * from scene;
See demo.

How can I select records that match a value in a json field array?

I'm trying to return records matching an array element that equals a specific value from a json field.
I found how to select all records containing certain values from a postgres json field containig an array that is close to my question. However, I think the key difference is I use json and not jsonb. For reasons, I need to use json at the moment. When I tried the steps from that other post, I get the same error as below.
I have this example data
{"name": "Bob", "scores": [64, 66]}
{"name": "Sally", "scores": [66, 65]}
{"name": "Kurt", "scores": [69, 71, 72, 67, 68]}
{"name": "Libby", "scores": [72, 73, 74, 75]}
{"name": "Frank", "scores": [80, 81, 82, 83]}
I'm trying to run this query:
SELECT data
FROM tests.results
where (data->>'scores') #> '[72]';
I expect two rows from the results:
{"name": "Kurt", "scores": [69, 71, 72, 67, 68]}
{"name": "Libby", "scores": [72, 73, 74, 75]}
but I get:
SQL Error [42883]: ERROR: operator does not exist: text #> integer
Hint: No operator matches the given name and argument type(s).
You might need to add explicit type casts.
I currently using Postgres 10, but am likely upgrading to 12. Any help is appreciated.
You would need to use -> instead of ->>. The former returns a json object, while the latter returns text; meanwhile, the #> operator operates on json objects, not on text.
SELECT data
FROM tests.results
where (data->'scores')::jsonb #> '[72]';
Demo on DB Fiddle:
WITH results AS (
SELECT '{"name": "Bob", "scores": [64, 66]}'::json mydata
UNION ALL SELECT '{"name": "Sally", "scores": [66, 65]}'::json
UNION ALL SELECT '{"name": "Kurt", "scores": [69, 71, 72, 67, 68]}'::json
UNION ALL SELECT '{"name": "Libby", "scores": [72, 73, 74, 75]}'::json
UNION ALL SELECT '{"name": "Frank", "scores": [80, 81, 82, 83]}'::json
)
SELECT mydata::text
FROM results
where (mydata->'scores')::jsonb #> '[72]';
| mydata |
| ------------------------------------------------ |
| {"name": "Kurt", "scores": [69, 71, 72, 67, 68]} |
| {"name": "Libby", "scores": [72, 73, 74, 75]} |

Postgresql query dictionary of objects in JSONB field

I have a table in a PostgreSQL 9.5 database with a JSONB field that contains a dictionary in the following form:
{'1': {'id': 1,
'length': 24,
'date_started': '2015-08-25'},
'2': {'id': 2,
'length': 27,
'date_started': '2015-09-18'},
'3': {'id': 3,
'length': 27,
'date_started': '2015-10-15'},
}
The number of elements in the dictionary (the '1', '2', etc.) may vary between rows.
I would like to be able to get the average of length using a single SQL query. Any suggestions on how to achieve this ?
Use jsonb_each:
[local] #= SELECT json, AVG((v->>'length')::int)
FROM j, jsonb_each(json) js(k, v)
GROUP BY json;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────┐
│ json │ avg │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────┤
│ {"1": {"id": 1, "length": 240, "date_started": "2015-08-25"}, "2": {"id": 2, "length": 27, "date_started": "2015-09-18"}, "3": {"id": 3, "length": 27, "date_started": "2015-10-15"}} │ 98.0000000000000000 │
│ {"1": {"id": 1, "length": 24, "date_started": "2015-08-25"}, "2": {"id": 2, "length": 27, "date_started": "2015-09-18"}, "3": {"id": 3, "length": 27, "date_started": "2015-10-15"}} │ 26.0000000000000000 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┘
(2 rows)
Time: 0,596 ms