fetch the data from array of objects sql BigQuery - sql

I need to fetch key value pairs from the second object in array. Also, need to create new columns with the fetched data. I am only interested in the second object, some arrays have 3 objects, some have 4 etc. The data looks like this:
[{'adUnitCode': ca-pub, 'id': 35, 'name': ca-pub}, {'adUnitCode': hmies, 'id': 49, 'name': HMIES}, {'adUnitCode': moda, 'id': 50, 'name': moda}, {'adUnitCode': nova, 'id': 55, 'name': nova}, {'adUnitCode': listicle, 'id': 11, 'name': listicle}]
[{'adUnitCode': ca-pub, 'id': 35, 'name': ca-pub-73}, {'adUnitCode': hmiuk-jam, 'id': 23, 'name': HM}, {'adUnitCode': recipes, 'id': 26, 'name': recipes}]
[{'adUnitCode': ca-pub, 'id': 35, 'name': ca-pub-733450927}, {'adUnitCode': digital, 'id': 48, 'name': Digital}, {'adUnitCode': movies, 'id': 50, 'name': movies}, {'adUnitCode': cannes-film-festival, 'id': 57, 'name': cannes-film-festival}, {'adUnitCode': article, 'id': 57, 'name': article}]
The desired output:
adUnitCode id name
hmies 49 HMIES
hmiuk-jam 23 HM
digital 48 Digital

Below is for BigQuery Standard SQL
#standardSQL
select
json_extract_scalar(second_object, "$.adUnitCode") as adUnitCode,
json_extract_scalar(second_object, "$.id") as id,
json_extract_scalar(second_object, "$.name") as name
from `project.dataset.table`, unnest(
[json_extract_array(regexp_replace(mapping, r"(: )([\w-]+)(,|})", "\\1'\\2'\\3"))[safe_offset(1)]]
) as second_object
if applied to sample data from your question - output is
as you can see, the "trick" here is to use proper regexp in regexp_replace function. I've included now any alphabetical chars and - . you can include more as you see needed
As an alternative yo can try regexp_replace(mapping, r"(: )([^,}]+)", "\\1'\\2'") as in below example - so you will cover potentially more cases without changes in code
#standardSQL
select
json_extract_scalar(second_object, "$.adUnitCode") as adUnitCode,
json_extract_scalar(second_object, "$.id") as id,
json_extract_scalar(second_object, "$.name") as name
from `project.dataset.table`, unnest(
[json_extract_array(regexp_replace(mapping, r"(: )([^,}]+)", "\\1'\\2'"))[safe_offset(1)]]
) as second_object

Related

RowNumber Window Query for Hiscores Ranking - Django

I'm trying to build a game hiscore view with rankings for my Django site, and I'm having some issues.
The query I have is the following:
row_number_rank = Window(
expression=RowNumber(),
partition_by=[F('score_type')],
order_by=F('score').desc()
)
hiscores = Hiscore.objects.annotate(rank=row_number_rank).values()
The query above works perfectly, and properly assigns each row a rank according to how it compares to other scores within each score type.
The result of this is the following:
{ 'id': 2, 'username': 'Bob', 'score_type': 'wins', 'score': 12, 'rank': 1 }
{ 'id': 1, 'username': 'John', 'score_type': 'wins', 'score': 5, 'rank': 2 }
{ 'id': 4, 'username': 'John', 'score_type': 'kills', 'score': 37, 'rank': 1 }
{ 'id': 3, 'username': 'John', 'score_type': 'kills', 'score': 5, 'rank': 2 }
{ 'id': 5, 'username': 'Bob', 'score_type': 'kills', 'score': 2, 'rank': 3 }
The issue comes in when I want to retrieve only a specific user's scores from the above results. If I append a filter(username="Bob") the query is now:
row_number_rank = Window(
expression=RowNumber(),
partition_by=[F('score_type')],
order_by=F('score').desc()
)
hiscores = Hiscore.objects.annotate(rank=row_number_rank).filter(username='Bob').values()
Unexpectedly, adding this filter step has yielded the following incorrect results:
{ 'id': 2, 'username': 'Bob', 'score_type': 'wins', 'score': 12, 'rank': 1 }
{ 'id': 5, 'username': 'Bob', 'score_type': 'kills', 'score': 2, 'rank': 1 }
Randomly, the rank on the id=5 entry has decided to change to 1 instead of its correct value of 3.
Why would adding this filter step modify the values of the fields in the QuerySet, instead of just excluding the proper elements from it?
Thanks.

Pandas - Extract value from Dataframe based on certain key value not in a sequence

I have a Dataframe in the below format:
id, ref
101, [{'id': '74947', 'type': {'id': '104', 'name': 'Sales', 'inward': 'Sales', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-A'}}]
102, [{'id': '74948', 'type': {'id': '105', 'name': 'Return', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-C'}},
{'id': '750001', 'type': {'id': '342', 'name': 'Sales', 'inward': 'Sales', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-X'}}]
103, [{'id': '74949', 'type': {'id': '106', 'name': 'Sales', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-B'}},
104, [{'id': '67543', 'type': {'id': '106', 'name': 'Other', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-BA'}}]
I am trying to extract rows that have name = Sales and return back the below output:
101, Prod-A
102, Prod-X
103, Prod-B
I am able to extract the required data if the key value pair appears at the first instance but I am not able to do so if it is not the first instance like in the case of id = 102
df['names'] = df['ref'].str[0].str.get('type').str.get('name')
df['value'] = df['ref'].str[0].str.get('inwardIssue').str.get('key')
df['output'] = np.where(df['names'] == 'Sales', df['value'], 0)
Currently I am able to only get values for id = 101, 103
Let us do explode
s=pd.DataFrame(df.ref.explode().tolist())
s=s.loc[s.type.str.get('name').eq('Sales'),'inwardIssue'].str.get('key')
dfs=df.join(s,how='right')
id ref inwardIssue
0 101 [{'id': '74947', 'type': {'id': '104', 'name':... Prod-A
2 103 [{'id': '74949', 'type': {'id': '106', 'name':... Prod-X
3 104 [{'id': '67543', 'type': {'id': '106', 'name':... Prod-B
If you already have a dataframe in that format, you may convert it to json format and use pd.json_normalize to turn original df to a flat dataframe and slicing/filering on this flat dataframe.
df1 = pd.json_normalize(df.to_dict(orient='records'), 'ref')
The output of this flat dataframe df1
Out[83]:
id type.id type.name type.inward type.outward inwardIssue.id \
0 74947 104 Sales Sales PO 76560
1 74948 105 Return Return Order PO 76560
2 750001 342 Sales Sales PO 76560
3 74949 106 Sales Return Order PO 76560
4 67543 106 Other Return Order PO 76560
inwardIssue.key
0 Prod-A
1 Prod-C
2 Prod-X
3 Prod-B
4 Prod-BA
Finally, slicing on df1
df_final = df1.loc[df1['type.name'].eq('Sales'), ['type.id', 'inwardIssue.key']]
Out[88]:
type.id inwardIssue.key
0 104 Prod-A
2 342 Prod-X
3 106 Prod-B

Pandas - Extract value from Dataframe based on certain key value

I have a Dataframe in the below format:
id, ref
101, [{'id': '74947', 'type': {'id': '104', 'name': 'Sales', 'inward': 'Sales', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-A'}}]
102, [{'id': '74948', 'type': {'id': '105', 'name': 'Return', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-C'}}]
103, [{'id': '74949', 'type': {'id': '106', 'name': 'Sales', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-B'}}]
I am trying to extract rows that have name = Sales and return back the below output:
id, value
101, Prod-A
103, Prod-B
Use str[0] for first lists with Series.str.get by values by keys of dicts:
#if necessary convert list/dict repr to list/dict
import ast
df['ref'] = df['ref'].apply(ast.literal_eval)
df['names'] = df['ref'].str[0].str.get('type').str.get('name')
df['value'] = df['ref'].str[0].str.get('inwardIssue').str.get('key')
print (df)
id ref names value
0 101 [{'id': '74947', 'type': {'id': '104', 'name':... Sales Prod-A
1 102 [{'id': '74948', 'type': {'id': '105', 'name':... Return Prod-C
2 103 [{'id': '74949', 'type': {'id': '106', 'name':... Sales Prod-B
And then filter by boolean indexing:
df1 = df.loc[df['names'].eq('Sales'), ['id','value']]
print (df1)
id value
0 101 Prod-A
2 103 Prod-B

Pandas - Extracting value based of common key

I have a Dataframe in the below format:
id, key1, key2
101, {'key': 'key_1001', 'fields': {'type': {'subtask': False}, 'summary': 'Title_1' , 'id': '71150'}}, NaN
101, NaN,{'key': 'key_1002', 'fields': {'type': {'subtask': False}, 'summary': 'Title_2' , 'id': '71151'}}
102, {'key': 'key_2001', 'fields': {'type': {'subtask': False}, 'summary': 'Title_11' , 'id': '71160'}}, NaN
102, NaN,{'key': 'key_2002', 'fields': {'type': {'subtask': False}, 'summary': 'Title_12' , 'id': '71161'}}
I am trying to achieve the below output from the above Dataframe.
id, key_value_1, key_value_2
101, key_1001, key_1002
102, key_2001, key_2002
Output of df.dict()
{'id': {103: '101', 676: '101'}, 'key1' : {103: {'fields': {'type': {'subtask': False}, 'summary': 'Title_1' , 'id': '71150'},
676: nan}
You can use:
s=df.set_index('id').stack().str.get('key').unstack()
key1 key2
id
101 key_1001 key_1002
102 key_2001 key_2002

Convert list of dictionary in a dataframe to seperate dataframe

To convert list of dictionary already present in the dataset to a dataframe.
The dataset looks something like this.
[{'id': 35, 'name': 'Comedy'}]
How do I convert this list of dictionary to dataframe?
Thank you for your time!
I want to retrieve:
Comedy
from the list of dictionary.
Use:
df = pd.DataFrame({'col':[[{'id': 35, 'name': 'Comedy'}],[{'id': 35, 'name': 'Western'}]]})
print (df)
col
0 [{'id': 35, 'name': 'Comedy'}]
1 [{'id': 35, 'name': 'Western'}]
df['new'] = df['col'].apply(lambda x: x[0].get('name'))
print (df)
col new
0 [{'id': 35, 'name': 'Comedy'}] Comedy
1 [{'id': 35, 'name': 'Western'}] Western
If possible multiple dicts in list:
df = pd.DataFrame({'col':[[{'id': 35, 'name': 'Comedy'}, {'id':4, 'name':'Horror'}],
[{'id': 35, 'name': 'Western'}]]})
print (df)
col
0 [{'id': 35, 'name': 'Comedy'}, {'id': 4, 'name...
1 [{'id': 35, 'name': 'Western'}]
df['new'] = df['col'].apply(lambda x: [y.get('name') for y in x])
print (df)
col new
0 [{'id': 35, 'name': 'Comedy'}, {'id': 4, 'name... [Comedy, Horror]
1 [{'id': 35, 'name': 'Western'}] [Western]
And if want extract all values:
df1 = pd.concat([pd.DataFrame(x) for x in df['col']], ignore_index=True)
print (df1)
id name
0 35 Comedy
1 4 Horror
2 35 Western