Is it possible to UNNEST an array in BigQuery so that the nested data in split into columns by a key value? - google-bigquery

Let's say I have some data in BigQuery which includes a nested array of objects like so:
{
"name" : "Bob",
"age": "24",
"customFields": [
{
"index": "1",
"value": "1.98"
},
{
"index": "2",
"value": "Nintendo"
},
{
"index": "3",
"value": "Yellow"
}
]
}
I've only been able to unnest this data so that the "index" and "value" fields are columns:
+------+-----+-------+----------+
| name | age | index | value |
+------+-----+-------+----------+
| Bob | 24 | 1 | 1.98 |
| Bob | 24 | 2 | Nintendo |
| Bob | 24 | 3 | Yellow |
+------+-----+-------+----------+
In most cases this would be the desired output, but as the data I'm using refers to Google Analytics custom dimensions I require something a bit more complex. I'm trying to get the index value to be used in the name of the column the data appears in, like so:
+------+-----+---------+----------+---------+
| name | age | index_1 | index_2 | index_3 |
+------+-----+---------+----------+---------+
| Bob | 24 | 1.98 | Nintendo | Yellow |
+------+-----+---------+----------+---------+
Is this possible? What would be the SQL query required to generate this output? It should use the "index" value in he column name, as the output won't be in the ordered "1,2,3,..." all the time.

What you are describing is often referred to as a pivot table - a transformation where values are used as columns. SQL doesn't generally support this as SQL is designed around the concept of having a fixed schema while pivot table requires dynamic schemas.
However if you have a fixed set of index columns you can emulate it with something like:
SELECT
name,
age,
ARRAY(SELECT value FROM UNNEST(customFields) WHERE index="1")[SAFE_OFFSET(0)] AS index_1,
ARRAY(SELECT value FROM UNNEST(customFields) WHERE index="2")[SAFE_OFFSET(0)] AS index_2,
ARRAY(SELECT value FROM UNNEST(customFields) WHERE index="3")[SAFE_OFFSET(0)] AS index_3
FROM your_table;
What this does is specifically define columns for each index that picks out the right values from the customFields array.

Related

How to aggregate or join two JSON datasets in Splunk?

I try to build up an overview some process steps of my application.
I generate two JSON documents
{
"requestID": "abc-123",
"username": "ringo",
}
and
{
"requestID": "abc-123",
"favoriteCar": "Lada"
}
ok, now I have also other entries like these:
abc-456 / paul / Fiat
bcd-987 / george / Talbot
and so on ... linked by the requestID
Now I want to do a table that shows me:
ID | Username | Car
---------|--------------|---------------
abc-123 | ringo | Lada
abc-456 | paul | Fiat
bcd-987 | george | Talbot
So my question is: How can I do these aggregation?
Kind regards
Markus
Aggregations are done with the stats command. Once you have the fields extracted, they can be grouped using stats values(*) as * by requestID.

Transform several Dataframe rows into a single row

The following is an example Dataframe snippet:
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_lid |trace |message |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1103960793391132675|47c10fda9b40407c998c154dc71a9e8c|[app.py:208] Prediction label: {"id": 617, "name": "CENSORED"}, score=0.3874854505062103 |
|1103960793391132676|47c10fda9b40407c998c154dc71a9e8c|[app.py:224] Similarity values: [0.6530804801919593, 0.6359653379418201] |
|1103960793391132677|47c10fda9b40407c998c154dc71a9e8c|[app.py:317] Predict=s3://CENSORED/scan_4745/scan4745_t1_r0_c9_2019-07-15-10-32-43.jpg trait_id=112 result=InferenceResult(predictions=[Prediction(label_id='230', label_name='H3', probability=0.0), Prediction(label_id='231', label_name='Other', probability=1.0)], selected=Prediction(label_id='231', label_name='Other', probability=1.0)). Took 1.3637824058532715 seconds |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have millions of these, log like structures, where they all can be grouped by trace which is unique to a session.
I'm looking to transform these sets of rows into single rows, essentially mapping over them, for for this example I would extract from the first name the "id": 617 from the second row the values 0.6530804801919593, 0.6359653379418201 and from the third row the Prediction(label_id='231', label_name='Other', probability=1.0) value.
Then I would compose a new table having the columns:
| trace | id | similarity | selected |
with the values:
| 47c10fda9b40407c998c154dc71a9e8c | 617 | 0.6530804801919593, 0.6359653379418201 | 231 |
How should I implement this group-map transform over several rows in pyspark ?
I've written the below example in Scala for my own convenience, but it should translate readily to Pyspark.
1) Create the new columns in your dataframe via regexp_extract on the "message" field. This will produce the desired values if the regex matches, or empty strings if not:
scala> val dss = ds.select(
| 'trace,
| regexp_extract('message, "\"id\": (\\d+),", 1) as "id",
| regexp_extract('message, "Similarity values: \\[(\\-?[0-9\\.]+, \\-?[0-9\\.]+)\\]", 1) as "similarity",
| regexp_extract('message, "selected=Prediction\\(label_id='(\\d+)'", 1) as "selected"
| )
dss: org.apache.spark.sql.DataFrame = [trace: string, id: string ... 2 more fields]
scala> dss.show(false)
+--------------------------------+---+--------------------------------------+--------+
|trace |id |similarity |selected|
+--------------------------------+---+--------------------------------------+--------+
|47c10fda9b40407c998c154dc71a9e8c|617| | |
|47c10fda9b40407c998c154dc71a9e8c| |0.6530804801919593, 0.6359653379418201| |
|47c10fda9b40407c998c154dc71a9e8c| | |231 |
+--------------------------------+---+--------------------------------------+--------+
2) Group by "trace" and eliminate the cases where the regex didn't match. The quick and dirty way (show below) is to select the max of each column, but you might need to do something more sophisticated if you expect to encounter more than one match per trace:
scala> val ds_final = dss.groupBy('trace).agg(max('id) as "id", max('similarity) as "similarity", max('selected) as "selected")
ds_final: org.apache.spark.sql.DataFrame = [trace: string, id: string ... 2 more fields]
scala> ds_final.show(false)
+--------------------------------+---+--------------------------------------+--------+
|trace |id |similarity |selected|
+--------------------------------+---+--------------------------------------+--------+
|47c10fda9b40407c998c154dc71a9e8c|617|0.6530804801919593, 0.6359653379418201|231 |
+--------------------------------+---+--------------------------------------+--------+
I ended up using something in the lines of
expected_schema = StructType([
StructField("event_timestamp", TimestampType(), False),
StructField("trace", StringType(), False),
...
])
#F.pandas_udf(expected_schema, F.PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def transform(pdf):
output = {}
for l in pdf.to_dict(orient='record'):
x = re.findall(r'^(\[.*:\d+\]) (.*)', l['message'])[0][1]
...
return pd.DataFrame(data=[output])
df.groupby('trace').apply(transform)

Group and split records in postgres into several new column series

I have data of the form
-----------------------------|
6031566779420 | 25 | 163698 |
6031566779420 | 50 | 98862 |
6031566779420 | 75 | 70326 |
6031566779420 | 95 | 51156 |
6031566779420 | 100 | 43788 |
6036994077620 | 25 | 41002 |
6036994077620 | 50 | 21666 |
6036994077620 | 75 | 14604 |
6036994077620 | 95 | 11184 |
6036994077620 | 100 | 10506 |
------------------------------
and would like to create a dynamic number of new columns by treating each series of (25, 50, 75, 95, 100) and corresponding values as a new series. What I'm looking for as target output is,
--------------------------
| 25 | 163698 | 41002 |
| 50 | 98862 | 21666 |
| 75 | 70326 | 14604 |
| 95 | 51156 | 11184 |
| 100 | 43788 | 10506 |
--------------------------
I'm not sure what the name of the sql / postgres operation I want is called nor how to achieve it. In this case the data has 2 new columns but I'm trying to formulate a solution that has has many new columns as are groups of data in the output of the original query.
[Edit]
Thanks for the references to array_agg, that looks like it would be helpful! I should've mentioned this earlier but I'm using Redshift which reports this version of Postgres:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.1007
and it does not seem to support this function yet.
ERROR: function array_agg(numeric) does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
Query failed
PostgreSQL said: function array_agg(numeric) does not exist
Hint: No function matches the given name and argument types. You may need to add explicit type casts.
Is crosstab the type of transformation I should be looking at? Or something else? Thanks again.
I've used array_agg() here
select idx,array_agg(val)
from t
group by idx
This will produce result like below:
idx array_agg
--- --------------
25 {163698,41002}
50 {98862,21666}
75 {70326,14604}
95 {11184,51156}
100 {43788,10506}
As you can see the second column is an array of two values(column idx) that corresponding to column idx
The following select queries will give you result with two separate column
Method : 1
SELECT idx
,col [1] col1 --First value in the array
,col [2] col2 --Second vlaue in the array
FROM (
SELECT idx
,array_agg(val) col
FROM t
GROUP BY idx
) s
Method : 2
SELECT idx
,(array_agg(val)) [1] col1 --First value in the array
,(array_agg(val)) [2] col2 --Second vlaue in the array
FROM t
GROUP BY idx
Result:
idx col1 col2
--- ------ -----
25 163698 41002
50 98862 21666
75 70326 14604
95 11184 51156
100 43788 10506
You can use array_agg function. Asuming, your columns are named A,B,C:
SELECT B, array_agg(C)
FROM table_name
GROUP BY B
Will get you output in array form. This is as close as you can get to variable columns in a simple query. If you really need variable columns, consider defining a PL/pgSQL procedure to convert array into columns.

SQLAlchemy getting label names out from columns

I want to use the same labels from a SQLAlchemy table, to re-aggregate some data (e.g. I want to iterate through mytable.c to get the column names exactly).
I have some spending data that looks like the following:
| name | region | date | spending |
| John | A | .... | 123 |
| Jack | A | .... | 20 |
| Jill | B | .... | 240 |
I'm then passing it to an existing function we have, that aggregates spending over 2 periods (using a case statement) and groups by region:
grouped table:
| Region | Total (this period) | Total (last period) |
| A | 3048 | 1034 |
| B | 2058 | 900 |
The function returns a SQLAlchemy query object that I can then use subquery() on to re-query e.g.:
subquery = get_aggregated_data(original_table)
region_A_results = session.query(subquery).filter(subquery.c.region = 'A')
I want to then re-aggregate this subquery (summing every column that can be summed, replacing the region column with a string 'other'.
The problem is, if I iterate through subquery.c, I get labels that look like:
anon_1.region
anon_1.sum_this_period
anon_1.sum_last_period
Is there a way to get the textual label from a set of column objects, without the anon_1. prefix? Especially since I feel that the prefix may change depending on how SQLAlchemy decides to generate the query.
Split the name string and take the second part, and if you want to prepare for the chance that the name is not prefixed by the table name, put the code in a try - except block:
for col in subquery.c:
try:
print(col.name.split('.')[1])
except IndexError:
print(col.name)
Also, the result proxy (region_A_results) has a method keys which returns an a list of column names. Again, if you don't need the table names, you can easily get rid of them.

How to represent and insert into an ordered list in SQL?

I want to represent the list "hi", "hello", "goodbye", "good day", "howdy" (with that order), in a SQL table:
pk | i | val
------------
1 | 0 | hi
0 | 2 | hello
2 | 3 | goodbye
3 | 4 | good day
5 | 6 | howdy
'pk' is the primary key column. Disregard its values.
'i' is the "index" that defines that order of the values in the 'val' column. It is only used to establish the order and the values are otherwise unimportant.
The problem I'm having is with inserting values into the list while maintaining the order. For example, if I want to insert "hey" and I want it to appear between "hello" and "goodbye", then I have to shift the 'i' values of "goodbye" and "good day" (but preferably not "howdy") to make room for the new entry.
So, is there a standard SQL pattern to do the shift operation, but only shift the elements that are necessary? (Note that a simple "UPDATE table SET i=i+1 WHERE i>=3" doesn't work, because it violates the uniqueness constraint on 'i', and also it updates the "howdy" row unnecessarily.)
Or, is there a better way to represent the ordered list? I suppose you could make 'i' a floating point value and choose values between, but then you have to have a separate rebalancing operation when no such value exists.
Or, is there some standard algorithm for generating string values between arbitrary other strings, if I were to make 'i' a varchar?
Or should I just represent it as a linked list? I was avoiding that because I'd like to also be able to do a SELECT .. ORDER BY to get all the elements in order.
As i read your post, I kept thinking 'linked list'
and at the end, I still think that's the way to go.
If you are using Oracle, and the linked list is a separate table (or even the same table with a self referencing id - which i would avoid) then you can use a CONNECT BY query and the pseudo-column LEVEL to determine sort order.
You can easily achieve this by using a cascading trigger that updates any 'index' entry equal to the new one on the insert/update operation to the index value +1. This will cascade through all rows until the first gap stops the cascade - see the second example in this blog entry for a PostgreSQL implementation.
This approach should work independent of the RDBMS used, provided it offers support for triggers to fire before an update/insert. It basically does what you'd do if you implemented your desired behavior in code (increase all following index values until you encounter a gap), but in a simpler and more effective way.
Alternatively, if you can live with a restriction to SQL Server, check the hierarchyid type. While mainly geared at defining nested hierarchies, you can use it for flat ordering as well. It somewhat resembles your approach using floats, as it allows insertion between two positions by assigning fractional values, thus avoiding the need to update other entries.
If you don't use numbers, but Strings, you may have a table:
pk | i | val
------------
1 | a0 | hi
0 | a2 | hello
2 | a3 | goodbye
3 | b | good day
5 | b1 | howdy
You may insert a4 between a3 and b, a21 between a2 and a3, a1 between a0 and a2 and so on. You would need a clever function, to generate an i for new value v between p and n, and the index can become longer and longer, or you need a big rebalancing from time to time.
Another approach could be, to implement a (double-)linked-list in the table, where you don't save indexes, but links to previous and next, which would mean, that you normally have to update 1-2 elements:
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
2 | 0 | goodbye
3 | 2 | good day
5 | 3 | howdy
hey between hello & goodbye:
hey get's pk 6,
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
6 | 0 | hi <- ins
2 | 6 | goodbye <- upd
3 | 2 | good day
5 | 3 | howdy
the previous element would be hello with pk=0, and goodbye, which linked to hello by now has to link to hey in future.
But I don't know, if it is possible to find a 'order by' mechanism for many db-implementations.
Since I had a similar problem, here is a very simple solution:
Make your i column floats, but insert integer values for the initial data:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
Then, if you want to insert something in between, just compute a float value in the middle between the two surrounding values:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
6 | 2.5 | hey
This way the number of inserts between the same two values is limited to the resolution of float values but for almost all cases that should be more than sufficient.