How to map each parameter in firebase analytics sql to a separate column? - sql

We use firebase analytics and bigQuery to run sql queries on collected data. This is turning out to be complex as some fields like event_params are repeated records. I want to map each of these repeated fields to separate column.
I want to write queries in the above dataset like finding the difference between minIso and maxIso. How can I define a UDF or a view which can return me the table in the column schema?

I want to map each of these repeated fields to separate column.
Going direction of pivoting parameters into columns conceptually doable but (in my strong opinion) is a “dead end” in most practical cases
There are many posts here on SO showing how to pivot/transpose rows to columns and the patterns are 1) you just hardcode all possible keys in your query )and obviously no-one likes this) or 2) you create utility query that extracts all keys for you and contracts needed query for you which then you need to execute – so either you do it manually in two steps or you using client of your choice to script those to steps to run in automated way
As I mentioned – there are plenty example of such here on SO
I want to write queries in the above dataset like finding the difference between minIso and maxIso
If all you need is to do some math with few parameters in the record – see below example
Dummy Example: for each app_instance_idtween find diff between coins_awarded and xp_awarded
#standardSQL
SELECT user_dim.app_info.app_instance_id, ARRAY(
SELECT AS STRUCT name,
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'coins_awarded') -
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'xp_awarded') diff_awarded
FROM UNNEST(event_dim) dim
WHERE dim.name = 'round_completed'
) AS event_dim
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160607`
WHERE 'round_completed' IN (SELECT name FROM UNNEST(event_dim))
with result as
Row app_instance_id event_dim.name event_dim.diff_awarded
1 02B6879DF2639C9E2244AD0783924CFC round_completed 226
2 02B6879DF2639C9E2244AD0783924CFC round_completed 171
3 0DE9DCDF2C407377AE3E779FB05864E7 round_completed 25
...
Dummy Example: leave whole user_dim intact but replace event_dim with just calculated values
#standardSQL
SELECT * REPLACE(ARRAY(
SELECT AS STRUCT name,
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'coins_awarded') -
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'xp_awarded') diff_awarded
FROM UNNEST(event_dim) dim
WHERE dim.name = 'round_completed'
) AS event_dim)
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160607`
WHERE 'round_completed' IN (SELECT name FROM UNNEST(event_dim))
This is turning out to be complex as some fields like event_params are repeated records. I want to map each of these repeated fields to separate column.
Hope, from above examples, you can see how really simple it is to deal with repeated fields. I do really recommend you to learn / practice work with arrays to gain long term benefits rather than looking for what [wrongly] looks like shortcut

Related

How can I assign pre-determined codes (1,2,3, etc,) to a JSON-type column in PostgreSQL?

I'm extracting a table of 2000+ rows which are park details. One of the columns is JSON type. Image of the table
We have about 15 attributes like this and we also have a documentation of pre-determined codes assigned to each attribute.
Each row in the extracted table has a different set of attributes that you can see in the image. Right now, I have cast(parks.services AS text) AS "details" to get all the attributes for a particular park or extract just one of them using the code below:
CASE
WHEN cast(parks.services AS text) LIKE '%uncovered%' THEN '2'
WHEN cast(parks.services AS text) LIKE '%{covered%' THEN '1' END AS "details"
This time around, I need to extract these attributes by assigning them the codes. As an example, let's just say
Park 1 - {covered, handicap_access, elevator} to be {1,3,7}
Park 2 - {uncovered, always_open, handicap_access} to be {2,5,3}
I have thought of using subquery to pre-assign the codes, but I cannot wrap my head around JSON operators - in fact, I don't know how to extract them on 2000+ rows.
It would be helpful if someone could guide me in this topic. Thanks a lot!
You should really think about normalizing your tables. Don't store arrays. You should add a mapping table to map the parks and the attribute codes. This makes everything much easier and more performant.
step-by-step demo:db<>fiddle
SELECT
t.name,
array_agg(c.code ORDER BY elems.index) as codes -- 3
FROM mytable t,
unnest(attributes) WITH ORDINALITY as elems(value, index) -- 1
JOIN codes c ON c.name = elems.value -- 2
GROUP BY t.name
Extract the array elements into one record per element. Add the WITH ORDINALITY to save the original order.
Join your codes on the elements
Create code arrays. To ensure the correct order, you can use the index values created by the WITH ORDINALITY clause.

Best practices for dealing with duplicate rows caused by unnested records in BigQuery?

Working with data coming from Facebook more often than not involves working with records where, in my case, all the “spicy” data is at. However, there is a downside, namely the huge amount of duplicate rows, which when not handled properly can cause over-reporting and/or data discrepancy.
Below is a use case which when joined with my primary data (coming from tables which do not involve any unnesting) causes a slight discrepancy in the final numbers.
Technologies used - Facebook Data -> Stitch -> BigQuery -> dbt -> Google Data Studio
I would usually create separate models where I’d unnest a record, transform the data and then join it into the rest of my models. An example of this is getting all website purchase conversion from the ads_insights’s actions record. 
Here is the difference though:

Query:
SELECT count(*) AS row_count
FROM ads_insights
Result:
 row_count - 316

Query:
SELECT count(*) AS row_count
FROM ads_insights,
UNNEST(actions) AS actions
Result:
 row_count - 5612

After unnesting, I’d use the row data to create columns for each conversion like so:
CASE WHEN value.action_type = 'offsite_conversion.fb_pixel_purchase' THEN COALESCE(value._28d_click, 0) + COALESCE(value._1d_view, 0) ELSE 0 END AS website_purchase

And finally I would join this model to the rest of my models. The only problem is that those 5600 rows cause a slight discrepancy when joined with the rest, and since I’ve already used the row data to create the columns, I don’t care about the unnested record data anymore, and I can go back to my original 316 rows. The only question is how? What techniques are out there that will help me clean up my model?
Solution:
Even though at some point I'd aggregate and group all the fields in my query like dylanbaker suggested in his answer, the discrepancy would still persist, and after doing a deep dive at my data I found that the unnested query will return 279 rows, whereas the nested one will return 314. This focused my attention at the unnesting query, where it will remove 35 rows, and those 35 rows happened to be null. After doing some google search I found this StackOverflow article which suggest using LEFT JOIN UNNEST to preserve all rows that have null record values, instead of CROSS JOIN UNNEST which will remove them.
You would typically want to do a 'pivot' here. You're most of the way there, you just need to sum and group by the relevant columns in order to get this back to the grain that you originally had and want.
I believe you'll want something like this:
select
ads_insights.some_column,
ads_insights.some_other_column,
sum(case
when value.action_type = 'offsite_conversion.fb_pixel_purchase'
then coalesce(value._28d_click, 0) + coalesce(value._1d_view, 0)
else 0
end) AS website_purchase
from ads_insights,
unnest(actions) as actions
group by 1,2
The initial columns would be whatever you want from the original table. The 'sum case whens' would be to pivot and aggregate the unnested data.
You can actually do some magic with unnests inside the select statement
Does this work for you?
SELECT
some_column,
(SELECT coalesce(_28d_click, 0) + coalesce(_1d_view, 0) from unnest(actions) WHERE action_type = "offsite_conversion.fb_pixel_purchase") AS website_purchase
FROM ads_insights

PostgreSQL - Query keyword patterns in columns in table

we all know in SQL we can query a column (lets say, column "breeds") for a certain word like "dog" via a query like this:
select breeds
from myStackOverflowDBTable
where breeds = 'dog'
However, say I had many more columns with much more data, say millions of records, and I did not want to find a word, but rather the most common keyword pattern or wildcard expression, a query like this:
SELECT *
FROM myStackOverflowDBTable
WHERE address LIKE '%alb%'"
Is there an efficient way to find these 'patterns' inside the columns using SQL? I need to find the most common substring so-to-speak, per the query above, say the wildcard string "alb" appeared the most in a "location" column that had words like Albany, Albuquerque, Alabama, obviously querying the words directly would yield 0 results but querying on that wildcard keyword pattern would yield many, but I want to find the most repeating or most frequent wildcard/keyword pattern/regex expression/substring (however you want to define it) for a given column - is there an easy way to do this without querying a million test queries and doing it manually???
Well, if you want to find three character patterns, you could extract all 3-character patterns, aggregate and count:
select substr(t.address, gs.i, 3) as ngram_3, count(*)
from t cross join lateral
generate_series(1, length(address) - 3, 1) gs(i)
group by ngram_3
order by count(*) desc
limit 100;

Splitting text in SQL Server stored procedure

I'm working with a database, where one of the fields I extract is something like:
1-117 3-134 3-133
Each of these number sets represents a different set of data in another table. Taking 1-117 as an example, 1 = equipment ID, and 117 = equipment settings.
I have another table from which I need to extract data based on the previous field. It has two columns that split equipment ID and settings. Essentially, I need a way to go from the queried column 1-117 and run a query to extract data from another table where 1 and 117 are two separate corresponding columns.
So, is there anyway to split this number to run this query?
Also, how would I split those three numbers (1-117 3-134 3-133) into three different query sets?
The tricky part here is that this column can have any number of sets here (such as 1-117 3-133 or 1-117 3-134 3-133 2-131).
I'm creating these queries in a stored procedure as part of a larger document to display the extracted data.
Thanks for any help.
Since you didn't provide the DB vendor, here's two posts that answer this question for SQL Server and Oracle respectively...
T-SQL: Opposite to string concatenation - how to split string into multiple records
Splitting comma separated string in a PL/SQL stored proc
And if you're using some other DBMS, go search for "splitting text ". I can almost guarantee you're not the first one to ask, and there's answers for every DBMS flavor out there.
As you said the format is constant though, you could also do something simpler using a SUBSTRING function.
EDIT in response to OP comment...
Since you're using SQL Server, and you said that these values are always in a consistent format, you can do something as simple as using SUBSTRING to get each part of the value and assign them to T-SQL variables, where you can then use them to do whatever you want, like using them in the predicate of a query.
Assuming that what you said is true about the format always being #-### (exactly 1 digit, a dash, and 3 digits) this is fairly easy.
WITH EquipmentSettings AS (
SELECT
S.*,
Convert(int, Substring(S.AwfulMultivalue, V.Value * 6 - 5, 1) EquipmentID,
Convert(int, Substring(S.AwfulMultivalue, V.Value * 6 - 3, 3) Settings
FROM
SourceTable S
INNER JOIN master.dbo.spt_values V
ON V.Value BETWEEN 1 AND Len(S.AwfulMultivalue) / 6
WHERE
V.type = 'P'
)
SELECT
E.Whatever,
D.Whatever
FROM
EquipmentSettings E
INNER JOIN DestinationTable D
ON E.EquipmentID = D.EquipmentID
AND E.Settings = D.Settings
In SQL Server 2005+ this query will support 1365 values in the string.
If the length of the digits can vary, then it's a little harder. Let me know.
Incase if the sets does not increase by more than 4 then you can use Parsename to retrieve the result
Declare #Num varchar(20)
Set #Num='1-117 3-134 3-133'
select parsename(replace (#Num,' ','.'),3)
Result :- 1-117
Now again use parsename on the same resultset
Select parsename(replace(parsename(replace (#Num,' ','.'),3),'-','.'),1)
Result :- 117
If the there are more than 4 values then use split functions

Optimizing a strange MySQL Query

Hoping someone can help with this. I have a query that pulls data from a PHP application and turns it into a view for use in a Ruby on Rails application. The PHP app's table is an E-A-V style table, with the following business rules:
Given fields: First Name, Last Name, Email Address, Phone Number and Mobile Phone Carrier:
Each property has two custom fields defined: one being required, one being not required. Clients can use either one, and different clients use different ones based on their own rules (e.g. Client A may not care about First and Last Name, but client B might)
The RoR app must treat each "pair" of properties as only a single property.
Now, here is the query. The problem is it runs beautifully with around 11,000 records. However, the real database has over 40,000 and the query is extremely slow, taking roughly 125 seconds to run which is totally unacceptable from a business perspective. It's absolutely required that we pull this data, and we need to interface with the existing system.
The UserID part is to fake out a Rails-esque foreign key which relates to a Rails table. I'm a SQL Server guy, not a MySQL guy, so maybe someone can point out how to improve this query? They (the business) demand that it be sped up but I'm not sure how to since the various group_concat and ifnull calls are required due to the fact that I need every field for every client and then have to combine the data.
select `ls`.`subscriberid` AS `id`,left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) AS `user_id`,
ifnull(min((case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end)),_utf8'') AS `first_name`,
ifnull(min((case when (`s`.`fieldid` in (3,36)) then `s`.`data` else NULL end)),_utf8'') AS `last_name`,
ifnull(`ls`.`emailaddress`,_utf8'') AS `email_address`,
ifnull(group_concat((case when (`s`.`fieldid` = 81) then `s`.`data` when (`s`.`fieldid` = 154) then `s`.`data` else NULL end) separator ''),_utf8'') AS `mobile_phone`,
ifnull(group_concat((case when (`s`.`fieldid` = 100) then `s`.`data` else NULL end) separator ','),_utf8'') AS `sms_only`,
ifnull(group_concat((case when (`s`.`fieldid` = 34) then `s`.`data` else NULL end) separator ','),_utf8'') AS `mobile_carrier`
from ((`list_subscribers` `ls`
join `lists` `l` on((`ls`.`listid` = `l`.`listid`)))
left join `subscribers_data` `s` on((`ls`.`subscriberid` = `s`.`subscriberid`)))
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
group by `ls`.`subscriberid`,`l`.`name`,`ls`.`emailaddress`
EDIT
I removed the regexp and that sped the query up to about 20 seconds, instead of nearly 120 seconds. If I could remove the group by then it would be faster, but I cannot as removing this causes it to duplicate rows with blank data for each field, instead of aggregating them. For instance:
With group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John Doe jdoe#example.com 5551234567 0 Sprint
Without group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John jdoe#xample.com
1 1 Doe jdoe#example.com
1 1 jdoe#example.com
1 1 jdoe#example.com 5551234567
And so on. What we need is the first result.
EDIT #2
The query still seems to take a long time, but earlier today it was running in only about 20 seconds on the production database. Without changing a thing, the same query is now once again taking over 60 seconds. This is still unacceptable.. any other ideas on how to improve this?
That is, without a doubt, the second most hideous SQL query I have ever laid my eyes on :-)
My advice is to trade storage requirements for speed. This is a common trick used when you find your queries have a lot of per-row functions (ifnull, case and so forth). These per-row functions never scale very well as the table becomes larger.
Create new fields in the table which will hold the values you want to extract and then calculate those values on insert/update (with a trigger) rather than select. This doesn't technically break 3NF since the triggers guarantee data consistency between columns.
The vast majority of database tables are read far more often than they're written so this will amortise the cost of the calculation across many selects. In addition, just about every reported problem with databases is one of speed, not storage.
An example of what I mean. You can replace:
case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end
with:
`s`.`data_2_35`
in your query if your insert/update trigger simply sets the data_2_35 column to data or NULL depending on the value of fieldid. Then you index data_2_35 and, voila, instant speed improvement at the cost of a little storage.
This trick can be done to the five case clauses, the left/regexp bit and the "naked" ifnull function as well (the ifnull functions containing min and group_concat may be harder to do).
The problem is most likely the WHERE condition:
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
This looks like complex string comparison, so no index can be used, which results in a full table scan, possibly for every row in the result set. I am not a MySQL expert, but if you can simplify this into more simple column comparisons it will probably run much faster.
The first thing that jumps out at me as the source of all the trouble:
The PHP app's table is an E-A-V style table...
Trying to convert data in EAV format into conventional relational format on the fly using SQL is bound to be awkward and inefficient. So don't try to smash it into a conventional column-per-attribute format. The following query returns multiple rows per subscriber, one row per EAV attribute:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE SUBSTRING_INDEX(l.name, _utf8'_', 1) REGEXP _utf8'[[:digit:]]+'
This eliminates the GROUP BY which is not optimized well in MySQL -- it usually incurs a temporary table which kills performance.
id user_id email_address fieldid data
1 1 jdoe#example.com 2 John
1 1 jdoe#example.com 3 Doe
1 1 jdoe#example.com 81 5551234567
But you'll have to sort out the EAV attributes in application code. That is, you can't seamlessly use ActiveRecord in this case. Sorry about that, but that's one of the disadvantages of using a non-relational design like EAV.
The next thing that I notice is the killer string manipulation (even after I've simplified it with SUBSTRING_INDEX()). When you're picking substrings out of a column, this says you me that you've overloaded one column with two distinct pieces of information. One is the name and the other is some kind of list-type attribute that you would use to filter the query. Store one piece of information in each column.
You should add a column for this attribute, and index it. Then the WHERE clause can utilize the index:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE l.list_name_contains_digits = 1;
Also, you should always analyze an SQL query with EXPLAIN if it's important for them to have good performance. There's an analogous feature in MS SQL Server, so you should be accustomed to the concept, but the MySQL terminology may be different.
You'll have to read the documentation to learn how to interpret the EXPLAIN report in MySQL, there's too much info to describe here.
Re your additional info: Yes, I understand you can't do away with the EAV table structure. Can you create an additional table? Then you can load the EAV data into it:
CREATE TABLE subscriber_mirror (
subscriberid INT PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
first_name2 VARCHAR(100),
last_name2 VARCHAR(100),
mobile_phone VARCHAR(100),
sms_only VARCHAR(100),
mobile_carrier VARCHAR(100)
);
INSERT INTO subscriber_mirror (subscriberid)
SELECT DISTINCT subscriberid FROM list_subscribers;
UPDATE subscriber_data s JOIN subscriber_mirror m USING (subscriberid)
SET m.first_name = IF(s.fieldid = 2, s.data, m.first_name),
m.last_name = IF(s.fieldid = 3, s.data, m.last_name),
m.first_name2 = IF(s.fieldid = 35, s.data, m.first_name2),
m.last_name2 = IF(s.fieldid = 36, s.data, m.last_name2),
m.mobile_phone = IF(s.fieldid = 81, s.data, m.mobile_phone),
m.sms_only = IF(s.fieldid = 100, s.data, m.sms_only),
m.mobile_carrer = IF(s.fieldid = 34, s.data, m.mobile_carrier);
This will take a while, but you only need to do it when you get a new data update from the vendor. Subsequently you can query subscriber_mirror in a much more conventional SQL query:
SELECT ls.subscriberid AS id, l.name+0 AS user_id,
COALESCE(s.first_name, s.first_name2) AS first_name,
COALESCE(s.last_name, s.last_name2) AS last_name,
COALESCE(ls.email_address, '') AS email_address),
COALESCE(s.mobile_phone, '') AS mobile_phone,
COALESCE(s.sms_only, '') AS sms_only,
COALESCE(s.mobile_carrier, '') AS mobile_carrier
FROM lists l JOIN list_subscribers USING (listid)
JOIN subscriber_mirror s USING (subscriberid)
WHERE l.name+0 > 0
As for the userid that's embedded in the l.name column, if the digits are the leading characters in the column value, MySQL allows you to convert to an integer value much more easily:
An expression like '123_bill'+0 yields an integer value of 123. An expression like 'bill_123'+0 has no digits at the beginning, so it yields an integer value of 0.