Doing a concat over a partition in SQL? - sql

I have some data ordered like so:
date, uid, grouping
2018-01-01, 1, a
2018-01-02, 1, a
2018-01-03, 1, b
2018-02-01, 2, x
2018-02-05, 2, x
2018-02-01, 3, z
2018-03-01, 3, y
2018-03-02, 3, z
And I wanted a final form like:
uid, path
1, "a-a-b"
2, "x-x"
3, "z-y-z"
but running something like
select
a.uid
,concat(grouping) over (partition by date, uid) as path
from temp1 a
Doesn't seem to want to play well with SQL or Google BigQuery (which is the specific environment I'm working in). Is there an easy enough way to get the groupings concatenated that I'm missing? I imagine there's a way to brute force it by including a bunch of if-then statements as custom columns, then concatenating the result, but I'm sure that will be a lot messier. Any thoughts?

You are looking for string_agg():
select a.uid, string_agg(grouping, '-' order by date) as path
from temp1 a
group by a.uid;

Related

Big query Pivoting with a specific requirement

I have used pivot in big query, but here is a specific use case and the data that I need to show in looker. I am trying the similar option in looker but wanted to know if I can just show this in big query.
This is how my data (Sample) in BIG QUERY table is:
The output should be as below:
If you look at it, it's pivoting but I need to assign the column names as shown (for the specific range) and for the range 6 and more, I need to add the pivot columns data into one.
I don't see pivot index or something like this in BIG_QUERY. Was thinking if there is a way to sum up the column data after pivot index 6 or so? Any suggestions how to achieve this?
Hope below approach would be helpful,
SELECT * FROM (
SELECT Node, bucket, total_code
FROM sample, UNNEST([RANGE_BUCKET(data1, [1, 2, 3, 4, 5, 6, 7])]) bucket
) PIVOT (SUM(total_code) `range` FOR bucket IN (1, 2, 3, 4, 5, 6, 7));
output:
RANGE_BUCKET - https://cloud.google.com/bigquery/docs/reference/standard-sql/mathematical_functions#range_bucket

Aggregating one bigquery table to another bigquery table

I am trying to aggregate multi PB (around 7PB) worth of BigQuery Table into another BigQuery Table
I have (partition_key, clusterkey1, clusterkey2, col1, col2, val)
Where partition_key is used for bigquery partition and clusterkey is used for clustering.
For example
(timestamp1, timestamp2, 0, 1, 2, 1)
(timestamp3, timestamp4, 0, 1, 2, 7)
(timestamp31, timestamp22, 2, 1, 2, 2)
(timestamp11, timestamp12, 2, 1, 2, 3)
should result to
(0, 1, 2, 8)
(2, 1, 2, 5)
I want to aggregate base on (clusterkey2, col1, col2), across all partition_key and all clusterkey1 for val
What is a feasible way to do this?
Should I write a custom loader and just read all data from it line by line, or is there a native way to do this?
Depending on where / how you are executing this you can do it by writing a simple sql script and defining the target output, for example:
SELECT clusterkey2
, col1
, col2
, sum(val)
from table
group by clusterkey2, col1, col2
This will get you the desired results.
From here you can do a few things, but they are mostly all outlined here in the documentation:
https://cloud.google.com/bigquery/docs/writing-results#writing_query_results
Specifically from the above you are looking to set the destination table.
One thing to note, you may want to include a partition key in the where clause to help narrow down your data if you do not want the aggregate results of the whole table.

Possible to do subselect in Google Sheets

I have the following data in the movies data range:
I would like to do a subselect to get the highest-grossing movie for that director. I can do it by adding a new column like this:
Note that I've used the hacky 'nested-query' notation to remove the header row and just return a single scalar value:
=QUERY(QUERY(movies, "SELECT MAX(C) WHERE A='"&A2&"' GROUP BY A", 0), "SELECT * OFFSET 1", 0)
However, I was wondering if I could just do a single query on the director|movie|boxoffice columns with a subselect within the query statement, I suppose it would come out to something like:
=QUERY(movies, "SELECT A, B, C, (SELECT MAX(C) WHERE A='"&A2&"' GROUP BY A)", 0)
I believe the answer to this is a straight 'no', but I was curious if there's any sort of sub-query composability within the google sheets query language, or if I just need to sort of figure out workarounds here?
https://developers.google.com/chart/interactive/docs/querylanguage
try:
=INDEX(IFNA(VLOOKUP(A2:A, SORT(A2:C, 3, ), 3, )))
or whole:
=INDEX({A1:C, {"highestgrossing"; IFNA(VLOOKUP(A2:A, SORT(A2:C, 3, ), 3, ))}})

How to get data in a column in order by using SQL in operator

There is a data set as shown below;
When input for event_type is 4, 1, 2, 3 for example, I would like to get 3, 999, 3, 9 from cnt_stamp in this order. I created a SQL code as shown below, but it seems like it always returns 999, 3, 9, 3 regardless the order of the input.
How can I fix the SQL to achieve this? Thank you for taking your time, and please let me know if you have any question.
SELECT `cnt_stamp` FROM `stm_events` WHERE `event_type` in (4,1,2,3)
Add ORDER BY FIELD(event_type, 4, 1, 2, 3) in your query. It should look like:
SELECT cnt_stamp FROM stm_events WHERE event_type in (4,1,2,3) ORDER BY FIELD(event_type, 4, 1, 2, 3);
its cannot because as default the data sort by ascending, if u want result like u want,, better u create 1 column for indexing

Generate histogram from regex matches

Sorry if this is an obvious question. I'm quite new to SQL and couldn't manage to adapt other examples out there to my needs.
I have a table (Postgres 9.3) defined as:
CREATE TABLE scripts (
id SERIAL PRIMARY KEY,
name VARCHAR(256) NOT NULL,
content TEXT NOT NULL);
The content column contains the content of various scripts. I'm interested in counting how many times distinct function calls occur in these scripts.
I've managed to construct a query that runs a regex over the contents, and pulls out all the function calls (as funcs)
SELECT id, name, regexp_matches(LOWER(content), '(\w+\.\w+)\(', 'g') AS funcs
FROM scripts
GROUP BY id, name, funcs;
The output looks something like
1, myscript, {class.m1}<br>
2, otherscript, {class_b.method4}<br>
2, otherscript, {class.m1}<br>
3, last_script, {classname.method2}<br>
3, last_script, {class.m1}<br>
3, last_script, {class_b.method4}<br>
I would really like to turn this into a table that shows a tally of each distinct function. Something like
class.m1, 3
class_b.method4, 2
classname.method2, 1
This is what I have so far:
SELECT COUNT(DISTINCT funcs) FROM (
SELECT tsr_id, name, regexp_matches(LOWER(content), '(\w+\.\w+)\(', 'g') AS funcs
FROM tsr_conf.rules
GROUP BY tsr_id, name, funcs
) x
But unfortunately it just gives me the total count of distinct functions. Any advice on how to count the occurances of each distinct function would be most appreciated!
Given what your first query returns, a group by should do what you want:
SELECT funcs, COUNT(*)
FROM (SELECT tsr_id, name, regexp_matches(LOWER(content), '(\w+\.\w+)\(', 'g') AS funcs
FROM tsr_conf.rules
GROUP BY tsr_id, name, funcs
) x
GROUP BY funcs;
You could actually write this more simply as:
SELECT regexp_matches(LOWER(content), '(\w+\.\w+)\(', 'g') AS funcs, COUNT(DISTINCT tsr_id, name)
FROM tsr_conf.rules
GROUP BY funcs;