How to create new count column based on adjacent combinations in existing table - sql

I have a simple table I've already built in BigQuery and all I want to do is what feels like a simple count of the number of times the combination of the person_id and the specific activity in the activity column has appeared in that table and create as a new column with a value/count of the adjacent combination in every row: 'combination_count' There are thousands of rows in the table so it's no good creating a filter or wheres etc.
It feels really simple but it's driving me mad. I've tried using counts and partitions but it doesn't work for me.
desired result:
person_id
activity
combination_count
1234
activity_1
1
1234
activity_1
2
1234
activity_2
1
5678
activity_1
1
and so on...

You can use row_number():
select t.*,
row_number() over (partition by person_id, activity order by person_id)
from t;

Related

How to create a new table that only keeps rows with more than 5 data records under the same id in Bigquery

I have a table like this:
Id
Date
Steps
Distance
1
2016-06-01
1000
1
There are over 1000 records and 50 Ids in this table, most ids have about 20 records, and some ids only have 1, or 2 records which I think are useless.
I want to create a table that excludes those ids with less than 5 records.
I wrote this code to find the ids that I want to exclude:
SELECT
Id,
COUNT(Id) AS num_id
FROM `table`
GROUP BY
Id
ORDER BY
num_id
Since there are only two ids I need to exclude, I use WHERE clause:
CREATE TABLE `` AS
SELECT
*
FROM ``
WHERE
Id <> 2320127002
AND Id <> 7007744171
Although I can get the result I want, I think there are better ways to solve this kind of problem. For example, if there are over 20 ids with less than 5 records in this table, what shall I do? Thank you.
Consider this:
CREATE TABLE `filtered_table` AS
SELECT *
FROM `table`
WHERE TRUE QUALIFY COUNT(*) OVER (PARTITION BY Id) >= 5
Note: You can remove WHERE TRUE if it runs successfully without it.

Select 1 record from each of 2 duplicate records

I have a messaging application which regularly inserts duplicate messages in BigQuery. The table name is 'metrics' and it has the following fields:
The Row column is a bigquery ROW_NUMBER() which is not part of the metrics table. All the other columns except batch_id form 2 duplicate rows for each message_id. You can see that message_id is repeated twice, and for each insertion 1 different batch_id is created.
I want the output like this, only 3 rows should be in the select result with 3 different message_id instead of the 6 rows i get here. It would be better if the row which had been inserted first among the duplicates for each message id would be selected(as the start_time and end_time is same for the duplicates i am not sure how to find that). I am new to Bigquery seen some examples in sql but not in Bigquery so any help is appreciated
Thanks for your help.
This deduping process becomes part of your business logic, so pick one method and stay consistent. I would do something like this:
with data as (
select
*,
row_number() over(partition by message_id order by batch_id asc) as rn
from `project.dataset.table`
)
select * from data where rn = 1
This query selects the row that has the "minimum" batch_id for each message_id. Your batch_id seem random/hashed (and not necessarily in a specific order), so this might or might work for your purposes, but it should reproduce the same results everytime (unless a 3rd record shows up, then it could begin to vary).

SQL Server Sum multiple rows into one - no temp table

I would like to see a most concise way to do what is outlined in this SO question: Sum values from multiple rows into one row
that is, combine multiple rows while summing a column.
But how to then delete the duplicates. In other words I have data like this:
Person Value
--------------
1 10
1 20
2 15
And I want to sum the values for any duplicates (on the Person col) into a single row and get rid of the other duplicates on the Person value. So my output would be:
Person Value
-------------
1 30
2 15
And I would like to do this without using a temp table. I think that I'll need to use OVER PARTITION BY but just not sure. Just trying to challenge myself in not doing it the temp table way. Working with SQL Server 2008 R2
Simply put, give me a concise stmt getting from my input to my output in the same table. So if my table name is People if I do a select * from People on it before the operation that I am asking in this question I get the first set above and then when I do a select * from People after the operation, I get the second set of data above.
Not sure why not using Temp table but here's one way to avoid it (tho imho this is an overkill):
UPDATE MyTable SET VALUE = (SELECT SUM(Value) FROM MyTable MT WHERE MT.Person = MyTable.Person);
WITH DUP_TABLE AS
(SELECT ROW_NUMBER()
OVER (PARTITION BY Person ORDER BY Person) As ROW_NO
FROM MyTable)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
First query updates every duplicate person to the summary value. Second query removes duplicate persons.
Demo: http://sqlfiddle.com/#!3/db7aa/11
All you're asking for is a simple SUM() aggregate function and a GROUP BY
SELECT Person, SUM(Value)
FROM myTable
GROUP BY Person
The SUM() by itself would sum up the values in a column, but when you add a secondary column and GROUP BY it, SQL will show distinct values from the secondary column and perform the aggregate function by those distinct categories.

How to index two columns automatically in sql

I have a table in sql having two fields 'JOB_NUMBER' 'SRno', the relation between the two is such that each job number has many SRNO starting from 1,2,3 and so on,every new Job number has to have a SR no starting form 1,
so my ideal table should some what look like this:
JOB_NUMBER SRno
1 1
1 2
1 3
2 1
2 2
3 1 and so on.......
What I want to do is to achive this indexing in sql itself ,can I do this ,is so how?
If there is another column on the table that is something like a timestamp (e.g. time submitted), then you can do something like:
select job_number,
row_number() over (partition by job_number order by time_submitted asc) as SRno
from tbl
You could make that into a view and you're good to go. Keep in mind that this will be sensitive to data modifications (i.e. if someone inserts a row between two other rows, the rows after the inserted one will be "renumbered"). Also keep in mind that this won't store the SRno on the table; it has to be calculated dynamically.
You mean auto-numbering (indexing is something different in DBMS's). It can be achieved using a trigger. But that's DBMS-specific issue. Check if your database supports triggers.
Are you looking for something like this..
select JOB_NUMBER,ROW_NUMBER() over(partition by JOB_NUMBER order by JOB_NUMBER)
as SRno from table_jobs

Converting Rows to Columns in SQL SERVER 2008

In SQL Server 2008,
I have a table for tracking the status history of actions (STATUS_HISTORY) that has three columns ([ACTION_ID],[STATUS],[STATUS_DATE]).
Each ACTION_ID can have a variable number of statuses and status dates.
I need to convert these rows into columns that preferably look something like this:
[ACTION_ID], [STATUS_1], [STATUS_2], [STATUS_3], [DATE_1], [DATE_2], [DATE_3]
Where the total number of status columns and date columns is unknown, and - of course - DATE_1 correlates to STATUS_1, etc. And I'd like for the status to be in chronological order (STATUS_1 has the earliest date, etc.)
My reason for doing this is so I can put the 10 most recent Statuses on a report in an Access ADP, along with other information for each action. Using a subreport with each status in a new row would cause the report to be far too large.
Is there a way to do this using PIVOT? I don't want to use the date or the status as a column heading.
Is it possible at all?
I have no idea where to even begin. It's making my head hurt.
Let us suppose for brevity that you only want 3 most recent statuses for each action_id (like in your example).
Then this query using CTE should do the job:
WITH rownrs AS
(
SELECT
action_id
,status
,status_date
,ROW_NUMBER() OVER (PARTITION BY action_id ORDER BY status_date DESC) AS rownr
FROM
status_history
)
SELECT
s1.action_id AS action_id
,s1.status AS status_1
,s2.status AS status_2
,s3.status AS status_3
,s1.status_date AS date_1
,s2.status_date AS date_2
,s3.status_date AS date_3
FROM
(SELECT * FROM rownrs WHERE rownr=1) AS s1
LEFT JOIN
(SELECT * FROM rownrs WHERE rownr=2) AS s2
ON s1.action_id = s2.action_id
LEFT JOIN
(SELECT * FROM rownrs WHERE rownr=3) AS s3
ON s1.action_id = s3.action_id
NULL values will appear in the rows where the action_id has less then 3 status-es.
I haven't had to do it with two columns, but a PIVOT sounds like what you should try. I've done this in the past with dates in a result set where I needed the date in each row be turned into the columns at the top.
http://msdn.microsoft.com/en-us/library/ms177410.aspx
I sympathize with the headache from trying to design and visualize it, but the best thing to do is try getting it working with one of the columns and then go from there. It helps once you start playing with it.