Google Biq Query and SQL

Google Biq Query and SQL - sql

I'm used to working with SQL Server databases and now I need to query data from BigQuery.
What is a better way to query data from the table like this?
Where one column includes several columns...

BigQuery supports unnest() for turning array elements into rows. So, you can convert all of this into rows as:
select t.user_id, t.user_pseudo_id, up.*
from t cross join
unnest(user_properties) up;
You want a field per property. There are several ways to do this. If you want exactly one value per row, you can use a subquery and aggregation:
select t.user_id, t.user_pseudo_id, p.*
from t cross join
(select max(case when up.key = 'age' then up.string_value end) as age,
max(case when up.key = 'gender' then up.string_value end) as gender
from unnest(user_properties) up
) p

Usually subqueries are used like:
SELECT
user_id,
user_pseudo_id,
(SELECT value.string_value FROM user_properties WHERE key = "age") AS age,
(SELECT value.string_value FROM user_properties WHERE key = "gender") AS gender,
FROM dataset.table

Related

Output of non-existent values when grouping in sql

For example, i have a table with the data:
Screenshot
This table named "table".
I have the SQL query:
select
kind,
count(kind)
from table
where region = 'eng'
group by kind
And I get the result:
Question: how do I write a query that would return all the values that are in the kind field (or any other field that can be in group by)? Even if this value is 0. For the example above, the desired result is
It is mandatory to use group by in the query.
I use a postgresql 10.

Using a conditional aggregation
select
kind,
count(case region when 'eng' then kind end) cnt
from table
group by kind

select
t1.kind,
coalesce(t2.total, 0) total
from
(
select distinct kind from table
) t1
left join
(
select
kind,
count(kind) total
from table
where region = 'eng'
group by kind
)t2
on t1.kind = t2.kind
db fiddle

How to pivot table and combine rows based on a condition

I have the following table produced by the following SQL:
select userid, name, sirname, age from Users
I am wondering what would the best way be to convert this to something that looks like:
Would partition by + distinct be the most efficient way of doing this?

SELECT userid, [0] AS name, [1] AS sirname, age
FROM users
PIVOT
(MAX(name)
FOR sirname IN ([0],[1]))AS p

I would just suggest aggregation:
select userid, max(case when simame = 0 then name end) as name,
max(case when simmame = 1 then name end) as simame,
age
from t
group by userid, age;

I'd like to suggest, that the most natural solution would be:
SELECT
u1.id
,u1.name
,u2.name as sirname
,u1.age
FROM users u1
INNER JOIN users u2 ON u1.id=u2.id and u2.sirname=1
WHERE
u1.sirname=0
even more so, as this approach can easily be modelled to cope with situations where one of the two names may be missing.

How do we find frequency of one column based off two other columns in SQL?

I'm relatively new to working with SQL and wasn't able to find any past threads to solve my question. I have three columns in a table, columns being name, customer, and location. I'd like to add an additional column determining which location is most frequent, based off name and customer (first two columns).
I have included a photo of an example where name-Jane customer-BEC in my created column would be "Texas" as that has 2 occurrences as opposed to one for California. Would there be anyway to implement this?

If you want 'Texas' on all four rows:
select t.Name, t.Customer, t.Location,
(select t2.location
from table1 t2
where t2.name = t.name
group by name, location
order by count(*) desc
fetch first 1 row only
) as most_frequent_location
from table1 t ;
You can also do this with analytic functions:
select t.Name, t.Customer, t.Location,
max(location) keep (dense_rank first order by location_count desc) over (partition by name) most_frequent_location
from (select t.*,
count(*) over (partition by name, customer, location) as location_count
from table1 t
) t;
Here is a db<>fiddle.
Both of these version put 'Texas' in all four rows. However, each can be tweaks with minimal effort to put 'California' in the row for ARC.

In Oracle, you can use aggregate function stats_mode() to compute the most occuring value in a group.
Unfortunately it is not implemented as a window function. So one option uses an aggregate subquery, and then a join with the original table:
select t.*, s.top_location
from mytable t
inner join (
select name, customer, stats_mode(location) top_location
from mytable
group by name, customer
) s where s.name = t.name and s.customer = t.customer
You could also use a correlated subquery:
select
t.*,
(
select stats_mode(t1.location)
from mytable t1
where t1.name = t.name and t1.customer = t.customer
) top_location
from mytable t

This is more a question about understanding the concepts of a relational database. If you want that information, you would not put that in an additional column. It is calculated data over multiple columns - why would you store that in the table itself ? It is complex to code and it would also be very expensive for the database (imagine all the rows you have to calculate that value for if someone inserted a million rows)
Instead you can do one of the following
Calculate it at runtime, as shown in the other answers
if you want to make it more persisent, you could embed that query above in a view
if you want to physically store the info, you could use a materialized view
Plenty of documentation on those 3 options in the official oracle documentation

Your first step is to construct a query that determines the most frequent location, which is as simple as:
select Name, Customer, Location, count(*)
from table1
group by Name, Customer, Location
This isn't immediately useful, but the logic can be used in row_number(), which gives you a unique id for each row returned. In the query below, I'm ordering by count(*) in descending order so that the most frequent occurrence has the value 1.
Note that row_number() returns '1' to only one row.
So, now we have
select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1 tb_
group by Name, Customer, Location
The final step puts it all together:
select tab.*, tb_.Location most_freq_location
from table1 tab
inner join
(select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1
group by Name, Customer, Location) tb_
on tb_.Name = tab.Name
and tb_.Customer = tab.Customer
and freq_name_cust = 1
You can see how it all works in this Fiddle where I deliberately inserted rows with the same frequency for California and Texas for one of the customers for illustration purposes.

SQL query to efficiently select non-perfect duplicates

I have a database table in the entity-attribute-value format which looks like this:
I wish to select all rows that have the same values for the 'entity' and 'attribute' columns, but have different values for the 'value' column. Multiple rows with the same values for all three columns should be treated as a single row. The way I achieved this is by using SELECT DISTINCT.
SELECT entity_id, attribute_name, COUNT(attribute_name) AS NumOcc
FROM (SELECT DISTINCT * FROM radiology) x
GROUP BY entity_id,attribute_name
HAVING COUNT(attribute_name) > 1
Response for this query
However, I have read that using SELECT DISTINCT is quite costly. I plan on using this query on very large tables, I am looking for a way to optimize this query, perhaps without using SELECT DISTINCT.
I am using PostgreSQL 10.3

select *
from radiology r
join (
select entity_id
, attribute_name
from radiology
group by
entity_id
, attribute_name
having count(distinct value) > 1
) dupe
on r.entity_id = dupe.entity_id
and r.attribute_name = dupe.attribute_name

This should work for you:
select a.* from radiology a join
(select entity, attribute, count(distinct value) cnt
from radiology
group by entity, attribute
having count(distinct value)>1)b
on a.entity=b.entity and a.attribute=b.attribute

I wish to select all rows that have the same values for the 'entity' and 'attribute' columns, but have different values for the 'value' column.
Your method does not do this. I would think exists:
select r.*
from radiology r
where exists (select 1
from radiology r2
where r2.entity = r.entity and r2.attribute = r.attribute and
r2.value <> r.value
);
If you just want the entity/attribute values with pairs, use group by:
select entity, attribute
from radiology
group by entity, attribute
having min(value) <> max(value);
Note that you could use having count(distinct value) > 1, but count(distinct) incurs more overhead than min() and max().

SQL - SELECTING multiple columns with duplicate information and some with unique

I need to produce a query that will pull all the records with:
Same First_Name
Same Last_Name
Same DOB
Same client_ID (Client_ID is given "1011")
Different Member_ID
Note: I have huge database with multimillion records, and as soon as I provide more than one subquery it takes hours to provide even first sample of data. (maybe my subqueries were incorrect though)
I've tried building this query step-by-step, but still it fails to filter the way I need.
Select
ta.Member_ID,
ta.First_Name,
ta.LAST_NAME,
ta.date_of_birth,
ta.client_id,
From TestTable ta
WHERE client_id = '1011'
AND
((SELECT COUNT(*)
FROM TestTable ta2
WHERE ta.date_of_birth=ta2.date_of_birth
AND ta.FIRST_NAME=ta2.FIRST_NAME
AND ta.LAST_NAME=ta2.LAST_NAME)>1
I'm not even got to the point of selecting different Member_ID, and still this query pulls records that not necesary follow those parameters.
Please help.
Here is sample data, highlighted is the pair that I want to be able to get:
My Sample Table

Just use window functions:
SELECT ta.Member_ID, ta.First_Name, ta.LAST_NAME, ta.date_of_birth,
ta.client_id
FROM (SELECT ta.*,
COUNT(*) OVER (PARTITION BY FIRST_NAME, LAST_NAME, date_of_birth) as cnt
FROM TestTable ta
) ta
WHERE client_id = '1011' AND cnt > 1;

As a general note,don't use correlated sub queries unless you absolutely have to. Performance takes a severe hit as the subquery is run for every row of the outer query. A simple join should work:
Select
ta.Member_ID,
ta.First_Name,
ta.LAST_NAME,
ta.date_of_birth,
ta.client_id
From TestTable ta JOIN TestTable ta2
WHERE ta.client_id = '1011' AND ta.Member_ID <> ta2.Member_ID
ON ta.date_of_birth=ta2.date_of_birth
AND ta.FIRST_NAME=ta2.FIRST_NAME
AND ta.LAST_NAME=ta2.LAST_NAME
AND ta.client_id=ta2.client_id

If your only intention is to find records with same details but diff Member ID use the basic group by to filter data. This is not as costly as joining two tables
Select
ta.First_Name,
ta.LAST_NAME,
ta.date_of_birth,
ta.client_id
From TestTable ta
group by
ta.First_Name,
ta.LAST_NAME,
ta.date_of_birth,
ta.client_id
having count(distinct Member_ID) > 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Google Biq Query and SQL - sql

I'm used to working with SQL Server databases and now I need to query data from BigQuery. What is a better way to query data from the table like this? Where one column includes several columns...

Usually subqueries are used like: SELECT user_id, user_pseudo_id, (SELECT value.string_value FROM user_properties WHERE key = "age") AS age, (SELECT value.string_value FROM user_properties WHERE key = "gender") AS gender, FROM dataset.table

Related

Output of non-existent values when grouping in sql

How to pivot table and combine rows based on a condition

How do we find frequency of one column based off two other columns in SQL?

SQL query to efficiently select non-perfect duplicates

SQL - SELECTING multiple columns with duplicate information and some with unique

Categories

Resources