How can I reduce Google BigQuery costs? - google-bigquery

I have been searching using Google BigQuery on the GDELT database of global news. I am repeating the same search 54 times, just changing the name of an African country.
Is it possible to include all 54 searches in the same query? As I understand the billing, the cost is based on the size of the database searched, not the number of query elements. Is that correct?
Here is an example of my queries for the country of Gabon, selecting themes appearing with ICT.
SELECT theme, COUNT(*) as count
FROM (
select UNIQUE(REGEXP_REPLACE(SPLIT(V2locations,';'), r',.*', '')) theme
from [gdelt-bq:gdeltv2.gkg]
where DATE>20150302000000 and DATE < 20200609000000 and V2locations like '%Gabon%'
AND V2themes like '%WB_133_INFORMATION_AND_COMMUNICATION_TECHNOLOGIES%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300

The simplest way to do so without changing your query logic is to replace
V2locations like '%Gabon%'
with
REGEXP_MATCH(V2locations, r'Gabon|Angola|Zimbabwe')
Note: the query in question is in BigQuery LegacySQL - so obviously i would recommend migration to Standard SQL

Related

What is the best way to aggreatate and also access specific data within a few tables?

I've been struggling with Postgres queries and I think the high level problem I'm struggling with is the best way to structure aggregation and non-aggregation within the same query.
For example, say I have a list of companies. I want to take IBM's revenue and divide it against the sum of its industry's entire revenue. Those kind of queries are causing me to build complex logic that I'm not sure if it's the best approach.
For example, if I have this query:
select extract(year from fisy.date),
(c.carbon/ln(fisy.totalrevenue- fisy.grossprofit)) as emissionsPerCOGS
from company."financials_Income_Statement_yearly" fisy
join company.carbon c
on lower(c.ticker) = lower(fisy.ticker)
where lower(fisy.ticker) = 'ibm.us'
This works fine, it gives me year and a formula result for each year.
But say I have another table general and in general I have details on ibm, and I want to see IBM's results above aligned to their industry. I can get the list of tickers, for the where statement via:
select lower(g4.ticker)
from "company"."General" g4
where industry = (
select industry
from "company"."General" g3
where lower(g3.ticker) = 'ibm.us'
)
But at this point, I'm confused If I included this in the where statement then all the select results are for the aggregated data and I get confused by having to select/filtering IBM's data specifically out of it.
So my question is, is there a straightforward way to get aggregated data and individual (ibm type) data so I can take individual sets of data in my DB and compare it against some aggregated view?
For example, say I have a list of companies. I want to take IBM's revenue and divide it against the sum of its industry's entire revenue.
This problem statement is pretty simple. I don't know what the queries in your question have to do with this problem. But you could just do:
select sum(revenue) filter (where company = 'IBM') / sum(revenue)
from t;
You can add a where clause if you want to limit this to a particular set of companies.
Or, if you wanted this for all companies:
select company, sum(revenue), sum(sum(revenue)) over ()
from t
group by company;

Quick one on Big Query SQL-Ecommerce Data

I am trying to replicate the Google Analyitcs data in Big Query but couldnt do that.
Basically I am using Custom Dimension 40 (user subscription status)
but I am getting wrong numbers in BQ.
Can someone help me on this?
I am using this query but couldn't find it out the exact one.
SELECT
(SELECT value FROM hits.customDimensions where index=40) AS UserStatus,
COUNT(hits.transaction.transactionId) AS Unique_Purchases
FROM
`xxxxxxxxxxxxx.ga_sessions_2020*` AS GA, --new rollup
UNNEST(GA.hits) AS hits
WHERE
(SELECT value FROM hits.customDimensions where index=40) IN ("xx001","xxx002")
GROUP BY 1
I am getting this from big query which is wrong.
I have check out the dates also but dont know why its wrong.
Your question is rather unclear. But because you want something to be unique and numbers are mysteriously not what you want, I would suggest using COUNT(DISTINCT):
COUNT(DISTINCT hits.transaction.transactionId) AS Unique_Purchases
As far as I understand, you imported Google Analytics data into Bigquery and you are trying to group the custom dimension with index 40 and values ("xx001","xxx002") in order to know how many hit transactions were performed in function of these dimension values.
Replicating your scenario and trying to execute the query you posted, I got the following error.
However, I created a query that could help with your use-case. At first, it selects the transactionId and dimension values with the transactionId different from null and with index value equal to 40, then the grouping is done by the dimension value, filtered with values equals to "xx001"&"xxx002".
WITH tx AS (
SELECT
HIT.transaction.transactionId,
CD.value
FROM
`xxxxxxxxxxxxx.ga_sessions_2020*` AS GA,
UNNEST(GA.hits) AS HIT,
UNNEST(HIT.customDimensions) AS CD
WHERE
HIT.transaction.transactionId IS NOT NULL
AND
CD.index = 40
)
SELECT tx.value AS UserStatus, count(tx.transactionId) AS Unique_Purchases
FROM tx
WHERE tx.value IN ("xx001","xx002")
GROUP BY tx.value
For further details about the format and schema of the data that is imported into BigQuery, I found this document.

SQL - Parse a field and SUM numbers at regular delimiter intervals

I request your help for an issue beyond my current skills...
I'm using Google Big Query to store analytics data about my website, and to calculate the revenue I have a quite difficult query to build.
We have the field %product% which is formatted as following :
;%productID%;%productQuantity%;%productRevenue%;;
If more than one product has been bought, the different products data will be delimited by ",", which can give this :
;12345678;1;49.99;;,;45678912;1;54.99;;
;45678912;2;59.98;;,;14521452;2;139.98;;,;12345678;2;19.98;;
;14521452;1;54.99;;
The only way to calculate the revenue is to sum all the different %productRevenue% from a line and store this into a column.
I have no idea how to do it just with a SQL query... Maybe with RegEx ? Any idea ?
I'd like to create a view with that info to easily pull the data into PowerBI then. But maybe I should process that with M directly in PBI ?
Thanks a lot,
Alex
Below is for BigQuery Standard SQL
#standardSQL
SELECT
SPLIT(i, ';')[OFFSET(1)] productID,
SUM(CAST(SPLIT(i, ';')[OFFSET(2)] AS INT64)) productQuantity,
SUM(CAST(SPLIT(i, ';')[OFFSET(3)] AS FLOAT64)) productRevenue
FROM `project.dataset.table`,
UNNEST(SPLIT(product)) i
GROUP BY productID
if to apply to sample data from your question - output is
Row productID productQuantity productRevenue
1 12345678 3 69.97
2 45678912 3 114.97
3 14521452 3 194.97

Group By Using Wildcards in Big Query

I have this query:
SELECT SomeTableA.*
FROM SomeTableB
LEFT JOIN SomeTableA USING (XYZ)
GROUP BY SomeTableA.*
I know that I cannot do the GROUP BY part with wildcards. At the same time, I don't really like listing all the columns (can be up to 20) manually.
Could this be added as new feature? Or is there any way how to easily get the list of all 20 columns from SomeTableA for the GROUP BY part?
If you really have the exact query shown in your question - then try below instead - no grouping required
#standardSQL
SELECT DISTINCT *
FROM `project.dataset.tableA`
WHERE xyz IN (SELECT xyz FROM `project.dataset.tableB`)
As of Group By Using Wildcards in Big Query this sounds more like grouping by struct which is not supported so you can submit feature request if you want - https://issuetracker.google.com/issues/new?component=187149&template=0

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.