Bigquery query performance when using starts_with() on a table of 12Mil rows - sql

I have a table company_totals, that has the following schema -
column_name
column_data_type
company
STRING
link
STRING
full_count
FLOAT
starts_with_count
FLOAT
Number of rows = 12,000,000. Table size = 1.6 GB. CLUSTERED BY = company link. SEARCH INDEX created on column = link.
I have the following select statement which is taking beyond 6 hours and the execution results in timeout - Operation timed out after 6.0 hours. Consider reducing the amount of work performed by your operation so that it can complete within this limit.)
SELECT first_table.company, first_table.link, null as full_count, SUM(second_table.full_count) AS starts_with_count
FROM company_totals first_table, company_totals second_table
WHERE STARTS_WITH(second_table.link, first_table.link)
group by first_table.company, first_table.link
The above query calculates values of the column starts_with_count which is the sum of values of another column full_count, based on a starts_with() condition. In the company_totals table, the column starts_with_count is what I want to fill. I have added the expected values for this column manually to show my expectation. Other column values are already present in the table. The starts_with_count value is sum (full_count) where its link appears in other rows.
company
link
full_count
starts_with_count (expected)
abc
http://www.abc.net1
1
15 (= sum (full_count) where link like 'http://www.abc.net1%')
abc
http://www.abc.net1/page1
2
9 (= sum (full_count) where link like 'http://www.abc.net1/page1%')
abc
http://www.abc.net1/page1/folder1
3
3 (= sum (full_count) where link like 'http://www.abc.net1/page1/folder1%')
abc
http://www.abc.net1/page1/folder2
4
4
abc
http://www.abc.net1/page2
5
5
xyz
http://www.xyz.net1/
6
21
xyz
http://www.xyz.net1/page1/
7
15
xyz
http://www.xyz.net1/page1/file1
8
8
Highly appreciate any help in this issue.

Related

BigQuery INSERT SELECT results in random order of records?

I used standard SQL to insert data form one table to another in BigQuery using Jupyter Notebook.
For example I have two tables:
table1
ID Product
0 1 book1
1 2 book2
2 3 book3
table2
ID Product Price
0 5 book5 8.0
1 6 book6 9.0
2 4 book4 3.0
I used the following codes
INSERT test_data.table1
SELECT *
FROM test_data.table2
ORDER BY Price;
SELECT *
FROM test_data.table1
I got
ID Product
0 1 book1
1 3 book3
2 2 book2
3 5 book5
4 6 book6
5 4 book4
I expected it appears in the order of ID 1 2 3 4 5 6 which 4,5,6 are ordered by Price
It also seems that the data INSERT and/or SELECT FROM display records in a random order in different run.
How do I control the SELECT FROM output without including the 'Price' column in the output table in order to sort them?
And this happened when I import a csv file to create a new table, the record order is random when using SELECT FROM to display them.
The ORDER BY clause specifies a column or expression as the sort criterion for the result set.
If an ORDER BY clause is not present, the order of the results of a query is not defined.
Column aliases from a FROM clause or SELECT list are allowed. If a query contains aliases in the SELECT clause, those aliases override names in the corresponding FROM clause.
So, you most likely wanted something like below
SELECT *
FROM test_data.table1
ORDER BY Price DESC
LIMIT 100
Note the use of LIMIT - it is important part - If you are sorting a very large number of values, use a LIMIT clause to avoid resource exceeded type of error

Extract only variables which is greater than other table in influxDB

I am using influxDB and I would like to extract some values which is greater than certain threshold in other table.
For example, I have two tables as shown in below.
Table A
Time value
1 15
2 25
3 9
4 22
Table B
Time threshold
1 16
2 12
3 13
4 15
Give above two tables, I would like to extract three values which is greater than first row in Table B. Therefore what I want to have is as below.
Time value
2 25
4 22
I tried it using below sql query, but it didn't give any correct result.
select * from data1 where value > (select spec from spec1 limit1);
Look forward to your feedback.
Thanks.
Integrate the condition in an inner join:
select * from tableA as a
inner join tableB as b on a.id=b.id and a.value > b.threshold
When your time column doesn't only include integer values, you have to format the time and join on a time range. Here is an example:
SQL join on time range

Excel Powerpivot measure conundrum- Average (of average?)

I have a powerpivot table that shows work_tickets and timestamps for each step taken towards resolution:
`Ticket | Step | Time | **TicketDuration**
--------------------------------------
1 1 5:30 15
1 2 5:33 15
1 3 5:45 15
2 1 6:00 10
2 2 6:05 10
2 3 6:10 10
[ticketDuration] is a calculated column I added on my own. Now I'm trying to create a measure for the [AverageTicketDuration] so that it returns 12.5 minutes for the table above{ (15+10)/2 }. I haven't got a clue how to use DAX to produce the results. Please help!
What you are looking for is the AVERAGEX function, which has the following definition AVERAGEX(<table>,<expression>)
The idea being that it will iterate though each row of a defined table applying your calculation, then average the results.
In the below example, I use Table1 as the table name.
To start with to iterate along tickets we would use the following VALUES( Table1[ticket]) which will return the unique values in the ticket column.
Then assuming that your ticket duration is always the same within a ticket ID, the aggregation method used in the expression would be Average(Table1[Ticket]). Since for example of ticket 1, (15 + 15 + 15)/3 = 15
Put together the measure would look like below:\
measure:=AVERAGEX( VALUES( Table1[ticket]), AVERAGE(Table1[Ticket Duration]))
The result when dropped into a pivot using your sample data.

SQL percentage usage calculation using 2 columns

Trying to get the percentage usage for a report based on the following columns:
Dept Ext Sec1 Sec2 StartDate EndDate
---------------------------------------------------------------
1 1234 5 5 2017-05-01:08:00:00 2017-05-04:08:00:10
2 1230 8 8 2017-05-01:09:10:00 2017-05-04:09:10:11
1 1234 15 15 2017-05-02:08:01:00 2017-05-04:08:01:20
I need to display the percentage time the user spent on the phone, based on the total seconds in Sec1, for the time period. If needs be, I can create a 3rd column with the percentage total as part of the creation job (the final table is generated form a join query of 2 other tables). Thanks
I had to add these lines to my creatDB query to get the right results:
alter table compinfo.dbo.pabxreport add TotalSec Int
alter table compinfo.dbo.pabxreport add TotalPer Decimal(14,8)
update compinfo.dbo.pabxreport
set TotalSec= (
select sum(billsec1) from pabxreport)
update compinfo.dbo.pabxreport
set TotalPer= (billsec1 * 100.00 / Totalsec)

access SQL count results using multiple sub queries against one table

I am using Access with a table having over 200k rows of data. I am looking for counts on a column which is broken down by job descriptions. For example, I want to return the total count (id) for a location where a person is status = "active" and position like "cook" [should equal 20] also another output where I get a count (id) for the same location where a person is status = "active" and position = "Lead Cook" [should equal 5]. So, one is a partial of the total population.
I have a few others to do just like this (# Bakers, # Lead Bakers...). How can I do this with one grand query/subquery or one query for each grouping.
My attempt is more like this:
SELECT
a.location,
Count(a.EMPLOYEE_NUMBER) AS [# Cook Total], --- should equal 20
(SELECT count(b.EMPLOYEE_ID) FROM Table_abc AS b where b.STATUS="Active Assignment" AND b.POSITION Like "*cook*" AND b.EMPLOYEE_ID=a.EMPLOYEE_ID) AS [# Lead Cook], --- should equal 5
FROM Table_abc AS a
ORDER BY a.location;
Results should be similar to:
Location Total Cooks Lead Cooks Total Bakers Lead Bakers
1 20 4 15 2
2 45 7 12 2
3 22 2 16 1
4 19 2 17 2
5 5 1 9 1
Try using conditional aggregation -- no need for sub queries.
Something like this should work (although I may not understand your desired results completely):
select location,
count(EMPLOYEE_NUMBER) as CookTotal,
sum(IIf(POSITION Like "*cook*",1,0)) as AllCooks,
sum(IIf(POSITION = "Lead Cook",1,0)) as LeadCooks
from Table_abc
where STATUS="Active Assignment"
group by location