Counting substrings in a string - google-bigquery

How can I count the numbers of times a substring shows up in a string?
In this case, I'm looking for every time "connect.facebook.net/en_US/all.js" shows up in the HTML bodies of the top 300K internet sites (stored in httparchive).

You could use SPLIT() on the string, and count the number of records produced:
SELECT fb_times, COUNT(*) n_pages
FROM
(SELECT COUNT(splits)-1 WITHIN RECORD AS fb_times
FROM
(SELECT SPLIT(body, 'connect.facebook.net/en_US/all.js') splits
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE body CONTAINS 'connect.facebook.net/en_US/all.js'
AND mimeType="text/html"
AND page=url))
GROUP BY 1
ORDER BY 1
Note the usage of WITHIN RECORD to count how many sub-records SPLIT() produced.
Results:
Fb_times N_pages
1 12,471
2 1,222
3 163
4 34
5 18
6 12
7 12
8 6
... ...

Related

Seperating a Oracle Query with 1.8 million rows into 40,000 row blocks

I have a project where I am taking Documents from one system and importing them into another.
The first system has the documents and associated keywords stored. I have a query that will return the results which will then be used as the index file to import them into the new system. There are about 1.8 million documents involved so this means 1.8 million rows (One per document).
I need to divide the returned results into blocks of 40,000 to make importing them in batches of 40,000 at a time, rather than one long import.
I have the query to return the results I need. Just need to know how to take that and break it up for easier import. My apologies if I have included to little information. This is my first time here asking for help.
Use the built-in function ORA_HASH to divide the rows into 45 buckets of roughly the same number of rows. For example:
select * from some_table where ora_hash(id, 44) = 0;
select * from some_table where ora_hash(id, 44) = 1;
...
select * from some_table where ora_hash(id, 44) = 44;
The function is deterministic and will always return the same result for the same input. The resulting number starts with 0 - which is normal for a hash, but unusual for Oracle, so the query may look off-by-one at first. The hash works better with more distinct values, so pass in the primary key or another unique value if possible. Don't use a low-cardinality column, like a status column, or the buckets will be lopsided.
This process is in some ways inefficient, since you're re-reading the same table 45 times. But since you're dealing with documents, I assume the table scanning won't be the bottleneck here.
A prefered way to bucketing the ID is to use the NTILE analytic function.
I'll demonstrate this on a simplified example with a table with 18 rows that should be divided in four chunks.
select listagg(id,',') within group (order by id) from tab;
1,2,3,7,8,9,10,15,16,17,18,19,20,21,23,24,25,26
Note, that the IDs are not consecutive, so no arithmetic can be used - the NTILE gets the parameter of the requested number of buckets (4) and calculates the chunk_id
select id,
ntile(4) over (order by ID) as chunk_id
from tab
order by id;
ID CHUNK_ID
---------- ----------
1 1
2 1
3 1
7 1
8 1
9 2
10 2
15 2
16 2
17 2
18 3
19 3
20 3
21 3
23 4
24 4
25 4
26 4
18 rows selected.
All but the last bucket are of the same size, the last one can be smaller.
If you want to calculate the ranges - use simple aggregation
with chunk as (
select id,
ntile(4) over (order by ID) as chunk_id
from tab)
select chunk_id, min(id) ID_from, max(id) id_to
from chunk
group by chunk_id
order by 1;
CHUNK_ID ID_FROM ID_TO
---------- ---------- ----------
1 1 8
2 9 17
3 18 21
4 23 26

Count Instances Of Occuring String With Unique IDs

I need to count the number of times that a specific string occurs but they when one ID has the same string more than once, only count it once. Basically, I need to count the number of occurrences of a string that occur uniquely to an ID. I believe this should be a simple thing to do but I don't know what I'm doing. Here is my current code:
SELECT
RXNAME as Name,
DUPERSID as ID,
COUNT(RXNAME) as Number
FROM
`OmniHealth.PrescriptionsMEPS`
GROUP BY
ID,
Name
ORDER BY
Number
When run, it says everything was counted as 1. Thanks for the help!
UPDATE:
Dataset: https://storage.googleapis.com/omnihealth/MepsPrescriptionData.csv
OUTPUT when run with code above:
Row Name ID Number
1 SUMATRIPTAN 68896102 1
2 IBUPROFEN 65063102 1
3 PENICILLN VK 66179101 1
4 FUROSEMIDE 63217102 1
5 HYSINGLA ER 70373101 1
6 FUROSEMIDE 76090101 1
7 SKELETAL MUSCLE RELAXANTS 78414101 1
8 AMOXICILLIN 69467103 1
9 TRAMADOL HCL 67667101 1
10 PANTOPRAZOLE 60737102 1
11 CARBAMIDE PEROXIDE 6.5% OTIC SOLN 63990104 1
12 PROMETH/COD 68433101 1
13 AZITHROMYCIN 79045102 1
14 METRONIDAZOL 75414101 1
15 DEXILANT 69625101 1
16 TRAMADOL HCL 66890203 1
17 AZITHROMYCIN 73838101 1
18 COLCRYS 63856102 1
19 PERMETHRIN 62103107 1
20 ACETAMINOPHEN TAB 500 MG 62456102 1
not sure if it is what you asked - but if you are looking for DISTINCT COUNT - go with below:
#standardSQL
SELECT
RXNAME AS Name,
COUNT(DISTINCT DUPERSID) AS Number
FROM `OmniHealth.PrescriptionsMEPS`
GROUP BY 1
ORDER BY Number DESC
Try this...You are grouping on a different field than you are counting. I think you are meaning to group by RXNAME.
SELECT
RXNAME as Name,
DUPERSID as ID,
COUNT(RXNAME) as Number
FROM
`OmniHealth.PrescriptionsMEPS`
GROUP BY
ID,
RXNAME
ORDER BY
Number
I think you want:
SELECT DUPERSID as ID, COUNT(DISTINCT RXNAME) as Number
FROM `OmniHealth.PrescriptionsMEPS`
GROUP BY ID
ORDER BY Number;
This assumes that "same string" means "same value for RXNAME".

Very simple BigQuery SQL script won't return "0" for Count rows with no results

I am trying to make this very simple SQL script work:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) date_submission,
COUNT(*) AS num_apples_oranges_submissions
FROM
[fh-bigquery:reddit_comments.2008]
WHERE
(LOWER(body) CONTAINS ('apples')
AND LOWER(body) CONTAINS ('oranges'))
GROUP BY
date_submission
ORDER BY
date_submission
The results look like this:
1 2008-01-07 3
2 2008-01-08 1
3 2008-01-09 2
4 2008-01-10 3
5 2008-01-11 2
6 2008-01-13 2
7 2008-01-15 2
8 2008-01-16 3
As you can see, for days where there were no submissions containing both "apples" and "oranges", instead of a value of 0 being returned, the entire row is simply missing (such as on the 12th and 14th).
How can I fix this? I'm at my wits end. Thank you.
Try below, it will return all submissions days
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) date_submission,
SUM((LOWER(body) CONTAINS ('apples') AND LOWER(body) CONTAINS ('oranges'))) AS num_apples_oranges_submissions
FROM
[fh-bigquery:reddit_comments.2008]
GROUP BY
date_submission
ORDER BY
date_submission

SQL complex grouping "in column"

I have a table with 3 columns (sorted by the first two):
letter
number (sorted for each letter)
difference between current number and previous number of the same letter
I'd like to calculate (with vanlla SQL) a fourth new column RESULT to group these data when the third column (difference of number between contiguos record; i.e #2 --> 4 = 5-1) is greater than 30 marking all the records of this interval with letter-number of the first record (i.e A1 for #1,#2,#3).
Since the difference between contiguos numbers makes sense just for records with the same letter, for the first record of a new letter, the value of differnce is 31 (meaning that it's a new group; i.e. #6).
Here is what I'd like to get as result:
# Letter Number Difference RESULT (new column)
1 A 1 1 A1
2 A 5 4 A1
3 A 7 2 A1
4 A 40 33 A40 (*)
5 A 43 3 A40
6 B 1 31 B1 (*)
7 B 25 24 B1
8 B 27 2 B1
9 B 70 43 B70 (*)
10 B 75 5 B70
Now I can only find the "breaking values" (*) with this query where they get a value of 1:
select letter
,number
,cast(difference/30 as int) break
from table
where cast(difference/30 as int) = 1
Even though I'm able to find these breaking values I can't finish my task.
Can anyone help me finding a way to obtain the column RESULT?
Thanks in advance
FF
As I understand you need to construct the last result column. You can use concat to do that:
SELECT letter
,number
,concat(letter, cast(difference/30 as int)) result
FROM table
HAVING result = 'A1'
after some exercise and a little help from a friend of mine, I've found a possible solution to my sql prolblem.
The only requirment for the solution is that my first record must have a value of 31 in Difference field (since I need "breaks" when Difference > 30 than the previous record).
Here is the query to get the column RESULT I needed:
select alls.letter
,alls.number
,ints.letter||ints.number as result
from competition.lag alls
,(select letter
,number
,difference
,result
from (select letter
,number
,difference
,case when difference>30 then 1 else 2 end as result
from competition.lag
) temp
where result = 1
) ints
where ints.letter=alls.letter
and alls.number>=ints.number
and alls.number-30<=ints.number

Counting instances of an entry in one table from another table in SQL

I have two sql tables site and sms
The structure is ROUGHLY like
SENEGAL_SITE:
siteID Lon Lat
1 11.232 12.32
2 12.232 12.42
3 11.232 12.62
4 11.232 11.42
ATA_SMS_Apr:
out_going_site_id inSite no_sms
4 1 65
2 4 21
3 4 54
i want to query out a result somthing like this
site id SMS_Site_count
1 5
2 3
3 1
So basically I want to count the number of sms through each site tower
The query which I used to do this is this
select * ,
count((select *
from ATA_SMS_Apr a
where s.site_id=a.out_going_site_id))
from SENEGAL_SITE s
Doing this I get a error as Cannot perform an aggregate function on an expression containing an aggregate or a subquery.
you need a correlated subquery to get the sms count from ATA_SMS_Apr table for each site id
The count is moved inside the subquery from outside.
select * ,
(select count(*)
from ATA_SMS_Apr a
where s.site_id=a.out_going_site_id) as SMS_Site_Count
from SENEGAL_SITE s