Structuring BigQuery with large array of data as input - google-bigquery

I am interested in obtaining the most frequently word associations with a particular word via BigQuery's ability find trigrams data. For example, when using Google's Ngram viewer, I could input great *, which will give me the most frequently associated word that follows "great", such as "great deal", then "great and" and "great many". My goal is to do it for a large list of words so that I could query with word1 * all the way to word10000 *
Following the discussion on this SO answer, I was led to the BigQuery's publicly available trigram data. What I can't seem to figure out at this point is how to use this service with input of an array of words, either as a file input or a way to paste them in. Any assistance is much appreciated - thanks.

Here is how you would find 10 most frequent words to follow "great":
SELECT second, SUM(cell.page_count) total
FROM [publicdata:samples.trigrams]
WHERE first = "great"
group by 1
order by 2 desc
limit 10
This results in
second total
------------------
deal 3048832
and 1689911
, 1576341
a 1019511
number 984993
many 875974
importance 805215
part 739409
. 700694
as 628978
If you wanted to limit to specific years - say between 1820 and 1840, then you can also restrict on cell.value (which is year of publication)
SELECT second, SUM(cell.page_count) total FROM [publicdata:samples.trigrams]
WHERE first = "great" and cell.value between '1820' and '1840'
group by 1
order by 2 desc
limit 10

Related

Get filtered row count using dm_db_partition_stats

I'm using paging in my app but I've noticed that paging has gone very slow and the line below is the culprit:
SELECT COUNT (*) FROM MyTable
On my table, which only has 9 million rows, it takes 43 seconds to return the row count. I read in another article which states that to return the row count for 1.4 billion rows, it takes over 5 minutes. This obviously cannot be used with paging as it is far too slow and the only reason I need the row count is to calculate the number of available pages.
After a bit of research I found out that I get the row count pretty much instantly (and accurately) using the following:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
But the above returns me the count for the entire table which is fine if no filters are applied but how do I handle this if I need to apply filters such as a date range and/or a status?
For example, what is the row count for MyTable when the DateTime field is between 2013-04-05 and 2013-04-06 and status='warning'?
Thanks.
UPDATE-1
In case I wasn't clear, I require the total number of rows available so that I can determine the number of pages required that will match my query when using 'paging' feature. For example, if a page returns 20 records and my total number of records matching my query is 235, I know I'll need to display 12 buttons below my grid.
01 - (row 1 to 20) - 20 rows displayed in grid.
02 - (row 21 to 40) - 20 rows displayed in grid.
...
11 - (row 200 to 220) - 20 rows displayed in grid.
12 - (row 221 to 235) - 15 rows displayed in grid.
There will be additional logic added to handle a large amount of pages but that's a UI issue, so this is out of scope for this topic.
My problem with using "Select count(*) from MyTable" is that it is taking 40+ seconds on 9 million records (thought it isn't anymore and I need to find out why!) but using this method I was able to add the same filter as my query to determine the query. For example,
SELECT COUNT(*) FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning'
Once I determine the page count, I would then run the same query but include the fields instead of count(*), the CurrentPageNo and PageSize in order to filter my results by page number using the row ids and navigate to a specific pages if needed.
SELECT RowId, DateTime, Status, Message FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning' AND
RowId BETWEEN (CurrentPageNo * PageSize) AND ((CurrentPageNo + 1) * PageSize)
Now, if I use the other mentioned method to get the row count i.e.
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
It returns the count instantly but how do I filter this so that I can include the same filters as if I was using the SELECT COUNT(*) method, so I could end up with something like:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable') AND
(index_id=0 or index_id=1) AND
([DateTime] BETWEEN '2018-04-05' AND '2018-04-06') AND
([Status] = 'Warning')
The above clearing won't work as I'm querying the dm_db_partition_stats but I would like to know if I can somehow perform a join or something similar to provide me with the total number of rows instantly but it needs to be filtered rather than apply to the entire table.
Thanks.
Have you ever asked for directions to alpha centauri? No? Well the answer is, you can't get there from here.
Adding indexes, re-orgs/re-builds, updating stats will only get you so far. You should consider changing your approach.
sp_spaceused will return the record count typically instantly; You may be able to use this, however depending (which you've not quite given us enough information) on what you are using the count for might not be adequate.
I am not sure if you are trying to use this count as a means to short circuit a larger operation or how you are using the count in your application. When you start to highlight 1.4 billion records and you're looking for a window in said set, it sounds like you might be a candidate for partitioned tables.
This allows you assign several smaller tables, typically separated by date, years / months, that act as a single table. When you give the date range on 1.4+ Billion records, SQL can meet performance expectations. This does depend on SQL Edition, but there is also view partitioning as well.
Kimberly Tripp has a blog and some videos out there, and Kendra Little also has some good content on how they are used and how to set them up. This would be a design change. It is a bit complex and not something you would want implement on a whim.
Here is a link to Kimberly's Blog: https://www.sqlskills.com/blogs/kimberly/sqlskills-sql101-partitioning/
Dev banter:
Also, I hear you blaming SQL, are you using entity framework by chance?

oracle text definescore with accum and Query rewriting

I am using Oracle text to search in a corpus of sentences
I want the scoring to be as counting the discrete occurrences only,
Example : My Query is ( dog cat table )
If it found the term " dog " it must count 1 even if the sentence has more than one "dog" term. If it found " dog cat " it must count 2 ... etc
I used this query, but it gives me 51 if it finds the two terms. I need to accumulate the discrete occurrences. So I want to override the behaviour of the scoring algorithm of Oracle Text.
select /*+ FIRST_ROWS(1)*/ sentence_id
,score(1) as sc
, isn
,sentence_length
from plag_docsentences
where contains(PROCESSED_TEXT,'DEFINESCORE(dog, DISCRETE*.01)
,DEFINESCORE(cat, DISCRETE*.01)'
,1)>0
order by score(1) desc
OK, I Solved that Issue.
suppose I find 2 terms out of 3, the score will be 67
which means ( 2/3=67 ) this is the default behavior of oracle text scoring alg.
so I derived an equation to find the number of occurrences (i.e number of terms in query found in the corpus sentence)
as follows:
x/query_lenght = score/100
then
x=query_lenght * score/100
this will find the number of matching words between the query and the corpus query
I hope this will help reasearchers in IR.

Oracle 'Contains' / 'Group' function return incorrect value

I have this query:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, '%sul%', 1) > 0
It produces output below:
The question is:
Why does the SCORE(1) produce 9? As I recall that CONTAINS function returns number of occurrences of search_string (in this case '%sul%').
I expect the output should be:
Sullivan 1
Sully 1
But when I try this syntax:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, 'sul', 1) >0;
It returns 0 rows selected.
And can someone please explain me what is the third parameter for?
Thanks in advance :)
The reason your second query is returning no rows is, you are looking for word sul in your search. Contains will not do pattern search unless you tell it to, it searches for words which you specified as your second paramter. To look for patterns, you will have to use wildcards, as you did in your first example.
Now, coming to the third parameter in CONTAINS - it is label and is just used to label the score operator. You should use the third parameter when you use SCORE in your SELECT list. It's importance is more clear when there are multiple SCORE operators
Quoting directly from documentaion
label
Specify a number to identify the score produced by the query.
Use this number to identify the CONTAINS clause which returns this
score.
Example
Single CONTAINS
When the SCORE operator is called (for example, in a SELECT clause),
the CONTAINS clause must reference the score label value as in the
following example:
SELECT SCORE(1), title from newsindex
WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC;
Multiple CONTAINS
Assume that a news database stores and indexes the title and body of
news articles separately. The following query returns all the
documents that include the words Oracle in their title and java in
their body. The articles are sorted by the scores for the first
CONTAINS (Oracle) and then by the scores for the second CONTAINS
(java).
SELECT title, body, SCORE(10), SCORE(20) FROM news WHERE CONTAINS
(news.title, 'Oracle', 10) > 0 OR CONTAINS (news.body, 'java', 20) > 0
ORDER BY SCORE(10), SCORE(20);
The Oracle Text Scoring Algorithm does not score by simply counting the number of occurrences. It uses an inverse frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
Think of a google search. If you search for the term Oracle you will not find (directly) any result that may help to explain your scoring value questioning, so we can consider this term a "noise" to your expectations. But if you search for the term Oracle Text Scoring Algorithm you will find your answer in the first google result.
And about your other questionings, I think that #Incognito already gives them a good answer.

MS Access - Query Need Assistance

I stumbled across this website and instantly fell in love. Let me be completely honest, I have little to NO knowledge of Access.. I told my manager this and he still insists that I "can figure it out" which is highly doubted. So here I am asking for help.. On to the question:
Where are the SQL code gurus? haha
I have 2 tables, "Found" & "Missing", both showing inventory adjustments for our building within the company. (Amazon)
I believe I have the process figured out but have no idea how it looks within Access..
Step 1: Group by ASIN (basically the numerical version of a barcode)
Step 2: Determine the +/- for the grouped ASINs in both lists
Step 3: Use TOP function to find the largest negative adjustments
There is a total of 3000+ records in both spreadsheets, but hopefully if I can figure out the process then the input/output wouldn't matter.
I thought maybe I needed a unique identifier? Bin(location) + ASIN(barcode) + Quantity
As you can see.. I have been thinking, organizing, and praying someone can help!
Here is a dummy example of the "Found" spreadsheet, the "Missing" spreadsheet is the exact same format with the only difference being a "M" instead of a "F" under "Reason Code"
Hopefully this is enough information, I know its a cluster.... thanks guys!
Date FC Application Name IOG ID IOG Name Container Id GL Product Group ASIN Processed By Reason Code Quantity Item Cost
1/5/2014 RIC1 FCICQACountService 1234 Doll Inc. P-1-A101xxx Toy B000000001 unknown1 F -1 12.34
1/5/2014 RIC1 FCICQACountService 1334 Amazon P-1-A101xxx Drugstore B000000002 unknown2 F -1 10.36
1/5/2014 RIC1 FCICQACountService 1432 Amazon P-1-A102xxx Office Product B000000003 unknown3 F -13 50.50
1/5/2014 RIC1 FCICQACountService 1442 Amazon P-1-A102xxx Office Product B000000004 unknown4 F -2 223.62
1/5/2014 RIC1 FCICQACountService 1337 Hope Inc. P-1-A102xxx Office Product B000000005 unknown5 F -1 100.99
I take it that by "spreadsheet", you actually mean "table". Might be a good idea to find a good primer on SQL and relational databases in general.
You've got a pretty good start, though. You've identified what you want. Note that in SQL, this is what you usually do; you think more about the result you want than you do the process of getting it. Each of your points suggests a keyword or function that will go into your query:
1) "Group by ASIN (basically the numerical version of a barcode)": You probably want to use the GROUP BY keyword.
2) "Determine the +/- for the grouped ASINs in both lists": Sounds like you want to SUM up a column here.
3) "Use TOP function to find the largest negative adjustments": Obviously, you already know you want TOP. The piece you're missing, though, is that the "largest negative" part suggests you want to use ORDER BY, and you want the smallest (largest magnitude negative) first. That will make sure that the right row is on top when it takes the top one.
So putting all that together, the only thing you need to figure out is the syntax. Your end query probably looks something like this:
SELECT TOP 1 ASIN, SUM(Quantity) AS TotalQuantity FROM Found GROUP BY ASIN ORDER BY SUM(Quantity);
This will calculate the sum of Quantity for each group of rows that has the same ASIN, and the result will be a set of rows that contain the ASIN and the total Quantity for that ASIN. Then it sorts the rows using the total quantity, with the smallest (most negative) row on top. The TOP then cuts off all the other rows. You could optionally leave out the TOP 1 if you want to see all the rows.
By the way, this SUM function is a little special. It's what we call an aggregate function. That's because it does something with a bunch of values across many rows. Not all functions are like that in SQL, but this one is.
If this isn't exactly what you're looking for, I hope it's enough to get you off the ground. Good luck.

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.