Sort public trigram data in BigQuery - sql

what I'd like to do is recreate the bigram data from the available/public trigram data on BigQuery. Along the way, I'd like to trim down the data. It's hard because there seems to be a list of data within a single row, for example, cell.value is a column name that contains all the years, and it could have 100 elements in it, and all of that is in one row.
The columns I'd like are something like this:
ngram, first, second, third, cell.match_count*modified
where the modified last column is the sum of all match counts from 2000-2008 (ignoring all the older data). I suspect this would greatly reduce the size of the file (along with a few other tweaks).
The code I have so far is (and I have to run 2 separate queries for this)
SELECT ngram, cell.value, cell.match_count
FROM [publicdata:samples.trigrams]
WHERE ngram = "I said this"
AND cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")
SELECT ngram, SUM(cell.match_count) as total
FROM [one_syllable.test]
GROUP BY ngram
The result is 2 columns with 1 row of data: I said this, 1181
But I'd like to get this for every ngram before I do some more trimming
How can I combine the queries so it's done all at once and also return the columns first, second, and third ?
Thanks!
PS I've tried
SELECT ngram, cell.value, cell.match_count
FROM [publicdata:samples.trigrams]
WHERE cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")
But I get an error "response too large to return"...

The error "response too large to return" means that you will have to write the results to a destination table, with "Allow Large Results" checked. BigQuery won't return more than 128MB directly without using a destination table.
You should be able to generate the table you want using some aggregation functions. Try "GROUP EACH BY ngram" to aggregate in parallel and use the FIRST function to pick a single value from the first, second and third columns. It would look something like this:
SELECT ngram, FIRST(first), FIRST(second), FIRST(third), SUM(cell.match_count)
FROM [publicdata:samples.trigrams]
WHERE cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")
GROUP EACH BY ngram;

Google BIGQUERY now has arrays on the free trigrams dataset and the original answer needs to be modified to flatten the array (cell in this case) by using the UNNEST function. Modified sample SQL code below.
SELECT t1.ngram, t1.first, t1.second, t1.third, SUM(c.match_count)
from bigquery-public-data.samples.trigrams t1, UNNEST(cell) as c
WHERE {"2000","2001","2002","2003","2004","2005","2006","2007","2008"} IN
UNNEST(c.value)
GROUP BY 1,2,3,4;

Related

Pentaho Adding summary rows

Any idea how to summarize data in a Pentaho transformation and then insert the summary row directly under the group being summarized.
I can use a Group By step and get a summarised result stream having one row per key field, but what I want is each sorted group written to the output and the summary row inserted underneath, thus preserving the input.
In the Group By, you can do 'Include all Rows', but this just appends the summary fields to the end of each existing row. It does not create new summary rows.
Thanks in advance
To get the summary rows to appear under the group by blocks you have to use some tricks, such as introducing a numeric "order" field, setting the value of the original data to 1 and the sub totals rows to 2.
Also in the group-by/ sub-totals stream, I am generating a sum field, say "subtotal". You have to make sure to also include this as a blank in your regular stream or else the metadata will be divergent and the final merge will not work.
Here is the best explanation I have found for this pattern:
https://www.packtpub.com/books/content/pentaho-data-integration-4-working-complex-data-flows
You will need to copy the rows too a different stream, and then merge or join them again, to make it a separate row.

BigQuery - duplicate rows in query with scoped aggregation

I'm trying to run a query that uses group_concat function with a scoped aggregation.
The following query returns 175 rows, all with the same values. The duplication seems to be a result of having 175 elements in cell.value
SELECT
ngram,
group_concat(cell.sample.id) within record con
FROM [publicdata:samples.trigrams]
where ngram = '! ! That'
When adding a new column to query above (count with scoped aggregation), the result is one row, as expected. The count row shows the value 175.
SELECT
ngram,
count(cell.value) within record cnt,
group_concat(cell.sample.id) within record con
FROM [publicdata:samples.trigrams]
where ngram = '! ! That'
Is seems that the row duplication occurs because there are no values to group (all nulls). If I change the group_concat to:
group_concat(if(cell.sample.id is null,'',cell.sample.id)) within record con
then once again there is only one row.
What is the reason for this?
How can this be avoided without resorting to group by on all columns (which will also require a subquery since it's impossible to combine group by and scoped aggregation functions)?
This is a bug in the query engine.... it should only return 1 row. We're tracking it internally, and hopefully should have a fix soon.

Return a SQL query where field doesn't contain specific text

I will setup a quick scenario and then ask my question: Let's say I have a DB for my warehouse with the following fields: StorageBinID, StorageReceivedDT, StorageItem, and StorageLocation.
Any single storage bin could have multiple records because of the multiple items in them. So, what I am trying to do is create a query that only returns a storage bin that doesn't contain a certain item, BUT, I don't want the rest of the contents. For example lets say I have 5000 storage bins in my warehouse and I know that there are a handful of bins that do not have "ItemX" in them, listed in the StorageItem field. I would like to return that short list of StorageBinID's without getting a full list of all of the bins without ItemX and their full contents. (I think that rules out IN, LIKE, and CONTAIN and their NOTS)
My workaround right now is running two queries, usually within a StorageReceivedDT. The first is the bins received with the date and then the second is the bins containing ItemX. Then import both .csv files into Excel and use a ISNA(MATCH) formula to compare the two columns.
Is this possible through a query? Thank you very much in advance for any advice.
You can do this as an aggregation query, with a having clause. Just count the number of rows where "ItemX" appears in each bin, and choose the bins where the count is 0:
select StorageBinID
from table t
group by StorageBinID
having sum(case when StorageItem = "ItemX" then 1 else 0 end) = 0;
Note that this only returns bins that have some items in them. If you have completely empty bins, they will not appear in the results. You do not provide enough information to handle that situation (although I can speculate that you have a StorageBins table that would be needed to solve this problem).
What flavour of SQL do you use ?
From the info that you gave, you could use:
select distinct StorageBinID
from table_name
where StorageBinID not in (
select StorageBinID
from table_name
where StorageItem like '%ItemX%'
)
You'll have to replace table_name with the name of your table.
If you want only exact matches (the StorageItem to be exactly "ItemX"), you should replace the condition
where StorageItem like '%ItemX%'
with
where StorageItem = 'ItemX'
Another option (should be faster):
select StorageBinID
from table_name
minus
select StorageBinID
from table_name
where StorageItem like '%ItemX%'

String ending in range of numbers

I have a column with data of the following structure:
aaa5644988
aaa4898494
aaa5642185
aaa5482312
aaa4648848
I have a range that can be anything, like 100-30000 or example. I want to have all values that end in numbers between that range.
I tried
like '%[100-30000]'
but this doesn't work apparently.
I have seen a lot of similar questions but none of the solved my problem
edit I'm using SQL server 2008
Example:
Value
aaa45645695
aaa28568720
aaa65818450
8789212
6566700
For the range 600-1200, I want to retrieve row 1,2,5 because they end with the range.
In SQL, like normally only support % and _ these two operators. That's why like '%[100-30000]' doesn't work.
Depend on your use case, there could be two solutions for this problem:
If you only need to do this query two or three times(didn't care how long it takes), or the dataset is not very big. You can select all the data from this column, and then do the filtering in another programming language.
Take ruby for example, you can do:
column_data = #connection.execute("select * from your_column_name")
result = column_data.map{|x| x.gsub(/^.*[^\d]/, '').to_i }.select{|x| x > 100 && x < 30000}
If you need to do this query regularly, I'd suggest you add a new column to this data table with only the numbers in the current column, so as to get a much better performance in querying speed.
SELECT *
FROM your_table
WHERE number_column BETWEEN 100 AND 30000

MS SQL 2000 - How to efficiently walk through a set of previous records and process them in groups. Large table

I'd like to consult one thing. I have table in DB. It has 2 columns and looks like this:
Name...bilance
Jane...+3
Jane...-5
Jane...0
Jane...-8
Jane...-2
Paul...-1
Paul...2
Paul....9
Paul...1
...
I have to walk through this table and if I find record with different "name" (than was on previous row) I process all rows with the previous "name". (If I step on the first Paul row I process all Jane rows)
The processing goes like this:
Now I work only with Jane records and walk through them one by one. On each record I stop and compare it with all previous Jane rows one by one.
The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs
Summary:
I loop through this table in 3 levels paralelly (nested loops)
1st level = search for changes of "name" column
2nd level = if change was found, get all rows with previous "name" and walk through them
3rd level = on each row stop and walk through all previous rows with current "name"
Can this be solved only using CURSOR and FETCHING, or is there some smoother solution?
My real table has 30 000 rows and 1500 people and If I do the logic in PHP, it takes long minutes and than timeouts. So I would like to rewrite it to MS SQL 2000 (no other DB is allowed). Are cursors fast solution or is it better to use something else?
Thank you for your opinions.
UPDATE:
There are lots of questions about my "summarization". Problem is a little bit more difficult than I explained. I simplified it just to describe my algorithm.
Each row of my table contains much more columns. The most important is month. That's why there are more rows for each person. Each is for different month.
"Bilances" are "working overtimes" and "arrear hours" of workers. And I need to sumarize + and - bilances to neutralize them using values from previous months. I want to have as many zeroes as possible. All the table must stay as it is, just bilances must be changed to zeroes.
Example:
Row (Jane -5) will be summarized with row (Jane +3). Instead of 3 I will get 0 and instead of -5 I will get -2. Because I used this -5 to reduce +3.
Next row (Jane 0) won't be affected
Next row (Jane -8) can not be used, because all previous bilances are negative
etc.
You can sum all the values per name using a single SQL statement:
select
name,
sum(bilance) as bilance_sum
from
my_table
group by
name
order by
name
On the face of it, it sounds like this should do what you want:
select Name, sum(bilance)
from table
group by Name
order by Name
If not, you might need to elaborate on how the Names are sorted and what you mean by "summarize".
I'm not sure what you mean by this line... "The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs".
But, it may be possible to use a group by query to get a lot of what you need.
select name, case when bilance < 0 then 'negative' when bilance >= 0 then 'positive', count(*)
from table
group by name, bilance
That might not be perfect syntax for the case statement, but it should get you really close.