Pivot Data in a BigQuery Standard SQL View Definition

Pivot Data in a BigQuery Standard SQL View Definition - sql

I'm not sure whether this is possible with some of the new BigQuery scripting capabilities, UDFs, array/string functions (or anything else!), however I simply can't figure it out.
I'm trying to write the SQL for a view in BigQuery which dynamically defines columns based on query results, similar to a pivot table in a spreadsheet/BI tool (or melt in pandas). I can do this externally in Python or hard-code it using case statements, but I'm sure that a SQL solution to this would be incredibly useful to a huge number of people.
Essentially I'm trying to write a query which would transform a table like this:
year | name | number
-----------------------
1963 | Michael | 9246
1961 | Michael | 9055
1958 | Michael | 9203
1957 | Michael | 9116
1953 | Robert | 9061
1952 | Robert | 9205
1951 | Robert | 9054
1948 | Robert | 9015
1947 | Robert | 10025
1947 | John | 9634
1946 | Robert | 9295
----------------------
SQL to generate initial example table:
SELECT year, name, number
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE number > 9000
ORDER BY year DESC
Into a table with the following structure:
year | John | Michael | Robert
---------------------------------
1946 | | 9,295 |
1947 | 9,634 | | 10,025
1948 | | 9,015 |
...
This then needs to be connected to downstream tools, without requiring maintenance when the data changes. I know that this is not always a great idea and that tidy form data is more universally useful, but there are still some scenarios where this behaviour is desirable.
I have seen a few solutions on here, but they all seem to involve string generation and then manually pasting the query... I can do this via the BigQuery API but am desperate to find a dynamic solution using nothing but SQL so I don't have to maintain an external function.
Thanks in advance for any pointers!

Related

Updating to most current version of customer information

I apologize in advance if there is a similar question out there already. I haven't had any luck finding anything. Basically, I run a small business and have four separate locations. With the current POS system we use, customer data is added/updated independently of each location. This means, for each customer, they have up to 4 customer IDs if they've visited all four locations. The information of course, will mostly constant for all four locations, but not in the following two cases: (1) The customer is new, (2) the customer updates their information at a particular location during their visit. I already figured how to handle new customers. However...
I'm trying write a query in SQL SERVER that automates the following for existing customers:
1. Imports and appends each of the four .CSV files (customer lists) I export from my POS system.
I've already accomplished this step with a simple BULK INSERT
2. Compare this list to what is currently in the customer data and:
a. For customers WITH four records in the newly imported list, I want to update all four records in the existing SQL table to the most current version of the customer's information. See example below:
Month 1: DATA IN SQL TABLE
Location | Cust_ID | Name | Phone Number
----------------------------------------------
0000001 | 12345A | David | 7025551234
0000002 | 12345B | David | 7025551234
0000003 | 12345C | David | 7025551234
0000004 | 12345D | David | 7025551234
Month 2: DATA TO COMPARE TO SQL TABLE
Location | Cust_ID | Name | Phone Number
---------------------------------------------
0000001 | 12345A | David | 7025551234
0000002 | 12345B | David | 7025559999
0000003 | 12345C | David | 7025551234
0000004 | 12345D | David | 7025551234
DESIRED RESULT
Location | Cust_ID | Name | Phone Number
----------------------------------------------
0000001 | 12345A | David | 7025559999
0000002 | 12345B | David | 7025559999
0000003 | 12345C | David | 7025559999
0000004 | 12345D | David | 7025559999
You guys might be thinking, "this guy just needs a new POS software provider." You're probably right, and I will once my wife is on-board with the idea of big change. In the meantime, this is what I'm stuck with. Thank you guys in advance for your help.

Flexibility of scenarios in Gherkin.

I looking for mechanism that will allow to build more flexible scenarios.
For example for these two very similar scenarios that test existence of records in database:
Scenario Outline: Testing query with 1 attribute with these 2 record in and another 2 out of result
Given I'm connected to <db> database
When I select <query> from database
Then Result should contain fields:
| <row> |
| <yes1> |
| <yes2> |
And Result should not contain fields:
| <row> |
| <no1> |
| <no2> |
Examples:
| db | row | yes1 | yes2 | no1 | no2 | query |
| 1 | model | 1013 | 1006 | 1012 | 1007 | "SELECT model FROM pc WHERE speed >= 3.0;" |
| 1 | maker | E | A | C | H | "SELECT maker FROM product NATURAL JOIN laptop WHERE hd >= 100;" |
Scenario Outline: Testing query with 2 attributes with these 2 record in and another 2 out of result
Given I'm connected to <db> database
When I select <query> from database
Then Result should contain fields:
| <rowA> | <rowB> |
| <yes1A> | <yes1B> |
| <yes2A> | <yes2B> |
And Result should not contain fields:
| <rowA> | <rowB> |
| <no1A> | <no1B> |
| <no2A> | <no2B> |
Examples:
| db | rowA | rowB | yes1A | yes1B | yes2A | yes2B | no1A | no1B | no2A | no2B | query |
| 1 | model | price | 1004 | 649 | 2007 | 1429 | 2004 | 1150 | 3007 | 200 | "SELECT model,price FROM product" |
| 2 | name | country | Yamato | Japan | North | USA | Repulse | Brit | Cal | USA | "SELECT name, country FROM clases" |
I would like to be able to write one scenario with general number of attributes. It would be great if number of tested rows will not be determined too.
My dream is to write only one general scenario:
Testing query with N attribute with these M record in and another L out of result
How to do this in Gherkin? Is it possible with any hacks?

The short answer is, No. Gherkin is not about flexibility, Gherkin is about concrete examples. Concrete example are everything except flexible.
A long answer is:
You are describing a usage of Gherkin as a test tool. The purpose with Gherkin is, however, not to test things. The purpose with Gherkin is to facilitate communication between development and the stakeholders that want a specific behaviour.
If you want to test something, there are other tooling that will support exactly what you want. Any test framework will be usable. My personal choice would be JUnit since I work mostly with Java.
The litmus test for deciding on the tooling is, who will have to be able to understand this?
If the answer is non techs, I would probably use Gherkin with very concrete examples. Concrete examples are most likely not comparing things in a database. Concrete examples tend to describe external, observable behaviour of the system.
If the answer is developers, then I would probably use a test framework where I have access to a programming language. This would allow for the flexibility you are asking for.
In your case, you are asking for a programming language. Gherkin and Cucumber are not the right tools in your case.

You can do it without any hacks, but I don't think you want to, at least not the entire scenario in a single line.
You will want to follow BDD structure, else why use BDD?
You should have and follow a structure like:
Given
When
Then
You need to split and have a delimitation between initial context, action(s) and result(s).It will be a bad practice to not have a limit between these.
Also note that a clear delimitation will increase reusability, readability and also help you a lot in debugging.
Please do a research of what BDD means and how it helps, it may help if you have a checklist with best practices of BDD that could also help in code review of the automated scenarios.

Hide Hierachy duplication in Powerpivot (Row Labels)

I am reporting on performance of legal cases, from a SQL database of activities. I have a main table of cases, which has a parent/child hierarchy. I am looking for a way to appropriately report on case performance, reporting only once for a parent/child group (`Family').
An example of relevant tables is:
Cases
ID | Client | MatterName | ClaimAmount | ParentID | NumberOfChildren |
1 | Mr. Smith | ABC Ltd | $40,000 | 0 | 2 |
2 | Mr. Smith | Jakob R | $40,000 | 1 | 0 |
3 | Mr. Smith | Jenny R | $40,000 | 1 | 0 |
4 | Mrs Bow | JQ Public | $7,000 | 0 | 0 |
Payments
ID | MatterID | DateReceived | Amount |
1 | 1 | 14/7/15 | $50 |
2 | 3 | 21/7/15 | $100 |
I'd like to be able to report back on a consolidated view that only shows the parent matter, with total received (and a lot of other similar related fact tables) - e.g.
Client | MatterName | ClaimAmount | TotalReceived |
Mr Smith | ABC Ltd | $40,000 | $150 |
Mrs Bow | JQ Public | $7,000 | $0 |
A key problem I'm having is hiding row labels for irrelevant rows (child matters). I believe I need to
Determine whether the current row is a parent group
Consolidate all measures for that parent group
Filter on that being True? Place all measures inside IF checks?
Any help appreciated

How many levels does your hierarchy have? If it's just 2 levels (parents have children, children cannot be parents), then denormalize your model. You can add a single column for ParentMatterName and use that as the rowfilter in pivots. If there is a reasonable maximum number of levels in your data (we typically look at <=6 as reasonable) then denormalization is probably preferable, and certainly simpler/more performant, than trying to dynamically roll up the child measure values.
Edits to address comment below:
Denormalizing your data structure in this case just means going to the following table structure:
Cases
ID | Client | ParentMatterName | MatterName | ClaimAmount
1 | Mr. Smith | ABC Ltd | ABC Ltd | $40,000
2 | Mr. Smith | Jakob R | ABC Ltd | $0
3 | Mr. Smith | Jenny R | ABC Ltd | $0
4 | Mrs Bow | JQ Public | JQ Public | $7,000
Regarding nomenclature - Excel is stupid, and so is DAX. Here is the way to think about these things to help minimize confusion - these are important concepts as you move forward in more complex DAX measures and queries.
Here are some absolutely truthful and accurate statements to show how stupid the nomenclature can get:
FILTER() is a table
Pivot table rows are filter context
FILTER() applies additional filter context when used as an argument to CALCULATE()
FILTER() creates row context internally which to evaluate expressions
FILTER()'s arguments are affected by filter context from pivot table rows
FILTER()'s second argument evaluates an expression evaluated in the pivot table's rowfilter context in the row context of each row in the table in the first argument
And so on. Don't think of a pivot table as anything but filters. You have filters, slicers, rowfilters, columnfilters. Everything in a pivot table is filter context.
Links:
Denormalization in Power Pivot
Denormalizing Dimensions

Getting part of query defined by a parameterized function

Hello all :) I'm try to do something like this in Oracle 10g:
SELECT
CAR_ID,
CAR_DATE,
get_some_other_info(CAR_TYPE)
FROM CARS
Where get_some_other_info(CAR_ID) would return several columns:
| CAR_ID | CAR_DATE | CAR_COLOR | CAR_CO2
| 001 | 01/01/2013 | BLUE | 100
| 002 | 02/01/2013 | RED | 120
| 003 | 03/01/2013 | BLUE | 100
I need to use a function for implementation reasons. I feel that I could use Table functions, but I cannot wrap my head around how to use them for my case.
Best regards,

You can also use dynamic SQL to accomplish that. Concatenate the result of your function with a variable containing your SQL then run it with execute immediate.
Google it and you'll find a lot of examples.

SQL Duplicated names in result

I've got problem with SQL.
Here is my code:
SELECT Miss.Name, Miss.Surname, Master.Name, Master.Surname,
COUNT(Date.Id_date) AS [Dates_together]
FROM Miss, Master, Date
WHERE Date.Id_miss = Miss.Id_miss AND Date.Id_master = Master.Id_master
GROUP BY Miss.Name, Miss.Surname, Master.Name, Master.Surname
ORDER BY [Dates_together] DESC
and I've got the result:
Dorothy | Mills | James | Jackson | 28
Dorothy | Mills | Kayne | West | 28
Emily | Walters | James | Jackson | 13
Emily | Walters | Tom | Marvel | 12
Sunny | Sunday | Kayne | West | 9
and I really do not know what to change to have a result like this:
Dorothy | Mills | James | Jackson | 28
Emily | Walters | Tom | Marvel | 12
Sunny | Sunday | Kayne | West | 9
Because I don't want to to have duplicated names of master or miss in a result... :(
Can anyone help me?

It looks like your result set is correct, as you are getting the appropriate distinct combinations.

The "duplicates" are accurate, because you are querying the combinations of the Miss and Master records, not the Miss and Master records themselves. For instance, in your second result set, it doesn't capture the fact that Dorothy Mills dated Kayne West 28 times.

You don't mention which database you're working with, but if I have this correctly you're trying to determine how many times a given couple have been on a date?
I think you need to ask your self what happens if you have two people, of either sex, that share the same combination of christian and surname...
Start off with :
Select idMaster, idMiss, count(*) as datecount from [Date] group by idMaster, idDate
From there, you need to simply need to add their names to the results...
Should get you started on the right track...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pivot Data in a BigQuery Standard SQL View Definition - sql

Related

Updating to most current version of customer information

Flexibility of scenarios in Gherkin.

Hide Hierachy duplication in Powerpivot (Row Labels)

Getting part of query defined by a parameterized function

SQL Duplicated names in result

Categories

Resources