Creating a list of years using bins with SQL BigQuery - sql

I am working on a project that involves studying the impact of NFL executives. I am using ProFootball Reference tables for each team like this and I am trying to combine them with this table featuring the executives. As you can see, the two data sets are not matching, which has created problems. My first instinct was to isolate the executives from each team and use a CASE statement to manually enter them into each team's table, but since there are 32 teams, I want to avoid doing this manually if I can. I also figured this would not be possible if I did this as a career, so I wanted to do this with a SQL function.
I decided to create a bin where I would list out each year and then join, but first I needed to see if it would work with just the general managers.
I'm using BigQuery Sandbox and currently don't have access to anything with more permissions, so I don't have the permission to CREATE TABLE, unfortunately. I attempted to convert the executive table into one where each year is shown with each executive and attempted a pivot so that the split teams are converted into columns after I seperated the array. I opted out of the Pivot since there it seemed to want an aggregate value, but I am still running into a snag.
WITH bin AS(
SELECT *
FROM UNNEST(generate_array(1921, 2022, 1)) AS year),
gm_teams_cowboys AS(
SELECT
Name,
team,
came,
went,
FROM (
SELECT
Positions,
Name,
exec.From AS came,
exec.To AS went,
SPLIT(Teams, ',') AS team
FROM lyrical-star-357613.nfl_stats.nfl_executives AS exec)
WHERE Positions LIKE '%General Manager%' AND team LIKE '%Cowboys%')
SELECT
*
FROM gm_teams
LEFT JOIN bins
ON bins.year = gm_teams.came
This code also did not work in that it did not provide the columns I wanted, so I was wondering if there was a good way to do this that I was missing. Would I need to just suck it up and manually put the GMs and Owners in if I don't have the ability to create a table?

Related

How do I select attributes from another selection of elements

I am using Excel's Power Query in order to test a SQL query that I am eventually going to use in order to make a pivot table that stays updated with the database. The database is accessed through an ODBC.
The problem is not related to Power Query itself but simply the SQL request.
Here I am trying to select all bills from the "facturation" (French database) table that are from the current year (2021). I am naming this selected data FACTURES_ANEE_COURANTE.
Then I want to also select some attributes of those items from 2021 in order to display them in the pivot table, but only on the selection that I just made in order to only select (and show) bills from the current year.
select * as FACTURES_ANNEE_COURANTE
from facturation
where year(date_fact)=2021 limit 3, select date_fact from FACTURES_ANNEE_COURANTE
I only have very basic knowledge of SQL and therefore this does not seem to work, the second part of my request that is (the first one works). I'm trying to do this in order to be able to show these specific attributes in the pivot table. What's the proper way to select attributes only from my first selection of elements from my table facturation?
Thank you for your help.
A major advantage of Power Query is being able to generate complex logic without needing to be able to code in SQL. So I would abandon writing hand coded SQL - there's no need.
Before PQ came out I had 2 decades of experience writing complex SQL. After PQ came out I've written almost none - the SQL code generated by PQ is good enough, you can easily add complex transformations that are hard/impossible in SQL, and overall developing and debugging is 10x easier.
For your scenario, I would build a PQ query just using the navigation to select your facturation table. Then I would use the PQ UI to Filter (instead of a SQL where clause) and Choose Columns (to restrict the columns returned).
Whatever other transformations you need are likely met using a button in the PQ UI.

Tableau count values after a GROUP BY in SQL

I'm using Tableau to show some schools data.
My data structure gives a table that has all de school classes in the country. The thing is I need to count, for example, how many schools has Primary and Preschool (both).
A simplified version of my table should look like this:
In that table, if I want to know the number needed in the example, the result should be 1, because in only one school exists both Primary and Preschool.
I want to have a multiple filter in Tableau that gives me that information.
I was thinking in the SQL query that should be made and it needs a GROUP BY statement. An example of the consult is here in a fiddle: Database example query
In the SQL query I group by id all the schools that meet either one of the conditions inside de IN(...) and then count how many of them meet both (c=2).
Is there a way to do something like this in Tableau? Either using groups or sets, using advanced filters or programming a RAW SQL calculated fiel?
Thanks!
Dubafek
PS: I add a link to my question in Tableu's forum because you can download my testing workbook there: Tableu's forum question
I've solved the issue using LODs (specifically INCLUDE and EXCLUDE statements).
I created two calculated fields having the aggregation I needed:
Then I made a calculated field that leaves only the School IDs that matches the number of types they have (according with the filtering) with the number of types selected in the multiple filter (both of the fields shown above):
Finally, I used COUNTD([Condition]) to display the amounts of schools matching with at least the School types selected.
Hope this helps someone with similar issue.
PS: If someone wants the Workbook with the solution I've uploaded it in an answer in the Tableau Forum

Merge two CSV and collate data

I have two CSV files, the first like so:
Book1:
ID,TITLE,SUBJECT
0001,BLAH,OIL
0002,BLAH,HAMSTER
0003,BLAH,HAMSTER
0004,BLAH,PLANETS
0005,BLAH,JELLO
0006,BLAH,OIL
0007,BLAH,HAMSTER
0008,BLAH,JELLO
0009,BLAH,JELLO
0010,BLAH,HAMSTER
0011,BLAH,OIL
0012,BLAH,OIL
0013,BLAH,OIL
0014,BLAH,JELLO
0015,BLAH,JELLO
0016,BLAH,HAMSTER
0017,BLAH,PLANETS
0018,BLAH,PLANETS
0019,BLAH,HAMSTER
0020,BLAH,HAMSTER
And then a second CSV with items associated with the first list, with ID being the common attribute between the two.
Book2:
ID,ITEM
0001,PURSE
0001,STEAM
0001,SEASHELL
0002,TRUMPET
0002,TRAMPOLINE
0003,PURSE
0003,DOLPHIN
0003,ENVELOPE
0004,SEASHELL
0004,SERPENT
0004,TRUMPET
0005,CAR
0005,NOODLE
0006,CANNONBALL
0006,NOODLE
0006,ORANGE
0006,SEASHELL
0007,CREAM
0007,CANNONBALL
0007,GUM
0008,SERPENT
0008,NOODLE
0008,CAR
0009,CANNONBALL
0009,SERPENT
0009,GRAPE
0010,SERPENT
0010,CAR
0010,TAPE
0011,CANNONBALL
0011,GRAPE
0012,ORANGE
0012,GUM
0012,SEASHELL
0013,NOODLE
0013,CAR
0014,STICK
0014,ORANGE
0015,GUN
0015,GRAPE
0015,STICK
0016,BASEBALL
0016,SEASHELL
0017,CANNONBALL
0017,ORANGE
0017,TRUMPET
0018,GUM
0018,STICK
0018,GRAPE
0018,CAR
0019,CANNONBALL
0019,TRUMPET
0019,ORANGE
0020,TRUMPET
0020,CHERRY
0020,ORANGE
0020,GUM
The real datasets are millions of records, so I'm sorry in advance for my simple example.
The problem I need to solve is getting the data merged and collated in a way where I can see which item groupings most commonly appear together on the same ID. (e.g. GRAPE,GUM,SEASHELL appear together 340 times, ORANGE and STICK 89 times, etc...)
Then I need to see if there is any change/deviation to the general results in common appearance when grouped by SUBJECT.
Tools I'm familiar with are Excel and SQL, but I also have PowerBI and Alteryx at my disposal.
Full disclosure: Not homework, or work, but a volunteer project, thus my unfamiliarity with this kind of data manipulation.
Thanks in advance.
An Alteryx solution:
Drag the two .csv files onto your canvas (seen as book1.csv and book2.csv in my picture; Alteryx will create "Input" tools for you.
Drag a "Join" tool on and connect the two .csv files to its inputs; select "ID" as the join field; unselect the "Right_ID" as output since it's merely a duplicate of "ID"
Drag a "Summary" tool on and connect the Join tool's output to the Summary tool's input; select all three of the outputs and add as a "group by"... then add the ID column with a "count"
Drag a browse tool on and connect the summary's output to the browse tool's input.
run the workflow
After all that, click on the browse tool and you should see what is seen in my screenshot: (which is showing just the first ten rows of output):
+1 for taking on a volunteer project - I think anyone who knows data can have a big impact in support of their favourite group or cause.
I would just pull the 2 files into Power BI as 2 separate tables (Get Data / From File). Create a relationship between the 2 tables based on ID (it might get auto-generated). It should be one to many.
Then I would add a Calculated Column to the Book1 table to Concatenate the related ITEM values, eg.
Items =
CALCULATE (
CONCATENATEX (
DISTINCT ( 'Book2'[ITEM] ),
'Book2'[ITEM],
", ",
'Book2'[ITEM], ASC
)
)
Now you can use that Items field in visuals (e.g. a Table), along with Count of ID to get the frequency.
Adding Subject to a copy of the table (e.g. to the Columns well of a Matrix) will produce your grouped scenario, or you could add a Subject Slicer.
As you will be comparing subsets of varying size, I would change Count of ID to Show value as - % of grand total.
Little different solution using Alteryx.
With this dataset, there are very few repeating 3 or 4 item groups. You can do the two item affinity analysis and get a probability of 3 or 4 item groups, or you can count the 3 and 4 item groups individually. I believe what you want is the latter as your probability of getting grapes with oranges may be altered by whether you have bananas in the cart or not.
Anyway, I did not join in the subject until after finding all of my combinations. I found all the combinations by taking the Cartesian join of two, then three, then four of the original set. I then removed all duplicates by ensuring items were always in alphabetical order in each row. I then counted occurrences of each combination. More joins can be added in the same pattern to count groups of 5,6,7...
Once you have the counts of occurrences, then I would join back with the subjects and perform this analysis on each group and compare to the overall results.
I'm supposed to disclose that I work for Alteryx.
first of all if you are using windows
just navigate to the directory which contains the CSV and write the following command:
copy pattern newfileName.csv
#example
copy *.csv merged.csv
now you created one csv file, the file is too large now you can't process it once, depending on your programming language you can use appropriate way, for python you can use generators to process line by line, or pandas you can read chunk by chunk it will be easy.
I hope this help you.

SQL query / SQL Reporting Services

Been rattling my brain for a while and I could not get pass how to do the SQL query that will show the relationship/connections between my two tables.
I'm working on an IT equipment inventory program. I have two tables;
SELECT serial_number, model, ship_dat, status FROM items_list
SELECT item_serial, connected-to_serial FROM connections
All items like desktops, laptops, monitors, etc are on the items_list table. To track down the relationship/connections of the items, I created the connections table. IE, Monitor with serial_number=Screen#1 is connected to a Desktop with serial_number=Serial#1. It works ok with my Window Form application because I
used a datagridview control to list all devices simple SQL query.
However, when trying to show the relationship/connection on SQL Reports I've ran out of ideas how to do it. I'm aiming to get the report look like below or something along the lines. I just need to show the connections between the items.
Thank you
You should be able to do this with a table in SSRS if that is what you are using. The query you would need to drive the table of all related items would be:
SELECT item_serial, connected-to_serial, mainItem.*, connectedItem.*
FROM connections
INNER JOIN items_list mainItem ON connections.item_serial = items_list.serial_number
INNER JOIN items_list connectedItem ON connections.connected-to_serial = connectedItem.serial_number
You can of course tailor the SELECT statement to your needs, mainItem.* and connectedItem.* will not give you the most descriptive column names. Using column aliases (found under column_alias here) you can give a more descriptive name to each column.
From here you should be able to use a table and create a row group on the main item (either name or serial number) to get the type of look you are looking to achieve here. I believe the Report Wizard actually has most of the functionality you are looking for and should handle the bulk of this. You may have to move some of the cells around to get the look you are going for though.

Use a Query from the Destination db to limit OLE DB Source task in SSIS 2008

All,
I have a package that I'm building as a data importer so I can copy sets of data from my production environment and develop on another instance.
I have two tables that contain header and detail rows for service tickets. Those service tickets are tied back to orders.
I am pulling the service tickets from a certain time window, however, the originating orders fall outside of the date range that I'm pulling for the tickets.
I want to be able to take the following steps in an SSIS package:
Import the header and detail rows within the given date range from prod to dev
Select the relevant order numbers from dev tables
Use the list of order numbers to import only the relevant orders from prod
I poked through other answers and couldn't find answers that addressed this directly, so I apologize if there is an answer out there and I missed it. I may not have been asking the question correctly. I'm assuming that I would need to pull those order numbers into a temp table or variable in order to apply them as a filter.
As I write this, it just crossed my mind to use a join on the source system with the ticket to order tables and still use the date range to limit, but I'm still posting the question to see if anyone has dealt with this before.
Your steps are already fairly clear, are you asking how to actually implement them? It looks like you can do all three steps by using SELECT statements in your data sources:
Build a SELECT statement dynamically with the correct dates to use in your data source. The dates could be programmatically generated in a script task, or saved in a database table and populated into variables. Then you copy the data across to the dev system.
Run a SELECT statement in the dev system that returns the order numbers, and copy the results to a table in the prod database.
Run a SELECT statement in the prod database that joins on the table from step 2 and copy the results back to dev.
An alternative to the table in steps 2 and 3 would be a lookup transformation, but if you have a large number of rows then using a table will probably be faster.