Summing n numerical variables by grouping level specific to each - sql

I am working through a group by problem and could use some direction at this point. I want to summarize a number of variables by a grouping level which is different (but the same domain of values) for each of the variables to be summed. In pseudo-pseudo code, this is my issue: For each empYEAR variable (there are 20 or so employment-by-year variables in wide format), I want to sum it by the county in which the business was located in that particular year.
The data is a bunch of tables representing business establishments over a 20-year period from Dun & Bradstreet/NETS.
More details on the database, which is a number of flat files, all with the same primary key.
The primary key is DUNSNUMBER, which is present in several tables. There are tables detailing, for each year:
employment
county
sales
credit rating (and others)
all organized as follows (this table shows employment, but the other variables are similarly structured, with a year postfix).
dunsnumber|emp1990 |emp1991|emp1992|... |emp2011|
a | 12 |32 |31 |... | 35 |
b | |2 |3 |... | 5 |
c | 1 |1 | |... | |
d | 40 |86 |104 |... | 350 |
...
I would ultimately like to have a table that is structured like this:
county |emp1990|emp1991|emp1992|...|emp2011|sales1990|sales1991|sales1992|sales2011|...
A
B
C
...
My main challenge right now is this: How can I sum employment (or sales) by county by year as in the example table above, given that county as a grouping variable changes sometimes by the year and specified in another table?
It seems like something that would be fairly straightforward to do in, say, R with a long data format, but there are millions of records, so I prefer to keep the initial processing in postgres.

As I understand your question this sounds relatively straight forward. While I normally prefer normalized data to work with, I don't see that normalizing things beforehand will buy you anything specific here.
It seems to me you want something relatively simple like:
SELECT sum(emp1990), sum(emp1991), ....
FROM county c
JOIN emp e ON c.dunsnumber = e.dunsnumber
JOIN sales s ON c.dunsnumber = s.dunsnumber
JOIN ....
GROUP BY c.name, c.state;
I don't see a simpler way of doing this. Very likely you could query the system catalogs or information schema to generate a list of columns to sum up. the rest is a straight group by and join process as far as I can tell.
if the variable changes by name, the best thing to do in my experience is to put together a location view based on that union and join against it. This lets you hide the complexity from your main queries and as long as you don't also join the underlying tables should perform quite well.

Related

How to make power bi work with many to many?

I have the following model, where the relation tables create a many to many relationship, as usual.
|Table car | |relation car worker|
|Id car |1----*|id car |
|car things| |id worker |*-----
|
|
|table worker | /
|id worker |1----------/
|worker things | \
\_______________
\
|table building | |relation worker building| |
|id building |1----*|id building | |
|building things| |id worker |*---------
When I load this model in Power Bi, it can build a table visualization (and others) containing one of the following:
Option 1:
Car things
Worker things
Option 2:
Worker things
Building things
But it totally fails when I try to put in the table visualization something from the edges of the model:
Car things
Building things
This is the error:
What's going on here? Why the error?
Basically I need to see which cars visit which buildings (and do one or two summarizations)
From what i can understand from your model.. you need to set your cross filter direction to both rather than single so it can filter back up the tables
Here is an example which may help you to understand
For example, for 1 from tableA, it can relate 1 or 2 from tableB, since 1 from Table B can also relate 1 or 2 from tableC, so for 1 from tableA, it can’t determine whether 1 or 2 from tableC should be refer to
*----------------------------*
|Table A Table B Table C |
*----------------------------*
| 1 1 1,2 |
| 1 2 1,2 |
*----------------------------*
You can create a new table with the formula
tableName = CROSSJOIN(tableA,tableB,tableC....tableZ)
I had to create a new table as if I were doing an inner join in SQL (which works just as expected).
Click "modeling", "create table".
For the formula:
NewTable = NATURALINNERJOIN(tb_cars,
NATURALINNERJOIN(tb_rel_car_worker,
NATURALINNERJOIN(tb_workers,
NATURALINNERJOIN(tb_rel_worker_building, tb_buildings))))
Then I start using everything from this table instead of the others.
In order to create metrics that include zero counts grouped by one of the edge tables, I had to leave this table separate:
Partial Table= NATURALINNERJOIN(tb_rel_car_worker,
NATURALINNERJOIN(tb_workers,
NATURALINNERJOIN(tb_rel_worker_building, tb_buildings))))
Then in the relationship editor, I added a relation between Partial Table[id_car] and tb_cars[id_car].
Now the metrics grouped by car can use the technique of adding +0 to the DAX formulas to show also cars whose metric sum zero in charts.

How do you 'join' multiple SQL data sets side by side (that don't link to each other)?

How would I go about joining results from multiple SQL queries so that they are side by side (but unrelated)?
The reason I am thinking of this is so that I can run 1 query in Google Big Query and it will return 1 single table which I can import into Excel and do some charts.
e.g. Query 1 looks at dataset TableA and returns:
**Metric:** Sales
**Value:** 3,402
And then Query 2 looks at dataset TableB and returns:
**Name:** John
**DOB:** 13 March
They would both use different tables and different filters, etc.
What would I do to make it look like:
---Sales----------John----
---3,402-------13 March----
Or alternatively:
-----Sales--------3,402-----
-----John-------13 March----
Or is there a totally different way to do this?
I can see the use case for the above, I've used something similar to create a single table from multiple tables with different metrics to query in Data Studio so that filters apply to all data in the dataset for example. However in that case, the data did share some dimensions that made it worthwhile doing.
If you are going to put those together with no relationship between the tables, I'd have 4 columns with TYPE describing the data in that row to make for easier filtering.
Type | Sales | Name | DOB
Use UNION ALL to put the rows together so you have something like
"Sales" | 3402 | null | null
"Customer Details" | null | John | 13 March
However, like the others said, make sure you have a good reason to do that otherwise you're just creating a bigger table to query for no reason.

Access query, if two values exist in one column, omit one

I have a series of queries that generate reports that contain chemical data. There are two compounds A and B where A is the total amount and B is a speciated amount (like total iron and ferrous iron, for example).
There are about one hundred total compounds in the query result, and I need a criteria to filter the results such that if both Compounds A and B are present, only Compound B is displayed. So far I've tried adding a few iif statements to the criteria section in the query builder with no luck.
Here is what I have so far:
SELECT Table1.KEY_ANLT
FROM Table1
WHERE (((Table1.KEY_ANLT)=IIf([Table1].[KEY_ANLT]=1223 And [Table1].[KEY_ANLT]=70,70,1223)));
This filters out Compound A but does not include the rest of the compounds. How can I modify the query to also include the other compounds?
So, to clarify some of the comments above, the problem here is you don't have (or haven't specified above) a way to identify values that go together. You gave 70 and 1223 as an example, but if you gave us a list of all the numbers, how would we be able to identify which ones go together? You might say "chemistry expertise", but that's based on another column with the compounds' names, right? So really, your query should use that column. But then there's still the problem of how to connect associated names (e.g., "total iron" and "ferrous iron" might be connected because they both have the word "iron", but what about "permanganate" and "manganese"?). In short, you need another column to specify the thing in common between these separate rows, whether it's element, ion, charge, etc. You would also need a column identifying which row in each "group" you would want to include in your query (or, which ones to exclude). For example:
+----------+-----------------+---------+---------+
| KEY_ANLT | Compound | Element | Primary |
+----------+-----------------+---------+---------+
| 70 | total iron | Fe | Y |
| 1223 | ferrous iron | Fe | |
| 1224 | ferric iron | Fe | |
| 900 | total manganese | Mn | Y |
| 901 | permanganate | Mn | |
+----------+-----------------+---------+---------+
Then, to get a query that shows just the "primary" rows, it's pretty trivial:
SELECT * FROM Table1 WHERE Primary='Y';
Without that [Primary] column, you'd have to decide how to choose each row. Perhaps you'd want the one with the smallest KEY_ANLT?
SELECT Table1.*
FROM
(SELECT Element, min(KEY_ANLT) AS MinKey FROM Table1 GROUP BY Element) AS Subquery
INNER JOIN Table1 ON
Subquery.Element=Table1.Element AND
Subquery.MinKey=Table1.KEY_ANLT
The reason your query doesn't work is that the WHERE clause operates row-by-row, and doesn't compare different rows to one another. So in your SQL:
IIf([Table1].[KEY_ANLT]=1223 And [Table1].[KEY_ANLT]=70,70,1223)
NONE of the rows will evaluate this as 70, because no single row has KEY_ANLT=1223 AND KEY_ANLT=70. Each row only has one value for KEY_ANLT. So then that IIF expression evaluates as 1223 for every row, and your condition will only return rows where KEY_ANLT=1223 (compound B).

How to query huge MySQL databases?

I have 2 tables, a purchases table and a users table. Records in the purchases table looks like this:
purchase_id | product_ids | customer_id
---------------------------------------
1 | (99)(34)(2) | 3
2 | (45)(3)(74) | 75
Users table looks like this:
user_id | email | password
----------------------------------------
3 | joeShmoe#gmail.com | password
75 | nolaHue#aol.com | password
To get the purchase history of a user I use a query like this:
mysql_query(" SELECT * FROM purchases WHERE customer_id = '$users_id' ");
The problem is, what will happen when tens of thousands of records are inserted into the purchases table. I feel like this will take a performance toll.
So I was thinking about storing the purchases in an additional field directly in the user's row:
user_id | email | password | purchases
------------------------------------------------------
1 | joeShmoe#gmail.com | password | (99)(34)(2)
2 | nolaHue#aol.com | password | (45)(3)(74)
And when I query the user's table for things like username, etc. I can just as easily grab their purchase history using that one query.
Is this a good idea, will it help better performance or will the benefit be insignificant and not worth making the database look messier?
I really want to know what the pros do in these situations, for example how does amazon query it's database for user's purchase history since they have millions of customers. How come there queries don't take hours?
EDIT
Ok, so I guess keeping them separate is the way to go. Now the question is a design one:
Should I keep using the "purchases" table I illustrated earlier. In that design I am separating the product ids of each purchase using parenthesis and using this as the delimiter to tell the ids apart when extracting them via PHP.
Instead should I be storing each product id separately in the "purchases" table so it looks like this?:
purchase_id | product_ids | customer_id
---------------------------------------
1 | 99 | 3
1 | 34 | 3
1 | 2 | 3
2 | 45 | 75
2 | 3 | 75
2 | 74 | 75
Nope, this is a very, very, very bad idea.
You're breaking first normal form because you don't know how to page through a large data set.
Amazon and Yahoo! and Google bring back (potentially) millions of records - but they only display them to you in chunks of 10 or 25 or 50 at a time.
They're also smart about guessing or calculating which ones are most likely to be of interest to you - they show you those first.
Which purchases in my history am I most likely to be interested in? The most recent ones, of course.
You should consider building these into your design before you violate relational database fundamentals.
Your database already looks messy, since you are storing multiple product_ids in a single field, instead of creating an "association" table like this.
_____product_purchases____
purchase_id | product_id |
--------------------------
1 | 99 |
1 | 34 |
1 | 2 |
You can still fetch it in one query:
SELECT * FROM purchases p LEFT JOIN product_purchases pp USING (purchase_id)
WHERE purchases.customer_id = $user_id
But this also gives you more possibilities, like finding out how many product #99 were bought, getting a list of all customers that purchased product #34 etc.
And of course don't forget about indexes, that will make all of this much faster.
By doing this with your schema, you will break the entity-relationship of your database.
You might want to look into Memcached, NoSQL, and Redis.
These are all tools that will help you improve your query performances, mostly by storing data in the RAM.
For example - run the query once, store it in the Memcache, if the user refresh the page, you get the data from Memcache, not from MySQL, which avoids querying your database a second time.
Hope this helps.
First off, tens of thousands of records is nothing. Unless you're running on a teensy weensy machine with limited ram and harddrive space, a database won't even blink at 100,000 records.
As for storing purchase details in the users table... what happens if a user makes more than one purchase?
MySQL is hugely extensible, and don't let the fact that it's free convince you of otherwise. Keeping the two tables separate is probably best, not only because it keeps the db more normal, but having more indices will speed queries. A 10,000 record database is relatively small in deference to multi-hundred-million record health record databases.
As far as Amazon and Google, they hire hundreds of developers to write specialized query languages for their specific application needs... not something developers like us have the resources to fund.

Creating new table from data of other tables

I'm very new to SQL and I hope someone can help me with some SQL syntax. I have a database with these tables and fields,
DATA: data_id, person_id, attribute_id, date, value
PERSONS: person_id, parent_id, name
ATTRIBUTES: attribute_id, attribute_type
attribute_type can be "Height" or "Weight"
Question 1
Give a person's "Name", I would like to return a table of "Weight" measurements for each children. Ie: if John has 3 children names Alice, Bob and Carol, then I want a table like this
| date | Alice | Bob | Carol |
I know how to get a long list of children's weights like this:
select d.date,
d.value
from data d,
persons child,
persons parent,
attributes a
where parent.name='John'
and child.parent_id = parent.person_id
and d.attribute_id = a.attribute_id
and a.attribute_type = "Weight';
but I don't know how to create a new table that looks like:
| date | Child 1 name | Child 2 name | ... | Child N name |
Question 2
Also, I would like to select the attributes to be between a certain range.
Question 3
What happens if the dates are not consistent across the children? For example, suppose Alice is 3 years older than Bob, then there's no data for Bob during the first 3 years of Alice's life. How does the database handle this if we request all the data?
1) It might not be so easy. MS SQL Server can PIVOT a table on an axis, but dumping the resultset to an array and sorting there (assuming this is tied to some sort of program) might be the simpler way right now if you're new to SQL.
If you can manage to do it in SQL it still won't be enough info to create a new table, just return the data you'd use to fill it in, so some sort of external manipulation will probably be required. But you can probably just use INSERT INTO [new table] SELECT [...] to fill that new table from your select query, at least.
2) You can join on attributes for each unique attribute:
SELECT [...] FROM data AS d
JOIN persons AS p ON d.person_id = p.person_id
JOIN attributes AS weight ON p.attribute_id = weight.attribute_id
HAVING weight.attribute_type = 'Weight'
JOIN attributes AS height ON p.attribute_id = height.attribute_id
HAVING height.attribute_type = 'Height'
[...]
(The way you're joining in the original query is just shorthand for [INNER] JOIN .. ON, same thing except you'll need the HAVING clause in there)
3) It depends on the type of JOIN you use to match parent/child relationships, and any dates you're filtering on in the WHERE, if I'm reading that right (entirely possible I'm not). I'm not sure quite what you're looking for, or what kind of database you're using, so no good answer. If you're new enough to SQL that you don't know the different kinds of JOINs and what they can do, it's very worthwhile to learn them - they put the R in RDBMS.
when you do a select, you need to specify the exact columns you want. In other words you can't return the Nth child's name. Ie this isn't possible:
1/2/2010 | Child_1_name | Child_2_name | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name | Child_2_name
Each record needs to have the same amount of columns. So you might be able to make a select that does this:
1/2/2010 | Child_1_name
1/2/2010 | Child_2_name
1/2/2010 | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name
1/4/2010 | Child_2_name
And then in a report remap it to how you want it displayed