How can I count replicate values across columns? - sql

I have a fairly simple problem that I need to solve by SQL>
There is a database that has several columns of values that are duplicate. I am trying to count the number of values that occurs across those columns. I need to know how many values for each category. For example, if there is a column for different types of cookies that one bakery sells; chocolate and raisin and another column has types of cookies that other bakery sells including chocolate and raisin, I want to know how many chocolate cookies all the bakeries sell.
My query:
SELECT types_cookies_store01. COUNT(*) FROM cookies
GROUP BY types_cookies_store01
This code is returning only one column's count but how can I know the count across all the columns?
Thank you!

Your data model is poor!
Assuming it is equal to this:
| types_cookies_store01 | types_cookies_store02 | types_cookies_store03 |
|-----------------------|-----------------------|-----------------------|
| A | | |
| C | | |
| | A | |
| A | | |
| B | | |
| | B | |
| | | B |
| C | | |
| | C | |
Don't use a new column for each store, use one column to specify the store:
| Cookietype | Store |
|------------|-------|
| A | 01 |
| A | 03 |
| B | 02 |
| B | 01 |
| A | 02 |
| C | 02 |
| D | 02 |
| B | 03 |
| B | 01 |
| A | 03 |
| C | 03 |
Now you can group by Cookietype to get sales of all stores.
Select Cookietype, Count(Cookietype) as CountOfCookieType
From SalesTable
Group By Cookietype
Or to get sales by store add the Store to the group by:
Select Cookietype, Store, Count(Cookietype) as CountOfCookieType
From SalesTable
Group By Cookietype, Store
With that data model you can add (almost) unlimited new stores.
With your data model you need a new column for each new store and tables are limited to 255 columns!
You can work around your poor model, by using a union query:
SELECT UnionCols.Cookietype, COUNT(*)
FROM (
SELECT types_cookies_store01 As Cookietype, 'Store01' As Store
FROM cookies
WHERE Not types_cookies_store01 Is Null
UNION ALL
SELECT types_cookies_store02 As Cookietype, 'Store02' As Store
FROM cookies
WHERE Not types_cookies_store02 Is Null
UNION ALL
SELECT types_cookies_store03 As Cookietype, 'Store03' As Store
FROM cookies
WHERE Not types_cookies_store03 Is Null
) As UnionCols
Group By UnionCols.Cookietype
But performance will be poor on large data and new stores are hard to add.
It is strongly recommended to improve your data model instead of working around it!
Some links to read:
Database design basics
Access All In One

Related

SQL: How do I combine similar value rows into one, not affecting the rest

Is there a way to merge similar values in the same column and not affect the rest, for example:
I want to sum Amount by Company and ID too.
You cannot get the data you want to display. You will be getting company name being repeated. If you want to dispaly data in the way, where company name is not repeating for subsequent rows, you have to use EXCEL or some other presentation layer tool.
SELECT Company, ID, SUM(Amount)
FROM Table1
GROUP BY Company,ID
+---------+-----+--------+
| Company | ID | Amount |
+---------+-----+--------+
| ABC | 001 | 3 |
| ABC | 002 | 3 |
| DEF | 002 | 10 |
| DEF | 003 | 5 |
+---------+-----+--------+

SQL Server create an ID for each distinct value in a column

I have a table [DWH].[Division] which looks like this:
| Company ID | Division Code | Division Key |
|------------|---------------|--------------|
| 01 | PROD | 1 |
| 02 | PROD | 2 |
| 01 | MAN | 3 |
| 04 | MAN | 4 |
| 03 | PROD | 5 |
I am trying to add a new column called [Global Division Code], so that whenever a row with an existsting [Division Code] is inserted, it will the corresponding integer value from [Global Division Code] and when a row with a new [Division Code] is inserted, the next highest integer value from [Global Division Code] will be used. The existing [Division Key] column uses INT IDENTITY(1,1). Values are only inserted into this table when the combination of [Company ID] and [Division Code] does not yes exist in the table.
The result would look like this:
| Company ID | Division Code | Division Key | Global Division Key |
|------------|---------------|--------------|---------------------|
| 01 | PROD | 1 | 1 |
| 02 | PROD | 2 | 1 |
| 01 | MAN | 3 | 2 |
| 04 | MAN | 4 | 2 |
| 03 | PROD | 5 | 1 |
When I add a new row with [Division Code] 'FIN' the result would be:
| Company ID | Division Code | Division Key | Global Division Key |
|------------|---------------|--------------|---------------------|
| 01 | PROD | 1 | 1 |
| 02 | PROD | 2 | 1 |
| 01 | MAN | 3 | 2 |
| 04 | MAN | 4 | 2 |
| 03 | PROD | 5 | 1 |
| 01 | FIN | 6 | 3 |
The data for this table is inserted from a staging-table via a MERGE-statement.
I have tried to use a sequence for this, so that whenever the merge insertes new data it will check if there is already a global division key for the row's division code. If there is, it was supposed to use that code, if not, grab the next value from the sequence. However, SQL Server will not let me use the NEXT VALUE function in a subquery. There has to be a better solution out there than using a cursor to post-process the [DWH].[Division] table row by row.
Any help would be greatly appreciated!
You could do what you want with a trigger. But that is not a good idea.
You have a failure in your data model. You should have a separate table for the global divisions with one row per global division code and key. The key can be assigned as an identity column.
You can "hide" the underlying structure of the data by using indexed views, if you like. That is, you can have a view where the data looks like the results you want. Under the hood, though, there are two tables.

How should I create an SQL table with stock information so that I can add new stocks and new fields easily?

I want to create an SQL table, where I can have any number of stocks (ie. MSFT, GOOG, IBM) and any number of fields (ie. Full Name, Sector, Country). But I want the flexibility to add new stocks and new fields as I go along. Say I want to add a new stock like AAPL, or I want a new boolean field for whether they pay dividends or not. I don't expect to store dynamic fields like CurrentStockPrice, but the information will have to change periodically. For instance, when a company changes its dividend policy. How do I design the table so that I don't have to change its structure?
I had one idea where I could have a new table for each stock, and a master table that has all the stocks, and a pointer to each individual stock's table. That way, I can freely add new stocks, and new fields easily. But I'm not very familiar with SQL, and would like an expert opinion on how it should be implemented.
The simple answer is that your requirements are not a good fit for SQL. The most important concern is not how to store the data, but how you will retrieve it - what kind of query will you need to run?
EAV allows you to store data whose schema you don't know in advance - but has lots of drawbacks when querying. Even moderately complex queries (find all stocks where the dividend was paid between 1 and 12 Jan, in the tech sector, whose CEO is female) run into a lot of compexity.
Creating a new table for each type of record very quickly gets crazy too - imagine the query above if you have to search dozens or hundreds of type-specific tables.
The relational model works best when you know the schema of the information in advance.
If you don't know the schema, consider using a NoSQL solution, or use SQL Server's support for XML or JSON. Store the fixed data in rows & columns, and the variable data in XML or JSON. Performance for searching is pretty good, and it's much less convoluted as a solution.
Just to expand on my comment, because the question itself begs for a couple of common schema anti-patterns. Some hybrid of EAV may actually be a good fit if you are willing to give up some flexibility and simplicity in your SQL and you aren't looking for fast queries.
EAV
EAV, or Entity-Attribute-Value is a design where, in your case, you would have a master table of stocks with some common attributes, or maybe even ticker info with a datetime. Something like:
+---------+--------+--------------+
| stockid | symbol | name |
+---------+--------+--------------+
| 1 | goog | Google |
| 2 | msft | Microsoft |
| 3 | gpro | GoPro |
| 4 | xom | Exxon Mobile |
+---------+--------+--------------+
And a second table (the EAV table) to store ever changing attributes:
+---------+-----------+------------+
| stockid | attribute | value |
+---------+-----------+------------+
| 1 | country | us |
| 1 | favorite | TRUE |
| 1 | startyear | 2004 |
| 3 | favorite | |
| 3 | bobspick | TRUE |
| 4 | country | us |
| 3 | country | us |
| 2 | startyear | 1986 |
| 2 | employees | 18000 |
| 3 | marketcap | 1850000000 |
+---------+-----------+------------+
And perhaps a third table to get that minute by minute ticker info stored:
+---------+----------------+--------+
| stockid | datetime | value |
+---------+----------------+--------+
| 1 | 9/21/2016 8:15 | 771.41 |
| 1 | 9/21/2016 8:14 | 771.39 |
| 1 | 9/21/2016 8:12 | 771.37 |
| 1 | 9/21/2016 8:10 | 771.35 |
| 1 | 9/21/2016 8:08 | 771.33 |
| 1 | 9/21/2016 8:06 | 771.31 |
| 1 | 9/21/2016 8:04 | 771.29 |
| 2 | 9/21/2016 8:15 | 56.81 |
| 2 | 9/21/2016 8:14 | 56.82 |
| 2 | 9/21/2016 8:12 | 56.83 |
| 2 | 9/21/2016 8:10 | 56.84 |
+---------+----------------+--------+
Generally this is considered not great design since stitching data back together in a format like:
+-------------+-----------+---------+-----------+----------+--------------+
| stocksymbol | stockname | country | startyear | bobspick | currentvalue |
+-------------+-----------+---------+-----------+----------+--------------+
causes you to write a query that is not fun to look at:
SELECT
stocks.stocksymbol,
stocks.name,
country.value,
bobspick.value,
startyear.value,
stockvalue.stockvalue
FROM
stocks
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'country') as country ON
stocks.stockid = country.stockid
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'Bobspick') as bobspick ON
stocks.stockid = bobspick.stockid
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'startyear') as startyear ON
stocks.stockid = startyear.stockid
LEFT OUTER JOIN (SELECT max(value) as stockvalue, stockid FROM ticketTable GROUP BY stockid) as stockvalue ON
stocks.stockid = stockvalue.stockid
WHERE symbol in ('goog', 'msft')
You can see that every "field" in the EAV table gets its own subquery, which means we read that table from storage three times. We gain the flexibility on the front end over the database design, but we lose flexibility when querying.
Imagine a more traditional schema:
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
| stockid | symbol | name | country | bobspick | favorite | startyear | marketcap | employees |
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
| 1 | goog | Google | us | | TRUE | 2004 | | |
| 2 | msft | Microsoft | | | | 1986 | | 18000 |
| 3 | gpro | GoPro | us | TRUE | | | 1850000000 | |
| 4 | xom | Exxon Mobile | us | | | | | |
| | | | | | | | | |
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
and
+---------+----------------+--------+
| stockid | datetime | value |
+---------+----------------+--------+
| 1 | 9/21/2016 8:15 | 771.41 |
| 1 | 9/21/2016 8:14 | 771.39 |
| 1 | 9/21/2016 8:12 | 771.37 |
| 1 | 9/21/2016 8:10 | 771.35 |
| 1 | 9/21/2016 8:08 | 771.33 |
| 1 | 9/21/2016 8:06 | 771.31 |
| 1 | 9/21/2016 8:04 | 771.29 |
| 2 | 9/21/2016 8:15 | 56.81 |
| 2 | 9/21/2016 8:14 | 56.82 |
| 2 | 9/21/2016 8:12 | 56.83 |
| 2 | 9/21/2016 8:10 | 56.84 |
+---------+----------------+--------+
To get the same results:
SELECT
stocks.stocksymbol,
stocks.name,
stocks.country,
stocks.bobspick,
stocks.startyear,
stockvalue.stockvalue
FROM
stocks
LEFT OUTER JOIN (SELECT max(value) as stockvalue, stockid FROM ticketTable GROUP BY stockid) as stockvalue ON
stocks.stockid = stockvalue.stockid
WHERE symbol in ('goog', 'msft')
Now we have the flexibility in the query where we can quickly change out fields without monkeying around in subqueries, but we have to hassle our DBA every time we want to add a field.
There is a further abstraction from EAV that is definitely something to avoid. I don't know if it has a name, but I call it "Database in a database". Here you have a table of tables, table of fields, and a table of values. The entire schema is kept as records as our the values that would be stored in the schema. Ultimatele flexibility is gained, but the sql you will write to get at your data will be nightmarish and your query speeds will degrade at a fast rate as you add to your data/schema/data/schema mess.
As for your last idea of adding a new table for each stock, if the fields you are going to track for each stock are different (startyear, employees, and market cap for one stock and marketmax, country, address, yearsinbusiness in another) and you aren't planning on adding new stocks often, then it may be a good fit. I'm betting though that the attributes/fields that you track on stock1 are going to also be tracked on stock2, and therefore suggest that your should have a single stock table with all those common attributes and maybe an EAV to track attributes that are particular to each stock so you can have the flexibility you need.
In each of these schemas I would also suggest that you put your ticker data in it's own table. Whether you are capturing ticket data by the minute, hour, day, week, or month, because it's datetime level data, it deserves it's own table. (Unless you are only going to track the most current value, then it becomes a field).
If you want to add fields dynamically, but without actually altering the schema of the table, then you should use a vertical schema for the table and retrieve the data via a PIVOT statement.
In this manner you can add as many Field/Value pairs as you wish for each stock/customer pairing.
The basic table would have 5 columns perhaps:
ID (Identity); StockName; AttributeName; Value; Timestamp;
If you take a look at how SQL organizes it's table schema in INFORMATION_SCHEMA.COLUMNS, it provides this very same vertical schema layout for you.

Query to compare values across different tables?

I have a pair of models in my Rails app that I'm having trouble bridging.
These are the tables I'm working with:
states
+----+--------+------------+
| id | fips | name |
+----+--------+------------+
| 1 | 06 | California |
| 2 | 36 | New York |
| 3 | 48 | Texas |
| 4 | 12 | Florida |
| 5 | 17 | Illinois |
| … | … | … |
+----+--------+------------+
places
+----+--------+
| id | place |
+----+--------+
| 1 | Fl |
| 2 | Calif. |
| 3 | Texas |
| … | … |
+----+--------+
Not all places are represented in the states model, but I'm trying to perform a query where I can compare a place's place value against all state names, find the closest match, and return the corresponding fips.
So if my input is Calif., I want my output to be 06
I'm still very new to writing SQL queries, so if there's a way to do this using Ruby within my Rails (4.1.5) app, that would be ideal.
My other plan of attack was to add a fips column to the "places" table, and write something that would run the above comparison and then populate fips so my app doesn't have to run this query every the page loads. But I'm very much a beginner, so that sounds... ambitious.
This is not an easy query in SQL. Your best bet is one of the fuzzing string matching routines, which are documented here.
For instance, soundex() or levenshtein() may be sufficient for what you want. Here is an example:
select distinct on (p.place) p.place, s.name, s.fips, levenshtein(p.place, s.name) as dist
from places p cross join
states s
order by p.place, dist asc;

PostgreSQL Inner Join on the same table + second table?

If this is a stupid question, forgive me, I'm not very familiar with PostgreSQL.
I've collected inventory data from used car dealerships in my area and stored it in a postgreSQL table. I've got a second table with particular details regarding certain makes and models. For example:
The dealership table is structured like so:
-----------------------------------------
| Dealership | Make | Model | Year | ID |
----------------------------------------|
| A | Ford | F250 | 2003 | 1 |
| A | Chevy| Cobalt| 2005 | 2 |
| B | Ford | F250 | 2003 | 1 |
| B | Dodge| Chrgr | 2012 | 3 |
-----------------------------------------
The details table is structured like so:
-----------------------------------------
| ID | DetailA| DetailB| DetailC|
-----------------------------------------
| 1 | data | data | data |
| 2 | data | data | data |
| 3 | data | data | data |
| 4 | data | data | data |
-----------------------------------------
My goal is to retrieve vehicle matches from multiple dealerships and display the appropriate details. In the above example, I would like to see:
-----------------------------------------------------
| Make | Model | Year | DetailA | DetailB | DetailC |
-----------------------------------------------------
| Ford | F250 | 2003 | data | data | data |
-----------------------------------------------------
With this result, I will know that both A and B havea 2003 Ford F250 for sale, and can view the related details of the vehicle.
I've tried many different queries, but most are variations on something like this:
SELECT DISTINCT
dealership_table.make,
dealership_table.model,
dealership_table.year
details_table.detaila,
details_table.detailb,
details_table.detailc
FROM
dealership_table
INNER JOIN
details_table
ON
dealership_table.id = details_table.id
WHERE
dealership_table.dealership = 'A'
OR
dealership_table.dealership = 'B'
However this returns all of the distinct matches from the table where dealership is either A or B. I've tried multiple inner-joins, but I an error complaining details_table is specified multiple times.
If I'm doing something really silly, I apologize. Like I said before, I'm pretty much an SQL noob.
What am I doing wrong? How should I go about retrieving the desired results? Any suggestions, solutions, or advice is greatly appreciated!
You can write:
SELECT dealership_table1.make,
dealership_table1.model,
dealership_table1.year,
details_table.detaila,
details_table.detailb,
details_table.detailc
FROM dealership_table dealership_table1
JOIN dealership_table dealership_table2
ON dealership_table1.make = dealership_table2.make
AND dealership_table1.model = dealership_table2.model
AND dealership_table1.year = dealership_table2.year
JOIN details_table
ON dealership_table.id = details_table.id
WHERE dealership_table1.dealership = 'A'
AND dealership_table1.dealership = 'B'
;
(Note that the FROM dealership_table dealership_table1 and JOIN dealership_table dealership_table2 set up distinct "aliases", so you can use the same table multiple different times in the same query without getting name-conflicts.)
I may be misunderstanding your table layout, but I think you should consider changing to a different structure. Here's what I would propose:
Vehicle:
----------------------------
| ID | Make | Model | Year |
----------------------------
| 1 | Ford | F250 | 2003 |
| 2 | Chevy| Cobalt| 2005 |
| 3 | Dodge| Chrgr | 2012 |
----------------------------
Dealership:
----------------------------
| Dealership | ID | Detail |
----------------------------
| A | 1 | data |
| A | 2 | data |
| B | 1 | data |
| B | 3 | data |
----------------------------
This way you're not storing vehicle information (make/model/year) in more than one place.
Here's how you would write your desired query given the above schema:
SELECT Make, Model, Year, A.Detail, B.Detail, C.Detail
FROM Vehicle V
LEFT OUTER JOIN Dealership A on A.Dealership = 'A' and A.id = V.id
LEFT OUTER JOIN Dealership B on B.Dealership = 'B' and B.id = V.id
LEFT OUTER JOIN Dealership C on C.Dealership = 'C' and C.id = V.id