Finding Outliers In SQL

Finding Outliers In SQL - sql

I am very new to SQL and have my data in an Access database (~50k rows) with the following structure
State Year Date Price
CA 2012 1/2/13 5.00
NY 2013 1/2/13 6.00
NY 2013 1/7/13 7.00
A (State, Year) pair, though held in different columns here, represent a vintage (like a wine). So we talk about how the price of "CA 2012" moves throughout the year.
Because some of our data is entered manually into this database, there is opportunity for error. We would like to write a query that flags any suspicious entries for further review.
I have read many different questions and threads on the subject but have not found anything that addresses my main concern of how to find local outliers - the price can move up and down so prices that may be okay for some date range may be an outlier earlier in the year
Update: I chunked my data into buckets of months so finding local outliers might be easier as a result of that. I'm still looking for good outlier detection methods I can implement in SQL.

Sometimes simple is best- No need for an intro to statistics yet. I would recommend starting with simple grouping. Within that function you can Average, get the minimum, the Maximum and other useful bits of data. Here are a couple of examples to get you started:
SELECT Table1.State, Table1.Yr, Count(Table1.Price) AS CountOfPrice, Min(Table1.Price) AS MinOfPrice, Max(Table1.Price) AS MaxOfPrice, Avg(Table1.Price) AS AvgOfPrice
FROM Table1
GROUP BY Table1.State, Table1.Yr;
Or (in case you want month data included)
SELECT Table1.State, Table1.Yr, Month([Dt]) AS Mnth, Count(Table1.Price) AS CountOfPrice, Min(Table1.Price) AS MinOfPrice, Max(Table1.Price) AS MaxOfPrice
FROM Table1
GROUP BY Table1.State, Table1.Yr, Month([Dt]);
Obviously you'll need to modify the table and field names (Just so you know though- 'Year' and 'Date' are both reserved words and best not used for field names.)

Related

Power pivot ytd calculation

Ok, I have watched many videos and read all sorts and I think I am nearly there, but must be missing something. In the data model I am trying to add the ytd calc to my product_table. I don't have unique dates in the product_table in column a and also they are weekly dates. I have all data for 2018 for each week of this year in set rows of 20, incrementing by one week every 20 rows. E.g. rows 1-20 are 01/01/2018, rows 21-40 are 07/01/2018, and so on.
Whilst I say they are in set rows of 20, this is an example. Some weeks there are more or less than 20 so I can't use the row count function-
Between columns c and h I have a bunch of other categories such as customer age, country etc. so there isn't a unique identifier. Do I need one for this to work? Column i is the sales column with the numbers. What I would like is a new column which gives me a ytd number for each row of data which all has unique criteria between a and h. Week 1 ytd is not going to be any different. For the next 20 rows I want it to add week1 sales to week2 sales, effectively giving me the ytd.
I could sumproduct this easily in the data set but I don't want do that. I want to use dax to save space etc..
I have a date_table which does have unique dates in the main_date column. All my date columns are formatted as date in the data model.
I have tried:
=calculate(products[sales],datesytd(date_table[main_date]))
This simply replicates the numbers in the sales column, not giving me an ytd as required. I also tried
=calculate(sum(products[sales]) ,datesytd(date_table[main_date]))
I don't know if what I am trying to do is possible. All the youtube clips don't seem to have the same issues I am having but I think they have unique dates in their data sets.
Id love to upload the data but its work stuff on a work computer so cant really. Hope I've painted the picture quite clearly.

Resolved, after googling sumif dax, mike honey had a response that i have adapted to get what i need. I needed to add the filter and earlier functions to my equarion and it ended up like this
Calculate (sum(products[sales]),
filter (sales, sales[we_date] <=earlier(sales[we_date]),
filter (sales, sales[year] =earlier(sales[year]),
filter (sales, sales[customer] =earlier(sales[customer]))
There are three other filter sections i had to add, but this now gives me the ytd i needed.
Hope this helps anyone else

Writing equations in SQL using multiple variables

I'm trying to use data that is labeled by year (2012 - 2016) to calculate CAGR. The data was originally in one column indicating the total population with another column indicating the year. I've isolated the 2012 and 2016 data into two separate columns and am trying to use SQL to calculate the CAGR rate ((data from 2016)/(data from 2012)^(1/4))-1.
Is this the correct way to calculate CAGR/cummulative growth? I've tried simply using the two columns of data but because they are mismatched and have nulls, it doesn't work. Please let me know if you have any ideas.

Compound Annual Growth Rate (CAGR) doesn't really lend itself to what you're trying to do.
Usually this is used when you say, invest $1000 in a fund, and you calculate the annual growth based on the ending value.
Example - if you invest $1000 and in 5 years it's worth $5000:
( 5,000 / 1,000)1/5 - 1 = .37973 = 37.97%
If I was to write that in SQL Server it would be:
SELECT SUM(POWER((5000.0/1000.0),(1.0/5.0))-1.0)
You can replace the 5000 and 1000 to be the specific columns you want to compare, or a range of data you need to compare.
If you elaborate your question I will update this answer.

MS Access query to convert values from one currency to another currency

Alright, I need a little assistance on a problem that I am facing.
I am working on a database project and have run into a problem regarding converting money values from a variety of different currencies into US Dollars.
The reason for my difficulty is that I need to maintain the original records in their original currency format but I also have to be able to convert these values to US Dollars, then perform a number of dynamic queries to sum up values of specific records and then output the final outcomes into a series of Reports.
I already have a table which contains all of my transactions (which includes currency type field, and several monetary value fields(12 fields per record))
I have a second table which contains the reference list of currency types along with the neccessary conversion rates over a 12 month period(so again 12 numeric fields) based on their relation to the US dollar. (ie. the entry of US dollar would be followed by 12 fields all containing a value of 1 for a 1-to-1 exchange rate)
I would like to be able to run a query which copies the records from my transactions table to a new table after converting them all to their US Dollar equivallent value. However I am not a expert in writing such a query and would like some assistance. is it possible to write a where clause into an expression within a query so that it takes each record from transactions, finds the correct conversion rate for the correct month, does the math and outputs to another table that same record with the modified values?
Or is there a way to perform this same function using a VBA script? If so what kind of recommendations would you make for that code?
UPDATE OF PROGRESS/SOLUTION
So after reviewing the solutions and comments here is the solution I came up with.
I built my exchange Rates table (ExRates) in the format that I had intended CurrencyName, Followed by the conversion rate for each of the 12 months (this is due to having to work with existing database elements)
Built the following 2 queries Match & Convert
SELECT ForcastTrans.*, ExRates.JanRate, ExRates.FebRate, ExRates.MarRate, ExRates.AprRate, ExRates.MayRate, ExRates.JunRate, ExRates.JulRate, ExRates.AugRate, ExRates.SepRate, ExRates.OctRate, ExRates.NovRate, ExRates.DecRate
FROM ForcastTrans, ExRates
WHERE ForcastTrans.Currency=ExRates.CurrencyName;
SELECT qryExRatematch.EntityID, qryExRatematch.Account, qryExRatematch.Currency, [qryExRatematch]![Month1]*[qryExRatematch]![JanRate] AS Jan, [qryExRatematch]![Month2]*[qryExRatematch]![FebRate] AS Feb, [qryExRatematch]![Month3]*[qryExRatematch]![MarRate] AS Mar, [qryExRatematch]![Month4]*[qryExRatematch]![AprRate] AS Apr, [qryExRatematch]![Month6]*[qryExRatematch]![JunRate] AS Jun, [qryExRatematch]![Month7]*[qryExRatematch]![JulRate] AS Jul, [qryExRatematch]![Month8]*[qryExRatematch]![AugRate] AS Aug, [qryExRatematch]![Month9]*[qryExRatematch]![SepRate] AS Sep, [qryExRatematch]![Month10]*[qryExRatematch]![OctRate] AS Oct, [qryExRatematch]![Month11]*[qryExRatematch]![NovRate] AS Nov, [qryExRatematch]![Month12]*[qryExRatematch]![DecRate] AS [Dec]
FROM qryExRatematch
ORDER BY qryExRatematch.EntityID, qryExRatematch.Account, qryExRatematch.Currency;
These got me the conversions that I needed and I can reconnect my reporting queries to these tables instead of the original ones I had done without the conversion.
Thank you everyone for your help, suggestions, and opinions and I credit Johnny Bones with this answer because his answer led me to the line of experimentation that help me reach my solution.
Thanks again for all your help

Are your table layouts set in stone? The easiest way to do this is to set up your currency table with 3 fields:
CurrencyDate - The date of the new currency exchange rate
CurrencyName - The name of the currency (Yen, Pound, Frank, etc...)
CurrencyRate - The exchange rate on that day
Then you would set up a query called qryCurrentExchange where you would take the Max(CurrencyDate) for each currency. This will give you one query that holds the current exchnage rate for each currency.
Create another query with your transaction table, and Inner Join the above query by the CurrencyName, and you should be able to pull in the exchange rate, which you would multiply by your currency field in your transaction table. You can either leave the query as-is or turn it into a Make Table query if you want to output the results to a table.

Structuring Databases for Financial Statements

I am looking for the best way to structure my database. I have quarterly financial statements for 1000’s of companies from 1997-2012. Each company has three different statements, an income statement, a balance sheet, and a cash flow statement.
I want to be able to perform calculations on the data such as adding up each quarter to get a yearly total for each line item on each statement.
I have tried two ways so far:
1) Storing each line item for each statement in it’s own table i.e. Sales would be one table and have only sales data for all companies I am tracking, with company as the primary key, and each quarters data as a separate column. This seems like the easiest way to work with the data, but updating the data each quarter is time consuming as there are hundreds of tables.
Sales Table
Company q32012 q22012 q12012
ABC Co. 500 100 202
XYZ Co. 230 302 202
2) The other option which is a little easier to update but harder to work with the data is to have a separate table for each company for each statement. For example, the income statement for Royal Bank would have it’s own table, with the primary column being the line item.
Income Statement for Royal Bank
Line_Item q32012 q22012 q12012
Sales
Net Profit
The problem here is when I try to annualize this data, I get a really ugly output due to the group by
SELECT
(CASE WHEN Line_Item = 'Sales' THEN SUM(q4 + q3 + q2 + q1) ELSE '0' END) AS Sales2012,
(CASE WHEN Line_Item = 'NetProfit' THEN SUM(q4 + q3 + q2 + q1)
ELSE '0' END) AS Inventories2012
FROM dbo.[RoyalBankIncomeStatement]
GROUP BY Line_Item
Any help would be appreciated.

Whenever I've had to build a database for fiscal reports by fiscal quarter, month, or year or whatever, I've found it convenient to borrow a concept from star schema design and data warehousing, even if I'm not really building a DW.
The borrowed concept is to have a table, let's call it ALMANAC, that has one row for each date, keyed by the date. In this case a natural key works out well. Dependent attributes in here can be what fiscal month and quarter the date belongs to, whether the date was one where the enterprise was open for business (TRUE or FALSE), and whatever other quirks are in the company calendar.
Then, you need a computer program that just generates this table out of nothing. All the strange rules for the company calendar are embedded in this one program. The ALMANAC can cover a ten year period in a little over 3,650 rows. That's a tiny table.
Now every date in the operational data can be used like a foreign key into the ALMANAC table, provided you consistently use the Date datatype for dates. For example, each sale has a date of sale. Then aggregating by fiscal quarter, or fiscal year, or whatever you like is just a matter of joining operational data with the ALMANAC, and using GROUP BY and SUM() to get the aggregate you want.
It's simple, and it makes generating a whole raft of time period reports a breeze.

My advice to you is to think about not using a SQL database to do this. Instead, think of using something like SQL Server Analysis Services (SSAS). If you want to get a quick start with SSAS, I recommend getting up to speed on PowerPivot for Excel. You can take the model you develop in PowerPivot and import it into SSAS when you're ready.
Why don't I recommend SQL? Because you're going to have a problem aggregating accounts in SQL Server. For example, your balance sheets aren't going to be something you're going to be able to aggregate easily in SQL -- Asking SQL Server for the "Cash" for 2010, for example means that you want to get the entry for the end of December 2010, not that you want to SUM all of the entries for Cash for that year (which would be a nonsense number). On the other hand, with income and expense accounts such as those which would appear on your income statements, you would want to SUM those values up. To make matters worse, some reports are going to have a mix of account types on them, which is going to make reporting quite difficult.
SSAS has provisions inside the product where it "knows" how to aggregate for your reports based on account type, and there are many tutorials out there which can show you how to set this up.
Either way, you're going to need to store your data somewhere before it goes into your reporting system or Analysis Services cube. In order to do that, you should structure your data something like this. Let's say you're storing your data in a table called Reports:
Reports
--------
[Effective Date]
[CompanyID]
[AccountID]
[Amount]
Your Account table would have the description of what you're trying to store (income, expenses, etc). Your [Effective Date] column would link back into a Dates table which describes to which year, quarter, etc., your data belongs. In essence, what I'm describing is a classic shape for reporting databases, called a star schema.

I would probably go with the following structure in one data table:
Company
StatementType
LineItem
FiscalYear
Q1, Q2, Q3, Q4
StatementType would be Income Statement, Balance Sheet or Cash Flow Statement. Line Item would be the coded/uncoded text of the item on the statement, Fiscal Year is 2012, 2011 and so on. You'd still need to make sure that Line Items are consistent across companies.
This structure would let you query for flat statement -
select
LineItem, Q1, Q2, Q3, Q4
from Data
where
Company = 'RoyalBank'
and FiscalYear = 2012
and StatementType = 'Income Statement'
or
QoQ
select
FiscalYear,
Q1
from Data
where
Company = 'Royal Bank'
and
StatementType = 'Income Statement'
and
LineItem = 'Sales'
order by FiscalYear
in addition to aggregates. You'd probably want to have another table for line items with some kind of an index reference to make sure you can pull the statement back in the original order of line items.

SQL YTD for previous years and this year

Wondering if anyone can help with the code for this.
I want to query the data and get 2 entries, one for YTD previous year and one for this year YTD.
Only way I know how to do this is as 2 separate queries with where clauses.. I would prefer to not have to run the query twice.
One column called DatePeriod and populated with 2011 YTD and 2012YTD, would be even better if I could get it to do 2011YTD, 2012YTD, 2011Total, 2012Total... though guessing this is 4 queries.
Thanks
EDIT:
In response to help clear a few things up:
This is being coded in MS SQL.
The data looks like so: (very basic example)
Date | Call_Volume
1/1/2012 | 4
What I would like is to have the Call_Volume summed up, I have queries that group it by week, and others that do it by month. I could pull all the dailies in and do this in Excel but the table has millions of rows so always best to reduce the size of my output.
I currently group by Week/Month and Year and union all so its 1 output. But that means I have 3 queries accessing the same table, large pain, very slow not efficient and that is fine but now I also need a YTD so its either 1 more query or if I could find a way to add it to the yearly query that would ideal:
So
DatePeriod | Sum_Calls
2011 Total | 40
2011 YTD | 12
2012 Total | 45
2012 YTD | 15
Hope this makes any sense.

SQL is built to do operations on rows, not columns (you select columns, of course, but aggregate operations are all on rows).
The most standard approach to this is something like:
SELECT SUM(your_table.sales), YEAR(your_table.sale_date)
FROM your_table
GROUP BY YEAR(your_table.sale_date)
Now you'll get one row for each year on record, with no limit to how many years you can process. If you're already grouping by another field, that's fine; you'll then get one row for each year in each of those groups.
Your program can then iterate over the rows and organize/render them however you like.
If you absolutely, positively must have columns instead, you'll be stuck with something like this:
SELECT SUM(IF(YEAR(date) = 2011, sales, 0)) AS total_2011,
SUM(IF(YEAR(date) = 2012, total_2012, 0)) AS total_2012
FROM your_table
If you're building the query programmatically you can add as many of those column criteria as you need, but I wouldn't count on this running very efficiently.
(These examples are written with some MySQL-specific functions. Corresponding functions exist for other engines but the syntax would be a little different.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas