Count instances within a date range in AWS Athena

Count instances within a date range in AWS Athena - sql

Trying to query my table to tell me the count of instances of one of my columns appearing over 2 different ETL runs. How I would ask it to give me everything in regards to its harmonized value is
SELECT clientname, mdm_cleanse_value_clientname FROM "tablename" WHERE mdm_cleansed_value_clientname IS NULL
That would give me what is left to harmonize. Im interested in duplicate values appearing across the next batch, we use metadata_job_start_time as our indicator, so I would need either a TOP or MAX function that includes the top 2 metadata_job_start_time
I am imagining towards the end it would read
SELECT metadata_job_run_id, metadata_job_start_time, clientname, COUNT(clientname)
FROM "tablename"
WHERE COUNT(clientname) > 1
GROUP BY metadata_job_start_time, metadata_job_run_id, clientname
How do I put in my date range

Related

How to calculate a bank's deposit growth from one call report to the next, as a percentage?

I downloaded the entire FDIC bank call reports dataset, and uploaded it to BigQuery.
The table I currently have looks like this:
What I am trying to accomplish is adding a column showing the deposit growth rate since the last quarter for each bank:
Note:The first reporting date for each bank (e.g. 19921231) will not have a "Quarterly Deposit Growth". Hence the two empty cells for the two banks.
I would like to know if a bank is increasing or decreasing its deposits each quarter/call report (viewed as a percentage).
e.g. "On their last call report (19921231)First National Bank had deposits of 456789 (in 1000's). In their next call report (19930331)First National bank had deposits of 567890 (in 1000's). What is the percentage increase (or decrease) in deposits"?
This "_%_Change_in_Deposits" column would be displayed as a new column.
This is the code I have written so far:
select
SFRNLL.repdte, SFRNLL.cert, SFRNLL.name, SFRNLL.city, SFRNLL.county, SFRNLL.stalp, SFRNLL.specgrp AS `Loan_Specialization`, SFRNLL.lnreres as `_1_to_4_Residential_Loans`, AL.dep as `Deposits`, AL.lnlsnet as `loans_and_leases`,
IEEE_DIVIDE(SFRNLL.lnreres, AL.lnlsnet) as SFR2TotalLoanRatio
FROM usa_fdic_call_reports_1992.All_Reports_19921231_1_4_Family_Residential_Net_Loans_and_Leases as SFRNLL
JOIN usa_fdic_call_reports_1992.All_Reports_19921231_Assets_and_Liabilities as AL
ON SFRNLL.cert = AL.cert
where SFRNLL.specgrp = 4 and IEEE_DIVIDE(SFRNLL.lnreres, AL.lnlsnet) <= 0.10
UNION ALL
select
SFRNLL.repdte, SFRNLL.cert, SFRNLL.name, SFRNLL.city, SFRNLL.county, SFRNLL.stalp, SFRNLL.specgrp AS `Loan_Specialization`, SFRNLL.lnreres as `_1_to_4_Residential_Loans`, AL.dep as `Deposits`, AL.lnlsnet as `loans_and_leases`,
IEEE_DIVIDE(SFRNLL.lnreres, AL.lnlsnet) as SFR2TotalLoanRatio
FROM usa_fdic_call_reports_1993.All_Reports_19930331_1_4_Family_Residential_Net_Loans_and_Leases as SFRNLL
JOIN usa_fdic_call_reports_1993.All_Reports_19930331_Assets_and_Liabilities as AL
ON SFRNLL.cert = AL.cert
where SFRNLL.specgrp = 4 and IEEE_DIVIDE(SFRNLL.lnreres, AL.lnlsnet) <= 0.10
The table looks like this:
Additional notes:
I would also like to view the last column (SFR2TotalLoansRatio) as a percentage.
This code runs correctly, however, previously I was getting a "division by zero" error when attempting to run 50,000 rows (1992 to the present).

Addressing each of your question individually.
First) Retrieving SFR2TotalLoanRatio as percentage, I assume you want to see 9.988% instead of 0.0988 in your results. Currently, in BigQuery you can achieve this by casting the field into a STRING then, concatenating the % sign. Below there is an example with sample data:
WITH data as (
SELECT 0.0123 as percentage UNION ALL
SELECT 0.0999 as percentage UNION ALL
SELECT 0.3456 as percentage
)
SELECT CONCAT(CAST(percentage*100 as String),"%") as formatted_percentage FROM data
And the output,
Row formatted_percentage
1 1.23%
2 9.99%
3 34.56%
Second) Regarding your question about the division by zero error. I am assuming IEEE_DIVIDE(arg1,arg2) is a function to perform the division, in which arg1 is the divisor and arg2 is the dividend. Therefore, I would adivse your to explore your data in order to figured out which records have divisor equals to zero. After gathering these results, you can determine what to do with them. In case you decide to discard them you can simply add within your WHERE statement in each of your JOINs: AL.lnlsnet = 0. On the other hand, you can also modify the records where lnlsnet = 0 using a CASE WHEN or IF statements.
UPDATE:
In order to add this piece of code your query, you u have to wrap your code within a temporary table. Then, I will make two adjustments, first a temporary function in order to calculate the percentage and format it with the % sign. Second, retrieving the previous number of deposits to calculate the desired percentage. I am also assuming that cert is the individual id for each of the bank's clients. The modifications will be as follows:
#the following function MUST be the first thing within your query
CREATE TEMP FUNCTION percent(dep INT64, prev_dep INT64) AS (
Concat(Cast((dep-prev_dep)/prev_dep*100 AS STRING), "%")
);
#followed by the query you have created so far as a temporary table, notice the the comma I added after the last parentheses
WITH data AS(
#your query
),
#within this second part you need to select all the columns from data, and LAG function will be used to retrieve the previous number of deposits for each client
data_2 as (
SELECT repdte, cert, name, city, county, stalp, Loan_Specialization, _1_to_4_Residential_Loans,Deposits, loans_and_leases, SFR2TotalLoanRatio,
CASE WHEN cert = lag(cert) OVER (PARTITION BY id ORDER BY d) THEN lag(Deposits) OVER (PARTITION BY id ORDER BY id) ELSE NULL END AS prev_dep FROM data
)
SELECT repdte, cert, name, city, county, stalp, Loan_Specialization, _1_to_4_Residential_Loans,Deposits, loans_and_leases, SFR2TotalLoanRatio, percent(Deposits,prev_dep) as dept_growth_rate FROM data_2
Note that the built-in function LAG is used together with CASE WHEN in order to retrieve the previous amount of deposits per client.

SQL Impala - Grouping by inconsistent variable type and finding min, max, and difference

See attached picture above for how the data is formatted and what my goal is.
I am looking to make a query to group all pairs of the CAB variables together and create new columns for High_BP and Low_BP then find the difference of the these two CABs FR.
I can do this currently when there are just two CAB for every unique ID but some of the unique ID on some DATE and Dest will have anywhere from 2-4 different CAB configurations.
Anyone have any idea on where to start with this?
Thank you so much!

You did not mention your tablename so I just used Yourtable
select table1.Dest
,table1.Date
,table1.ID
,concat(table1.cab, table2.cab) as cab_pair
,case when table1.BP > table2.BP then table2.BP else table1.BP end as Low_BP
,case when table1.BP > table2.BP then table1.BP else table2.BP end as High_BP
,abs(table1.FR-table2.FR) as Diff_FR
from Yourtable as table1
inner join Yourtable as table2
on table1.id=table2.id and table1.cab > table2.cab
The only problem with the query is that the pairing will be done alphabetically with the first letter being farther in the alphabet, so instead of CF it will be FC.

SAP BO - how to get 1/0 distinct values per week in each row

the problem I am trying to solve is having a SAP Business Objects query calculate a variable for me because calculating it in a large excel file crashes the process.
I am having a bunch of columns with daily/weekly data. I would like to get a "1" for the first instance of Name/Person/Certain Identificator within a single week and "0" for all the rest.
So for example if item "Glass" was sold 5 times in week 4 in this variable/column first sale will get "1" and next 4 sales will get "0". This will allow me to have the number of distinct items being sold in a particular week.
I am aware there are Count and Count distinct functions in Business Objects, however I would prefer to have this 1/0 system for the entire raw table of data because I am using it as a source for a whole dashboard and there are lots of metrics where distinct will be part/slicer for.
The way I doing it previously is with excel formula: =IF(SUMPRODUCT(($A$2:$A5000=$A2)*($G$2:$G5000=$G2))>1,0,1)
This does the trick and gives a "1" for the first instance of value in column G appearing in certain value range in column A ( column A is the week ) and gives "0" when the same value reappears for the same week value in column A. It will give "1" again when the week value change.
Since it is comparing 2 cells in each row for the entire columns of data as the data gets bigger this tends to crash.
I was so far unable to emulate this in Business Objects and I think I exhausted my abilities and googling.
Could anyone share their two cents on this please?

Assuming you have an object in the query that uniquely identifies a row, you can do this in a couple of simple steps.
Let's assume your query contains the following objects:
Sale ID
Name
Person
Sale Date
Week #
Price
etc.
You want to show a 1 for the first occurrence of each Name/Week #.
Step 1: Create a variable with the following definition. Let's call it [FirstOne]
=Min([Sale ID]) In ([Name];[Week #])
Step 2: In the report block, add a column with the following formula:
=If [FirstOne] = [Sale ID] Then 1 Else 0
This should produce a 1 in the row that represents the first occurrence of Name within a Week #. If you then wanted to show a 1 one the first occurrence of Name/Person/Week #, you could just modify the [FirstOne] variable accordingly:
=Min([Sale ID]) In ([Name];[Person];[Week #])

I think you want logic around row_number():
select t.*,
(case when 1 = row_number() over (partition by name, person, week, identifier
order by ??
)
then 1 else 0
end) as new_indicator
from t;
Note the ??. SQL tables represent unordered sets. There is no "first" row in a table or group of rows, unless a column specifies that ordering. The ?? is for such a column (perhaps a date/time column, perhaps an id).
If you only want one row to be marked, you can put anything there, such as order by (select null) or order by week.

Bringing back multiple max on a single column in sql

I have a spreadsheet with customer accounts and when we get a new account it gets added on using a reference account number i.e. Anderson Electrical would be AND01 etc. I'm trying to use sql to bring back the highest number from each variation of letterings e.g. if AND01 already existed and our highest company value was AND34 then it would just bring back AND34 rather than 1 to 34.
Each account has the first 3 letters of there company name followed by whatever the next number is.
Hope I have explained this well enouh for someone to understand :)

For a single reference account:
select max(AcctNum)
from Accounts
where left(AcctNum, 3) = <reference account>
If you want it for all at once:
select left(AcctNum, 3) as ReferenceAcct, max(AcctNum)
from Accounts
group by left(AcctNum, 3)

Not sure if that's what you're asking but if you need to find max value that is part of a string you can do it with substring. So if you need to find the highest number from a column that contains those values you can do it with:
;WITH tmp AS(
SELECT 'AND01' AS Tmp
UNION ALL
SELECT 'AND34'
) SELECT MAX(SUBSTRING(tmp, 4, 2)) FROM tmp GROUP BY SUBSTRING(tmp, 0, 3)
That's a little test query that returns 34 because I'm grouping by first 3 letters, you probably want to group it by some ID.

How do I set ORDER BY in SQL query to a value depending by the SQL query itself?

Imagine an auction (ebay auction, for example). You create an auction, set the start bidding value, let's say, 5 dollars. This gets stored as a minimal bid value to the auctions table.At this point, the current bid value of this auction is 5 dollars.
Now, if someone bids to your auction, let's say, 10 dollars, this gets stored to the bids table.At this point, the current bid value of this auction is 10 dollars.
Now let's imagine you want to retrieve 5 cheapest auctions. You will write a query like this:
SELECT
`auction_id`,
`auction_startPrice`,
MAX(bids.bid_price) as `bid_price`
FROM
`auctions`
LEFT JOIN `bids` ON `auctions`.`auction_id`=`bids`.`bid_belongs_to_auction`
GROUP BY `auction_id`
LIMIT 5
Pretty simple, and it works! But now you need to add an ORDER BY clause to the query. The problem is, however, that we want to ORDER BY either by auctions.auction_startPrice or by bid_price, depending on whichever of this is higher, as explained in the first paragraphs.
Can this be understood? I know how to do this using 2 queries, but I am hoping it can be done with 1 query.
Thanks!
EDIT: Just a further explanation to help you imagine the problem. If I set ORDER BY auction_startPrice ASC, then I will get 5 auctions with their lowest initial bid price, but what if there are already bids placed on those auctions? Then their current lowest price is equal to those bids, NOT to the start price, therefore my query is wrong.

SELECT
`auction_id`,
`auction_startPrice`,
`bid_price`
FROM
(
SELECT
`auction_id`,
`auction_startPrice`,
MAX(bids.bid_price) as `bid_price`,
IF(MAX(bids.bid_price)>`auction_startPrice`,
MAX(bids.bid_price),
`auction_startPrice`) higherPrice
FROM
`auctions`
LEFT JOIN `bids` ON `auctions`.`auction_id`=`bids`.`bid_belongs_to_auction`
GROUP BY `auction_id`
) X
order by higherPrice desc
LIMIT 5;
Note:
In the inner query, an extra column is created, named 'higherPrice'
The IF function compares the MAX(bid_price) column against the startprice, and only if the Max-bid is not null (implicitly required in comparison) and greater than start price, then the Max-bid becomes the value in the higherPrice column. Otherwise, it will contain the start price.
The outer query merely makes use of the columns from the inner query, ordering by the higherPrice

I'm not sure which database you're using but look at this example:
http://www.extremeexperts.com/sql/articles/CASEinORDER.aspx
SELECT
`auction_id`,
`auction_startPrice`,
MAX(bids.bid_price) as `bid_price`
FROM
`auctions`
LEFT JOIN `bids` ON `auctions`.`auction_id`=`bids`.`bid_belongs_to_auction`
GROUP BY `auction_id`
ORDER BY CASE WHEN `auction_startPrice` > isnull(MAX(bids.bid_price),0) then `auction_startPrice` else MAX(bids.bid_price) end
LIMIT 5

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas