hive - Duplicate counts check associated from one to another column - sql

I have a table with and trying to fetch counts of distinct uniqueness from across a column by comparing to another column and the data is across millions to billions for each TMKEY partitioned column
ID TNUM TMKEY
23455 ABCD 1001
23456 ABCD 1001
23455 ABCD 1001
112233 BCDE 1001
113322 BCDE 1001
9009 DDEE 1001
9009 DDEE 1001
1009 FFGG 1001
Looking for desired output:
total_distinct_tNUM_count count_of_TNUM_which_has_more_than_disintct_ID TMKEY
4 2 1001
Here when TNUM is DDEE, the ID is fetching 9009 which has duplicates shouldn't be picked up when calculating the count of TNUM which has more than distinct ID. All I'm looking in here is get group concat counts. Any suggestions please. As I have data with more than 3 billion to 4 billions my approach is completely different and stuck.
select a.tnum,a.group_id,a.time_week from (SELECT time_week,tnum,count(*) as num_of_rows, concat_ws('|' , collect_set(id)) as group_id from source_table_test1 where time_week=1001 group by tnum,time_week) as a where length(a.group_id)>16 and num_of_rows>1

Related

OPENEDGE (ODBC) Joining without Duplicates on ID

trying to query 3 tables without duplicates on the Object_ID
If this is not possible, order for both Object_ID, so the duplicates of both Object_IDs are among themselves, would also do the trick for me. But this is not really working for me, since Im only able to order by one Object_ID, so the duplicates are among themselves.
Tables:
S_Anl
Ktr Anl
4711 1234
4711 5678
4711 9000
AB_Erg
Anl AB_Erg_Obj Value
1234 c9d91f 1000
1234 696bfc 2000
1234 8c9915 3000
5678 141a65 4000
E_BP
Anl E_BP_Obj Value
1234 99f75ab 500
1234 720e573 100
9000 830614c 50
9000 958ac28 200
Query
SELECT B.AB_Erg_Obj, C.E_BP_Obj, A.Anl, B.Value, C.Value
FROM PUB.S_Anl AS A
LEFT JOIN PUB.AB_Erg AS B ON A.Anl = B.Anl
LEFT JOIN PUB.E_BP AS C ON A.Anl = C.Anl
WHERE A.Ktr = '4711'
ORDER BY A.Anl, B.AB_Erg_Obj, C.E_BP_Obj
with (nolock)
Expected Result
Anl AB_Erg_Obj E_BP_Obj Value Value
1234 c9d91f 99f75ab 1000 500
1234 696bfc 720e573 2000 100
1234 8c9915 NULL 3000 NULL
5678 141a65 NULL 4000 NULL
9000 830614c 830614c NULL 50
9000 958ac28 958ac28 NULL 200
Or Ordering AB_Erg_Obj and E_BP_Obj among themselves.
Is either of this possible?
//EDIT:
I know that ordering wouldnt remove duplicates in the result set, but it would be easier to do it afterwards.
Also its not necassary that the data is matched exactly on row-level, I just need the overall sum of Value from E_BP and Value of AB_Erg in the first place - because of that exact matching on row-level is not needed, just no duplicates on the Object_ID-Level

How to Check Duplicate value SQL table?

I am using SQL server.Import data from Excel . i have Following Fields column
Entity ExpenseTypeCode Amount Description APSupplierID ExpenseReportID
12 001 5 Dinner 7171 90
12 001 6 Dinner 7171 90
12 001 5 Dinner 7273 90
12 001 5 Dinner 7171 95
12 001 5 Dinner 7171 90
I added Sample Data. Now I want select Duplicate Records .which Rows have all columns value same i want fetch that row. suppose above My table Fifth Row duplicate . i have more four thousands Query . i want select Duplicate records .Above I mention . please How to select using Query ?
If you want the values that are duplicated, then use group by:
select Entity, ExpenseTypeCode, Amount, Description, APSupplierID, ExpenseReportID, count(*) as numDuplicates
from t
group by Entity, ExpenseTypeCode, Amount, Description, APSupplierID, ExpenseReportID
having count(*) > 1;

Last Invoice using Postgres

I have a Postgres 9.1 database with three tables - Customer, Invoice, and Line_Items
I want to create a customer list showing the customer and last invoice date for any customer with a specific item (specifically all invoices that have the line_items.code beginning with 'L3').
First, I am trying to pull the one transaction for each customer (the last invoice with the 'L3" code) (figuring I can JOIN the customer names once this list is created).
Tables are something like this:
Customers
cust_number last_name first_name
=========== ======== ====================
1 Smith John
2 Jones Paul
3 Jackson Mary
4 Brown Phil
Transactions
trans_number date cust_number
=========== =========== ====================
1001 2014-01-01 1
1002 2014-02-01 4
1003 2014-03-02 2
1004 2014-03-06 3
Line_Items
trans_number date item_code
=========== =========== ====================
1001 2014-01-01 L3000
1001 2014-01-01 M2420
1001 2014-01-01 L3500
1002 2014-02-01 M2420
1003 2014-03-02 M2420
1004 2014-03-06 L3000
So far, I have:
Select transactions.cust_number, transactions.trans_number
from transactions
where transactions.trans_number in
( SELECT Line_Items.trans_number
FROM Line_Items
WHERE Line_Items.item_code ilike 'L3%'
ORDER BY line_items.date DESC
)
order by transactions.pt_number
This pulls all the invoices for each customer with an 'L3' code on the invoice, but I can't figure out how to just have the last invoice.
Use DISTINCT ON:
SELECT DISTINCT ON (t.cust_number)
t.cust_number, t.trans_number
FROM line_items l
JOIN transactions t USING (trans_number)
WHERE l.item_code ILIKE 'L3%'
ORDER BY t.cust_number, l.date DESC;
This returns at most one row per cust_number - the one with the latest trans_number. You can add more columns to the SELECT list freely.
Detailed explanation:
Select first row in each GROUP BY group?
you could use MIN or MAX:
SELECT Line_Items.trans_number, Max(line_items.date) As [last]
From Line_Items
Group By Line_Items.trans_number

Execute an SQL UPDATE using GROUP BY and COUNT

I am working with SQL in an SQLite database. I have a table that looks something like this:
STORAGE
------------------------------
REC_ID SEQ_NO NAME
------------------------------
100 1 plastic jar
100 2 glass cup
100 fiber rug
101 1 steel fork
101 wool scarf
102 1 leather boots
102 2 paintbox
102 3 cast iron pan
102 toolbox
Keep in mind that that this is a very small number of records compared to what I actually have in the table. What I need to do is update the table so that all the records that have a null value for SEQ_NO are set with the actual number they are supposed to be in sequence to the group of records with the same REC_ID.
Here is what I want the table to look like after the update:
STORAGE
------------------------------
REC_ID SEQ_NO NAME
------------------------------
100 1 plastic jar
100 2 glass cup
100 3 fiber rug
101 1 steel fork
101 2 wool scarf
102 1 leather boots
102 2 paintbox
102 3 cast iron pan
102 4 toolbox
so for example, the record with REC_ID 102 should have have SEQ_NO of 4, because it is the fourth record with the REC_ID 102.
If I do:
SELECT REC_ID, COUNT(*) FROM STORAGE GROUP BY REC_ID;
this returns all of the records by REC_ID and the number (count) of records matching each ID, which would also be the number I would want to assign to each of the records with a null SEQ_NO.
Now how would I go about actually updating all of these records with their count values?
this should work:
update storage set
seq_no=(select count(*) from storage s2 where storage.rec_id=s2.rec_id)
where seq_no is null

How can I create multiple rows from a single row (sql server 2008)

In our organisation we have a central purchasing company (CPC) who then sells to our retail company (Company_X) via an intercompany PO, who then sells on to the customer.
What I need to do is link our retail sale back to the original purchase order.
For example I have a table which contains the following (and a multitude of other columns):
Company_X_Sales:
InterCO_PO_no Sales_Order_No Part_No Qty
------------- -------------- ------- ---
12345 98765 ABCD 10
I then have a table which has the following:
CPC_Sales:
PO_Number InterCO_SO_No Part_No Qty
--------- ------------- ------- ---
00015 12345 ABCD 5
00012 12345 ABCD 2
00009 12345 ABCD 4
00007 12345 ABCD 3
So you can see that the final sale of 10 items was made up of parts which came from more than 1 external POs in the central company.
What I need to be able to do is replicate the rows in Company_X_Sales, include the Original PO Number and set the quantities as in CPC_Sales.
I need to end up with something like this:
Company_X_Sales_EXTD:
PO_Number InterCO_PO_no Sales_Order_No Part_No Qty
--------- ------------- -------------- ------- ---
00007 12345 98765 ABCD 3
00009 12345 98765 ABCD 4
00012 12345 98765 ABCD 2
00015 12345 98765 ABCD 1
I have to use the Company_X_Sales as my driving table - the CPC_Sales is simply as a lookup to derive the original PO Number.
Hoping you can help I am working through the weekend on this as it is part of a piece of work which has a very aggressive timescale.
I do not mind if the solution requires more than one pass of the table or creation of views if needed. I am just really really struggling.
I'm a little confused by your question, but it sounds like you're trying to make your Company_X_Sales table have 3 rows instead of 1, just with varying quantities? If so, something like this should work:
SELECT S.PO_Number, C.InterCO_PO_no, C.Sales_Order_No, C.Part_No, S.Qty
FROM Company_X_Sales C
JOIN CPC_Sales S ON C.InterCO_PO_no = S.InterCO_SO_No
Here is the SQL Fiddle.
That will give you the 4 rows with the correct quantities. Then you can delete and reinsert accordingly.
To get those rows into the table, you have a few options, but something like this should work:
--Flag the rows for deletion
UPDATE Company_X_Sales SET Qty = -1 -- Or some arbitrary value that does not exist in the table
--Insert new correct rows
INSERT INTO Company_X_Sales
SELECT C.InterCO_PO_no, C.Sales_Order_No, C.Part_No, S.Qty
FROM Company_X_Sales C
JOIN CPC_Sales S ON C.InterCO_PO_no = S.InterCO_SO_No
--Cleanup flagged rows for deletion
DELETE FROM Company_X_Sales WHERE Qty = -1
Good luck.
select [PO_Number],[InterCO_SO_No], Company_X_Sales.Sales_Order_No, [Part_No],[Qty] from CPC_Sales inner join Company_X_Sales on Company_X_Sales.InterCO_PO_no = CPC_Sales.InterCO_SO_no
Simple inner join on two tables will get you the required result IMHO.