Hive table validation

Hive table validation - hive

I have an hive table with some records.
I validate the table using a script, and the result is stored in another table.
Basically I have to get a count of both these tables, and find the difference.
The difference will then be compared to a threshold value.
Based on that result, table will be used for analysis or rejected.
The threshold value is a measure of the bad records that the table can contain.
Example:-
Assume that table A has 100 records.
the records are validated and the valid records are stored in another table B.
let's say that table B has 70 records.
so difference is 100-70=30.
let's assume that threshold is 20(means maximum of 20 bad records can be present),
So in this case, table will be rejected.
how to perform this?

If hive.stats.autogather is set to true then the table statistics is gather automatically, including numRows.
All you have to do in that case is to check the statistics and take what ever decision you want.
If this is not the case than just execute 2 select count(*) queries, e.g. -
src_numRows=$(hive -e "select count(*) from src")
trg_numRows=$(hive -e "select count(*) from trg")
P.s.
It can be done is a single query but I think it would be better to save the raw results other than just the diff.
select s.cnt - t.cnt as diff
from (select count(*) as cnt from src) s
cross join (select count(*) as cnt from trg) t
;
Demo
create source table
create table src
as
select i+1 as n
from (select 1) x lateral view posexplode(split(space(100-1),' ')) pe as i,x
;
create target table
create table trg
as
select *
from src
where n > 30
;
Check tables` statistics
bash
src_numRows=$(hive -e "show tblproperties src ('numRows')")
trg_numRows=$(hive -e "show tblproperties trg ('numRows')")
printf "src:%g,trg:%g,diff:%g\n" ${src_numRows} ${trg_numRows} $((src_numRows-trg_numRows))
src:100,trg:70,diff:30
Another option is to check the statistics directly from the metasotre
MySQL
select t.TBL_NAME
,tp.PARAM_VALUE
from DBS d
join TBLS t
on t.DB_ID = d.DB_ID
join TABLE_PARAMS tp
on tp.TBL_ID = t.TBL_ID
where d.NAME = 'default'
and t.TBL_NAME in ('src','trg')
and tp.PARAM_KEY = 'numRows'
;
+----------+-------------+
| TBL_NAME | PARAM_VALUE |
+----------+-------------+
| src | 100 |
| trg | 70 |
+----------+-------------+

Related

SQL: Selecting record where values in one field are unique based off of most recent date

I'm attempting to write an SQL statement to select records such that each record has a unique PartNo, and I want that record to be based off of the most recent ReceiveDate. I got an answer when I asked this question:
SELECT t.*
FROM Table as t
WHERE t.ReceiveDate = (SELECT MAX(t2.ReceiveDate)
FROM Table as t2
WHERE t2.PartNo = t.PartNo
);
However, this answer assumes that for each ReceiveDate, you would not have the same PartNo twice. In situations where there are multiple records with the same PartNo and ReceiveDate, it does not matter which is selected, but I only want one to be selected (PartNo must be unique)
Example:
PartNo | Vendor | Qty | ReceiveDate
100 | Bob | 2 | 2020/07/30
100 | Bob | 3 | 2020/07/30
Should only return one of these records.
I'm using Microsoft Access which uses Jet SQL which is very similar to T-SQL.

Use NOT EXISTS:
select distinct t.*
from tablename as t
where not exists (
select 1 from tablename
where partno = t.partno
and (
receivedate > t.receivedate
or (receivedate = t.receivedate and qty > t.qty)
or (receivedate = t.receivedate and qty = t.qty and vendor > t.vendor)
)
)

manually set up a standard Aggregate query (sigma icon in ribbon) where grouped on Part No and Date field is set to MAX...
run the query to check to see it returns the values you seek... then while in design view - - select SQL view and this will give you the sql statement...

SQL simplifying an except query

I have a database with around 50 million entries showing the status of a device for a given day, simplified to the form:
id | status
-------------
1 | Off
1 | Off
1 | On
2 | Off
2 | Off
3 | Off
3 | Off
3 | On
...
such that each id is guaranteed to have at least 2 rows with an 'off' status, but doesn't have to have an 'on' status. I'm trying to get a list of only the ids that do not have an 'On' status. For example, in the above data set I'd want a query returned with only '2'
The current query is:
SELECT DISTINCT id FROM table
EXCEPT
SELECT DISTINCT id FROM table WHERE status <> 'Off'
Which seems to work, but it's having to iterate over the entire table twice which ends up taking ~10-12 minutes to run per query. Is there a simpler way to do this with only a single query?

You can use WHERE NOT EXISTS instead:
Select Distinct Id
From Table A
Where Not Exists
(
Select *
From Table B
Where A.Id = B.Id
And B.Status = 'On'
)
I would also recommend looking at the indexes on the Status column. 10-12 minutes to run is excessively long. Even with 50m records, with proper indexing, a query like this shouldn't take longer than a second.
To add an index to the column, you can run this (I'm assuming SQL Server, your syntax may vary):
Create NonClustered Index Ix_YourTable_Status On YourTable (Status Asc);

You can use conditional aggregation.
select id
from table
group by id
having count(case when status='On' then 1 end)=0

You can use the help of a SELF JOIN ..
SELECT DISTINCT A.Id
FROM Table A
LEFT JOIN Table B ON A.Id=B.Id
WHERE B.Status='On'
AND B.Id IS NULL

Assign a random order to each group

I want to expand each row in TableA into 4 rows. The result hold all the columns from TableA and two additional columns: SetID = ranging from 0 to 3 and unique when grouped by TableA. Random = a random permutation of SetID within the same grouping.
I use SQLite and would prefer a pure SQL solution.
Table A:
Description
-----------
A
B
Desired output:
Description | SetID | Random
------------|-------|-------
A | 0 | 2
A | 1 | 0
A | 2 | 3
A | 3 | 1
B | 0 | 3
B | 1 | 2
B | 2 | 0
B | 3 | 1
My attempt so far solves creating 4 rows for each row in TableA but doesn't get the permutation correctly. wrong will contain a random number ranging from 0 to 3. I need exactly one 0, 1, 2 and 3 for each unique value in Description and their order should be random.
SELECT
Description,
SetID,
abs(random()) % 4 AS wrong
FROM
TableA
LEFT JOIN
TableB
ON
1 = 1
Table B:
SetID
-----
0
1
2
3

Use a cross join
SELECT Description,
SetID,
abs(random()) % 4 AS wrong
FROM TableA
CROSS JOIN TableB

Consider a solution in your specialty, R. As you know, R maintains excellent database packages, one of which is RSQLite. Additionally, R can run commands via the connection without the need to import very large datasets.
Your solution is essentially a random sampling without replacement. Simply have R run the sampling and concatenate list items into an SQL string.
Below creates a table in the SQLite database where R sends the CREATE TABLE command to the SQL engine. No import or export of data. Should you need to run every four rows, run an iterative loop in a defined function that outputs the sql string. For append queries change the CREATE TABLE AS to INSERT INTO ... SELECT statement.
library(RSQLite)
sqlite <- dbDriver("SQLite")
conn <- dbConnect(sqlite,"C:\\Path\\To\\Database\\File\\newexample.db")
# SAMPLE WITHOUT REPLACEMENT
randomnums <- as.list(sample(0:3, 4, replace=F))
# SQL CONCATENATION
sql <- sprintf("CREATE TABLE PermutationsTable AS
SELECT a.Description, b.SetID,
(select %d from TableB WHERE TableB.SetID = b.SetID AND TableB.SetID=0
union select %d from TableB WHERE TableB.SetID = b.SetID AND TableB.SetID=1
union select %d from TableB WHERE TableB.SetID = b.SetID AND TableB.SetID=2
union select %d from TableB WHERE TableB.SetID = b.SetID AND TableB.SetID=3)
As RandomNumber
from TableA a, TableB b;",
randomnums[[1]], randomnums[[2]],
randomnums[[3]], randomnums[[4]])
# RUN QUERY
dbSendQuery(conn, sql)
dbDisconnect(conn)
You will notice a nested union subquery. This is used to achieve the inline random numbers for each row. Also, to return all possible combinations from all tables, no join statements are needed, simply list tables in FROM clause.

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?

A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2

You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1

I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.

How can I select unique rows in a database over two columns?

I have found similar solutions online but none that I've been able to apply to my specific problem.
I'm trying to "unique-ify" data from one table to another. In my original table, data looks like the following:
USERIDP1 USERIDP2 QUALIFIER DATA
1 2 TRUE AB
1 2 CD
1 3 EF
1 3 GH
The user IDs are composed of two parts, USERIDP1 and USERIDP2 concatenated. I want to transfer all the rows that correspond to a user who has QUALIFIER=TRUE in ANY row they own, but ignore users who do not have a TRUE QUALIFIER in any of their rows.
To clarify, all of User 12's rows would be transferred, but not User 13's. The output would then look like:
USERIDP1 USERIDP2 QUALIFIER DATA
1 2 TRUE AB
1 2 CD
So basically, I need to find rows with distinct user ID components (involving two unique fields) that also possess a row with QUALIFIER=TRUE and copy all and only all of those users' rows.

Although this nested query will be very slow for large tables, this could do it.
SELECT DISTINCT X.USERIDP1, X.USERIDP2, X.QUALIFIER, X.DATA
FROM YOUR_TABLE_NAME AS X
WHERE EXISTS (SELECT 1 FROM YOUR_TABLE_NAME AS Y WHERE Y.USERIDP1 = X.USERIDP1
AND Y.USERIDP2 = X.USERIDP2 AND Y.QUALIFIER = TRUE)
It could be written as an inner join with itself too:
SELECT DISTINCT X.USERIDP1, X.USERIDP2, X.QUALIFIER, X.DATA
FROM YOUR_TABLE_NAME AS X
INNER JOIN YOUR_TABLE_NAME AS Y ON Y.USERIDP1 = X.USERIDP1
AND Y.USERIDP2 = X.USERIDP2 AND Y.QUALIFIER = TRUE
For a large table, create a new auxiliary table containing only USERIDP1 and USERIDP2 columns for rows that have QUALIFIER = TRUE and then join this table with your original table using inner join similar to the second option above. Remember to create appropriate indexes.

This should do the trick - if the id fields are stored as integers then you will need to convert / cast into Varchars
SELECT 1 as id1,2 as id2,'TRUE' as qualifier,'AB' as data into #sampled
UNION ALL SELECT 1,2,NULL,'CD'
UNION ALL SELECT 1,3,NULL,'EF'
UNION ALL SELECT 1,3,NULL,'GH'
;WITH data as
(
SELECT
id1
,id2
,qualifier
,data
,SUM(CASE WHEN qualifier = 'TRUE' THEN 1 ELSE 0 END)
OVER (PARTITION BY id1 + '' + id2) as num_qualifier
from #sampled
)
SELECT
id1
,id2
,qualifier
,data
from data
where num_qualifier > 0

Select *
from yourTable
INNER JOIN (Select UserIDP1, UserIDP2 FROM yourTable WHERE Qualifier=TRUE) B
ON yourTable.UserIDP1 = B.UserIDP1 and YourTable.UserIDP2 = B.UserIDP2

How about a subquery as a where clause?
SELECT *
FROM theTable t1
WHERE CAST(t1.useridp1 AS VARCHAR) + CAST(t1.useridp2 AS VARCHAR) IN
(SELECT CAST(t2.useridp1 AS VARCHAR) + CAST(t.useridp2 AS VARCHAR)
FROM theTable t2
WHERE t2.qualified
);

This is a solution in mysql, but I believe it should transfer to sql server pretty easily. Use a subquery to pick out groups of (id1, id2) combinations with at least one True 'qualifier' row; then join that to the original table on (id1, id2).
mysql> SELECT u1.*
FROM users u1
JOIN (SELECT id1,id2
FROM users
WHERE qualifier
GROUP BY id1, id2) u2
USING(id1, id2);
+------+------+-----------+------+
| id1 | id2 | qualifier | data |
+------+------+-----------+------+
| 1 | 2 | 1 | aa |
| 1 | 2 | 0 | bb |
+------+------+-----------+------+
2 rows in set (0.00 sec)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas