How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one query? - hive

I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions
Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds
It is possible that clo_US may have a numeric value while col_DS is 0 or null
Question:
I want to retrieve top 500 non null and non 0 records from each of these columns from one query.
My Query:
From(
SELECT D.COL_US, D.DATESTAMP,
ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
FROM ${wf_table_name} D
WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
AND COL_US > 0)T
INSERT OVERWRITE TABLE ${wf_us_table}
SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;

As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
As per your question your table may have 2 type of value. for example.
col_us
0
NULL
10
5
col_ds
5
10
0
NULL
or both column may have >0 value.
So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
But with this condition you will not get any value from above stated 4 rows.
So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.
Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.
What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.

Related

Firebird select distinct with count

In Firebird 2.5 I have a table of hardware device events; each row contains a timestamp, a device ID and an integer status of the event. I need to retrieve a rowset of the subset of IDs with non-0 statuses and the number of instances of the non-0 events for each ID, within a specified date range. I can get the subset of IDs with non-0 statuses in the specified date range, but I can't figure out how to get the count of non-0-status rows associated with each ID in the same rowset. I'd prefer to do this in a query rather than a stored proc, if possible.
The table is:
RPR_HISTORY
TSTAMP timestamp
RPRID integer
PARID integer
LASTRES integer
LASTCUR float
The rowset I want is like
RPRID ERRORCOUNT
-------------------
18 4
19 2
66 7
The query
select distinct RPRID from RPR_HISTORY
where (LASTRES <> 0)
and (TSTAMP >= :STARTSTAMP);
gives me the IDs I'm looking for, but obviously not the count of non-0-status rows for each ID. I've tried a bunch of combinations of nested queries derived from the above; all generate errors, usually on grouping or aggregation errors. It seems like a straightforward thing to do but is just escaping me.
Got it! The query
select rh.RPRID, count(rh.RPRID) from RPR_HISTORY rh
where (rh.LASTRES <> 0)
and (rh.TSTAMP >= :STARTSTAMP)
and rh.RPRID in
(select distinct rd.RPRID from RPR_HISTORY rd where rd.LASTRES <> 0)
group by rh.RPRID;
returns the rowset I need.

How do I aggregate data in sql for multiple rows of data by column name?

hi im new to sql and trying to understand how to work with data structures. I have a table
fact.userinteraction
interactionuserkey visitdatecode
0 20220404
1 20220404
5 20220402
5 20220128
If the interaction userkey number repeats then, i want a column called number of visits. in this case, for interactionuserkey 5, there are 2 total visits since its repeated twice. for interactionuserkey 0, number of visits =1 and so on. Basically, sum duplicates in column 1 and give total count AS number of visits. How do i do this?
In sql, it's resolved using basic aggregation
select interactionuserkey, count(*)
from your_table
group by interactionuserkey

Most Efficient way to Search Massive Redshift Table for Duplicate Values

I have a large Redshift tables (hundreds of millions of ROWS with ~50 columns per row).
There is a need for me to find rows that have duplicate columns for a specific value.
Example:
if my table has the columns 'column_of_interest' and 'date_time', In those hundreds of millions of columns, I need to find all the instances where 'column_of_interest' has more than one value between a certain 'date_time'.
eg:
column_of_interest date_time
ROW 1: ABCD-1234 165895896565
ROW 2: FCEG-3434 165895896577
ROW 3: ABCD-1234 165895986688
ROW 4: ZZZZ-9999 165895986689
ROW 5: ZZZZ-9999 165895987790
in the above.. since ROW 1 and ROW 3 have the same column_of_interest i would like that column_of_interest returned. and ROW 4 and ROW 5 as well, so i would like those returned.
So the end result would be:
duplicates
ABCD-1234
ZZZZ-9999
I have found a few things online, but the table is so large, the query times about before any results are returned. Am I going about this the wrong way? Here are a couple that I tried just to get the results back (but they timeout before returning).
SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING COUNT(*) > 1
WHERE date_time >= 1601510400000 AND date_time < 1601596800000
LIMIT 200
SELECT a.*
FROM my_table a
JOIN (SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING count(*) > 1 ) b
ON a.column_of_interest = b.column_of_interest
ORDER BY a.column_of_interest
LIMIT 200
This should be a fine method. And it should not "time out". Your version has a syntax error.
So try:
SELECT column_of_interest, COUNT(*)
FROM my_table
WHERE date_time >= 1601510400000 AND date_time < 1601596800000
GROUP BY column_of_interest
HAVING COUNT(*) > 1
LIMIT 200

SQL - Update top n records for each value in column a where n = count of column b

I have one table with the following columns and sample values:
[test]
ID | Sample | Org | EmployeeNumber
1 100 6513241
2 200 3216542
3 300 5649841
4 100 9879871
5 200 6546548
6 100 1116594
My example count query based on [test] returns these sample values grouped by Org:
Org | Count of EmployeeNumber
100 3
200 2
300 1
My question is can I use this count to update test.Sample to 'x' for the top 3 records of Org 100, the top 2 records of Org 200, and the top 1 record of Org 300? It does not matter which records are updated, as long as the number of records updated for the Org = the count of EmployeeNumber.
I realize that I could just update all records in this example but I have 175 Orgs and 900,000 records and my real count query includes an iif that only returns a partial count based on other columns.
The db that I am taking over uses a recordset and loop to update. I am trying to write this in one SQL update statement. I have tried several variations of nested select statements but can't quite figure it out. Any help would save my brain from exploding. Thanks!
Assuming, that id is the unique ID of the row, you could use a correlated subquery to select the count of row IDs of the rows sharing the current organization, that are less than or equal to the current row ID and check, that this count is less than or equal to the number of records from that organization you want to designate.
For example to mark 3 records of the organization 100 you could use:
UPDATE test
SET sample = 'x'
WHERE org = 100
AND (SELECT count(*)
FROM test t
WHERE t.org = test.org
AND t.id <= test.id) <= 3;
And analog for the other cases.
(Disclaimer: I don't have access to Access (ha, ha, pun), so I could not test it. But I guess it's basic enough, to work in almost every DBMS, also in Access.)

Split a query result based on the result count

I have a query based on basic criteria that will return X number of records on any given day.
I'm trying to check the result of the basic query then apply a percentage split to it based on the total of X and split it in 2 buckets. Each bucket will be a percentage of the total query result returned in X.
For example:
Query A returns 3500 records.
If the number of records returned from Query A is <= 3000, then split the 3500 records into a 40% / 60% split (1,400 / 2,100).
If the number of records returned from Query A is >=3001 and <=50,000 then split the records into a 10% / 90% split.Etc. Etc.
I want the actual records returned, and not just the math acting on the records that returns one row with a number in it (in the column).
I'm not sure how you want to display different parts of the resulting set of rows, so I've just added additional column(part) in the resulting set of rows that contains values 1 indicating that row belongs to the first part and 2 - second part.
select z.*
, case
when cnt_all <= 3000 and cnt <= 40
then 1
when (cnt_all between 3001 and 50000) and (cnt <= 10)
then 1
else 2
end part
from (select t.*
, 100*(count(col1) over(order by col1) / count(col1) over() )cnt
, count(col1) over() cnt_all
from split_rowset t
order by col1
) z
Demo #1 number of rows 3000.
Demo #2 number of rows 3500.
For better usability you can create a view using the query above and then query that view filtering by part column.
Demo #3 using of a view.