HIVE table - Get records which are closest to a given numeric value - sql

From a hive table, I want records which are closest to a given value of each of the columns.
E.g.
The table has columns - total_score, avg_score, etc. I want to get records which have total_score and avg_score close or equal to "a given value".
Note - Table has approx. 183 million rows and I want 1,50,000 records which are closest/equal to the given value of each of the columns.
Please help me with the process of doing it.

The general concept needs to be top x, ordered by the absolute value of difference between parameter value and values in list.

Related

How can I remove Null value from first column but keep the value of the 2nd and thirds columns

I am Omar, a new learner of SQL.
I have a large excel sheet that I want to analyze by SQL.
It has the following columns (Manufacturers, Products, sales)
the problem is, in the first column 'Manufacturers,' the manufacturer name has only been entered once per one manufacturer. while for the rest of the below rows, the cells are empty until the next manufacturer.
Please refer to the attached image for more understanding.
How can I remove these null values in my query results while keeping the values of the product column value?
thank you
The main problem you have is that SQL tables represent unordered sets. So, if you have only your specified columns, you cannot reconstruct the Excel format.
To solve this, you want to load the data into a table that has an identity or auto-incremented column, in order to preserve the insertion order. The exact details depend on the database. Let me call this column id.
Then you can "spread" the value where it is missing. One method is:
select t.*,
max(manufacturer) over (partition by manufacturer_grp) as imputed_manufacturer
from (select t.*,
count(manufacturer) over (order by id) as manufacturer_grp
from t
) t

Get latest data for all people in a table and then filter based on some criteria

I am attempting to return the row of the highest value for timestamp (an integer) for each person (that has multiple entries) in a table. Additionally, I am only interested in rows with the field containing ABCD, but this should be done after filtering to return the latest (max timestamp) entry for each person.
SELECT table."person", max(table."timestamp")
FROM table
WHERE table."type" = 1
HAVING table."field" LIKE '%ABCD%'
GROUP BY table."person"
For some reason, I am not receiving the data I expect. The returned table is nearly twice the size of expectation. Is there some step here that I am not getting correct?
You can 1st return a table having max(timestamp) and then use it in sub query of another select statement, following is query
SELECT table."person", timestamp FROM
(SELECT table."person",max(table."timestamp") as timestamp, type, field FROM table GROUP BY table."person")
where type = 1 and field LIKE '%ABCD%'
Direct answer: as I understand your end goal, just move the HAVING clause to the WHERE section:
SELECT
table."person", MAX(table."timestamp")
FROM table
WHERE
table."type" = 1
AND table."field" LIKE '%ABCD%'
GROUP BY table."person";
This should return no more than 1 row per table."person", with their associated maximum timestamp.
As an aside, I surprised your query worked at all. Your HAVING clause referenced a column not in your query. From the documentation (and my experience):
The fundamental difference between WHERE and HAVING is this: WHERE selects input rows before groups and aggregates are computed (thus, it controls which rows go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed.

SQL Query to count multiple values from one table into specific view

I like to request your help. I can get the results seperated but now i want to create a query which has it perfect for a external person. my explanation:
I have a statistics database with in this database a table when some records comes in and each records has several columns with values etc...
Now one of these columns is called "MT"
MT Column can have only one of the following values per records: A,B,C,D,E
The records also have a columne called TotalAmount which indicate a size of a value outside the database. This TotalAmount column is numeric without decimals and can have a value between 1 and 10.000.
And the last part is the records it self, the table has X amount of records.
So Basicly i need to create a query which seperates each MT value and calculates the amount of records per MT and the sum of TotalAmount.
This is on SQL Server 2005.
Many thanks for your assistance!
Very hard to guess without a full db schema. But I think you need.
SELECT MT, Count(*), SUM (TotalAmout)
FROM YourTable
GROUP BY MT

Drill on Hbase: issue with SELECT *

I created a test table in HBASE with 2 column families and 3 columns for each, all same value, 100,000 rows in total.
SELECT COUNT(*) will not return the correct row count (much much less).
However, doing SELECT on a subset of those columns gives the right count.
If I reduced the column number to 2 in this case, SELECT COUNT(*) gives the right count.
Later I tried tables with only one column family, drill returns the right count only when column number is smaller than 6.
What could be the possible reason of this?
Have I missed any drill configurations?

Find first of a set of related records, dependent on a value in another row

I have a table in which a row represents a subsection of a larger block of data.
Given an input parameter which identifies one of these rows, I would like to return a different row which represents the root record.
Specifically I would like to retrieve the first record in this set.
For example:
Find the row with column X value of Y.
Get the value A, of column Z.
Return the first row with column Z value of A.
What is the best way of doing this?
Two separate queries on the original table?
A single query on the original table?
Construct a new view that would enable a single query?
Something else?
You can get the same row, if you will not order you rows by some column, but general answer will be
select top 1 *
from table
where Z = (select Z from table where X = #parameter)
order by ???