How to query the front and back n rows of a piece of random row in hive - hive

After selecting a random row, I want to be able to select n number of records preceding and following it.
example:
id content
1 add
2 bob
3 cdf
4 asd
random row id is 3,i need select result:
2 bob
3 cdf
4 asd

Related

How to sum values of two columns by an ID column, keeping some columns with repeated values and excluding others?

I need to organize a large df adding values of a column by a column ID (the ID is not sequencial), keeping some columns of the df that have repeated values by ID and excluding column that have different values by ID. Below I inserted a reproducible example and the output I need. I think there is a simple way to do that, but I am not soo familiar with R.
df=read.table(textConnection("
ID spp effort generalist specialist
1 a 10 1 0
1 b 10 1 0
1 c 10 0 1
1 d 10 0 1
2 a 16 1 0
2 b 16 1 0
2 e 16 0 1
"), header = TRUE)
The output I need:
ID effort generalist specialist
1 10 2 2
2 16 2 1

Can I split a dynamic semicolon delimited string into columns using T-SQL?

I have a table that looks like so:
Id SubNumber Values
1 1 1;4;8;3
2 2 8;9;7;10
3 3 41;45;23;0
I will not always only have 4 values and the number of "SubNumbers" can be greater than 3. Is there any way I can query this table to look like this?
Id SubNumber 1 2 3 4
1 1 1 4 8 3
2 2 8 9 7 10
2 3 41 45 23 0
The rows will always have the same number of values delimited by a semicolon but the amount separated by a semicolon can vary. So a table may even have 10 values or 1 or more.
The 2nd table doesn't have to have numbers to represent the values. It can even be blank or the default that is given by SQL when no name is provided.
This is not a duplicate of the example provided because this deals with separating a dynamic number of values into columns.

SQL query to pull ONLY data which contains range of records

I need your help to write query with select statement to pull only claim IDs which contains code from this range: 99213, 99214, 99215, 99217.
So my results should be claim ID 1 (all lines) and claim ID 3 (all lines). Since claim ID 2 has codes which are outside of the range, i do not want that in my results.
Claim id line # code
1 1 99213
1 2 99214
1 3 99215
1 4 99217
2 1 99213
2 2 89557
2 3 36415
3 1 99215
3 2 99217
Result should be like this
Claim id line # code
1 1 99213
1 2 99214
1 3 99215
1 4 99217
3 1 99215
3 2 99217
Use a subquery to isolate ClaimIDs that have Codes outside of your list of values. Then rule them out of the main query with a not in.
SELECT *
FROM Table
WHERE ClaimID NOT IN (
SELECT ClaimID FROM Table WHERE Code NOT IN (99213,99214,99215,99217)
);

Removing duplicates from many excel sheets

I got a question if there is any fast way to remove duplicate rows across two excel spreadsheets. After searching I can do it by comparing the same rows in the spreadsheets (VBA). But I want to check whether the row from one is included anywhere in two. If exactly the same row exists in two it should be removed. So far I can do it if they are the same rows (e.g. 1 and 1).
Thanks in advance for any kind of help.
I can think of a workaround for this:
Create a column at the end of each row which is concatenation of all the columns of that particular row: Lets sat below are the two tables on the two excel sheets:
sheet1
A B C D(Concat)
1 2 3 123
4 5 6 456
7 8 9 789
1 3 5 135
4 3 2 432
sheet2
A B C D(Concat)
2 3 4 234
1 1 1 111
1 2 3 123
2 2 2 222
4 5 6 456
We will now identify the duplicate rows based on the last concatenated column. Using the formula =IF(ISNUMBER(MATCH(D4,Sheet1!D:D,0)),"DUP","NONDUP") in the second sheet, we can identify the rows which are already present in sheet1 irrespective of the sequence of the row in sheet1 wrt sheet2.
Result on Sheet2 shows up as below:
A B C D E(Result)
2 3 4 234 NONDUP
1 1 1 111 NONDUP
1 2 3 123 DUP
2 2 2 222 NONDUP
4 5 6 456 DUP

How can I identify the text values that have the lowest row IDs across 4 columns?

I found a few articles that are close, but not the same as what I am trying to do. I have an Excel file that has 4 columns of duplicated data, each column is sorted based on a numeric value that came from a different worksheet.
I need to identify the 25(or so?) rows where the value of the four columns match, and the row ID is the lowest. There will be roughly 250 rows of data to sift through, so I only really need the top 10%.
I don't HAVE to approach it this way. I can dump this data into Access if this cannot be done in Excel. Or I can assign columns next to each text column (a way of assigning IDs to each field in column 1, 2, 3, and 4) and use those values. The approach is negotiable, as long as the outcome works.
Here's what my data looks like in Excel:
A B C D
abc bcd abc def
cde fgh def bcd
def def bcd abc
bcd hji xyz lmn
So in this case I would want to highlight (or somehow identify) the value "def" because it appears closest to the top of all 4 columns, hence it has the lowest row ID. The value "bcd" would be second on the list since it also is identified in all 4 and has a low row id.
Any suggestions would be appreciated. I know SQL fairly well, so if you think dumping it in a DB would be best and you can suggest a query that would be awesome. But ideally... keeping it in Excel would be the least amount of work for me. I'm open to formulas, conditional formatting, etc.
Thanks!!
I THINK I came up with a fairly cool solution...
So, supposing you have this data in columns A-D, begining in cell A2, say.
Now, you know that you ONLY want values if they already exist in column A - Otherwise they're not in all 4 columns.
So:
In E2, type in the formula =Row() - This basically says where A's value is located
In F2, type in =Match($A2,B:B,0) - This will find the first match for A2's value in columns B
Drag that formula across to G2 & H2 (to find the first match for A2's value in C & D respectively).
In I2, type in the formula =Sum(E2:H2)
Now, drag E:H down for your entire dataset.
So, If H = #N/A, that means the values weren't in all 4 columns
And the lower the value for H, the lower the rank of the match - (Column A's text being the value you're matching for).
Now you could sort according to Column H, etc, to suit your needs.
Hope this does the trick (and makes sense)!
Cool Q, BTW!!!
Do you have, or can you create, a master list of all of the possible cell values? If so, then some simple VLOOKUPs on each of the 4 data columns could give, for each unique cell value, the row number in each column. Add up the 4 reesults and sort on the total.
If you don't have the master list of unique values, I'd tend to go to Access because it's a pretty easy set of queries to get what you want.
Clarification Needed
When I first came up with this answer I used the same approach that John used in his clever Excel answer, namely to use the sum of the minimum rows per column to produce the rank. That produces the sample result in the question, but consider the following modified test data:
F1 F2 F3 F4 RowNum
--- --- --- --- ------
XXX bar baz bat 1
foo XXX baz bat 2
YYY bar XXX bat 3
foo YYY baz bat 4
foo bar YYY bat 5
foo bar baz YYY 6
foo bar baz bat 7
foo bar baz bat 8
foo bar baz bat 9
foo bar baz XXX 10
XXX appears in rows 1, 2, 3, and 10, so the sum would be 16. YYY appears in rows 3, 4, 5, and 6 so the sum would be 18. Ranking by sum would declare XXX the winner, even though if you started scanning for XXX from row 1 you would have to go all the way to row 10 to reach the last XXX, whereas if you started scanning for YYY from row 1 you would only have to go down to row 6 to reach the last YYY.
In this case should YYY actually be the winner?
(original answer)
The following code will import the Excel data into Access and add a [RowNum] column
Sub ImportExcelData()
On Error Resume Next '' in case it doesn't already exist
DoCmd.DeleteObject acTable, "ExcelData"
On Error GoTo 0
DoCmd.TransferSpreadsheet acImport, acSpreadsheetTypeExcel12Xml, "ExcelData", "C:\Users\Gord\Documents\ExcelData.xlsx", False
CurrentDb.Execute "ALTER TABLE ExcelData ADD COLUMN RowNum AUTOINCREMENT(1,1)", dbFailOnError
End Sub
So now we have an [ExcelData] table in Access like this
F1 F2 F3 F4 RowNum
--- --- --- --- ------
abc bcd abc def 1
cde fgh def bcd 2
def def bcd abc 3
bcd hji xyz lmn 4
Let's create a saved query named ExcelItems in Access to string the entries out in a long "list"...
SELECT F1 AS Item, RowNum, 1 AS ColNum FROM ExcelData
UNION ALL
SELECT F2 AS Item, RowNum, 2 AS ColNum FROM ExcelData
UNION ALL
SELECT F3 AS Item, RowNum, 3 AS ColNum FROM ExcelData
UNION ALL
SELECT F4 AS Item, RowNum, 4 AS ColNum FROM ExcelData
...returning...
Item RowNum ColNum
---- ------ ------
abc 1 1
cde 2 1
def 3 1
bcd 4 1
bcd 1 2
fgh 2 2
def 3 2
hji 4 2
abc 1 3
def 2 3
bcd 3 3
xyz 4 3
def 1 4
bcd 2 4
abc 3 4
lmn 4 4
Now we can find the lowest RowNum where Item is found for each ColNum...
TRANSFORM Min(ExcelItems.[RowNum]) AS MinOfRowNum
SELECT ExcelItems.[Item]
FROM ExcelItems
GROUP BY ExcelItems.[Item]
PIVOT ExcelItems.[ColNum] In (1,2,3,4);
...returning...
Item 1 2 3 4
---- - - - -
abc 1 1 3
bcd 4 1 3 2
cde 2
def 3 3 2 1
fgh 2
hji 4
lmn 4
xyz 4
If we save that query as ExcelItems_Crosstab then we can use it to rank the items that appear in all four columns:
SELECT Item, [1]+[2]+[3]+[4] AS Rank
FROM ExcelItems_Crosstab
WHERE ([1]+[2]+[3]+[4]) IS NOT NULL
ORDER BY 2
...returning...
Item Rank
---- ----
def 9
bcd 10