How to count all rows in raw data file using Hive? - sql

I am reading some raw input which looks something like this:
20 abc def
21 ghi jkl
mno pqr
23 stu
Note the first two rows are "good" rows and the last two rows are "bad" rows since they are missing some data.
Here is the snippet of my hive query which is reading this raw data into a readonly external table:
DROP TABLE IF EXISTS readonly_s3;
CREATE EXTERNAL TABLE readonly_s3 (id string, name string, data string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
I need to get the count of ALL the rows, both "good" and "bad." The problem is some of the data is missing, and if I do SELECT count(id) as total_rows for example, that doesn't work since not all the rows have an id.
Any suggestions on how I can count ALL the rows in this raw data file?

Hmmm . . . You can use:
select sum(case when col1 is not null and col2 is not null and col3 is not null then 1 else 0 end) as num_good,
sum(case when col1 is null or col2 is null or col3 is null then 1 else 0 end) as num_bad
from readonly_s3;

Related

How to update a Column from a Bunch of other columns

I have a Table A where i column 1 Column 2 Column 3 Column 4 and Column 5.
Column 1,2,3,4 already have data and we need to update Column 5 based on that data and on priority .
Column 1 has Priority 5 , Col 2 has Priority 4 ,Col 3 has priority 3 and Col 4 has priority 2.
So if a particular row has all the column , then it should pick up Col 1 since it has highest priority and update Col 5 ,
If a record has data only in Col 3 and 4 then it should be Col3 and update in Col 5 since 3 has higher priority than Col4 .
If there is no data from Col 1-4 , col 5 should be null.
I have 24k records in my Table and i need to run this for all rows.
Any pointers for this query would he highly appreciated .
I think you want coalesce() -- assuming that the columns with no values have NULL:
update t
set col5 = coalesce(col1, col2, col3, col4);
You can also put the coalesce() in a select, if you don't want to actually change the data.

Get the sum of partial value in a column

Consider there is a table tableA
col1 col2
1 some random string and number 1213 aa5 string aaasome number
2 some random string 432682 aa3 test
1 aa7
I need to get the result as below.
1 12
2 3
group by col1 and the result will be 5+7 (the partial int after the 'aa' string)
To add more clarity to the question,the col2 has some other strings as well.. like test test test aa2 again test test 23u45 ajsdk 4834... . Here i need to pick the 2 alone.
kindly suggest a solution for this.
You need to get rid of the prefix, cast to a number, and sum. One method looks like:
select col1, sum(cast(replace(col2, 'aa', '') as number)
from tablea a
group by col1;
You can use regular expression to get the required digits from the string:
Select col1, sum(regexp_replace(col2,'(^|.*\s)aa(\d+)(\s.*|$)', '\2'))
From t
Group by col1
demo

How to convert multiple rows into one row with multiple columns using Pivot in SQL

I have a log file that comes in as 1 column and each record consist of multiple rows. Each record is delimited by an empty(blank) row. How can I move each row of record into its own column(s) and I have to do it dynamically? Be gentle...im very new.
Column1
----------------------------
**empty row -Begin Record 1
row1data1
row1data2
row1data3
**empty row -Begin Record 2
row2data1
row2data2
row2data3
NULL
row2data5
**empty row - Begin Record 3
.
.
.
The results I would like:
Column1 Column2 Column3 Column4 Column5
----------------------------------------------------
row1data1 row1data2 row1data3
row2data1 row2data2 row2data3 NULL row2data5
.
.
.
Thanks in advance!

SQL query to return matrix

I have a set of rows with one column of actual data. The goal is display this data in Matrix format. The numbers of Column will remain same, the number of rows may vary.
For example:
I have 20 records. If I have 5 columns - then the number of rows would be 4
I have 24 records. I have 5 columns the number of rows would be 5, with the 5th col in 5th row would be empty.
I have 18 records. I have 5 columns the number of rows would be 4, with the 4th & 5th col in 4th row would be empty.
I was thinking of generating a column value against each row. This column value would b,e repeated after 5 rows. But I cannot the issue is "A SELECT statement that assigns a value to a variable must not be combined with data-retrieval operations"
Not sure how it can be achieved.
Any advice will be helpful.
Further Addition - I have managed to generate the name value association with column name and value. Example -
Name1 Col01
Name2 Col02
Name3 Col03
Name4 Col01
Name5 Col02
You can use ROW_NUMBER to assign a sequential integer from 0 up. Then group by the result of integer division whilst pivoting on the remainder.
WITH T AS
(
SELECT number,
ROW_NUMBER() OVER (ORDER BY number) -1 AS RN
FROM master..spt_values
)
SELECT MAX(CASE WHEN RN%5 = 0 THEN number END) AS Col1,
MAX(CASE WHEN RN%5 = 1 THEN number END) AS Col2,
MAX(CASE WHEN RN%5 = 2 THEN number END) AS Col3,
MAX(CASE WHEN RN%5 = 3 THEN number END) AS Col4,
MAX(CASE WHEN RN%5 = 4 THEN number END) AS Col5
FROM T
GROUP BY RN/5
ORDER BY RN/5
In general:
SQL is for retrieving data, that is all your X records in one column
Making a nice display of your data is usually the job of the software that queries SQL, e.g. your web/desktop application.
However if you really want to build the display output in SQL you could use a WHILE loop in connection with LIMIT and PIVOT. You would just select the first 5 records, than the next ones until finished.
Here is an example of how to use WHILE: http://msdn.microsoft.com/de-de/library/ms178642.aspx

Select MAX of multiple Attributes

I have a table that contains 3000 attributes (its for a data mining experiment)
The table looks like
id attr1 attr2, attr3
a 0 1 0
a 1 0 0
a 0 0 0
a 0 0 1
I wish to have it in the format
id, attr1, attr2, attr3
a 1 1 1
The values can only be 0 or 1 so, i think just getting the max of each column and grouping it by the ID would achieve this
However, i don't wish to Type MAX (attr X) for each and every attribute
Does anyone know a quick way of implementing this
Thank you very much for your help in advance
This is easy enough with group by:
select id, max(attr1) as attr1, max(attr2) as attr2, max(attr3) as attr3
from t
group by id
If you don't want to do all this typing, put your list of columns in Excel. Add in a formula such as =" max("&A1&") as "&A1&",". Then copy the cell down and copy the result to where your query is.
You can also do this in SQL, with something like:
select ' max('||column_name||') as '||column_name||','
from INFORMATION_SCHEMA.columns c
where table_name = <your table name here> and column_name like 'attr%'
When you do these last two, remember to remove the final comma from the last row.
You have to use some aggregating function in order to use attributes which are not in the group statement. So there is no any quicker way.