hive count and count distinct not correct - sql

I have a table in Hive that has 20 columns and I want to count unique records and all records per hour.
Table looks like:
CREATE EXTERNAL TABLE test1(
log_date string,
advertiser_creatives_id string,
cookieID string,
)
STORED AS ORC
LOCATION "/day1orc"
tblproperties ("orc.compress"="ZLIB");
And my query like this:
SELECT Hour(log_date),
Count(DISTINCT cookieid) AS UNIQUE,
Count(1) AS impressions
FROM test1
GROUP BY Hour(log_date);
But the results are not correct. I have about 70 million entries and when I do a sum of impressions I only get like 8 million so I suspect the distinct takes too many columns in account.
So how can I fix this so that I get the correct amount of impressions?
** Extra information **
hive.vectorized.execution.enabled is undefined so it is not active.
The same query in TEXT format returns even less rows (about 2.7 million)
Result of COUNT(*): 70643229
Result of COUNT(cookieID): 70643229
Result of COUNT(DISTINCT cookieID): 1440195
Cheers

I have an example,may be useful for you.I think you "row format delimited fields terminated by" has some problems .
I have a text,seperate by "\t",like below:
id date value
1 01-01-2014 10
1 03-01-2014 05
1 07-01-2014 40
1 05-01-2014 20
2 05-01-2014 10
but I only create a table have 2 columns, like below:
use tmp ;
create table sw_test(id string,td string) row format delimited fields terminated by '\t' ;
LOAD DATA LOCAL INPATH '/home/hadoop/b.txt' INTO TABLE sw_test;
How do you think the result of "select td from sw_test ;"
NOT
td
01-01-2014 10
03-01-2014 05
07-01-2014 40
05-01-2014 20
05-01-2014 10
BUT
td
01-01-2014
03-01-2014
07-01-2014
05-01-2014
05-01-2014
So,I think you cookie contains some special column include your defined seperator.
I hope this can help you .
good luck!

Related

Bigquery query performance when using starts_with() on a table of 12Mil rows

I have a table company_totals, that has the following schema -
column_name
column_data_type
company
STRING
link
STRING
full_count
FLOAT
starts_with_count
FLOAT
Number of rows = 12,000,000. Table size = 1.6 GB. CLUSTERED BY = company link. SEARCH INDEX created on column = link.
I have the following select statement which is taking beyond 6 hours and the execution results in timeout - Operation timed out after 6.0 hours. Consider reducing the amount of work performed by your operation so that it can complete within this limit.)
SELECT first_table.company, first_table.link, null as full_count, SUM(second_table.full_count) AS starts_with_count
FROM company_totals first_table, company_totals second_table
WHERE STARTS_WITH(second_table.link, first_table.link)
group by first_table.company, first_table.link
The above query calculates values of the column starts_with_count which is the sum of values of another column full_count, based on a starts_with() condition. In the company_totals table, the column starts_with_count is what I want to fill. I have added the expected values for this column manually to show my expectation. Other column values are already present in the table. The starts_with_count value is sum (full_count) where its link appears in other rows.
company
link
full_count
starts_with_count (expected)
abc
http://www.abc.net1
1
15 (= sum (full_count) where link like 'http://www.abc.net1%')
abc
http://www.abc.net1/page1
2
9 (= sum (full_count) where link like 'http://www.abc.net1/page1%')
abc
http://www.abc.net1/page1/folder1
3
3 (= sum (full_count) where link like 'http://www.abc.net1/page1/folder1%')
abc
http://www.abc.net1/page1/folder2
4
4
abc
http://www.abc.net1/page2
5
5
xyz
http://www.xyz.net1/
6
21
xyz
http://www.xyz.net1/page1/
7
15
xyz
http://www.xyz.net1/page1/file1
8
8
Highly appreciate any help in this issue.

How to select last element for each ID

I would like select some elements from the last id
Here an example that I have :
id money
1 200
1 150
1 500
3 50
4 40
4 300
5 110
Here what I would like :
1 500
3 50
4 300
5 110
So like you can see, I took last id and the money who corresponds.
I tried to do a group by id order by id descending with limit 1. But limit 1 is not available in proc sql from sas and it doesn't work.
Thanks in advance
Unlike SAS datasets, SQL tables represent unordered sets. In your case, it looks like you want the maximum value in the second column, in which case you can use aggregation:
proc sql;
select id, max(money)
from t
group by id;
If you actually mean the last row per id based on the ordering in the SAS dataset, I would suggest using a data step instead.

Aggregating / Concatenation of very long Varchar2 strings and find key words in the text || Oracle

I have been given a task to develop a script/ function/ query to aggregate groups of rows in a table and then search for specific keywords in it. The column to be aggregated is a varchar2 column with size 3200 and some of the aggregated rows have lengths way beyond 5000.
(I understand that the size of varchar2 is 4000)
When I try to aggregate the data into a single column, it gives a "result of string concatenation is too long" error (ORA-01489)
I have tried inbuilt aggregators like LISTAGG, XMLAGG, and also some custom functions but I have been asked to prefer a SQL query over a function or procedure.
Once I can get the data to be aggregated, I have to then search through the rows for matching keywords.
(can't just search the rows without aggregating as some of the words are split across the rows, eg row1 ends with "KEYW" and row2 starts with "ORD" if I need to look for "KEYWORD" in the table
my table kind of looks like this (can't post the real table data, sorry),
id_1 | id_2 | name | row_num | description
1 5 A 0 this has so
1 5 A 1 me keyword
1 5 B 0 this is
1 3 E 0 new some
2 12 A 0 diff str
here the unique rows are identified using the first 3 columns and the 4th column lists the order in which these "description" strings need to be concatenated.
I would like to get the output as:
id_1 | id_2 | name | description (concated)
1 5 A this is **some** keyword
1 3 E new **some**
when looking for the keyword "some"
Please help as I am fairly new to DBs and any help will be highly appreciated.
Thanks & Regards
Kunal

SQL percentage usage calculation using 2 columns

Trying to get the percentage usage for a report based on the following columns:
Dept Ext Sec1 Sec2 StartDate EndDate
---------------------------------------------------------------
1 1234 5 5 2017-05-01:08:00:00 2017-05-04:08:00:10
2 1230 8 8 2017-05-01:09:10:00 2017-05-04:09:10:11
1 1234 15 15 2017-05-02:08:01:00 2017-05-04:08:01:20
I need to display the percentage time the user spent on the phone, based on the total seconds in Sec1, for the time period. If needs be, I can create a 3rd column with the percentage total as part of the creation job (the final table is generated form a join query of 2 other tables). Thanks
I had to add these lines to my creatDB query to get the right results:
alter table compinfo.dbo.pabxreport add TotalSec Int
alter table compinfo.dbo.pabxreport add TotalPer Decimal(14,8)
update compinfo.dbo.pabxreport
set TotalSec= (
select sum(billsec1) from pabxreport)
update compinfo.dbo.pabxreport
set TotalPer= (billsec1 * 100.00 / Totalsec)

How can i get incremental counter with sql?

Can you help me with sql query to get the desired result
Database used :- Redshift
requirement is
I have 3 columns as:- dish_id,cateogory_id,counter
So i want counter to increase +1 if the dish_id is repeated and if not it should remain 1
the query i need should be able to query the source table and get the results as
dish_id category_id counter
21 4 1
21 6 2
21 6 3
12 1 1
Unless I missunderstood your question, you can accomplish that using window functions:
SELECT *,row_number() OVER (PARTITION BY dish_id) FROM my_table;