What is the appropriate hive query? - hive

This is my case study
Identify which numbers made calls as well as made SMS messages, where a number of calls made should be more than 10 and a number of messages should be more than 5.
I have written the query:
select call.user_no from (select user_no, count(other_no) as total_calls
from calls group by user_no having total_calls > 10) as call
,(select user_no, count(other_no) as total_msgs
from messages group by user_no having total_msgs > 5) as msgs where
call.user_no = msgs.user_no;
These are the tables I have created
create table calls (user_no string,other_no string,direction string,duration smallint,call_timestamp string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
create table messages (user_no string,other_no string,direction string,msg_len string,call_timestamp string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
The output shows no user.
What is the error?

Related

summarize word count for a timeframe

I have the below table which stores the response text and the keyword search associated with it.
create table nlp.search(response string, words string,inquiry_time timestamp);
insert into nlp.search values('how to reset password','reset word password',TIMESTAMP ("2021-09-19 05:30:00+00"));
insert into nlp.search values('how to reset password','reset passphrase',TIMESTAMP ("2021-09-20 07:30:00+00"));
insert into nlp.search values('how to reset password','password',TIMESTAMP ("2021-09-16 08:30:00+00"));
insert into nlp.search values('how to reset password','reset',TIMESTAMP ("2021-09-14 08:30:00+00"));
I want to provide a summary report in this format
response and the count of each individual words associated with it.
response individual_word_count
how to reset password reset(3) word(1) password(2) passphrase(1)
also the timestamp column inquiry_time can be passed to narrow down the date range and the summary values must be computed accordingly
e.g for timeframe filter 2021-09-19 till 2021-09-20
response individual_word_count
how to reset password reset(2) word(1) password(1) passphrase(1)
can this be accomplished using a view?
Use below
select response, word, count(1) ndividual_word_count
from `nlp.search`,
unnest(split(words, ' ')) word
where date(inquiry_time) between '2021-09-19' and '2021-09-20'
group by response, word
if applied to sample data in your question - output is
I Need to display the word and count in 1 single column
use below then
select response,
string_agg(format('%s (%i)', word, individual_word_count)) counts
from (
select response, word, count(1) individual_word_count
from `nlp.search`,
unnest(split(words, ' ')) word
where date(inquiry_time) between '2021-09-19' and '2021-09-20'
group by response, word
)
group by response
with output

find diffrences between 2 tables sql and how can i get the changed value?

i have this query
insert into changes (id_registro)
select d2.id_registro
from daily2 d2
where exists (
select 1
from daily d1
where
d1.id_registro = d2.id_registro
and (d2.origen, d2.sector, d2.entidad_um, d2.sexo, d2.entidad_nac, d2.entidad_res,
d2.municipio_res, d2.tipo_paciente,d2.fecha_ingreso, d2.fecha_sintomas,
d2.fecha_def, d2.intubado, d2.neumonia, d2.edad, d2.nacionalidad, d2.embarazo,
d2.habla_lengua_indig, d2.diabetes, d2.epoc, d2.asma, d2.inmusupr, d2.hipertension,
d2.otra_com, d2.cardiovascular, d2.obesidad,
d2.renal_cronica, d2.tabaquismo, d2.otro_caso, d2.resultado, d2.migrante,
d2.pais_nacionalidad, d2.pais_origen, d2.uci )
<>
(d1.origen, d1.sector, d1.entidad_um, d1.sexo, d1.entidad_nac, d1.entidad_res,
d1.municipio_res, d1.tipo_paciente, d1.fecha_ingreso, d1.fecha_sintomas,
d1.fecha_def, d1.intubado, d1.neumonia, d1.edad, d1.nacionalidad, d1.embarazo,
d1.habla_lengua_indig, d1.diabetes, d1.epoc, d1.asma, d1.inmusupr, d1.hipertension,
d1.otra_com, d1.cardiovascular, d1.obesidad,
d1.renal_cronica, d1.tabaquismo, d1.otro_caso, d1.resultado, d1.migrante,
d1.pais_nacionalidad, d1.pais_origen, d1.uci ))
it results in an insersion data that doesn't exist in another table, that's fine. but i want know exactly which field has changed to store it in a log table
You don't mention precisely what you expect to see in your output but basically to accomplish what you're after you'll need a long sequence of CASE clauses, one for each column
e.g. one approach might be to create a comma-separated list of the column names that have changed:
INSERT INTO changes (id_registro, column_diffs)
SELECT d2.id_registro,
CONCAT(
CASE WHEN d1.origen <> d2.origen THEN 'Origen,' ELSE '' END,
CASE WHEN d1.sector <> d2.sector THEN 'Sector,' ELSE '' END,
etc.
Within the THEN part of the CASE you can build whatever detail you want to show
e.g. a string showing before and after values of the columns CONCAT('Origen: Was==> ', d1.origen, ' Now==>', d2.origen). Presumably though you'll also need to record the times of these changes if there can be multiple updates to the same record throughout the day.
Essentially you'll need to decide what information you want to show in your logfile, but based on your example query you should have all the information you need.

SQL - When a column has a value from a list and a value not in that same list

Not sure the best way to word this but I'm looking for a way to specify a condition when a value in a column has at least one value in a given list AND avalue not in the same list, then that column's value should show up. An example table:
email program
john#john.com program1
john#john.com program2
john#john.com program3
jeff#jeff.com program3
jeff#jeff.com program4
steve#steve.com program1
steve#steve.com program2
If I have this table and a list of (program1, program2), I would like the corresponding email to show up if the programs associated with a given email match at least one in the given list AND if the given email has a program NOT in the given list
So for the table above and the given list above all we would have show up with the correct query would be:
email
john#john.com
Any help on this would be greatly appreciated. Note: this would be in Redshift/PostgreSQL
I like doing this with group by and having. Here is a pretty general approach:
select email
from t
group by email
having sum( (program = 'program1')::int ) > 0 and
sum( (program = 'program2')::int ) = 0;
In this case, "program1" is required and "program2" is not. And, you can keep adding conditions -- as many as you like.
I forget if Redshift supports the :: syntax. You can always express this using standard SQL:
having sum( case when program = 'program1' then 1 else 0 end ) > 0 and
sum( case when program = 'program2' then 1 else 0 end ) = 0;
EDIT:
I think #dnswit is right on the parsing of the OP's question. The logic would be:
having sum( (program in ('program1', 'program2'))::int ) > 0 and
sum( (program not in ('program1', 'program2'))::int ) > 0;
if you just want a single list of emails no matter how many times they are on the list by having multiple programs
it is just select distinct email from tablename
First your Data Table is constructed wrong, you should use an unique Identifier so you can retrieve the program version you are specifying.
so your database should look like this:
> email program1 program2 program3
john#john.com ProgVersion1 ProgVersion2 ProgVersion3
steve#steve.com ProgVersion1 ProgVersion2 ProgVersion3
If you notice of the table above you can now query to get the program value you need for the specified Email. Use SQL Query, your Data Fields for your table are email, Program 1 Program 2 Program 3, when retrieving the value of the fields to be displayed, you are using redundancy you do not need to repeat the email address multiple times for each version of the program. This would not be expectable methodology.
SQL Query you can use:
instructions: you will create a parameter to use as a variable to query the data table from the list.
> CREATE PROCEDURE spLoadMyProgramVersion
>
> #email nvarchar(50),
>
> AS
>
>BEGIN
>SELECT program1,program2,program3
>FROM MyTableName
>WHERE (email LIKE #email) RETURN
This will allow you to load all your program version in a list by just specify the email address you want to load, this is a loading stored procedure just use it when you make a SQLCommand Object you can call your stored procedure.

extract data in hive

I have data in a hive table column as below:
customer reason=Other#customer reason free text=Space#customer type=Regular#customer end date=2020-12-31 19:50:00#customer offering criterion=0#customer type=KK#Customer factor=0.00#customer period=0#customer type=TN#customer value=0#customer plan type=M#cttype type=KK#
I want to extract data value 0 against customer value.
I tried below query but it is giving full 'customer value=0' but i want only 0.
Please suggest.
select regexp_replace(regexp_extract(information,'customer period=[^#]*',0),'customer period=','') from detail;
The question is not clear at all.
If you want to limit the output, use LIMIT 1 (where 1 is number of rows you want to limit to)
Please give the following a shot,
---table creation----
create table ch7(custdetails map string,string>)
row format delimited
collection items terminated by '#'
map keys terminated by '=';
----data loading----
Load data local inpath 'your-file-location-in-local' overwrite into table ch7;
----data selection-----
select custdetails["customer value"] from ch7;
Hope this helps!!

Pig latin script treats different columns in csv file as one single column

I am just pasting a line from the file for example
The following line is from the file 'airlines_new.txt' which I am loading into a relation
2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA ,NA,NA,NA
====================================================
I am using the following query :
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt'
USING PigStorage(' ') AS
(Year, Month, DayofMonth, DayofWeek, DepTime_actual:chararray, CRSDeptime:chararray, Arrtime_actual:chararray, CRSArrtime:chararray, UniqueCarrier, FlightNum, TailNum_Plane ,ActualElapsedTime, CRSElapsedTime, Airtime, Arrdelay, Depdelay, Origin,Dest, Distance, Taxiin, Taxiout, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay);
==========================================================
B = FOREACH Airlines_data_schema generate $0 ;
dump B ;
=========================================================
Result :
(Year, Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCar rier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDela y,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Carrie rDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay )
(2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,N A,NA,NA,NA )
It is giving the all columns as single column. But intention is to break these into different columns. Ideally according to my script it should only give only the column "Year".
The records are separated by a comma but in the script you are using ' ' as the delimiter.Modify your script to use ',' as the delimiter in PigStorage.
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt' USING PigStorage(',') AS (Year,Month,DayofMonth,DayofWeek,DepTime_actual:chararray,CRSDeptime:chararray,Arrtime_actual:chararray,CRSArrtime:chararray,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay)
Appropriate delimiter needs to be used in this scenario to ensure that the fields are separated .
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt'
USING **PigStorage(',')** AS
(Year, Month, DayofMonth, DayofWeek, DepTime_actual:chararray, CRSDeptime:chararray, Arrtime_actual:chararray, CRSArrtime:chararray, UniqueCarrier, FlightNum, TailNum_Plane ,ActualElapsedTime, CRSElapsedTime, Airtime, Arrdelay, Depdelay, Origin,Dest, Distance, Taxiin, Taxiout, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay);.
This will ensure you to access each of the fields which are separated in the csv by ',' .