How to subset a remote tbl in r? - sql

I have loaded a sql table to a tbl using dplyr (rr is my db connection) :
s_log=rr%>%tbl("s_log")
then extracted three columns and put them in a new tbl :
id_date_amount=s_log%>%select(id,created_at,amount)
when i run head (id_date_amount) it works properly:
id created_at amount
1 34101 2016-07-20 10:41:23 19750
2 11426 2016-07-20 10:38:15 19750
3 26694 2016-07-20 10:38:18 49750
4 25656 2016-07-20 10:42:05 49750
5 23987 2016-07-20 10:40:31 19750
6 24564 2016-07-20 10:38:35 19750
now , i need to filter the id_date_amount in a way that it only contains the past 21 days:
filtered_ADP=subset(id_date_payment,as.Date('2016-08-22')- as.Date(created_at) > 0 & as.Date('2016-08-22')- as.Date(created_at)<=21)
i get the following error:
Error in as.Date(created_at) : object 'created_at' not found
i think that's because i don't have id_date_payment locally , but how can i shape that subset to sent it to id_date_payment and get back the results?
i tried using deployer::filter :
id_date_amount=id_date_amount %>% filter( as.Date('2016-08-22') - as.Date(created_at) > 0 & as.Date('2016-08-22')- as.Date(created_at)<=21 )
but i get this error :
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "AS"
LINE 3: WHERE "status" = 'success' AND AS.DATE('2016-08-22') - AS.DA...
^
)

Related

how to generate unique weekid using weekofyear in hive

I have a table I"m just iterating dates of 50 years.
Using the values of weekofyear("date") -> week_no_in_this_year.
I would like to create a column using (week_no_in_this_year), it should be unique for a week. name it as -> week_id
which should be concatination of Year+two_digit_week_no_in_this_year+Some_number(to make this id unique for one week). I tried like below:
concat(concat(YEAR,IF(week_no_in_this_year<10,
concat(0,week_no_in_this_year),week_no_in_this_year)),'2') AS week_id.
But I'm facing issue for few dates for below scenario's:
SELECT weekofyear("2019-01-01") ;
SELECT concat(concat("2019",IF(1<10, concat(0,1),1)),'2') AS week_id;
Expected Result: 2019012
SELECT weekofyear("2019-12-31");
SELECT concat(concat("2019",IF(1<10, concat(0,1),1)),'2') AS week_id;
Expected Result: 2020012
One way to do it is with UDF. Create a python script and push it to HDFS
mypy.py
import sys
import datetime
for line in sys.stdin:
line = line.strip()
(y,m,d) = line.split("-")
d = datetime.date(int(y),int(m),int(d)).isocalendar()
print str(d[0])+str(d[1])
In Hive
add file hdfs:/user/cloudera/mypy.py;
select transform("2019-1-1") using "python mypy.py" as (week_id);
INFO : OK
+----------+--+
| week_id |
+----------+--+
| 20191 |
+----------+--+
select transform("2019-12-30") using "python mypy.py" as (week_id)
+----------+--+
| week_id |
+----------+--+
| 20201 |
+----------+--+
1 row selected (33.413 seconds)
This scenario only happens when there is a split between years at the end of a given year (that is Dec 31) and the week number is rollover to next year. If we put a condition for this case, then we get what you expect.
Right function is the same as substr (, -n).
SELECT DTE as Date,
CONCAT(IF(MONTH(DTE)=12 and WEEKOFYEAR(DTE)=1, year(DTE)+1, year(DTE)),
SUBSTR(CONCAT('0', WEEKOFYEAR(DTE)), -2), '2') as weekid
FROM tbl;
Result:
Date WeekId
2019-01-01 2019012
2019-11-01 2019442
2019-12-31 2020012

I'm looking to find an average difference between a series of 2 rows same column in SQL

So I've looked through a lot of questions about subtraction and all that for SQL but haven't found the exact same use.
I'm using a single table and trying to find an average response time between two people talking on my site. Here's the data sample:
id created_at conversation_id sender_id receiver_id
307165 2017-05-03 20:03:27 96557 24 1755
307166 2017-05-03 20:04:22 96557 1755 24
303130 2017-04-20 18:03:53 102458 2518 4475
302671 2017-04-18 20:11:20 102505 3100 1079
302670 2017-04-18 20:09:38 103014 3100 2676
350570 2017-09-18 20:59:56 103496 5453 929
290458 2017-02-16 13:38:47 103575 2841 2282
300001 2017-04-08 16:42:16 104159 2740 1689
304204 2017-04-24 17:31:25 104531 5963 1118
284873 2017-01-12 22:33:19 104712 3657 3967
284872 2017-01-12 22:31:38 104712 3967 3657
What I want is to find an Average Response Time based on the conversation_id
Hmmm . . . You can get the "response" for a given row by getting the next row between the two conversers. The rest is getting the average -- which is database dependent.
Something like this:
select avg(next_created_at - created_at) -- exact syntax depends on the database
from (select m.*,
(select min(m2.created_at)
from messages m2
where m2.sender_id = m.receiver_id and m.sender_id = m2.receiver_id and
m2.conversation_id = m.conversation_id and
m2.created_at > m.created_at
) next_created_at
from messages m
) mm
where next_created_at is not null;
A CTE will take care of bringing the conversation start and end into the same row.
Then use DATEDIFF to compute the response time, and average it.
Assumes there are only ever two entries per conversation (ignores others with 1 or more than 2).
WITH X AS (
SELECT conversation_id, MIN(created_at) AS convstart, MAX(created_at) AS convend
FROM theTable
GROUP BY conversation_id
HAVING COUNT(*) = 2
)
SELECT AVG(DATEDIFF(second,convstart,convend)) AS AvgResponse
FROM X

Regular Expression and Hive

I'm trying to create a hive external table using org.apache.hadoop.hive.serde2.RegexSerDe for analysing comments. The sample rows are:
0 #chef/maintain fix the problem 2017-05-25 20:34:45 1 2017-05-25 20:34:27 0
6 ^ trailing comma is trolling you 2017-05-23 23:08:46 0 2017-05-24 04:40:42 1
This is my regex:
("input.regex" = “(d{1,5}\\s\\w+\\s\\.{19}\\.{1}\\s\\.{1}");
I am getting a null table and couldn't figure the regex.
Table definition:
order 1,2,3,4...
comment #chef/maintain fix the problem
comment_time 2017-05-25 20:34:45
merged 1 or 0
merged_time 2017-05-25 20:34:27
resolved 1 or 0
Can any one help on this?
Try this regex
(\\d)\\s+([^\\d{4}]*)\\s(\\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2})\\s+(\\d)\\s+(\\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2})\\s(\\d)

SQL OpenOffice Base "Not a Condition in Statement"

So I tried using OpenOffice Base and I had a hard time. Now, I have this SQL query here and it works well:
SELECT "CUSTOMER"."CREDIT_LIMIT" AS "CREDIT_LIMIT",
COUNT(*) AS "TOTAL_NUMBER"
FROM "CUSTOMER"
WHERE "SLSREP_NUMBER" = 6
GROUP BY "CREDIT_LIMIT";
Query:
| CRED_LIMIT | TOTAL_NUMBER |
| 1500 | 1 |
| 750 | 2 |
| 1000 | 1 |
Now my problem is when I add this : AND ("TOTAL_NUMBER" > 1)
SELECT "CUSTOMER"."CREDIT_LIMIT" AS "CREDIT_LIMIT",
COUNT(*) AS "TOTAL_NUMBER"
FROM "CUSTOMER"
WHERE "SLSREP_NUMBER" = 6 AND "TOTAL_NUMBER" > 1
GROUP BY "CREDIT_LIMIT";
Open Office would throw an Error: "Not a condition in statement"
My questions are: is there something wrong with my syntax? Have I written something wrong? or is my copy of OOBase defective? or am I missing something?
Update: I tried using HAVING as suggested by potashin (Thank you for answering) and it seems like it's still not working.
#potashin was close but didn't quite have it right. Do not say AS "TOTAL_NUMBERS". Also, Base does not require quotes around UPPER case names.
SELECT CUSTOMER.CREDIT_LIMIT AS CREDIT_LIMIT, COUNT(*)
FROM CUSTOMER
WHERE SLSREP_NUMBER = 6
GROUP BY CREDIT_LIMIT
HAVING COUNT(*) > 1
See also: http://www.w3resource.com/sql/aggregate-functions/count-having.php

Presto can't fetch content in HIVE table

My environment:
hadoop 1.0.4
hive 0.12
hbase 0.94.14
presto 0.56
All packages are installed on pseudo machine. The services are not running on localhost but
on the host name with a static IP.
presto conf:
coordinator=false
datasources=jmx,hive
http-server.http.port=8081
presto-metastore.db.type=h2
presto-metastore.db.filename=/root
task.max-memory=1GB
discovery.uri=http://<HOSTNAME>:8081
In presto cli I can get the table in hive successfully:
presto:default> show tables;
Table
-------------------
ht1
k_business_d_
k_os_business_d_
...
tt1_
(11 rows)
Query 20140114_072809_00002_5zhjn, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:11 [11 rows, 291B] [0 rows/s, 26B/s]
but when I try to query data from any table the result always be empty: (no error information)
presto:default> select * from k_business_d_;
key | business | business_name | collect_time | numofalarm | numofhost | test
-----+----------+---------------+--------------+------------+-----------+------
(0 rows)
Query 20140114_072839_00003_5zhjn, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]
If I executed the same sql in HIVE, the result show there are 1 row in the table.
hive> select * from k_business_d_;
OK
9223370648089975807|2 2 测试机 2014-01-04 00:00:00 NULL 1.0 NULL
Time taken: 2.574 seconds, Fetched: 1 row(s)
Why presto can't fetch from HIVE tables?
It looks like this is an external table that uses HBase via org.apache.hadoop.hive.hbase.HBaseStorageHandler. This is not supported yet, but one mailing list post indicates it might be possible if you copy the appropriate jars to the Hive plugin directory: https://groups.google.com/d/msg/presto-users/U7vx8PhnZAA/9edzcK76tD8J