Hive query results to new table - sql

I have a very simple query below, which counts the number of transactions that happen each hour on our platform.
The numbers are in the billions so the query takes some time.
As such, I'd like to be able to run the query hourly, appending the results to another table - so we can have less latency & less load on the cluster.
I have access to Hue to do this - I am using Hive. is the below the correct way to do this?
INSERT INTO table udsuser.healthcheck
SELECT dt, hour, count(*)as transactions, 'dpi_datasum' as feed, 'FULL' as environment
FROM dpi_datasum
WHERE hour=hour(from_unixtime(unix_timestamp()))-2
Group by dt, hour

INSERT INTO table udsuser.healthcheck
SELECT dt, hour, count(*)as transactions,'dpi_datasum' as feed,'FULL' as
environment
FROM dpi_datasum
WHERE hour=hour(from_unixtime(unix_timestamp()))-2
Group by dt, hour
or
INSERT overwrite table udsuser.healthcheck
SELECT dt, hour, count(*)as transactions,'dpi_datasum' as feed,'FULL' as
environment
FROM dpi_datasum
WHERE hour=hour(from_unixtime(unix_timestamp()))-2
Group by dt, hour

Related

Creating a partitioned table from query in Big Query does not yield same as without partitioning

When creating a table let's say "orders" with partitioning in the following way my result gets truncated in comparison to if I create it without partitioning. (Commenting and uncommenting rows five and 6).
I suspect that it might have something to do with the BQ limits (found here) but I can't figure out what. The ts is a timestamp field and order_id is a UUID string.
i.e. The count distinct on the last row will yield very different results. When partitioned it will return far less order_ids than without partitioning.
DROP TABLE IF EXISTS
`project.dataset.orders`;
CREATE OR REPLACE TABLE
`project.dataset.orders`
-- PARTITION BY
-- DATE(ts)
AS
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2;
SELECT COUNT(DISTINCT order_id) FROM `project.dataset.orders`;
(This is not a valid 'answer', I just need a better place to write SQL than the comment box, I don't mind if moderator convert this answer into a comment AFTER it serves its purpose)
What is the number you'd get if you do query below, and which one does it align with (partitioned or non-partitioned)?
SELECT COUNT(DISTINCT order_id) FROM (
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2
) t;
It turns out that there's a 60 day partition expiration!
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
So by updating the partition expiration I could get the full range.

Count the number of rows inserted in a table per second

I want to be able to count number of rows inserted in a table per second using SQL database. The count has to be for all the rows in the table. Sometimes there could be 100 rows and others 10 etc so this is just for stats. I managed to count rows per day but need more details. Any advise or any scripts would be appreciated
Thanks
If you truncate the datetime column to the second.
Then you can aggregate on it, to get totals per second.
For example:
SELECT
CAST(dt AS DATE) as [Date],
MIN(Total) as MinRecordsPerSec,
MAX(Total) as MaxRecordsPerSec,
AVG(Total) as AverageRecordsPerSec
FROM
(
SELECT
CONVERT(datetime, CONVERT(char(19), YourDatetimeColumn, 120), 120) as dt,
COUNT(*) AS Total
FROM YourTable
GROUP BY CONVERT(char(19), YourDatetimeColumn, 120)
) q
GROUP BY CAST(dt AS DATE)
ORDER BY 1;
Well it depends on language you are using, the way to do this would be to fetch your DB and change date column to timestamp, then group them by each stamp as you would know each timestamp is per second.
OR
Alternatively, you can store timestamps in DB instead of actual date the it will be easy to query from DB.
OR
Use this function 'UNIX_TIMESTAMP()' in mysql to get timestamp of column then you can do whatever and whichever comparison you want to do on it
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_unix-timestamp
Hope this gives you an idea.

Oracle sub-partition shows data, but oracle query using filter does not show data

In Oracle 11g, I have created fact table with date as partition and site_id as sub-partition.
analyse is running daily on this table. but based on one day interval, analyse step is performed.
In SQL DEVELOPER tool, when I open table definition, under partition tab, I am able to see the partition as 23-JAN-2016. For all site_ids, I am able to see sub-partition.
Select * from NPM.EH_MODEM_HIST_PRFRM_FACT subpartition(SYS_SUBP1256625);
When I run the above query, I am able to see the data.
But I am running below query using report sql; but table is not fetching data
select * from NPM.EH_MODEM_HIST_PRFRM_FACT
where time_stamp ='23-JAN-16' and site_id =580
Is there any problem in managing this table?
Probably, what you're actually after is something like:
select *
from NPM.EH_MODEM_HIST_PRFRM_FACT
where time_stamp >= to_date('23/01/2016', 'dd/mm/yyyy')
and time_stamp < to_date('23/01/2016', 'dd/mm/yyyy') + 1
and site_id = 580;
The above assumes that the datatype for the time_stamp column is DATE. If it's actually TIMESTAMP then you should use the SQL below:
select *
from NPM.EH_MODEM_HIST_PRFRM_FACT
where time_stamp >= to_date('23/01/2016', 'dd/mm/yyyy')
and time_stamp < to_date('23/01/2016', 'dd/mm/yyyy') + interval '1' day
and site_id = 580;
Note also that I have specified the date with a four digit year. Two digit years are just soooo pre-y2k! *{;-)

How to update or insert a record in a postgres table which is obtained by doing another query?

I want to write a simple statistic tool that is doing some queries and saving the results in a nother table from the same database.
Mainly I want to tracke the number of items in different tables, number of touched items during a month and so on. This would allow me to get some analytics regarding the usages of the system, information that I will not be able to get just by looking at the database status at one moment.
Let's say that I have this query:
select count(*) as mytab_mcount from mytab where updated > CURRENT_DATE - INTERVAL '1 months';
Now I do want to store the result of this query in a stats table so I can query it in order to get some trend data.
Clearly I could code this in something but I am wondering if I can do this only in SQL, Postgres blend of it.
I want to put the result in a table like
date mytab_mcount some_stat
2013-09-01 1234 NUL
Clearly the SQL should insert a new row or update the existing one.
Is this possilbe, can you put a basic example?
I this could be done in a single query it would be very easy to automate this, keeping all the logic in one place, and having a cron job to execute it.
Have you tried something like:
INSERT INTO stat_table (stat_date, table_name, row_count, some_stat)
SELECT CURRENT_DATE, 'mytab', count(*), 2+3
FROM mytab
WHERE updated > CURRENT_DATE - INTERVAL '1 months';
Or
UPDATE stat_table
SET row_count = (SELECT count(*) FROM mytab WHERE updated > CURRENT_DATE - INTERVAL '1 months'),
stat_date = CURRENT_DATE,
some_stat = (SELECT 1+3)
WHERE table_name = 'mytab';

Query to get the duration and details from a table

I have a scenario and not quite sure how to query it. As a sample, I have following table structure and want to get the history of the action for bus:
ID-----TIME---------BUSID----OPID----MOVING----STOPPED----PARKED----COUNT
1------10:10:10-----101------1101-----1---------0----------0---------15
2------10:10:11-----102------1102-----0---------1----------0---------5
3------10:11:10-----101------1101-----1---------0----------0---------15
4------10:12:10-----101------1101-----0---------1----------0---------15
5------10:13:10-----101------1101-----1---------0----------0---------19
6------10:14:10-----101------1101-----1---------0----------0---------19
7------10:15:10-----101------1101-----0---------1----------0---------19
8------10:16:10-----101------1101-----0---------0----------1---------0
9------10:17:10-----101------1101-----0---------0----------1---------0
I want to write a query to get the status of a bus like:
BUSID----OPID----STATUS-----TIME---------DURATION---COUNT
101------1101----MOVING-----10:10:10-----2-----------15
101------1101----STOPPED----10:12:10-----1-----------15
101------1101----MOVING-----10:13:10-----2-----------19
101------1101----STOPPED----10:15:10-----1-----------19
101------1101----PARKED-----10:16:10-----2-----------0
I am using SQL Server 2008.
Thanks for your help.
You can use Common Table Expressions to calculate the duration between the different rows.
WITH cte_log AS
(
SELECT
Row_Number()
OVER
(
ORDER BY time DESC
)
AS
id, time, busid, opid, moving, stopped, parked, count
FROM
log_table
WHERE
busid = 101
)
SELECT
current_rows.busid,
current_rows.opid,
current_rows.time,
DATEDIFF(second, current_rows.time, previous_rows.time) AS duration
current_rows.count
FROM
cte_log_position AS current_rows
LEFT OUTER JOIN
log_table AS previous_rows ON ((current_rows.row_id + 1) = previous_rows.row_id)
WHERE
current_rows.busid = 101
ORDER BY
current_rows.time DESC;
The WITH statement creates a temporary result set that is defined within the execution scope of this query. We are using it to fetch the previous records of each row and to calculate the time difference between the the current and the previous record.
This example was not tested, and it may not work perfectly, but I hope it gets you going in the correct direction. Feel free to leave feedback.
You may also want to check the following external links on how to use Common Table Expressions:
SQL Select Next Row and SQL Select Previous Row with Current Row using T-SQL CTE
Calculate Difference between current and previous rows... CTE and Row_Number() rocks!
4 Guys From Rolla: Common Table Expressions (CTE) in SQL Server 2005
MSDN: Using Common Table Expressions
personally i would denormalize the data so you have start_time and end_time in the one row. this will make the query much more efficient.
I don't have access to SQL Server at the moment, so there may be syntax errors in the following:
SELECT
BUSID,
OPID,
IF (MOVING = 1) 'MOVING' ELSE IF (STOPPED = 1) 'STOPPED' ELSE 'PARKED' AS STATUS
TIME,
COUNT
FROM BUS_DATA_TABLE
GROUP BY BUSID
ORDER BY TIME
You'll note that this does not include duration. Until you order your data, you don't know which is the previous entry. Once the data is ordered you can calculate the duration as the difference between the times in consecutive records. You could do this by SELECTing into a new table and then running a second query.
Grouping by BUSID, should give you your report for all buses.
Making certain assumptions about column type, etc:
SELECT
BUSID,
OPID,
STATUS,
TIME,
DURATION,
COUNT
FROM
TABLENAME
WHERE
BUSID = 1O1
ORDER BY
TIME
;