Track table counts from a schema - sql

I use postgres and I need some help from you all PG experts...
I am looking to track counts from a large set of source tables whose counts keep changing everyday. I want to use the tablename, row count and tablesize in a tracker table, and a column called created_dttm field to show when this row count is recorded from source table. This is for trending how the table counts are changing with time and look for peaks.
insert into tracker_table( tablename, rowcount, tablesize, timestamp)
from
(
(select schema.tablename ... - not sure how to drive this to pick up a list of tables??
, select count(*) from schema.tablename
, SELECT pg_size_pretty(pg_total_relation_size('"schema"."tablename"'))
, select created_dttm from schema.tablename
)
);
Additionally, I want to get a particular column from source table for a fourth column. This would be a created_dttm timestamp field in the source table, and I want to run a simple sql to get this date to the tracker table. Any suggestions how to attack this problem?

before reading the code please consider this:
instead of selecting several subqueries, this if you can join them into one qry, eg select (select 1 from t), (select 2 from t) can be refactored to select 1,2 from t
pg_total_relation_size is sum of data pages, so it is size of table, but not size of data in it.
you need aggregation on your created_dttm column (I used oid instead), otherwise your subquery returns more then one row, so you won't be able to insert the result.
instead of select count(*) maybe use pg_stat_all_tables stats?.. counting can be very expensive and acuracy of the count() is neglected by the fact that next minute same select count() will be different and you probably wont run this count every two seconds...
code:
t=# create table so30 (n text, c int, s text, o int);
CREATE TABLE
t=# do
$$
declare
_r record;
_s text;
begin
for _r in (values('pg_database'),('pg_roles')) loop
_s := format('select %1$L,(select count(*) from %1$I), (SELECT pg_size_pretty(pg_total_relation_size(%1$L))), (select max(oid) from %1$I)',_r.column1);
execute format('insert into so30 %s',_s);
end loop;
end;
$$
;
DO
t=# select * from so30;
n | c | s | o
-------------+---+---------+-------
pg_database | 4 | 72 kB | 16384
pg_roles | 2 | 0 bytes | 4200
(2 rows)

Related

For loop with output arrays

In snowflake :
I have two tables available:
"SEG_HISTO": This is a segmentation run once a month.
columns: Client ID /date (1st of each month) /segment.
"TCK": a table that contains the tickets with the columns: Ticket ID / Customer ID / Date / Amount.
For each customer ID in the "SEG_HISTO" table, I searched for all the customer's tickets over a rolling year and associated the sum of the amount spent:
SELECT SEG_OMNI.*, TCK_12M.TOTAL_AMOUNT_HT
FROM "SHARE"."DATAMARTS_DATASCIENCE"."SEG_OMNI" SEG_OMNI
LEFT OUTER JOIN
(
SELECT DISTINCT PR_ID_BU,
SUM(TOTAL_AMOUNT_HT) AS "TOTAL_AMOUNT_HT",
COUNT(*) "NB_ACHAT"
FROM
(
SELECT * FROM "SHARE"."RAW_BDC"."TCK"
WHERE TO_DATE(DT_SALE) >= DATEADD(YEAR, -1, '2022-07-01') -- <<<===== date add manually
)
GROUP BY PR_ID_BU
) TCK_12M
ON SEG_OMNI."pr_id_bu" = TCK_12M.PR_ID_BU
Now I need to create a for loop that iterates this for each date in the SEG_OMNI table (SELECT DISTINCT TO_DATE(DT_MAJ) DT FROM "SHARE"."DATAMARTS_DATASCIENCE"."SEG_HISTO") and stack the output in a view.
And it is at this level where I block
Thank you for your help in advance
As Dave said in the comments, it would be better if you could figure out how to run all this in one query, instead of running the same query multiple times.
But as you are asking how to output the results of multiple queries out of one stored procedure I'm going to give you the pattern for that here. I'm also assuming you want this in a SQL script (we could use Python/Java/JS instead):
declare
your_var string;
all_dates cursor for (
select dates
from your_table
);
begin
-- create a table to store results
create or replace temp table discovery_results(x string, y string, z int);
for record in all_dates do
-- for each date run the query an insert results into the table created
insert into discovery_results
select x, y, z
from the_query
where (:dates_cursor_data)
;
end for;
return 'run [select * from discovery_results] to find the results';
end;
select *
from discovery_results

Create an Impala text table where rows meet a condition

I am trying to create a table in Impala (SQL) that takes rows from a parquet table. The data represents bike rides in a city. Rows will be imported into the new table if there starting code (a string, ex: '6100') shows up more than 100 times in the first table. Heres what I have so far:
#I am using Apache Impala via the Hue Editor
invalidate metadata;
set compression_codec=none;
invalidate metadata;
Set compression_codec=gzip;
create table bixirides_parquet (
start_date string, start_station_code string,
end_date string, end_station_code string,
duration_sec int, is_member int)
stored as parquet;
Insert overwrite table bixirides_parquet select * from bixirides_avro;
invalidate metadata;
set compression_codec=none;
create table impala_out stored as textfile as select start_date, start_station_code, end_date, end_station_code, duration_sec, is_member, count(start_station_code) as count
from bixirides_parquet
having count(start_station_code)>100;
For some reason the statement will run, but no rows are inserted in the new table. It should import a row into the new table if that rows starting code shows up more than 100 times in the original table. I think I'm wording my select statement improperly but I'm not sure how exactly.
I think the final query you want is:
select start_date, start_station_code, end_date,
end_station_code, duration_sec, is_member, cnt
from (select bp.*,
count(*) over (partition by start_station_code) as cnt
from bixirides_parquet bp
) bp
where cnt > 100;

Oracle Sequence wastes/reserves values (in INSERT SELECT)

I've been struggling with sequences for a few days. I have this Origin data table called "datos" with the next columns:
CENTRO
CODV
TEXT
INCIDENCY
And a Destiny data table called "anda" with the following:
TIPO = 31 (for all rows)
DESCRI = 'Site' (for all rows)
SECU = sequence number generated with Myseq.NEXTVAL
CENTRO
CODV
TEXT
The last three columns must be filled in with data from "datos" table.
When I execute my query, it all works fine, my table is filled and the sequence generates its values. But, in the INSERT INTO SELECT, I have the following conditions:
Every row in origin "datos" must not already be in the destiny "anda", so it won't be duplicated, and every row in "datos" must have the INCIDENCY flag value to 'N' or NULL.
If each row matches the conditions, it should be filled.
The thing is, that the query works fine and I have been trying with many different values. Here comes the problem:
When a row has its INCIDENCY value set to 'Y' (so it must not be copied into destiny table), it doesn't appear, but the sequence DOES consumes one value, and when I check Myseq.NEXTVAL its value is higher.
How can I prevent the sequence to add any value when it doesn't match the conditions? I've read that Oracle first reserves all the possible values returning from the SELECT query, but I can't find how to prevent it.
Here's the SQL:
INSERT INTO anda (TIPO, DESCRI, SECU, CENTRO, CODV, TEXT)
SELECT( 31 TIPO,
'Site' DESCRI,
Myseq.NEXTVAL,
datos.CENTRO,
datos.CODV,
datos.TEXT
FROM datos
WHERE (CENTRO, CODV) NOT IN
(SELECT CENTRO, CODV
FROM anda)
AND (datos.INCIDENCY = 'N' OR datos.INCIDENCY IS NULL)
)
Thanks in advance!!
Definition of MySeq
CREATE SEQUENCE CREATE SEQUENCE "BBDD"."MySeq" MINVALUE 800000000000
MAXVALUE 899999999999 INCREMENT BY 1 START WITH 800000000000 CACHE 20 ORDER NOCYCLE ;
You might be able to trick Oracle into doing this with a CTE:
INSERT INTO anda (TIPO, DESCRI, SECU, CENTRO, CODV, TEXT)
WITH toinsert as (
SELECT d.*
FROM datos d
WHERE (CENTRO, CODV) NOT IN (SELECT CENTRO, CODV FROM anda) AND
(d.INCIDENCY = 'N' OR d.INCIDENCY IS NULL)
)
SELECT 31 as TIPO, 'Site' as DESCRI, Myseq.NEXTVAL,
d.CENTRO, d.CODV, d.TEXT
FROM toinsert d;
I'm not quite sure if that will work. A more guaranteed approach is to use a before insert trigger (or identity column if you are using 12c+). You would increment the value in the trigger.
However, I do agree with Hugh Jones. You should be confident using the sequence to add a unique value to each row and this value will be increasing. Gaps can appear for other reasons, such as deletes. Also, I know that SQL Server can create gaps when doing parallel inserts; I'm not sure if that also happens with Oracle.
I don't believe you have a real problem(the gaps are not really an issue) but you can put a before insert (at row level) trigger on anda table and set sequ there with your sequence generated value.
But keep in mind that this will keep consecutive only the sequ number in a statement. You'll get gaps anyway for other reasons.
UPDATE: as Alex Poole has commented, the insert itself does not generate gaps.
See a test below:
> drop sequence tst_fgg_seq;
sequence TST_FGG_SEQ dropped.
> drop table tst_fgg;
table TST_FGG dropped.
> drop table tst_insert_fgg;
table TST_INSERT_FGG dropped.
> create sequence tst_fgg_seq start with 1 nocycle;
sequence TST_FGG_SEQ created.
> create table tst_fgg as select level l from dual connect by level < 11;
table TST_FGG created.
> create table tst_insert_fgg as
select tst_fgg_seq.nextval
from tst_fgg
where l between 3 and 5;
table TST_INSERT_FGG created.
> select * from tst_insert_fgg;
NEXTVAL
----------
1
2
3
> insert into tst_insert_fgg
select tst_fgg_seq.nextval
from tst_fgg
where l between 3 and 5;
3 rows inserted.
> select * from tst_insert_fgg;
NEXTVAL
----------
1
2
3
4
5
6
6 rows selected

Why can't I use SELECT ... FOR UPDATE with aggregate functions?

I have an application where I find a Sum() of a database column for a set of records and later use that sum in a separate query, similar to the following (made up tables, but the idea is the same):
SELECT Sum(cost)
INTO v_cost_total
FROM materials
WHERE material_id >=0
AND material_id <= 10;
[a little bit of interim work]
SELECT material_id, cost/v_cost_total
INTO v_material_id_collection, v_pct_collection
FROM materials
WHERE material_id >=0
AND material_id <= 10
FOR UPDATE;
However, in theory someone could update the cost column on the materials table between the two queries, in which case the calculated percents will be off.
Ideally, I would just use a FOR UPDATE clause on the first query, but when I try that, I get an error:
ORA-01786: FOR UPDATE of this query expression is not allowed
Now, the work-around isn't the problem - just do an extra query to lock the rows before finding the Sum(), but that query would serve no other purpose than locking the tables. While this particular example is not time consuming, the extra query could cause a performance hit in certain situations, and it's not as clean, so I'd like to avoid having to do that.
Does anyone know of a particular reason why this is not allowed? In my head, the FOR UPDATE clause should just lock the rows that match the WHERE clause - I don't see why it matters what we are doing with those rows.
EDIT: It looks like SELECT ... FOR UPDATE can be used with analytic functions, as suggested by David Aldridge below. Here's the test script I used to prove this works.
SET serveroutput ON;
CREATE TABLE materials (
material_id NUMBER(10,0),
cost NUMBER(10,2)
);
ALTER TABLE materials ADD PRIMARY KEY (material_id);
INSERT INTO materials VALUES (1,10);
INSERT INTO materials VALUES (2,30);
INSERT INTO materials VALUES (3,90);
<<LOCAL>>
DECLARE
l_material_id materials.material_id%TYPE;
l_cost materials.cost%TYPE;
l_total_cost materials.cost%TYPE;
CURSOR test IS
SELECT material_id,
cost,
Sum(cost) OVER () total_cost
FROM materials
WHERE material_id BETWEEN 1 AND 3
FOR UPDATE OF cost;
BEGIN
OPEN test;
FETCH test INTO l_material_id, l_cost, l_total_cost;
Dbms_Output.put_line(l_material_id||' '||l_cost||' '||l_total_cost);
FETCH test INTO l_material_id, l_cost, l_total_cost;
Dbms_Output.put_line(l_material_id||' '||l_cost||' '||l_total_cost);
FETCH test INTO l_material_id, l_cost, l_total_cost;
Dbms_Output.put_line(l_material_id||' '||l_cost||' '||l_total_cost);
END LOCAL;
/
Which gives the output:
1 10 130
2 30 130
3 90 130
The syntax select . . . for update locks records in a table to prepare for an update. When you do an aggregation, the result set no longer refers to the original rows.
In other words, there are no records in the database to update. There is just a temporary result set.
You might try something like:
<<LOCAL>>
declare
material_id materials.material_id%Type;
cost materials.cost%Type;
total_cost materials.cost%Type;
begin
select material_id,
cost,
sum(cost) over () total_cost
into local.material_id,
local.cost,
local.total_cost
from materials
where material_id between 1 and 3
for update of cost;
...
end local;
The first row gives you the total cost, but it selects all the rows and in theory they could be locked.
I don't know if this is allowed, mind you -- be interesting to hear whether it is.
For example, there is product table with id, name and stock as shown below.
product table:
id
name
stock
1
Apple
3
2
Orange
5
3
Lemon
8
Then, both 2 queries below can run sum() and SELECT FOR UPDATE together:
SELECT sum(stock) FROM (SELECT * FROM product FOR UPDATE) AS result;
WITH result AS (SELECT * FROM product FOR UPDATE) SELECT sum(stock) FROM result;
Output:
sum
-----
16
(1 row)
For that, you can use the WITH command.
Exemple:
WITH result AS (
-- your select
) SELECT * FROM result GROUP BY material_id;
Is your problem "However, in theory someone could update the cost column on the materials table between the two queries, in which case the calculated percents will be off."?
In that case , probably you can simply use a inner query as:
SELECT material_id, cost/(SELECT Sum(cost)
FROM materials
WHERE material_id >=0
AND material_id <= 10)
INTO v_material_id_collection, v_pct_collection
FROM materials
WHERE material_id >=0
AND material_id <= 10;
Why do you want to lock a table? Other applications might fail if they try to update that table during that time right?

Get last record of a table in Postgres

I'm using Postgres and cannot manage to get the last record of my table:
my_query = client.query("SELECT timestamp,value,card from my_table");
How can I do that knowning that timestamp is a unique identifier of the record ?
If under "last record" you mean the record which has the latest timestamp value, then try this:
my_query = client.query("
SELECT TIMESTAMP,
value,
card
FROM my_table
ORDER BY TIMESTAMP DESC
LIMIT 1
");
you can use
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1
assuming you want also to sort by timestamp?
Easy way: ORDER BY in conjunction with LIMIT
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1;
However, LIMIT is not standard and as stated by Wikipedia, The SQL standard's core functionality does not explicitly define a default sort order for Nulls.. Finally, only one row is returned when several records share the maximum timestamp.
Relational way:
The typical way of doing this is to check that no row has a higher timestamp than any row we retrieve.
SELECT timestamp, value, card
FROM my_table t1
WHERE NOT EXISTS (
SELECT *
FROM my_table t2
WHERE t2.timestamp > t1.timestamp
);
It is my favorite solution, and the one I tend to use. The drawback is that our intent is not immediately clear when having a glimpse on this query.
Instructive way: MAX
To circumvent this, one can use MAX in the subquery instead of the correlation.
SELECT timestamp, value, card
FROM my_table
WHERE timestamp = (
SELECT MAX(timestamp)
FROM my_table
);
But without an index, two passes on the data will be necessary whereas the previous query can find the solution with only one scan. That said, we should not take performances into consideration when designing queries unless necessary, as we can expect optimizers to improve over time. However this particular kind of query is quite used.
Show off way: Windowing functions
I don't recommend doing this, but maybe you can make a good impression on your boss or something ;-)
SELECT DISTINCT
first_value(timestamp) OVER w,
first_value(value) OVER w,
first_value(card) OVER w
FROM my_table
WINDOW w AS (ORDER BY timestamp DESC);
Actually this has the virtue of showing that a simple query can be expressed in a wide variety of ways (there are several others I can think of), and that picking one or the other form should be done according to several criteria such as:
portability (Relational/Instructive ways)
efficiency (Relational way)
expressiveness (Easy/Instructive way)
If your table has no Id such as integer auto-increment, and no timestamp, you can still get the last row of a table with the following query.
select * from <tablename> offset ((select count(*) from <tablename>)-1)
For example, that could allow you to search through an updated flat file, find/confirm where the previous version ended, and copy the remaining lines to your table.
The last inserted record can be queried using this assuming you have the "id" as the primary key:
SELECT timestamp,value,card FROM my_table WHERE id=(select max(id) from my_table)
Assuming every new row inserted will use the highest integer value for the table's id.
If you accept a tip, create an id in this table like serial. The default of this field will be:
nextval('table_name_field_seq'::regclass).
So, you use a query to call the last register. Using your example:
pg_query($connection, "SELECT currval('table_name_field_seq') AS id;
I hope this tip helps you.
To get the last row,
Get Last row in the sorted order: In case the table has a column specifying time/primary key,
Using LIMIT clause
SELECT * FROM USERS ORDER BY CREATED_TIME DESC LIMIT 1;
Using FETCH clause - Reference
SELECT * FROM USERS ORDER BY CREATED_TIME FETCH FIRST ROW ONLY;
Get Last row in the rows insertion order: In case the table has no columns specifying time/any unique identifiers
Using CTID system column, where ctid represents the physical location of the row in a table - Reference
SELECT * FROM USERS WHERE CTID = (SELECT MAX(CTID) FROM USERS);
Consider the following table,
userid |username | createdtime |
1 | A | 1535012279455 |
2 | B | 1535042279423 | //as per created time, this is the last row
3 | C | 1535012279443 |
4 | D | 1535012212311 |
5 | E | 1535012254634 | //as per insertion order, this is the last row
The query 1 and 2 returns,
userid |username | createdtime |
2 | B | 1535042279423 |
while 3 returns,
userid |username | createdtime |
5 | E | 1535012254634 |
Note : On updating an old row, it removes the old row and updates the data and inserts as a new row in the table. So using the following query returns the tuple on which the data modification is done at the latest.
Now updating a row, using
UPDATE USERS SET USERNAME = 'Z' WHERE USERID='3'
the table becomes as,
userid |username | createdtime |
1 | A | 1535012279455 |
2 | B | 1535042279423 |
4 | D | 1535012212311 |
5 | E | 1535012254634 |
3 | Z | 1535012279443 |
Now the query 3 returns,
userid |username | createdtime |
3 | Z | 1535012279443 |
Use the following
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1
These are all good answers but if you want an aggregate function to do this to grab the last row in the result set generated by an arbitrary query, there's a standard way to do this (taken from the Postgres wiki, but should work in anything conforming reasonably to the SQL standard as of a decade or more ago):
-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $2;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.LAST (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
It's usually preferable to do select ... limit 1 if you have a reasonable ordering, but this is useful if you need to do this within an aggregate and would prefer to avoid a subquery.
See also this question for a case where this is the natural answer.
The column name plays an important role in the descending order:
select <COLUMN_NAME1, COLUMN_NAME2> from >TABLENAME> ORDER BY <COLUMN_NAME THAT MENTIONS TIME> DESC LIMIT 1;
For example: The below-mentioned table(user_details) consists of the column name 'created_at' that has timestamp for the table.
SELECT userid, username FROM user_details ORDER BY created_at DESC LIMIT 1;
In Oracle SQL,
select * from (select row_number() over (order by rowid desc) rn, emp.* from emp) where rn=1;
select * from table_name LIMIT 1;