How do I efficiently create a TRIGGER which, on occasion, generates a VIEW? - sql

On a small pre-registration database, I'm using the following SQL to generate a VIEW whenever a specific user name is given. I'm using this mainly to get a snapshot whenever a sysadmin suspects duplicate names are registering. This will be done rarely (atmost once per hour), so the database schema shouldn't be getting excessively big.
CREATE OR REPLACE TRIGGER upd_on_su_entry
AFTER UPDATE OR INSERT
ON PRE_REG_MEMBER
FOR EACH ROW
BEGIN
IF :new.MEMBER_NAME = 'SysAdmin Dup Tester' THEN
EXECUTE IMMEDIATE 'CREATE OR REPLACE VIEW mem_n AS SELECT :old.MEMBER_NAME, COUNT(:old.MEMBER_NAME) FROM MEMBER GROUP BY MEMBER_NAME';
END IF;
END;
However, this appears to be a bloated, inefficient and erroneous way of working (according to my admin). Is there a fundamental error here ? Can I take an equivalent snapshot some other way?
I'm very new to SQL, so please bear with me.
Also, I want to be using the view as :
public void dups()throws Exception
{
Calendar cal = Calendar.getInstance();
jt.setText("Duplicate List at : "+ cal.getTime());
try{
rs=stat.executeQuery("select * from upd_on_su_entry");
while(rs.next())
{
jt.append(rs.getString("MEMBER_NAME")+"\t");
jt.append(rs.getString(2)+"\t");
}
}
catch(Exception e){System.out.print("\n"+e);}
}

There seem to be some misunderstandings here.
1.) views are basically stored sql statements, not stored sql results, so your view will always display the data as it is at the point of querying the view.
2.) Never ever use DDL (create statements) and similar during normal processing of an application. Its just not the way databases are intended to work.
If you want a snapshot at a point in time, create a secondary table which contains all the columns of the original table plus a snapshot time stamp.
When ever you want to make a snapshot copy all the data you want from the original table into the snapshot table while adding the current time stamp.

Based on your comment, it sounds like you want something like this
SELECT MEMBER_NAME FROM PRE_REG_MEMBER
GROUP BY MEMBER_NAME HAVING COUNT(*) > 1;
This will return all members with more than one row in the table

Again, ask yourself "what am I trying to do"?
Don't focus on the technology. Do that after you have a clear idea of what your goal is.
If you are just trying to avoid duplicate registrations on your database, just search the users table and show an error if that username is already there.
Also, think of your datamodel carefully before going into the implementation details.

Related

How do you deduplicate records in a BigQuery table?

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated
I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.

How to reference the latest table from a manually partitioned BigQuery table

We have a manually partitioned "video metadata" table being fed fresh data each day. In our system, old data is only kept for historical reasons since the latest data is the most up to date.
What we cant figure out is how to reference only the latest partition in this table using LookML.
So far we have attempted to store views in BigQuery. We have tried and failed to store a simple "fetch the newest partition" query as a view, in both standard and legacy SQL, and upon some searching, this seems to be by design, even though the error message states "Dataset not found" instead of something more relevant.
We've also tried to build the filtering into Looker, but we're having trouble with getting things to actually work and only having the latest data returned to us through it.
Any help would be appreciated.
We've managed to find a solution, derived tables
We figured that since we couldn't define a view on BigQuery's side, we could do it on Looker's side instead, so we defined the table in a derived table block inside a view.
derived_table: {
sql: SELECT * FROM dataset.table_*
WHERE _TABLE_SUFFIX = (
SELECT max(_TABLE_SUFFIX) FROM dataset.table_*
);;
sql_trigger_value: SELECT max(_TABLE_SUFFIX) FROM dataset.table_*;;
}
This gave us a view with just the newest data in it.

Oracle Apex page based on loading data after entering the one or its refreshing

Good evening!
At this moment I'm working on page in the Oracle Apex Application, which works the following way. This is the page, which contains some kind of big and complex report with data differentiated by one feature (let it be named feature A). On the left side of the page there is a "catalog menu", due to which user can see the data answering to feature A, on the right side the data is shown and above there is a the search bar, which can help users to find some exact data by other features feature B, feature C etc.
I had a view (let it be named V_REPORT_BIG_DATA) for showing the report, but it was so big and was loading so slow, that I've decided to switch the page on the table with the same fields as V_REPORT_BIG_DATA (let it be named T_REPORT_BIG_DATA_TEMP). Besides, it has the additional field for process identificator (let it be named PID) and is temporary not physically, but by its purpose. I thought that it must work this way: user enters the page, receives his own PID relevant to the session (it works if PID is null, otherwise it doesn't change), and then the procedure (P_REPORT_BIG_DATA_RELOAD) makes the deleting of the "old" data and uploading of the "new" one, besides these actions are executed with the one PID and are concerned to definite (say, current) user.
But my idea didn't appear to work correct. The procedure P_REPORT_BIG_DATA_RELOAD itself works fine and is executed from the Process page, and PID is a global Application Item (it is generated from a database sequence). But my brain has nearly been blown up when I saw that my table has duplicates of data concerned to one user and one PID! By making the table of logs (which has been filled with the facts, how much rows had been deleted and inserted again, in the code of P_REPORT_BIG_DATA_RELOAD) I saw very strange thing: some users "loaded" duplicates as if the uploading procedure had been executed several times simultaneously!
Taking into account all I've said before, I have the following question: what do I do wrong? What should I do, so that I wouldn't have to use the word "distinct" in the query from the table T_REPORT_BIG_DATA_TEMP?
UPD: Additional facts to my question. Excuse me for my inattention, because I thought that there I cannot edit my first posts. :-/
Well, I'll explain my problem further. :) Firstly, I did all the best for my view P_REPORT_BIG_DATA_RELOAD to expect its loading much faster, but it involves many-many rows. Secondly, the code executed from the Process Page (say, during the loading of my page) is this:
begin
if :PID is null then
:PID := NEW_PID;
end if;
P_REPORT_BIG_DATA_RELOAD(AUTH => :SUSER, PID => :PID);
end;
NEW_PID is a function which generates new PID, and P_REPORT_BIG_DATA_RELOAD is my procedure which refresh the data depending on user and his PID.
and the code of my procedure is this:
procedure P_REPORT_BIG_DATA_RELOAD
(AUTH in varchar2, PID in number)
is
NCOUNT_DELETED number;
NCOUNT_INSERTED number;
begin
--first of all I check that both parameters are not null - let me omit this part
--I find the count of data to be deleted (for debug only)
select count(*)
into NCOUNT_DELETED
from T_REPORT_BIG_DATA_TEMP T
where T.AUTHID = AUTH
and T.PID = P_REPORT_BIG_DATA_RELOAD.PID;
--I delete "old" data
delete from T_REPORT_BIG_DATA_TEMP T
where T.AUTHID = AUTH
and T.PID = P_REPORT_BIG_DATA_RELOAD.PID;
--I upload "new" one
insert into T_REPORT_BIG_DATA_TEMP
select V.*, PID from
(select S.* from V_REPORT_BIG_DATA S
where S.AUTHID = AUTH);
--I find the count of uploaded data (for debug only)
NCOUNT_INSERTED := SQL%ROWCOUNT;
--I write the logs (for debug only)
insert into T_REPORT_BIG_DATA_TEMP_LG(AUTHID,PID,INS_CNT,DLD_CNT,WHEN)
values(AUTH,PID,NCOUNT_INSERTED,NCOUNT_DELETED,sysdate);
end P_REPORT_BIG_DATA_RELOAD;
And one more fact: I tried to turn :PID into Page Item, but it cleared after every refresh in spite of that the option Maintain session state is Per session, so that I couldn't even hope for using the same PID by every definite user in definite session.

Selecting specific joined record from findAll() with a hasMany() include

(I tried posting this to the CFWheels Google Group (twice), but for some reason my message never appears. Is that list moderated?)
Here's my problem: I'm working on a social networking app in CF on Wheels, not too dissimilar from the one we're all familiar with in Chris Peters's awesome tutorials. In mine, though, I'm required to display the most recent status message in the user directory. I've got a User model with hasMany("statuses") and a Status model with belongsTo("user"). So here's the code I started with:
users = model("user").findAll(include="userprofile, statuses");
This of course returns one record for every status message in the statuses table. Massive overkill. So next I try:
users = model("user").findAll(include="userprofile, statuses", group="users.id");
Getting closer, but now we're getting the first status record for each user (the lowest status.id), when I want to select for the most recent status. I think in straight SQL I would use a subquery to reorder the statuses first, but that's not available to me in the Wheels ORM. So is there another clean way to achieve this, or will I have to drag a huge query result or object the statuses into my CFML and then filter them out while I loop?
You can grab the most recent status using a calculated property:
// models/User.cfc
function init() {
property(
name="mostRecentStatusMessage",
sql="SELECT message FROM statuses WHERE userid = users.id ORDER BY createdat DESC LIMIT 1,1"
);
}
Of course, the syntax of the SELECT statement will depend on your RDBMS, but that should get you started.
The downside is that you'll need to create a calculated property for each column that you need available in your query.
The other option is to create a method in your model and write custom SQL in <cfquery> tags. That way is perfectly valid as well.
I don't know your exact DB schema, but shouldn't your findAll() look more like something such as this:
statuses = model("status").findAll(include="userprofile(user)", where="userid = users.id");
That should get all statuses from a specific user...or is it that you need it for all users? I'm finding your question a little tricky to work out. What is it you're exactly trying to get returned?

Can I insert in a programmatically defined PostgreSQL table using SQL language?

Context: I'm trying to INSERT data in a partitioned table. My table is partitioned by months, because I have lots of data (and a volume expected to increase) and the most recent data is more often queried. Any comment on the partition choice is welcome (but an answer to my question would be more than welcome).
The documentation has a partition example in which, when inserting a line, a trigger is called that checks the new data date and insert it accordingly in the right "child" table. It uses a sequence of IF and ELSIF statements, one for each month. The guy (or gal) maintaining this has to create a new table and update the trigger function every month.
I don't really like this solution. I want to code something that will work perfectly and that I won't need to update every now and then and that will outlive me and my grand-grand-children.
So I was wondering if I could make a trigger that would look like this:
INSERT INTO get_the_appropriate_table_name(NEW.date) VALUES (NEW.*);
Unfortunately all my attempts have failed. I tried using "regclass" stuffs but with no success.
In short, I want to make up a string and use it as a table name. Is that possible?
I was just about to write a trigger function using EXECUTE to insert into a table according to the date_parts of now(), or create it first if it should not exist .. when I found that somebody had already done that for us - right under the chapter of the docs you are referring to yourself:
http://www.postgresql.org/docs/9.0/interactive/ddl-partitioning.html
Scroll all the way down to user "User Comments". Laban Mwangi posted an example.
Update:
The /interactive branch of the Postgres manual has since been removed, links are redirected. So the comment I was referring to is gone. Look to these later, closely related answers for detailed instructions:
INSERT with dynamic table name in trigger function
Table name as a PostgreSQL function parameter
For partitioning, there are better solutions by now, like range partitioning in Postgres 10 or later. Example:
Better database for “keep always the 5 latest entries per ID and delete older”?