Fetch and Update large amount of data in postgresql

Fetch and Update large amount of data in postgresql - sql

I am building a web application and using postgreSQL as database. I need to fetch and update thousands of rows every 5-10 mins. Let's say i have 1M rows with this following schema in my table:
ServiceStatus {
id: string,
userid: string,
status: string,
}
I will be fetching all the rows based on service status (let's assume 100,000 rows every 5 mins) and based on the status i'll do some processing and update status in db. As i said i'll do this every 5-10 mins. What's the most efficient approach to this?

Fetching 100,000 row once every 5 minutes will not be strenuous. Updating them should not be either, but probably don't update one just to set it back to the same value as it already holds.
If this turns out to be a problem, it will come down to some detail you haven't described to us, and which we can't guess.

Related

TMSL command to process big tabular cube

Currently spending a lot of time processing the cube which increases with every new wave of data.
The cube processing job is a sequence of steps containing ANALYSISCOMMANDs:
1 refresh type: full for all Dimensional tables.
2 refresh type: clearValues for Fact table 1
3 refresh type: dataOnly for Fact table 1
4 refresh type: calculate database (without table indication)
...
22 refresh type: clearValues for Fact table N
23 refresh type: dataOnly for Fact table N
24 refresh type: calculate database (without table indication)
One of the considerations was to split processing into steps so when the processing fails with "out of memory", we could start from the previous step.
Current sequence does not seem very efficient and I am looking how to decrease processing time.
Specifically having "refresh type: calculate database (without table indication)" coming each time after Fact table block makes me wonder:
can I either indicate a relevant Fact table here or get rid of those, leaving one calculate step on database level at the very end?

Calculate database (or recalc) is only needed once and it needs to be the last operation you do. It's only needed once, because it's a database (model) level command.
More info here: http://bifuture.blogspot.com/2017/02/ssas-processing-tabular-model.html

How long a temporary table and a Job goes on bigquery goes for default?

By default how long a temp table will be on to get data? I know that we can set up the expiration, but what is the default?
And about the job? What's the default expiration time if it has?
I tried to find these in the documentation but i couldn't find, we return the jobId to the client so he can get the data when the job is complete, but some of them like to store and tries to fetch data with a jobId from 2 weeks ago, 1 month ago.
What's the default time here so i can explain them better?

Query results are stored for 24 hours:
All query results, including both interactive and batch queries, are cached in temporary tables for approximately 24 hours with some exceptions.
https://cloud.google.com/bigquery/docs/cached-results

As was mentioned by Alexey, they query results are stored for 24 hours when using cache.
Regarding the timelife for BigQuery jobs, you can get the job history for the last six months
By the other side, based on your description, creating a new table from your query results with expiration time seems to be the most appropriate strategy. Also, you could check if using materialized views could help you to store some results of recurring queries.

Find the user logged in to application how many times with different version

I have a Oracle table which has record of frequent logins with different versions of applications. for Version_1 there is one entry but for Version_2 there are 8 entries. These 8 entries are having different time stamps(different milli sec). I want to find out for each user howmany times he is logging in to the Version_2 application. Here we can take Minutes to remove duplicate records.
Here is the sample data with column names
I want to find out the each user has logged in how many time excluding the duplicate entry(these entries have different time stamps with change in milli seconds).

Try this query
SELECT count(emp_id) as log_number,app_version FROM YOUR_TABLE_NAME WHERE
app_version like 'VER_2';

SQL Server Query: Daily Data Snapshot Comparison (Counting Delta Occurrences)

I am working towards counting customer subscription ("package") changes. To do this, I am selecting all data from my package table once, every day. I am calling the daily query results "snapshots" (approx 500k rows). I then load the snapshot data into a new table. After 10 days I have a total of 5 million rows in the snapshots table (500k rows * 10 days). The majority of customers do not changes packages (65%). I need to report which customers, of the remaining 35%, are switching packages, when they are switching packages, what package changes they are making (from "package X" to "package y") and which customers are changing packages most frequently.
The query I have written uses a self-join. I am identifying the changes but my results contain duplicate rows.
This is my query:
select *
from UserPackageDump UPD1, UserPackageDump UPD2
where UPD1.user_id = UPD2.user_id
and UPD1.package_id <> UPD2.package_id
How can I change this query to yield only distinct results?

SELECT
DISTINCT *
FROM
UserPackageDump UPD1
JOIN UserPackageDump UPD2
ON UPD1.user_id = UPD2.user_id
WHERE
UPD1.package_id <> UPD2.package_id

You have many options for doing this, and I'm not sure your approach is the right one to take. Firstly to answer your specific question, you could perform a DISTINCT as per #sqlab's answer. Or you could include the date in the join, ensuring that UDP1 only matches a record in UDP2 that is one day different.
However, to come back to the approach, there should be no need to take a full copy of all the data. You have lots of other options for more efficient data storage, some of which being:
Put a "LastUpdated" datetime2 field in the database, to be populated each time the row is changed. Copy only those rows that have a LastUpdated more recent than the last time the copy was made. Assuming the only change possible to the table is to change the package_id then you will now only have rows in the table for users that have changed.
Create a UserPackageHistory table into which rows are written each time a user subscribes to a package, at the same time that UserPackage is updated. This then leaves you with much the same result as the first bullet, but in advance of running the copy job.
Then, with any one of these sets of data, to satisfy the reporting requirements you could populate a cube. Your source would be a set of rows containing user_id, old_package_id, new_package_id and date. You would create a measure group containing these measures:
Distinct count of user_id
Count of switches (basically just the row count of the source data)
This measure group could then be related to the following dimensions:
Date, so you can see when the switches are taking place
User, so you can drill down on who is switching
Switch Type, which is a dimension built from the selecting the old_package_id and new_package_id from your source data. This gives you the ability to see the popularity of particular shifts.

SQL Update query not completed

I made a SQL query to update registers on a table. The table has about 15 million registers. The update statement is like:
UPDATE temp_conafe
set apoyo = trim(apoyo)
where cve_status like '%APOYO%';
I keep checking the field v$transaction.used_ured to see if the query is rolling forward or backwards but when the number of records reach to more than 15 millions the query starts rolling backwards.
How do I get the update to complete successfully?
I'm not the DBA, just a programmer, but I can't keep developing till that thing updates my registers.

It looks as if your transaction is to big. Try to add another limiting clause in the where. If you have a Id field you can add something like this:
where cve_status like '%APOYO%'
AND id > 1 AND id < 100000
You need to run it multiple times an change the range accordingly. If this is not an option you have to talk to your DBA and ask him to give you more resources.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas