GroupBy query for billion records - Vertica - sql

I am working on an application where records are in billions and I need to make a query where GroupBy clause is needed.
Table Schema:
CREATE TABLE event (
eventId INTEGER PRIMARY KEY,
eventTime INTEGER NOT NULL,
sourceId INTEGER NOT NULL,
plateNumber VARCHAR(10) NOT NULL,
plateCodeId INTEGER NOT NULL,
plateCountryId INTEGER NOT NULL,
plateStateId INTEGER NOT NULL
);
CREATE TABLE source (
sourceId INTEGER PRIMARY KEY,
sourceName VARCHAR(32) NOT NULL
);
Scenario:
User will select sources, suppose source ID (1,2,3)
We need to get all events which occurred more than once for those source for event time range
Same event criteria (same platenumber, platecodeId, platestateId, plateCountryId)
I have prepared a query to perform above mentioned operation but its taking long time to execute.
select plateNumber, plateCodeId, plateStateId,
plateCountryId, sourceId,count(1) from event
where sourceId in (1,2,3)
group by sourceId, plateCodeId, plateStateId,
plateCountryId, plateNumber
having count(1) > 1 limit 10 offset 0
Can you recommend optimized query for it?

Since you didn't supply the projection DDL, I'll assume the projection is default and created by the CREATE TABLE statement
Your goal is to achieve the use of the GROUPBY PIPELINED algorithm instead of GROUPBY HASH which is usually slower and consumes more memory.
To do so, you need the table('s projection) to be sorted by the columns in the group by clause.
More info here: GROUP BY Implementation Options
CREATE TABLE event (
eventId INTEGER PRIMARY KEY,
eventTime INTEGER NOT NULL,
sourceId INTEGER NOT NULL,
plateNumber VARCHAR(10) NOT NULL,
plateCodeId INTEGER NOT NULL,
plateCountryId INTEGER NOT NULL,
plateStateId INTEGER NOT NULL
)
ORDER BY sourceId,
plateCodeId,
plateStateId,
plateCountryId,
plateNumber;
You can see which algorithm is being used by adding EXPLAIN before your query.

Related

How to find the columns that need to be indexed?

I'm starting to learn SQL and relational databases. Below is the table that I have, and it has around 10 million records. My composite key is (reltype, from_product_id, to_product_id).
What strategy should I follow while selecting the columns that needs to be indexed? Also, I have documented the operations that would be performed on the table. Please help in determining which columns or combination of columns that need to be indexed?
Table DDL is shown below.
Table name: prod_rel.
Database schema name : public
CREATE TABLE public.prod_rel (
reltype varchar NULL,
assocsequence float4 NULL,
action varchar NULL,
from_product_id varchar NOT NULL,
to_product_id varchar NOT NULL,
status varchar NULL,
starttime varchar NULL,
endtime varchar null,
primary key reltype, from_product_id, to_product_id)
);
Operations performed on table:
select distinct(reltype )
from public.prod_rel;
update public.prod_rel
set status = ? , starttime = ?
where from_product_id = ?;
update public.prod_rel
set status = ? , endtime = ?
where from_product_id = ?;
select *
from public.prod_rel
where from_product_id in (select distinct (from_product_id)
from public.prod_rel
where status = ?
and action in ('A', 'E', 'C', 'P')
and reltype = ?
fetch first 1000 rows only);
Note: I'm not performing any JOIN operations. Also please ignore the uppercase for table or column names. I'm just getting started.
Ideal would be two indexes:
CREATE INDEX ON prod_rel (from_product_id);
CREATE INDEX ON prod_rel (status, reltype)
WHERE action IN ('A', 'E', 'C', 'P');
Your primary key (which also is implemented using an index) cannot support query 2 and 3 because from_product_id is not in the beginning. If you redefine the primary key as from_product_id, to_product_id, reltype, you don't need the first index I suggested.
Why does order matter? Imagine you are looking for a book in a library where the books are ordered by “last name, first name”. You can use this ordering to find all books by “Dickens” quickly, but not all books by any “Charles”.
But let me also comment on your queries.
The first one will perform badly if there are lots of different reltype values; try raising work_mem in that case. It is always a sequential scan of the whole table, and no index can help.
I have changed the order of primary columns as shown below as per #a_horse_with_no_name 's suggestion and created only one index for (from_product_id, reltype, status, action) columns.
CREATE TABLE public.prod_rel (
reltype varchar NULL,
assocsequence float4 NULL,
action varchar NULL,
from_product_id varchar NOT NULL,
to_product_id varchar NOT NULL,
status varchar NULL,
starttime varchar NULL,
endtime varchar null,
primary key reltype, from_product_id, to_product_id)
);
Also, I have gone thorough the portal suggested by #a_horse_with_no_name. It was amazing. I came to know lot of new things on indexing.
https://use-the-index-luke.com/

SQLITE3: find IDs across multiple tables

I would like to do analysis of what codes appear in multiple tables under certains conditions. However I don't think the database schema suits the task very well but maybe there's something I don't know about that can help me. Here's a simplified schema:
CREATE TABLE "batchDescription" (
id INTEGER NOT NULL,
name TEXT NOT NULL UNIQUE,
PRIMARY KEY (id)
);
CREATE TABLE "simulationDetails" (
id INTEGER NOT NULL,
ko_index_id INTEGER NOT NULL,
batch_description_id INTEGER NOT NULL,
data1 REAL NOT NULL,
data2 INTEGER NOT NULL,
PRIMARY KEY (id)
FOREIGN KEY(ko_index_id) REFERENCES "koIndex" (id)
FOREIGN KEY(batch_description_id) REFERENCES "batchDescription" (id)
);
CREATE TABLE "koIndex" (
id INTEGER NOT NULL,
number_of_kos INTEGER NOT NULL,
PRIMARY KEY (id)
);
CREATE TABLE "1kos" (
ko_index_id INTEGER NOT NULL,
ko1 INTEGER NOT NULL,
PRIMARY KEY (ko_index_id)
FOREIGN KEY(ko_index_id) REFERENCES "koIndex" (id)
);
CREATE TABLE "2kos" (
ko_index_id INTEGER NOT NULL,
ko1 INTEGER NOT NULL,
ko2 INTEGER NOT NULL,
PRIMARY KEY (ko_index_id)
FOREIGN KEY(ko_index_id) REFERENCES "koIndex" (id)
);
CREATE TABLE "3kos" (
ko_index_id INTEGER NOT NULL,
ko1 INTEGER NOT NULL,
ko2 INTEGER NOT NULL,
ko3 INTEGER NOT NULL,
PRIMARY KEY (ko_index_id)
FOREIGN KEY(ko_index_id) REFERENCES "koIndex" (id)
);
This goes up to table "525kos" which has ko1 to ko525 in it - ko1 to ko525 are IDs that are primary keys in a table not shown here. I want to do an analysis of how often certain IDs are present under certain conditions. Here is a simple example to illustrate:
I would like to like to count the amount of times a certain ID (let's say 127) (in any koX column) in the "13kos" table occurs when simulationDetails.data1 not equal to 0. I would do this on a database called ko.db from the bash command line like:
for ko_idx in {1..13}; do sqlite3 ko.db "select count(ko${ko_idx}) from '13kos' where ko${ko_idx} = 127 and ko_index_id in (select ko_index_id from simulationDetails where data1 != 0);"; done
Already this is slow and inefficient but is simple compared to what I would like to do. What if I wanted to do an analysis of all the IDs in all possible columns in all "Xkos" tables and compare them to where data1 is equal and not equal to zero?
Can anybody direct me to a better way of doing this or is the schema design just not very good for this kind of analysis and I'll have to give up?
EDIT: Thought I'd add a bit of extra detailto avoid confusion. I suspect that a good way to achieve want I want would be to somehow combine all the "Xkos" tables into one temporary table and then search for certain IDs from that table. How would I combine all 525 ko tables without writing out each table name?
How would I combine all 525 ko tables without writing out each table
name?
Create a table with the same number of columns as the largest table (the table into which you merge) allowing nulls.
query the sqlite_master table using something like :-
SELECT * from sqlite_master WHERE name LIKE '%kos%' AND type = 'table'
Loop through the extracted table names building an INSERT SELECT for each table that will insert the rows from the tables into the table created in 1.
See 2. INSERT INTO table SELECT ...; especially in regard to handling missing columns.
All done, the table created in 1 will be populated accordingly.

SQL - Selecting random rows and combining into a new table

Here's the creation of my tables...
CREATE TABLE questions(
id INTEGER PRIMARY KEY AUTOINCREMENT,
question VARCHAR(256) UNIQUE NOT NULL,
rangeMin INTEGER,
rangeMax INTEGER,
level INTEGER NOT NULL,
totalRatings INTEGER DEFAULT 0,
totalStars INTEGER DEFAULT 0
);
CREATE TABLE games(
id INTEGER PRIMARY KEY AUTOINCREMENT,
level INTEGER NOT NULL,
inclusive BOOL NOT NULL DEFAULT 0,
active BOOL NOT NULL DEFAULT 0,
questionCount INTEGER NOT NULL,
completedCount INTEGER DEFAULT 0,
startTime DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE gameQuestions(
gameId INTEGER,
questionId INTEGER,
asked BOOL DEFAULT 0,
FOREIGN KEY(gameId) REFERENCES games(id),
FOREIGN KEY(questionId) REFERENCES questions(id)
);
I'll explain the full steps that I'm doing, and then I'll ask for input.
I need to...
Using a games.id value, lookup the games.questionCount and games.level for that game.
Now since I have games.questionCount and games.level, I need to look at all of the rows in questions table with questions.level = games.level and select games.questionCount of them at random.
Now with the rows (aka questions) I got from step 2, I need to put them into gameQuestions table using the games.id value and the questions.id value.
Whats the best way to accomplish this? I could do it with several different sql queries, but I feel like someone really skilled with sql could make it happen a bit more efficient. I am using sqlite3.
This does it in one statement. Let's assume :game_id to be the game id you want to process.
insert into gameQuestions (gameId, questionId)
select :game_id, id
from questions
where level = (select level from games where id = :game_id)
order by random()
limit (select questionCount from games where id = :game_id);
#Tony: sqlite doc says LIMIT takes an expression. The above statement works fine using sqlite 3.8.0.2 and produces the desired results. I have not tested an older version.

SQL: Use values stored in table_1 to populate a column in table_2 based on a variable listed in table_2

I am very new to sql, taking a class in school but its very basic. I have 2 tables i want to link together somehow, but dont know where to start. i have a month table that lists employees and completed service calls for that month. each employee is paid differently per call and by the type of call. so i have a commissions table that lists each employee, types of service calls, and the data in the columns are dollar amounts that each employee makes for each different type of call. i want to link the employee ids in each table so that i can do something like this...
SELECT sum(TypeOfCall) as "Total Commission"
from December
where TypeOfCall='abc' and EmployeeID='John';
but the data stored in the MothTable in the TypeOfCall column is a variable like 'abc' or 'cdf' each variable is listed and given a value in the CommissionsTable. how can i get a sum of the TypeOfCall column in the MonthTable using the values listed in the CommissionsTable?
the actual table, columns and types are as follows:
CREATE TABLE "December" (
"EmployeeID" INTEGER NOT NULL,
"InvoiceType" VARCHAR (6),
"DR" VARCHAR (8),
"TypeOfCall" VARCHAR (6),
"CommissionType" VARCHAR (6),
"Date DD/MM/YY" VARCHAR (10),
"InvoiceNumber" INTEGER,
"InvoiceAmount" FLOAT (6),
"KeyCode" VARCHAR(20)
)
and...
CREATE TABLE "Commissions" (
"EmployeeID" VARCHAR(25) PRIMARY KEY NOT NULL UNIQUE,
"T3" INTEGER NOT NULL,
"T5" INTEGER NOT NULL,
"T7" INTEGER NOT NULL,
"7B" INTEGER NOT NULL,
"Other10" INTEGER NOT NULL,
"Other12" INTEGER NOT NULL,
"Other13" INTEGER NOT NULL,
"Other14" INTEGER NOT NULL,
"Other15" INTEGER NOT NULL
)
What you really need is SUM(CommissionsTable.value) (substitute the correct column name) with an INNER JOIN between the month's table and CommissionsTable.
SELECT
December.TypeOfCall,
SUM(CommissionsTable.value) AS "Total Commission"
FROM
December
/* Substitute the correct column name for ComissionsTable.TypeOfCall */
INNER JOIN CommissionsTable WHERE December.TypeOfCall = CommissionsTable.TypeOfCall
WHERE
December.TypeOfCall = 'abc'
AND EmployeeID = 'John'
/* GROUP BY needed if you were retrieving more than just 'abc' */
GROUP BY December.TypeOfCall
It is not appropriate to have a separate table for each month, however. Instead, your table listing calls should include a date value for each call, along with its type and EmployeeID. Something like:
CREATE TABLE calls (
EmployeeID VARCHAR(32) NOT NULL,
TypeOfCall CHAR(3) NOT NULL,
CallDate INTEGER NOT NULL
)
calls table store records in just one or many table depend on the number of records and data save period, if the acount is very very larg , partition or mutiplue tables are good choice.
create index on December.TypeOfCall may tune perf.

Index Guidance for SQL Query

Anyone have guidance on how to approach building indexes for the following query? The query works as expected, but I can't seem to get around full table scans. Working with Oracle 11g.
SELECT v.volume_id
FROM ( SELECT MIN (usv.volume_id) volume_id
FROM user_stage_volume usv
WHERE usv.status = 'NEW'
AND NOT EXISTS
(SELECT 1
FROM user_stage_volume kusv
WHERE kusv.deal_num = usv.deal_num
AND kusv.locked = 'Y')
GROUP BY usv.deal_num, usv.volume_type
ORDER BY MAX (usv.priority) DESC, MIN (usv.last_update) ASC) v
WHERE ROWNUM = 1;
Please request any more info you may need in comments and I'll edit.
Here is the create script for the table. The PK is VOLUME_ID. DEAL_NUM is not unique.
CREATE TABLE ENDUR.USER_STAGE_VOLUME
(
DEAL_NUM NUMBER(38) NOT NULL,
EXTERNAL_ID NUMBER(38) NOT NULL,
VOLUME_TYPE NUMBER(38) NOT NULL,
EXTERNAL_TYPE VARCHAR2(100 BYTE) NOT NULL,
GMT_START DATE NOT NULL,
GMT_END DATE NOT NULL,
VALUE FLOAT(126) NOT NULL,
VOLUME_ID NUMBER(38) NOT NULL,
PRIORITY INTEGER NOT NULL,
STATUS VARCHAR2(100 BYTE) NOT NULL,
LAST_UPDATE DATE NOT NULL,
LOCKED CHAR(1 BYTE) NOT NULL,
RETRY_COUNT INTEGER DEFAULT 0 NOT NULL,
INS_DATE DATE NOT NULL
)
ALTER TABLE ENDUR.USER_STAGE_VOLUME ADD (
PRIMARY KEY
(VOLUME_ID))
An index on (deal_num) would help the subquery greatly. In fact, an index on (deal_num, locked) would allow the subquery to avoid the table itself altogether.
You should expect a full table scan on the main query, as it filters on status which is not indexed (and most likely would not benefit from being indexed, unless 'NEW' is a fairly rare value for status).
I think it's running your inner subquery (inside not exists...) once for every run of the outer subquery.
That will be where performance takes a hit - it will run through all of user_stage_volume for each row in user_stage_volume, which is O(n^2), n being the number of rows in usv.
An alternative would be to create a view for the inner subquery, and use that view, or alternatively, to name a temporary view by using WITH.