Display another field in the referenced table for multiple columns with performance issues in mind - sql

I have a table of edge like this:
-------------------------------
| id | arg1 | relation | arg2 |
-------------------------------
| 1 | 1 | 3 | 4 |
-------------------------------
| 2 | 2 | 6 | 5 |
-------------------------------
where arg1, relation and arg2 reference to the ids of objects in another object table:
--------------------
| id | object_name |
--------------------
| 1 | book |
--------------------
| 2 | pen |
--------------------
| 3 | on |
--------------------
| 4 | table |
--------------------
| 5 | bag |
--------------------
| 6 | in |
--------------------
What I want to do is that, considering performance issues (a very big table more than 50 million of entries) display the object_name for each edge entry rather than id such as:
---------------------------
| arg1 | relation | arg2 |
---------------------------
| book | on | table |
---------------------------
| pen | in | bag |
---------------------------
What is the best select query to do this? Also, I am open to suggestions for optimizing the query - adding more index on the tables etc...
EDIT: Based on the comments below:
1) #Craig Ringer: PostgreSQL version: 8.4.13 and only index is id for both tables.
2) #andrefsp: edge is almost x2 times bigger than object.

If you can change the structure of the database, you may try to denormalize this part of the database and make table edge with fields id, arg1_name, relation_name, arg2_name. And keep table object without changes to take names for the edge table when you insert or update it.
It is not good. Your data will be duplicates (size of the database will be greater) and it may be difficult to insert or update tables.
But it should be fast to select (no JOINs):
SELECT arg1_name, relation_name, arg2_name
FROM edge;

It won't get cheaper than this:
SELECT o1.object_name, r1.object_name, o2.object_name
FROM edge e
JOIN object o1 ON o1.id = e.arg1
JOIN object r ON r.id = e.relation
JOIN object o2 ON o2.id = e.arg2;
And you don't need more indexes. The one on object.id is the only one needed for this query.
But I seriously doubt that you want to retrieve 50 millions of rows at once, and in no particular order. You still didn't give the full picture.

Related

Sql Server how to find values in different tables that have different suffix

I'm struggling to find a value that might be in different tables but using UNION is a pain as there are a lot of tables.
[Different table that contains the suffixes from the TestTable_]
| ID | Name|
| -------- | -----------|
| 1 | TestTable1 |
| 2 | TestTable2 |
| 3 | TestTable3 |
| 4 | TestTable4 |
TestTable1 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------- |
| 1 | goose | withFeather? |featherID |
| 2 | rooster| withoutFeather?|shinyfeatherID |
| 3 | rooster| age | 20 |
TestTable2 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------------------|
| 1 | brazilian_goose | withFeather? |featherID |
| 2 | annoying_rooster | withoutFeather?|shinyfeatherID |
| 3 | annoying_rooster | no_legs? |dead |
TestTable3 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------- |
| 1 | goose | withFeather? |featherID |
| 2 | rooster| withoutFeather?|shinyfeatherID |
| 3 | rooster| age | 15 |
Common columns: q1 and a1
Is there a way to parse through all of them to lookup for a specific value without using UNION because some of them might have different columns?
Something like: check if "q1='age'" exists in all those tables (from 1 to 50)
Select q1,*
from (something)
where q1 exists in (TestTable_*)... or something like that.
If not possible, not a problem.
You could use dynamic SQL but something I do in situations like this where I have a list of tables that I want to quickly perform the same actions on is to either use a spreadsheet to paste the list of tables into and type a query into the cell with something like #table then use the substitute function to replace it.
Alternative I just paste the list into SSMS and use SHIFT+ALT+ArrowKey to select the column and start typing stuff out.
So here is my list of tables
Then I use that key combo. As you can see my cursor has now selected all those rows.
Now I can start typing and all rows selected will get the input.
Then I just go to the other side of the table names and repeat the action
It's not a perfect solution but it's quick a quick and dirty way of doing something repetitive quickly.
If you want to find all the tables with that column name you can use information schema.
Select table_name from INFORMATION_SCHEMA.COLUMNS where COLUMN_NAME = 'q1'
Given the type of solution you are after I can offer a method that I've had to use on legacy systems.
You can query sys.columns for the name of the column(s) you need to find in N tables and join using object_id to sys.tables where type='U'. This will give you a list of table names.
From this list you can then build a working query for each table, and depending on your requirements (is this ad-hoc?) either just manually execute it yourself of build a procedure that will do it for you using sp_executesql
Eg
select t.name, c.name
into #workingtable
from sys.columns c
join sys.tables t on t.object_id=c.object_id
where c.name in .....
psudocode:
begin loop while rows exist in #working table
select top 1 row from #workingtable
set #sql=your query specific to that table and column(s)
exec(#sql) / sp_executesql / try/catch as necessary
delete row from working table
end loop
Hopefully that give ideas at least for how you might implement your requirements.

Filling information from same and other tables in SQL

For my further work I need to create a lookup table where all the different IDs my data has (because of different sources) are noted.
It has to look like this:
Lookup_Table:
| Name | ID_source1 | ID_source2 | ID_source3 |
-----------------------------------------------
| John | EMP_992 | AKK81239K | inv1000003 |
Note, that Name and ID_Source1 are coming from the same table. The other IDs are coming from different tables. They share the same name value, so e.g. source 2 looks like this:
Source2 Table:
| Name | ID |
--------------------
| John | AKK81239K |
What is the SQL code to accomplish this? Im using Access and it doesnt seem to work with this code for source 2:
INSERT INTO Lookup_Table ([ID_Source2])
SELECT [Source2].[ID]
FROM Lookup_Table LEFT JOIN [Source2]
ON [Lookup_Table].[Name] = [Source2].[Name]
It just adds the ID from Source2 in a new row:
| Name | ID_source1 | ID_source2 | ID_source3 |
-----------------------------------------------
| John | EMP_992 | | |
| | | AKK81239K | |
Hope you guys can help me.
You're looking for an UPDATE query, not an INSERT query.
An UPDATE query updates existing records. An INSERT query inserts new records into a table.
UPDATE Lookup_Table
INNER JOIN [Source2] ON [Lookup_Table].[Name] = [Source2].[Name]
SET [ID_Source2] = [Source2].[ID]

BigQuery Match Table Lookup for DCM Data Transfer

With DCM's Data Transfer v2 you get 3 main tables of data in GCS:
p_activity_166401
p_click_166401
p_impression_166401
Along with a plethora of match tables like:
p_match_table_advertisers_166401
p_match_table_campaigns_166401
Table 1: p_activity_166401
Row | Event_time | User_ID | Advertiser_ID | Campaign_ID |
------ | ------------- | ------- | ------------- | ----------- |
1 | 149423090566 | AMsySZa | 5487307 | 9638421 |
2 | 149424804284 | 2vmdsXS | 5487307 | 10498283 |
Table 2: p_match_table_advertisers_166401
Row | Advertiser_ID | Advertiser |
------ | ------------- | ----------- |
1 | 5487307 | Company A |
2 | 5487457 | Company B |
How do I reference a value from Table 1 in Table 2 and return the value from Table 2 in a query?
I'd like a result like:
Row | Advertiser | User_ID |
------ | ---------- | ----------- |
1 | Company A | AMsySZa |
2 | Company A | 2vmdsXS |
Been searching around here and online and I just can't seem to find a clear reference on how to do the lookups across table, apologies in advance is this is a really simple thing I'm missing :)
EDIT
So with a nudge in the right direction I have found the JOIN function...
SELECT
*
FROM
[dtftv2_sprt.p_activity_166401]
INNER JOIN
[dtftv2_sprt.p_match_table_advertisers_166401]
ON
[p_activity_166401.Advertiser_ID] =
p_match_table_advertisers_166401.Advertiser_ID]
LIMIT
100;
Error: Field 'p_activity_166401.Advertiser_ID' not found.
That is definitely a field in the table.
So this query works great in creating a view with all the data in it.
SELECT
*
FROM
[dtftv2_sprt.p_activity_166401]
INNER JOIN
[dtftv2_sprt.p_match_table_advertisers_166401]
ON
dtftv2_sprt.p_activity_166401.Advertiser_ID = dtftv2_sprt.p_match_table_advertisers_166401.Advertiser_ID;
Using the view I can now run smaller queries to pull the data I want out. Thanks for guiding me in the right direction Mikhail Berlyant.

1 to Many Query: Help Filtering Results

Problem: SQL Query that looks at the values in the "Many" relationship, and doesn't return values from the "1" relationship.
Tables Example: (this shows two different tables).
+---------------+----------------------------+-------+
| Unique Number | <-- Table 1 -- Table 2 --> | Roles |
+---------------+----------------------------+-------+
| 1 | | A |
| 2 | | B |
| 3 | | C |
| 4 | | D |
| 5 | | |
| 6 | | |
| 7 | | |
| 8 | | |
| 9 | | |
| 10 | | |
+---------------+----------------------------+-------+
When I run my query, I get multiple, unique numbers that show all of the roles associated to each number like so.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 4 | C |
| 4 | A |
| 5 | B |
| 5 | C |
| 5 | D |
| 6 | D |
| 6 | A |
+---------------+-------+
I would like to be able to run my query and be able to say, "When the role of A is present, don't even show me the unique numbers that have the role of A".
Maybe if SQL could look at the roles and say, WHEN role A comes up, grab unique number and remove it from column 1.
Based on what I would "like" to happen (I put that in quotations as this might not even be possible) the following is what I would expect my query to return.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 5 | B |
| 5 | C |
| 5 | D |
+---------------+-------+
UPDATE:
Query Example: I am querying 8 tables, but I condensed it to 4 for simplicity.
SELECT
c.UniqueNumber,
cp.pType,
p.pRole,
a.aRole
FROM c
JOIN cp ON cp.uniqueVal = c.uniqueVal
JOIN p ON p.uniqueVal = cp.uniqueVal
LEFT OUTER JOIN a.uniqueVal = p.uniqueVal
WHERE
--I do some basic filtering to get to the relevant clients data but nothing more than that.
ORDER BY
c.uniqueNumber
Table sizes: these tables can have anywhere from 50,000 rows to 500,000+
Pretending the table name is t and the column names are alpha and numb:
SELECT t.numb, t.alpha
FROM t
LEFT JOIN t AS s ON t.numb = s.numb
AND s.alpha = 'A'
WHERE s.numb IS NULL;
You can also do a subselect:
SELECT numb, alpha
FROM t
WHERE numb NOT IN (SELECT numb FROM t WHERE alpha = 'A');
Or one of the following if the subselect is materializing more than once (pick the one that is faster, ie, the one with the smaller subtable size):
SELECT t.numb, t.alpha
FROM t
JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') = 0) AS s USING (numb);
SELECT t.numb, t.alpha
FROM t
LEFT JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') > 0) AS s USING (numb)
WHERE s.numb IS NULL;
But the first one is probably faster and better[1]. Any of these methods can be folded into a larger query with multiple additional tables being joined in.
[1] Straight joins tend to be easier to read and faster to execute than queries involving subselects and the common exceptions are exceptionally rare for self-referential joins as they require a large mismatch in the size of the tables. You might hit those exceptions though, if the number of rows that reference the 'A' alpha value is exceptionally small and it is indexed properly.
There are many ways to do it, and the trade-offs depend on factors such as the size of the tables involved and what indexes are available. On general principles, my first instinct is to avoid a correlated subquery such as another, now-deleted answer proposed, but if the relationship table is small then it probably doesn't matter.
This version instead uses an uncorrelated subquery in the where clause, in conjunction with the not in operator:
select num, role
from one_to_many
where num not in (select otm2.num from one_to_many otm2 where otm2.role = 'A')
That form might be particularly effective if there are many rows in one_to_many, but only a small proportion have role A. Of course you can add an order by clause if the order in which result rows are returned is important.
There are also alternatives involving joining inline views or CTEs, and some of those might have advantages under particular circumstances.

Rolling id based on foreign key in a hierarchical schema

As an example, consider this hierarchical schema.
Assume all id fields are auto incrementing primary keys and that foreign keys are named by [parent_table_name]_id convention.
The problem
As soon as there are multiple companies in the database, then companies will share all primary key sequences between them.
For example, if there are two company rows, the customer_group table could look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
-------------------
But it should look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
This behavior should also be exhibited for customer and any other table in the tree that directly or indirectly references company.
Note that I will most likely make a second id column (named something like relative_id) for this purpose, keeping the unique id column intact, as this is really mostly for display purposes and how users will reference these data entities.
Now if this was just one level of hierarchy, it would be a relatively simple solution.
I could make a table (table_name, company_id, current_id) and a trigger procedure that fires before insert on any of the tables, incrementing the current id by 1 and setting the row's relative_id to that value.
It's trivial when the company_id is right there in the insert query.
But how about the tables that don't reference company directly?
Like the lowest level of the hierarchy in this example, workorder, which only references customer.
Is there a clean, reusable solution to climb the ladder all the way from 'customer_id' to ultimately retrieve the parenting company_id?
Going recursively up the hierarchy with SELECTs on each INSERT doesn't sound too appealing to me, performance wise.
I also do not like the idea of just adding a foreign key to company for each of these tables, the schema would get increasingly uglier with each additional table.
But these are the two solutions I can see, but I may not be looking in the right places.
The company shouldn't care what the primary key is if you're using generated keys. They're supposed to be meaningless; compared for equality and nothing else. I grumbled about this earlier, so I'm really glad to see you write:
Note that I will most likely make a second id column (named something
like relative_id) for this purpose, keeping the unique id column
intact, as this is really mostly for display purposes and how users
will reference these data entities.
You're doing it right.
Most of the time it doesn't matter what the ID is, so you can just give them whatever comes out of a sequence and not care about holes/gaps. If you're concerned about inter-company leakage (unlikely) you can obfuscate the IDs by using the sequence as an input to a pseudo-random generator. See the function Daniel Verité wrote in response to my question about this a few years ago, pseudo_encrypt.
There are often specific purposes for which you need perfectly sequential gapless IDs, like invoice numbers. For those you need to use a counter table and - yes - look up the company ID. Such ID generation is slow and has terrible concurrency anyway, so an additional SELECT with a JOIN or two on indexed keys won't hurt much. Don't go recursively up the schema with SELECTs though, just use a series of JOINs. For example, for an insert into workorder your key generation trigger on workorder would be something like the (untested):
CREATE OR REPLACE FUNCTION workorder_id_tgfn() RETURNS trigger AS $$
BEGIN
IF tg_op = 'INSERT' THEN
-- Get a new ID, locking the row so no other transaction can add a
-- workorder until this one commits or rolls back.
UPDATE workorder_ids
SET next_workorder_id = next_workorder_id + 1
WHERE company_id = (SELECT company_id
FROM customer
INNER JOIN customer_group ON (customer.customer_group_id = customer_group.id)
INNER JOIN company ON (customer_group.company_id = company.id)
WHERE customer.id = NEW.customer_id)
RETURNING next_workorder_id
INTO NEW.id;
END IF;
END;
$$ LANGUAGE 'plpgsql';
For the UPDATE ... RETURNING ... INTO syntax see Executing a Query with a Single-Row Result.
There can be gaps in normal sequences even if there's no multi-company problem. Observe:
CREATE TABLE demo (id serial primary key, blah text);
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
BEGIN;
INSERT INTO demo(blah) values ('bb');
ROLLBACK;
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
SELECT * FROM demo;
Result:
regress=# SELECT * FROM demo;
id | blah
----+------
1 | aa
3 | aa
"But it should look like this"
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
I think it should not and I think you want a many to many relationship. The customer_group table:
| id | name |
-------------
| 1 | n1 |
| 2 | n2 |
| 3 | n3 |
-------------
And then the customer_group_company table:
| group_id | company_id |
-------------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------------