array clustering with unique identifier for file datasets

array clustering with unique identifier for file datasets - sql

I have a dataset with big int array column in s3 and I want to filter rows efficiently based on array values. I know we can use gin index in sql table but need solution to work on s3 dataset. I am planning to use cluster id for each combinations of elements in array (as their cardinality is not huge. max 2500) and then store it as new column on which later on filter can applied.
Example,
Table A
+------+------+-----------+
| Col1 | Col2 | Col3 |
+------+------+-----------+
| 1 | 101 | [123,234] |
| 2 | 102 | [123] |
| 3 | 103 | [234,345] |
+------+------+-----------+
I am trying to add new column like,
Table B (column Col3 will be removed from actual schema)
+------+------+-----------+-----------+
| Col1 | Col2 | Col3 | Cid |
+------+------+-----------+-----------+
| 1 | 101 | [123,234] | 1 |
| 2 | 102 | [123] | 2 |
| 3 | 103 | [234,345] | 3 |
+------+------+-----------+-----------+
and there will be another table of mapping for col3 and Cid like,
Table C
+-----------+-----+
| Col3 | Cid |
+-----------+-----+
| [123,234] | 1 |
| [123] | 2 |
| [234,345] | 3 |
+-----------+-----+
This table C will be added a new entry if a new combination is created and B will be updated if any array element gets added or removed. Goal is to be able to filter out records from Table A based on values in array column efficiently. Queries like
123 = Any(Col3) can be served as Cid = 2 or queries like [123, 345] = Any(Col3) can be served as Cid in (2,3).
Is there any better way to do solve this problem?
Also I am thinking of creating required combinations at runtime to limit number of combinations. Is it a good idea to create minimum combinations?

In Postgres, you can create the table and use join to calculate the values:
create table array_dim as
select col3 as arr, row_number() over (order by min(col1)) as array_id
from t
group by col3;
You can then add the new column:
select a.*, ad.array_id
from a join
array_dim ad
on a.col3 = ad.arr

Related

How do I update a column from a table with data from a another column from this same table?

I have a table "table1" like this:
+------+--------------------+
| id | barcode | lot |
+------+-------------+------+
| 0 | ABC-123-456 | |
| 1 | ABC-123-654 | |
| 2 | ABC-789-EFG | |
| 3 | ABC-456-EFG | |
+------+-------------+------+
I have to extract the number in the center of the column "barcode", like with this request :
SELECT SUBSTR(barcode, 5, 3) AS ToExtract FROM table1;
The result:
+-----------+
| ToExtract |
+-----------+
| 123 |
| 123 |
| 789 |
| 456 |
+-----------+
And insert this into the column "lot" .

follow along the lines
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;
i.e in your case
UPDATE table_name
SET lot = SUBSTR(barcode, 5, 3)
WHERE condition;(if any)

UPDATE table1 SET Lot = SUBSTR(barcode, 5, 3)
-- WHERE ...;

Many databases support generated (aka "virtual"/"computed" columns). This allows you to define a column as an expression. The syntax is something like this:
alter table table1 add column lot varchar(3) generated always as (SUBSTR(barcode, 5, 3))
Using a generated column has several advantages:
It is always up-to-date.
It generally does not occupy any space.
There is no overhead when creating the table (although there is overhead when querying the table).
I should note that the syntax varies a bit among databases. Some don't require the type specification. Some use just as instead of generated always as.

CREATE TABLE Table1(id INT,barcode varchar(255),lot varchar(255))
INSERT INTO Table1 VALUES (0,'ABC-123-456',NULL),(1,'ABC-123-654',NULL),(2,'ABC-789-EFG',NULL)
,(3,'ABC-456-EFG',NULL)
UPDATE a
SET a.lot = SUBSTRING(b.barcode, 5, 3)
FROM Table1 a
INNER JOIN Table1 b ON a.id=b.id
WHERE a.lot IS NULL
id | barcode | lot
-: | :---------- | :--
0 | ABC-123-456 | 123
1 | ABC-123-654 | 123
2 | ABC-789-EFG | 789
3 | ABC-456-EFG | 456
db<>fiddle here

How to create INSERT query that adds sequence number in one table to another

I have a table sample_1 in a Postgres 10.7 database with some longitudinal research data and an ascending sequence number per key. I need to INSERT data from a staging table (sample_2) maintaining the sequence column accordingly.
sequence numbers are 0-based. I assume I need a query to seek the greatest sequence number per key in sample_1 and add that to each new row's follow-up sequence number. I'm mainly struggling at this step with the sequence number arithmetic. Tried this:
INSERT INTO sample_1 (KEY, SEQUENCE, DATA)
SELECT KEY, sample_2.SEQUENCE + max(sample_1.SEQUENCE), DATA
FROM sample_2;
However, I get errors saying I can't use 'sample_1.SEQUENCE' in Line 2 because that's the table being inserted in to. I can't figure out how to do the arithmetic with my insert sequence!
Sample data:
sample_1
| KEY | SEQUENCE | DATA |
+-------------+----------+------+
| YMH_0001_XX | 0 | a |
| YMH_0001_XX | 1 | b |
| YMH_0002_YY | 0 | c |
sample_2
| KEY | SEQUENCE | DATA |
+-------------+----------+------+
| YMH_0001_XX | 1 | d |
| YMH_0002_YY | 1 | e |
| YMH_0002_YY | 2 | f |
I want to continue ascending sequence numbers per key for inserted rows.
To be clear, the resultant table in this example would be 3 columns and 6 rows as such:
sample_1
| KEY | SEQUENCE | DATA |
+-------------+----------+------+
| YMH_0001_XX | 0 | a |
| YMH_0001_XX | 1 | b |
| YMH_0001_XX | 2 | d |
| YMH_0002_YY | 0 | c |
| YMH_0002_YY | 1 | e |
| YMH_0002_YY | 2 | f |

That should do what you are after:
INSERT INTO sample_1 (key, sequence, data)
SELECT s2.key
, COALESCE(s1.seq_base, -1)
+ row_number() OVER (PARTITION BY s2.key ORDER BY s2.sequence)
, s2.data
FROM sample_2 s2
LEFT JOIN (
SELECT key, max(sequence) AS seq_base
FROM sample_1
GROUP BY 1
) s1 USING (key);
Notes
You need to build on the existing maximum sequence per key in sample_1. (I named it seq_base.) Compute that in a subquery and join to it.
Add row_number() to it as demonstrated. That preserves the order of input rows, discarding absolute numbers.
We need the LEFTJOIN to avoid losing rows with new keys from sample_2.
Likewise, we need COALESCE to start a fresh sequence for new keys. Default to -1 to effectively start sequences with 0 after adding the 1-based row number.
This is not safe for concurrent execution, but I don't think that's your use case.

Join two tables returning all rows as single row from the second table

I want to get data in a single row from two tables which have one to many relation.
Primary table
Secondary table
I know that for each record of primary table secondary table can have maximum 10 rows. Here is structure of the table
Primary Table
-------------------------------------------------
| ImportRecordId | Summary |
--------------------------------------------------
| 1 | Imported Successfully |
| 2 | Failed |
| 3 | Imported Successfully |
-------------------------------------------------
Secondary table
------------------------------------------------------
| ImportRecordId | CodeName | CodeValue |
-------------------------------------------------------
| 1 | ABC | 123456A |
| 1 | DEF | 8766339 |
| 1 | GHI | 887790H |
------------------------------------------------------
I want to write a query with inner join to get data from both table in a way that from secondary table each row should be treated as column instead showing as multiple row.
I can hard code 20 columns names(as maximum 10 records can exist in secondary table and i want to display values of two columns in a single row) so if there are less than 10 records in the secondary table all other columns will be show as null.
Here is expected Output. You can see that for first record in primary table there was only three rows that's why two required columns from these three rows are converted into columns and for all others columns values are null.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ImportRecordId | Summary | CodeName1 | CodeValue1 | CodeName2 | CodeValue2 | CodeName3 | CodeValue3 | CodeName4 | CodeValue4| CodeName5 | CodeValue5| CodeName6 | CodeValue6| CodeName7 | CodeValue7 | CodeName8 | CodeValue8 | CodeName9 | CodeValue9 | CodeName10 | CodeValue10|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 1 | Imported Successfully | ABC | 123456A | DEF | 8766339 | GHI | 887790H | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Here is my simple SQL query which return all data from both tables but instead multiple rows from secondary table i want to get them in a single row like above result set.
Select p.ImportRecordId,p.Summary,s.*
from [dbo].[primary_table] p
inner join [dbo].[secondary_table] s on p.ImportRecordId = s.ImportRecordId

The following uses Row_Number(), a JOIN and a CROSS APPLY to create the source of the PIVOT
You'll have to add the CodeName/Value 4...10
Example
Select *
From (
Select A.[ImportRecordId]
,B.Summary
,C.*
From (
Select *
,RN = Row_Number() over (Partition by [ImportRecordId] Order by [CodeName])
From Secondary A
) A
Join Primary B on A.[ImportRecordId]=B.[ImportRecordId]
Cross Apply (values (concat('CodeName' ,RN),CodeName)
,(concat('CodeValue',RN),CodeValue)
) C(Item,Value)
) src
Pivot (max(value) for Item in (CodeName1,CodeValue1,CodeName2,CodeValue2,CodeName3,CodeValue3) ) pvt
Returns
ImportRecordId Summary CodeName1 CodeValue1 CodeName2 CodeValue2 CodeName3 CodeValue3
1 Imported Successfully ABC 123456A DEF 8766339 GHI 887790H

Pivot SSRS Dataset

I have a dataset which looks like so
ID | PName | Node | Val |
1 | Tag | Name | XBA |
2 | Tag | Desc | Dec1 |
3 | Tag | unit | Int |
6 | Tag | tids | 100 |
7 | Tag | post | AAA |
1 | Tag | Name | XBB |
2 | Tag | Desc | Des9 |
3 | Tag | unit | Float |
7 | Tag | post | BBB |
6 | Tag | tids | 150 |
I would like the result in my report to be
Name | Desc | Unit | Tids | Post |
XBA | Dec1 | int | 100 | AAA |
XBB | Des9 | Float | 150 | BBB |
I have tried using a SSRS Matrix with
Row: PName
Data: Node
Value: Val
The results were simply one row with Name and next row with desc and next with unit etc. Its not all in the same rows and also the second row was missing. This is possibly because there is no grouping on the dataset.
What is a good way of achieving the expected results?

I would not recommend this for a production scenario but if you need to knock out a report quickly or something you can try this. I would just not feel comfortable that the order of the records you get will always be what you expect.
You COULD try to insert the results of the SP into a table (regular table, temp table, table variable...doesn't matter really as long as you can get an identity column added). Assuming that the rows always come out in the correct order (which is probably not a valid assumption 100% of the time) then add an identity column on the table to get a unique row number for each row. From there you should be able to write some math logic to "group" your values together and then pivot out what you want.
create table #temp (ID int, PName varchar(100), Node varhar(100), Val varchar(100))
insert #temp exec (your stored proc)
alter table #temp add UniqueID int identity
then use UniqueID (modulo on 5 perhaps?) to group records together and then pivot

select statement for specific value sqlite

could you help me to make a select query for this case,
recently i'm looking for a way to implement expandable list view that fill the data from database, but i'm not found a proper example yet,
and this i'm thinking about another way,
i have 2 table :
table1 :
+------------+----------+
| id_table1 | Item |
+------------+----------+
| 1 | Item1 |
| 2 | Item2 |
| 3 | Item3 |
| 4 | Item2.1 |
| 5 | Item2.2 |
| 6 | Item3.1 |
| 7 | Item3.2 |
+------------+----------+
table 2 : id_table2.table2 = id_table1.table1 and table2.id_table = id_table1.table1
+------------+----------+
| id_table2 | id_table |
+------------+----------+
| 2 | 4 |
| 2 | 5 |
| 3 | 6 |
| 3 | 7 |
+------------+----------+
and with some select query the result will be :
Item1
Item2
Item2.1 //with space
Item2.2 //with space
Item3
Item3.1 //with space
Item3.2 //with space

You can do what you want to do with these tables, a la the following:
select id_table1, Item from table1
where not exists (
select id_table2
from table2 where id_table1=id_table)
union
select id_table2, child.Item
from table1 parent, table2, table1 child
where table2.id_table2=parent.id_table1
and table2.id_table=child.id_table1;
The first query finds those items that are "parent" items. The second one finds those that are children. (You might have some issues ordering later on. And this assumes only two levels at the moment.) But it is not a very clear way to do it. At least I would suggest column names that indicate what you are doing, e.g:
table1: ViewItem. Columns: id, Item
table2: ItemChild. Columns: parentId, childId
You will find quite a few hits on this type of question, hierarchical menus being one such application.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

array clustering with unique identifier for file datasets - sql

In Postgres, you can create the table and use join to calculate the values: create table array_dim as select col3 as arr, row_number() over (order by min(col1)) as array_id from t group by col3; You can then add the new column: select a.*, ad.array_id from a join array_dim ad on a.col3 = ad.arr

Related

How do I update a column from a table with data from a another column from this same table?

How to create INSERT query that adds sequence number in one table to another

Join two tables returning all rows as single row from the second table

Pivot SSRS Dataset

select statement for specific value sqlite

Categories

Resources