Postgresql: How to copy rows n times (n=value of original column A) and insert "yes" k-times (k=value column B) randomly within the n new rows - postgresql-9.5

I have a dataset where information about public toilet buildings was collected. Each building (dataset row) can contain several toilets (see example datatable below)
To analyze the data, I need every toilet as a single geometry with an attribute female yes/no broken: yes/no. It's not important for me to know exactly if the female toilet is broken or a male one.
So what I try to do is to copy each row of my dataset n times ( n=value of column total_number_toilets) and insert randomly k times 'yes'( k=value of column female_toilets) in column female_toilets and i times 'yes' (i= value of column broken_toilets) in column broken_toilets in the n new rows. The data type of each column is string although there are numbers inserted at the moment, so it should not be a problem to insert 'yes' instead of a number into the columns of the copied rows.
Why it is not important to know exactly if the female toilet is broken or not is that I just need a huge sample dataset for an application.
Original row:
total_number_toilets | female_toilets |broken_toilets
---------------------|-----------------|-------------------
5 | 2 | 1
new rows:
total_number_toilets | female_toilets |broken_toilets
---------------------|-----------------|-------------------
1 | 'yes' | 'no'
1 | 'no' | 'yes'
1 | 'yes' | 'no'
1 | 'no' | 'no'
1 | 'no' | 'no'

Related

SQL - Retrieve only one record of related records

I have table which depicts shares of a particular type of record. 2 records are created for a shared item which results in something like this
|--------------|------------|
| Shared From | Shared To |
|--------------|------------|
| Record 1 | Record 2 |
|--------------|------------|
| Record 2 | Record 1 |
|--------------|------------|
Is it possible to retrieve a single share record ? Meaning that from the table above I get only one record (Doesn't make a difference which)
|--------------|------------|
| Shared From | Shared To |
|--------------|------------|
| Record 1 | Record 2 |
Using distinct on both columns doesn't work since the combination is different
Use case expressions to return the smaller value in the first column, and the larger value in the seceond column. Do SELECT DISTINCT to remove duplicates.
select distinct case when SharedFrom < SharedTo then SharedFrom else SharedTo end,
case when SharedFrom > SharedTo then SharedFrom else SharedTo end
from tablename
Note: May switch columns for unique combinations. (If col1 > col2.)
If I understand you correctly, you want to get a single line from the table.
To get the top N of rows in a table, you can use TOP('N'), for example:
SELECT TOP(1) share_column FROM shares_table

Finding & updating duplicate rows

I need to implement a query (or maybe a stored procedure) that will perform soft de-duplication of data in one of my tables. If any two records are similar enough, I need to "squash" them: deactivate one and update another.
The similarity is based on a score. Score is calculated the following way:
from both records, take values of column A,
values equal? add A1 to the score,
values not equal? subtract A2 from the score,
move on to the next column.
As soon as all desired value pairs checked:
is resulting score more then X?
yes – records are duplicate, mark older record as "duplicate"; append its id to a duplicate_ids column to the newer record.
no – do nothing.
How would I approach solving this task in SQL?
The table in question is called people. People records are entered by different admins. The de-duplication process exists to make sure no two same people exists in the system.
The motivation for the task is simple: performance.
Right now the solution is implemented in scripting language via several sub-par SQL queries and logic on top of them. However, the volume of data is expected to grow to tens of millions of records, and script will eventually become very slow (it should run via cron every night).
I'm using postgresql.
It appears that the de-duplication is generally a tough problem.
I found this: https://github.com/dedupeio/dedupe. There's a good description of how this works: https://dedupe.io/documentation/how-it-works.html.
I'm going to explore dedupe. I'm not going to try to implement it in SQL.
If I get you correctly, this could help.
You can use PostgreSQL Window Functions to get all the duplicates and use "weights" to determine which records are duplicated so you can do whatever you like with them.
Here is an example:
-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);
-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');
-- SELECT * FROM test
-- id | colA | colB | colC | creation_date
-- ----+-------+-------+-------+---------------
-- 1 | A | B | C | 2017-05-01
-- 2 | D | E | F | 2017-06-01
-- 3 | A | B | D | 2017-08-01 <-- Duplicate A,B
-- 4 | A | B | R | 2017-09-01 <-- Duplicate A,B
-- 5 | C | J | K | 2017-09-01
-- 6 | A | C | J | 2017-10-01
-- 7 | C | W | K | 2017-10-01 <-- Duplicate C,K
-- 8 | R | T | Y | 2017-11-01
-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):
-- third, you select the id of the duplicates
SELECT id
FROM
(
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
SELECT
id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
FROM
(
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
FROM test ORDER BY id
) count_column_duplicates
) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1
-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7
You can add this query to a stored procedure so you can run it whenever you like. Hope it helps.

Combine Rows in Access SQL where fields are blank

I have been looking to combine multiple rows of information into a single row to fill in blank spaces.
Id | name | value1 | value2 | value3
1 bob 3
1 bob 6
1 bob B
How do I get those 3 rows into a single row? It is a confirmed that they will fit perfectly where the blank values are, no values will overlap.
Id | name | value1 | value2 | value3
1 bob 3 B 6
I have come across nothing of much use in my research besides something called ConcatRelated which I could not modify to fit my needs. I also tried a group by statement which I couldn't get to work either. Any ideas? I am newer to Access and SQL in general.
How do I get those 3 rows into a single row?
Use a GROUP BY query to consolidate the 3 "bob" rows into one. And then you can include an aggregate function such as Max() to retrieve the non-blank rows for each of the valueX columns.
SELECT
y.Id,
y.name,
Max(y.value1) AS MaxOfvalue1,
Max(y.value2) AS MaxOfvalue2,
Max(y.value3) AS MaxOfvalue3
FROM YourTable AS y
GROUP BY
y.Id,
y.name;
If you want to reuse the original column name, valueX, instead of a MaxOfvalueX alias, enclose that name in square brackets ...
Max(y.value1) AS [value1]

Sqlite : Loop through rows and match, break when distinct row encountered

I want to compare two tables A and B, row by row on the basis of a column name.As soon as I encounter a distinct row I want to break.
I want to do this using a query, something like this :
select case
when ( compare row 1 of A and B
if same continue with row+1
else break
)
if all same, then 1
else 0
end
as result
I am not sure how to loop through rows and break? Is it even possible in sqlite?
EDIT
Table looks like this
-------------------------- -----------------------------------
id | name id | name
-------------------------- -----------------------------------
1 | A 1 | A (same)
2 | C 2 | C (same)
3 | B 3 | Z (different break)
4 | K
Both tables have same structure. I want to just compare the names row by row, to see whether there is any order difference.

SQL Join two tables with different number of rows get all rows from one table

I'm trying to join two tables with a different number of rows. Client Number is an identifying field in both tables.
The first table includes the client number for all clients.
so either could be referenced). To complicate things further, the ClientNumber2 field is a text field but I need it to be a number like the other client number fields. There is also a region field in the second table that I need to limit to certain region numbers.
The second table includes the client number only for clients who meet certain specifications. Some of the clients in the second table have a second client number.
I want to create a new column that shows the client number from the first table for all clients who do not exist in the second table, shows the second client number in the second table if it exists, otherwise, the client number from the first table (which is the same as client number in the first column of the second table
I've included the syntax I'm using below. It runs without errors. The OriginalCN field returns the desired value for those with a value in ClientNumber2 of Table 2 but is returning null values for all others. I cannot figure out how to get it to work correctly. I've also included sample tables and my desired table. Any help is greatly appreciated!
CLIENT TABLE 1
CLIENT NUMBER
1
2
3
4
5
6
7
8
CLIENT TABLE 2
CLIENT NUMBER 2ND CLIENT NUMBER REGION
2 14 1
6 2
8 15 2
DESIRED RESULT
1
14
3
4
5
6
7
15
Here is the syntax I am using:
SELECT
TABLE2.CLIENTNUMBER
TABLE1.CLIENTNUMBER
CASE
WHEN TABLE2.CLIENTNUMBER IS NULL THEN TABLE1.CLIENTNUMBER
WHEN TABLE2.CLIENTNUMBER2 IS NULL THEN TABLE2.CLIENTNUMBER
WHEN TABLE2.CLIENTNUMBER2 = ' ' THEN TABLE2.CLIENTNUMBER
ELSE
CAST(TABLE2.CLIENTNUMBER2 AS INT)
END AS OriginalCN
FROM DSS.DBO.TABLE1
LEFT OUTER JOIN
RPTO.DBO.TABLE2
ON
DSS.DBO.TABLE1.CLIENTNUMBER = RPTO.DBO.TABLE2.CLIENTNUMBER
WHERE TABLE2.REGION IN (1,2,3)
Try this:
SELECT COALESCE(CAST(NULLIF(t2.ClientNumber2,' ') As Int), t1.ClientNumber) As ClientNumber
FROM DSS.DBO.TABLE1 t1
LEFT JOIN RPTO.DBO.TABLE2 t2 ON t1.CLIENTNUMBER = t2.CLIENTNUMBER
WHERE COALESCE(t2.REGION, 1) IN (1,2,3)
Try this..
CREATE TABLE #CLIENT_TABLE_1
(CLIENTNUMBER INT)
INSERT #CLIENT_TABLE_1
VALUES (1),(2),(3),(4),(5),(6),(7),(8)
CREATE TABLE #CLIENT_TABLE_2
(CLIENTNUMBER INT,scNDCLIENTNUMBER varchar(10),REGION INT)
INSERT #CLIENT_TABLE_2
VALUES( 2,'14',1),(6,' ',2),(8,'15',2)
SELECT CASE
WHEN b.CLIENTNUMBER IS NOT NULL
AND len(b.scNDCLIENTNUMBER)>0 THEN b.scNDCLIENTNUMBER
ELSE a.CLIENTNUMBER
END Result
FROM #CLIENT_TABLE_1 a
LEFT JOIN #CLIENT_TABLE_2 b
ON a.CLIENTNUMBER = b.CLIENTNUMBER
Output :
+------+
|Result|
+------+
| 1 |
| 14 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 15 |
+------+