Redshift IN condition on thousands of values - sql

What's the best way to get data that matches any one of ~100k values?
For this question, I'm using an Amazon Redshift database and have a table something like this with hundreds of millions of rows:
--------------------
| userID | c1 | c2 |
| 101000 | 12 | 'a'|
| 101002 | 25 | 'b'|
____________________
There are also millions of unique userIDs. I have a CSV list of 98,000 userIDs that I care about, and I want to do math on the columns for those specific users.
select c1, c2 from table where userID in (10101, 10102, ...)
What's the best solution to match against a giant list like this?
My approach was to make a python script that read in the result of all users in our condition set, then filtering against the CSV in python. It was dead slow and wouldn't work in all scenarios though.
A coworker suggested uploading the 98k users into a temporary table, then joining against in in the query. This seems like the smartest way, but I wanted to ask if you all had ideas.
I also wondered if printing an insanely long SQL query containing all 98k users to match against and running it would work. Out of curiosity, would that even have ran?

As your coworker suggests, put your IDs into a temporary table by uploading a CSV to S3 and then using COPY to import the file into a table. You can then use an INNER JOIN condition to filter your main data table on the list of IDs you're interested in.
An alternative option, if uploading a file to S3 isn't possible for you, could be to use CREATE TEMP TABLE to set up a table for your list of IDs and then use a spreadsheet to generate a whole of INSERT statements to populate the temp table. 100K of inserts could be quite slow though.

Related

BigQuery Create Table Query from Google Sheet with Variable item string field into Repeated Field

I hope I explain this adequately.
I have a series of Google Sheets with data from an Airtable database. Several of the fields are stringified arrays with recordIds to another table.
These fields can have between 0 and n - comma separated values.
I run a create/overwrite table SELECT statement to create native BigQuery tables for reporting. This works great.
Now I need to add the recordIds to a Repeated field.
I've manually written to a repeated field using:
INSERT INTO `robotic-vista-339622.Insurly_dataset.zzPOLICYTEST` (policyID, locations, carrier)
VALUES ('12334556',[STRUCT('recordId1'),STRUCT('recordId2')], 'name of policy');
However, I need to know how I to do this using SELECT statement rather than INSERT. I also need to know how to do this if you do not know the number of recordIds that have been retrieved from Airtable. One record could have none and another record could have 10 or more.
Any given sheet will look like the following, where "locations" contains the recordIds I want to add to a repeated field.
SHEETNAME: POLICIES
|policyId |carrier | locations |
|-----------|-----------|---------------------------------|
|recrTkk |Workman's | |
|rec45Yui |Workman's |recL45x32,recQz70,recPrjE3x |
|recQb17y |ABC Co. |rec5yUlt,recIrW34 |
In the above, the first row/record has no location Id's. And then three and two on the subsequent rows/records.
Any help is appreciated.
Thanks.
I'm unsure if answering my own question is the correct way to show that it was solved... but here is what it took.
I create a Native table in BigQuery. the field for locations is a string, mode repeated.
Then I just run an overwrite table SELECT statement.
SELECT recordId,Name, Amount, SPLIT(locations) as locations FROM `projectid.datasetid.googlesheetsdatatable`;
Tested and I run linked queries on the locations with unnest.

Postgresql query using multiple WHERE conditions

I am wondering if there is a simple / smart way to pass a query to a Postgresql database. I have a database whose headers look something like this:
measurementPointID | parameterA | parameterB | measurement | measurementTIME
There are some dozens of records within the database.
I would like to pass a query that retrieves data only for a set of measurementPointID's. There are several dozens of thousands of measurementPointID's values that I need to retrieve and I have all of these available in, for example, an CSV file.
The query should do a GROUP BY measurementTIME and ORDER BY measurementTIME as well. One detail is that if the measurement is zero (measurement = 0) there is no row corresponding to the measurementPointID at all.
Am I trying to do something too complicated or in a stupid way?

VB.NET Access Database 255 Columns Limit

I'm currently developing an application for a client using Visual Basic .NET. It's a rewrite of an application that accessed an Oracle database, filtered the columns and performed some actions on the data. Now, for reasons beyond my control, the client wants to use an Access (.mdb) database for the new application. The problem with this is that the tables have more than the 255 columns access supports so the client suggested splitting the data into multiple databases/tables.
Well even when the tables are split, at some point, I have to query all columns simultaneously (I did an INNER JOIN on both tables) which, of course, yields an error. The limit apparently is on number of simultaneously queryable columns not on the total number of columns.
Is there a possiblility to circumvent the 255 columns limit somehow? I was thinking in the direction of using LINQ to combine queries of both tables, i.e. have an adapter that emulates a single table I can perform queries on. A drawback of this is that .mdb is not a first-class citizen of LINQ-to-SQL (i.e. no insert/update supported etc.).
As a workaround, I might be able to rewrite my stuff so as to only need all columns at one point (I dynamically create control elements depending on the column names in the table). Therefore I would need to query say the first 250 columns and after that the following 150.
Is there a Access-SQL query that can achieve something like this. I thought of something like SELECT TOP 255 * FROM dbname or SELECT * FROM dbname LIMIT 1,250 but these are not valid.
Do I have other options?
Thanks a lot for your suggestions.
The ADO.NET DataTable object has no real limitations on the number of columns that it could contain.
So, once you have splitted the big table in two tables and set the same primary key in both subtables with less columns, you can use, on the VB.NET side, the DataTable.Merge method.
In their example on MSDN they show two tables with the same schema merged together, but it works also if you have two totally different schemas, but just the Primary key in common
Dim firstPart As DataTable = LoadFirstTable()
Dim secondPart As DataTable = LoadSecondTable()
firstPart.Merge(secondPart)
I have tested this just with only one column of difference, so I am not very sure that this is a viable solution in terms of performance.
As I know there is no way to directly bypass this problem using Access.
If you cannot change the db your only way I can think of is to make a wrapper that understand you're were the field are, automatically splits the query in more queryes and then regroup it in a custom class containing all the columns for every row.
For example you can split every table in more tables duplicating the field you're making the conditions on.
TABLEA
Id | ConditionFieldOne | ConditionFierldTwo | Data1 | Data2 | ... | Data N |
in
TABLEA_1
Id | ConditionFieldOne | ConditionFieldTwo | Data1 | Data2 | ... | DataN/2 |
TABLEA_2
Id | ConditionFieldOne | ConditionFieldTwo | Data(N/2)+1 | Data(n/2)+2 | ... | DataN |
and a query where is
SELECT * FROM TABLEA WHERE CONDITION1 = 'condition'
become with the wrapper
SELECT * FROM TABLEA_1 WHERE ConditionFieldOne = 'condition'
SELECT * FROM TABLEA_2 WHERE ConditionFieldOne = 'condition'
and then join the results.

Compare two tables and save the difference in a file

I have two tables in two different Oracle databases, they look the same (same column names etc) but the data is mostly different. I would like to compare them and save the difference in a third database (or just save it in an easily imported format).
The tables aren't huge but its still like 40 million rows in each table and would like help to do the compare in an efficient way.
There is no keys or unique columns but there are no columns with the same Nr and Name
Table:
Nr Name AText
1234 Jon Doe Ksjfkjsdkfjksdfsf
3234 Jon Sho sdfsdfasdfsdf
1434 Ian Doe lksjdfkljlkjsdfkj
If you're not trying to do this programmatically, you should take a look at SQL Data Compare from Red Gate. I believe it does exactly what you're looking for.
Depends on what you want to find.
For example, if tables are very simmilar, you can make two exports to txt files but ordered(select * from table order by 1, 2, 3) and then try a diff -h between these files. This is somehow fast.
Or, you can import one table in the other database, and try minus, but this is slow. Advantage: you can minus (col1, col2) and exclude col3...

CSV Import with Validation

I have a need to import a number of CSV files into corresponding tables in SQL. I am trying to write a stored procedure that will import any of these CSV files, using a number of parameters to set things like file name, destination name etc.
Simple so far. The problem comes because of the structure of this DB. Each data table has a number of (usually 5) columns that are of a set format, and then however many data columns you want. There are then a set of data validation tables that contain specific sets of values that these 5 columns can contain. So the problem is, is that when I do the import from CSV, I need to validate that each row that is imported meets the criteria in these validation tables, essentially that there is a row in the validation table that has data that matches the 5 columns in the imported data.
If it does not, then it needs to write an error to the log and not import it, if it does then it should import it.
Here is an example of what I mean:
Data Table (where the imported data will go)
|datatype|country|currency| datacolumn1 | datacolumn|
|1 | 2 | GBP | 10000 | 400 |
|3 | 4 | USD | 10000 | 400 |
Validation table
|datatype|country|currency|
|1 |2 |GBP |
|2 |3 |USD |
So the first line is valid, it has a matching record in the validation table for the first 3 columns, but the second is not and should be rejected.
The added problem is that each table can reference a different validation table (although many reference the same one) so the columns that have to be checked often vary in amount and name.
My first problem is really how to do a row by row check when importing from CSV, is there any way to do so without importing into a temporary table first?
After that, what is the best way to check that the columns match, in a generic way despite that fact that the name and number of columns change depending on what table is being imported.
You can import the contents of a csv into some temporary tables by using this -
SELECT * into newtable FROM
OPENROWSET ('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};DefaultDir={Directory Path of the CSV File};',
'SELECT * from yourfile.csv');
Once you have your data in some sql table, you can use an inner join to validate the data and narrow down to the valid rows.
SELECT A.*,B.* FROM newtable A
INNER JOIN validation_table B ON A.Datatype = B.Datatype
INNER JOIN validation_table C ON A.Country = C.Country
INNER JOIN validation_table D ON A.Currency = D.Currency
This should give you the valid rows according to your validation rules.
SSIS would let you check, filter, and process data while it was being loaded. I'm not aware of any other native SQL tool that does this. Without SSIS (or a third-party tool), you'd have to first load all the data from a file into some kind of "staging" table (#temp or dedicate permanent) and work from there.
#Pavan Reddy's OPENROWSET solution should work. I've used views, where I first determined the rows in the source file, built a "mapping" view on the target table, and then BULK INSERTED into the view (which also lets you play games with defaults on "skipped columns").
(Just to mention, you can launch an SSIS package from a stored procedure, using xp_cmdshell to call DTEXEC. It's complex and requires a host of parameters, but it can be done.)