Getting NOT IN subset with SQL - sql

I have a SQL/Postgres database where I keep the application data, including references to user-generated content persisted in object storage - image files. Every now and then, these user-generated files get overwritten or deleted. I would like to add garbage collection to clean up the unused storage.
My plan is:
get a list of files from the object store.
randomly select a subset of maybe 100 files at a time.
use FaaS to run something like the below query every few minutes:
SELECT
*
FROM
events e, categories c
WHERE
e.image <> ANY($1) AND
e.cropped_image <> ANY($1) AND
e.cropped_image_thumb <> ANY($1) AND
e.promoted_img <> ANY($1) AND
c.header_image <> ANY($1) AND
c.list_image <> ANY($1)
Only, the query is working "backwards" - incorrectly. My version will return a list database records not containing the images. I want a list of images no longer referenced in the database.
Can I somehow join on the array $1 and get elements not matching anything?

Related

How to set the explicit order for child table rows for one-to-many SQL relation?

Imagine a database with two tables, lists (with id and name) and items (with id, list_id, which is a foreign key linking to lists.id, and name) and the application with ORM and the corresponding models.
A task: have a way in the application to create/edit/view the list and the items inside it (that should be pretty easy), but also saving the order of the items within one list and allowing to reorder the items within one list (so, a user creates the items list, then swaps two items, then when displaying the list, the items order should be preserved) or deleting items.
What is the best way to implement it, database-wise? Which db structure should I use for it?
I see these ways of solving it:
not using the external table for items, but storing everything in a list document (as a postgres jsonb column for example) - can work but I suppose that's not RDBMS way to do it and if the user would want to update the single item, the whole list object would need to be updated
having a position field in items table and adding a way to manage the position in the API - can work, but it's quite complicated (like, handling the cases where the position is the same for some items, handling swapping items, handling items deletions and having to decrease the position of all the items coming after the deleted one etc.)
Is there a simple way of implementing it? Like the one used in production by some big companies? I'm really curious about how such cases are handled in real life.
This is more theoretical question, so no code samples here (except for the db structure).
This is a good question, which as far as I know doesn't have any simple answers. I once came up with a solution for a high volume photo sharing site using an item table with columns list_id and position as you describe. The key to performance was to minimize renumbering as this database had millions of photos (and more than 2^32 likes).
The only operation was to move a single item to another point in the list (before or after another item in the list). This would work by first assigning positions with large steps, e.g. 1000, 2000, 3000. Whenever an item is moved between two others the average is used, e.g. move from pos=3000 to 1500. Eventually you can try to move an item between two items that have consecutive position numbers. Then you choose to renumber items either above or below depending on which way requires fewer updates (e.g. if there were a run of consecutive positions). This was done using RANK and #vars as I recall on MySQL 5.7.
This did work well resolving a problem where there was intermittent unavailability in production due to massive renumberings that were occurring before when consecutive positions were used.
I was able to dig up a couple of the queries (that was meant to go into a blog post ages ago). Turns out this was MySQL before RANK() was a thing which is why the #shuffle_rank variable was used. The + 0 (and the + 1) is because this is the actual SQL sent to the query but it was generated in code. This is to find the first gap below (greater than) position 120533287:
SELECT shuffle_rank, position
FROM (SELECT #shuffle_rank := #shuffle_rank + 1 AS shuffle_rank, position
FROM `gallery_items`
JOIN (SELECT #shuffle_rank := 0) initialize_rank_var
WHERE `gallery_items`.`gallery_id` = 14103882 AND (position >= 120533287)
ORDER BY position ASC) positionable_items
WHERE ABS(120533287 - position) >= shuffle_rank + 0 LIMIT 1
Here's the update query after the above query and supporting code decided that 3 rows need to be updated to make a gap. The + 1 here may be larger if renumbering with some gap if there's room.
UPDATE `gallery_items`
SET position = -222 + (#shuffle_rank := #shuffle_rank + 1)
WHERE `gallery_items`.`gallery_id` = 24669422
AND (position >= -222)
AND ((SELECT #shuffle_rank := 0) = 0)
ORDER BY position ASC
LIMIT 3
Note that this pair of actual queries aren't for the same operation seeing as they have different gallery_id values (aka list_id).

Applying a filter of unknown elements using array. Or hiding select records from user

Using a split database, everyone gets a front end with a local table I use as a 'cart' like in online shopping.
I'm copying records to a local table from stock. I don't want the record I copied across to be allowed to be transferred over again making duplicates. I also don't want to delete the original record, just modify it.
So I want them to edit the records copy locally then hit a button that will update the record on the database back end. If they don't hit the button and close the front end, no changes are made. Assume the temp table is wiped on start up.
To stop duplicate records I want to hide select records from the particular user of the front end database only. So if the Access app crashes the record isn't hidden for all users.
Idea: What If I add a Stock_ID (hidden) field to the local table? Then I can poll the column and if any Stock_ID matches the ID of the record I want to copy a message box says Error, record already exists and cancels the record copy?
I think you're saying you want to show the front end user only those stock records whose Stock_ID values are not present in the local table.
If that is correct, you can use an "unmatched query" to display those stock records.
SELECT s.*
FROM
stock AS s
LEFT JOIN [local] AS l
ON s.Stock_ID = l.Stock_ID
WHERE l.Stock_ID Is Null;
The Access query designer has a query wizard for this task. It should be worth a look.
When you say "hide select records", what combinations? Hide all of a certain type from ALL users; hide certain records from SOME users? In your split database, does EACH user have a copy of the front-end, or do all share the same front-end? There must be some criteria that determines who sees what records? Once that is identified, then a solution can follow.

SQL Server: Remove substrings from field data by iterating through a table of city names

I have two databases, Database A and Database B.
Database A contains some data which needs to be placed in a table in Database B. However, before that can happen, some of that data must be “cleaned up” in the following way:
The table in Database A which contains the data to be placed in Database B has a field called “Desc.” Every now and then the users of the system put city names in with the data they enter into the “Desc” field. For example: a user may type in “Move furniture to new cubicle. New York. Add electric.”
Before that data can be imported into Database B the word “New York” needs to be removed from that data so that it only reads “Move furniture to new cubicle. Add electric.” However—and this is important—the original data in Database A must remain untouched. In other words, Database A’s data will still read “Move furniture to new cubicle. New York. Add electric,” while the data in Database B will read “Move furniture to new cubicle. Add electric.”
Database B contains a table which has a list of the city names which need to be removed from the “Desc” field data from Database A before being placed in Database B.
How do I construct a stored procedure or function which will grab the data from Database A, then iterate through the Cities table in Database B and if it finds a city name in the “Desc” field will remove it while keeping the rest of the information in that field thus creating a recordset which I can then use to populate the appropriate table in Database B?
I have tried several things but still haven’t cracked it. Yet I’m sure this is probably fairly easy. Any help is greatly appreciated!
Thanks.
EDIT:
The latest thing I have tried to solve this problem is this:
DECLARE #cityName VarChar(50)
While (Select COUNT(*) From ABCScanSQL.dbo.tblDiscardCitiesList) > 0
Begin
Select #cityName = ABCScanSQL.dbo.tblDiscardCitiesList.CityName FROM ABCScanSQL.dbo.tblDiscardCitiesList
SELECT JOB_NO, LTRIM(RTRIM(SUBSTRING(JOB_NO, (LEN(job_no) -2), 5))) AS LOCATION
,JOB_DESC, [Date_End] , REPLACE(Job_Desc,#cityName,' ') AS NoCity
FROM fmcs_tables.dbo.Jobt WHERE Job_No like '%loc%'
End
"Job_Desc" is the field which needs to have the city names removed.
This is a data quality issue. You can always make a copy of the [description] in Database A and call it [cleaned_desc].
One simple solution is to write a function that does the following.
1 - Read data from [tbl_remove_these_words]. These are the phrases you want removed.
2 - Compare the input - #var_description, to the rows in the table.
3 - Upon a match, replace with a empty string.
This solution depends upon a cleansing table that you maintain and update.
Run a update query that uses the input from [description] with a call to [fn_remove_these_words] and sets [cleaned_desc] to the output.
Another solution is to look at products like Melisa Data (DQ) product for SSIS or data quality services in the SQL server stack to give you a application frame work to solve the problem.

Is it possible to select and delete in the same query with SOLR (apache)

Documents can be added and removed any times.
I need to remove some documents for historic storage.
Is it possible to select AND delete in the same request ?
Select&delete data
Store data somewhere else.
Send commit.
Right now, I do :
Select data with criteria + stats=true&stats.field=dateField
Store data somewhere else.
Query delete (with same criteria but using the getMax() value of dateField to not delete newer documents matching the request)

Optimal way to add / update EF entities if added items may or may not already exist

I need some guidance on adding / updating SQL records using EF. Lets say I am writing an application that stores info about files on a hard disk, into an EF4 database. When you press a button, it will scan all the files in a specified path (maybe the whole drive), and store information in the database like the file size, change date etc. Sometimes the file will already be recorded from a previous run, so its properties should be updated; sometimes a batch of files will be detected for the first time and will need to be added.
I am using EF4, and I am seeking the most efficient way of adding new file information and updating existing records. As I understand it, when I press the search button and files are detected, I will have to check for the presence of a file entity, retrieve its ID field, and use that to add or update related information; but if it does not exist already, I will need to create a tree that represents it and its related objects (eg. its folder path), and add that. I will also have to handle the merging of the folder path object as well.
It occurs to me that if there are many millions of files, as there might be on a server, loading the whole database into the context is not ideal or practical. So for every file, I might conceivably have to make a round trip to the database on disk to detect if the entry exists already, retrieve its ID if it exists, then another trip to update. Is there a more efficient way I can insert/update multiple file object trees in one trip to the DB? If there was an Entity context method like 'Insert If It Doesnt Exist And Update If It Does' for example, then I could wrap up multiple in a transaction?
I imagine this would be a fairly common requirement, how is it best done in EF? Any thoughts would be appreciated.(oh my DB is SQLITE if that makes a difference)
You can check if the record already exists in the DB. If not, create and add the record. You can then set the fields of the record which will be common to insert and update like the sample code below.
var strategy_property_in_db = _dbContext.ParameterValues().Where(r => r.Name == strategy_property.Name).FirstOrDefault();
if (strategy_property_in_db == null)
{
strategy_property_in_db = new ParameterValue() { Name = strategy_property.Name };
_dbContext.AddObject("ParameterValues", strategy_property_in_db);
}
strategy_property_in_db.Value = strategy_property.Value;