Efficient way to find if a given string is _not_ listed in (sqlite3) table - sql

I have a Db table listing media files which have been archived to LTO (4.3 million of them). The ongoing archiving process is manual, carried out by different people as and when downtime arises. We need an efficient way of determining which files in a folder are not archived so we can complete the job if needed, or confidently delete the folder if it's all archived.
(For the sake of argument let's assume all filenames are unique, we do need to handle duplicates but that's not this question.)
I should probably just fire up Perl/Python/Ruby and talk to the Db thru them. But it would take me quite a while to get back up to speed in those and I have a nagging feeling that it would be overkill.
I can think of a two simpler approaches, but each has drawbacks and I wonder if there's a yet better way?
Method 1: is to simply bash-recurse down each directory structure, invoking sqlite3 per-file and outputting the filename if the query returns and empty result
This is probably less efficient than
Method 2: recurse through the directory structure and produce an sql file which will:
create a table with all our on-disk files in it (let's call it the "working table")
compare that with the archive table - select all files in the working table but not in the archive table
destroy the working table, or quit without saving
While 2 seems likely more efficient than 1, it seems that building the comparison table in the first place might incur some overhead and I did kind of imagine the backup table as a monolithic read-only thing that people refer to and don't write into.
Is there any way in pure SQL to just output a list of not-founds (without them existing in another table)?

Finding values not in some other table is easy:
SELECT *
FROM SomeTable
WHERE File NOT IN (SELECT File
FROM OtherTable);
To create the other table, you can write a series of INSERT statements, or just use the .import command of the shell from a plain text file.
A temporary table will not be saved.

Sooo, I think I have to answer my own question.
tl;dr - use a scripting language (the thing I was hoping to avoid)
Trying that and the other two approaches (details below) on my system yields the following numbers when checking a 33-file directory structure against the 4.3 million record Db:
A Ruby script: 0.27s
Bash running sqilte3 once per file ("Method 1"): 0.73s
SQL making a temp table and using "NOT IN" (Method 2): 8s
The surprising thing for me is that the all-sql is an order of magnitude slower than bash. This was true using the macOS (10.12) commandline sqlite3 and the GUI "DB Browser for SQLite"
The details
Script method
This is the crux of my Ruby script. Ruby of course is not the fastest language out there and you could probably do better than this (but if you really need speed, it might be time for C)
require "sqlite3"
db = SQLite3::Database.open 'path/to/mydb.db'
# This will skip Posix hidden files, which is fine by me
Dir.glob("search_path/**/*") do |f|
file = File.stat(f)
next unless file.file?
short_name = File.basename(f)
qouted_short_name = short_name.gsub("'", "''")
size = File.size(f)
sql_cmd = "select * from 'Backup_Table' where filename='#{qouted_short_name}' and sizeinbytesincrsrc=#{size}"
count = db.execute(sql_cmd).length
if count == 0
puts "UNARCHIVED: #{f}"
end
end
(Note the next two are Not The Answer, but I'll include them if anyone wants to check my methodology)
Bash
This is a crude Bash recurse-through-files which will print a list of files that are backed up (not what I want, but gives me an idea of speed):
#! /bin/bash
recurse() {
for file in *; do
if [ -d "${file}" ]; then
thiswd=`pwd`
(cd "${file}" && recurse)
cd "${thiswd}"
elif [ -f "${file}" ]; then
fullpath=`pwd`${file}
filesize=`stat -f%z "${file}"`
sqlite3 /path/to/mydb.db "select filename from 'Backup_Table' where filename='$file'"
fi
done
}
cd "$1" && recurse
SQL
CL has detailed method 2 nicely in his/her answer

Related

How to rename photo files using awk, such that they are named (and hence ordered) by "date taken"?

I have 3 groups of photos, from 3 different cameras (with time sychronised onboard all cameras) but with different naming schemes (e.g.: IMG_3142.jpg, DCM_022.jpg). I would like to rename every photo file with the following naming convention:
1_yyyy_mm_dd_hh_mm_ss.jpg for earliest
2_yyyy_mm_dd_hh_mm_ss.jpg for next earliest, and so on,
until we reach around 5000_yyyy_mm_dd_hh_mm_ss.jpg for the last one (i.e. the most recent)
I would like the yyyy_mm_dd_hh_mm_ss field to be replaced by the “date and time taken” value for when this photo was taken. Which is saved in the metadata/properties of each file.
I have seen awk used to carry out similar operations but I'm not familiar enough to know how to access the “time taken” metadata, etc.
Also, not that this should make a difference: my computer is a mac.
You can use jhead for this. The command is:
jhead -n%Y_%m_%d_%H_%M_%S *.jpg
Make a COPY of your files first before running it! You can install jhead with homebrew using:
brew install jhead
Or, if you don't have homebrew, download here for OS X.
That will get you the date in the filename as you wish. The sequence number is a little more difficult. Try what I am suggesting above and, if you are happy with that, we can work on the sequence number maybe. Basically, you would run jhead again to set the file modification times of your files to match the time they were shot - then the files can be made to show up in the listing in date order and we can put your sequence number on the front.
So, to get the file's date set on the computer to match the time it was taken, do:
jhead -ft *.jpg
Now all the files will be dated on your computer to match the time the photos were taken. Then we need to whizz through them in a loop with our script adding in the sequence number:
#!/bin/bash
seq=1
# List files in order, oldest first
for f in $(ls -rt *jpg)
do
# Work out new name
new="$seq_$f"
echo Rename $f as $new
# Remove "#" from start of following command if things look good so the renaming is actually done
# mv "$f" $new"
((seq++))
done
You would save that in your HOME directory as renamer, then you would go into Terminal and make the script executable like this:
chmod +x renamer
Then you need to go to where your photos are, say Desktop\Photos
cd "$HOME/Desktop/Photos"
and run the script
$HOME/renamer
That should do it.
By the way, I wonder how wise it is to use a simple sequence number at the start of your filenames because that will not make them come up in order when you look at them in Finder.
Think of file 20 i.e. 20_2015_02_03_11:45:52.jpg. Now imagine that files starting with 100-199 will be listed BEFORE file 2o, also files 1000-1999 will also be listed before file 20 - because their leading 1s come before file 20's leading 2. So, you may want to name your files:
0001_...
0002_...
0003_...
...
0019_...
0020_...
then they will come up in sequential order in Finder. If you want that, use this script instead:
#!/bin/bash
seq=1
for f in $(ls -rt *jpg)
do
# Generate new name with zero-padded sequence number
new=$(printf "%04d_$f" $seq)
echo Rename $f as $new
# Remove "#" from start of following command if things look good so the renaming is actually done
# mv "$f" $new"
((seq++))
done

Buffering output with AWK

I have an input file which consists of three parts:
inputFirst
inputMiddle
inputLast
Currently I have an AWK script which with this input creates an output file which consists of two parts:
outputFirst
outputLast
where outputFirst and outputLast is generated (on the fly) from inputFirst and inputLast respectively. However, to calculate the outputMiddle part (which is only one line) I need to scan the entire input, so I store it in a variable. The problem is that the value of this variable should go in between outputFirst and outputLast in the output file.
Is there a way to solve this using a single portable AWK script that takes no arguments? Is there a portable way to create temporary files in an AWK script or should I store the output from outputFirst and outputLast in two variables? I suspect that using variables will be quite inefficient for large files.
All versions of AWK (since at least 1985) can do basic I/O redirection to files or pipelines, just like the shell can, as well as run external commands without I/O redirection.
So, there are any number of ways to approach your problem and solve it without having to read the entire input file into memory. The most optimal solution will depend on exactly what you're trying to do, and what constraints you must honour.
A simple approach to the more precise example problem you describe in your comment above would perhaps go something like this: first in the BEGIN clause form two unique filenames with rand() (and define your variables), then read and sum the first 50 numbers from standard input while also writing them to a temporary file, then continuing to read and sum the next 50 numbers and write them to a second file, then finally in an END clause you would use a loop to read the first temporary file with getline and write it to standard output, print the total sum, then read the second temporary file the same way and write it to standard output, and finally call system("rm " file1 " " file2) to remove the temporary files.
If the output file is not too large (whatever that is), saving outputLast in a variable is quite reasonable. The first part, outputFirst, can (as described) be generated on the fly. I tried this approach and it worked fine.
Print the "first" output while processing the file, then write the remainder to a temporary file until you have written the middle.
Here is a self-contained shell script which processes its input files and writes to standard output.
#!/bin/sh
t=$(mktemp -t middle.XXXXXXXXX) || exit 127
trap 'rm -f "$t"' EXIT
trap 'exit 126' HUP INT TERM
awk -v temp="$t" "NR<500000 { print n+1 }
{ s+=$1 }
NR>=500000 { print n+1 >>temp
END { print s }' "$#"
cat "$t"
For illustration purposes, I used really big line numbers. I'm afraid your question is still too vague to really obtain a less general answer, but perhaps this can help you find the right direction.

How to split sql in MAC OSX?

Is there any app for mac to split sql files or even script?
I have a large files which i have to upload it to hosting that doesn't support files over 8 MB.
*I don't have SSH access
You can use this : http://www.ozerov.de/bigdump/
Or
Use this command to split the sql file
split -l 5000 ./path/to/mysqldump.sql ./mysqldump/dbpart-
The split command takes a file and breaks it into multiple files. The -l 5000 part tells it to split the file every five thousand lines. The next bit is the path to your file, and the next part is the path you want to save the output to. Files will be saved as whatever filename you specify (e.g. “dbpart-”) with an alphabetical letter combination appended.
Now you should be able to import your files one at a time through phpMyAdmin without issue.
More info http://www.webmaster-source.com/2011/09/26/how-to-import-a-very-large-sql-dump-with-phpmyadmin/
This tool should do the trick: MySQLDumpSplitter
It's free and open source.
Unlike the accepted answer to this question, this app will always keep extended inserts intact so the precise form of your query doesn't matter; the resulting files will always have valid SQL syntax.
Full disclosure: I am a share holder of the company that hosts this program.
The UploadDir feature in phpMyAdmin could help you, if you have FTP access and can modify your phpMyAdmin's configuration (or are allowed to install your own instance of phpMyAdmin).
http://docs.phpmyadmin.net/en/latest/config.html?highlight=uploaddir#cfg_UploadDir
You can split into working SQL statements with:
csplit -s -f db-part db.sql "/^# Dump of table/" "{99}"
Which makes up to 99 files named 'db-part[n]' from db.sql
You can use "CREATE TABLE" or "INSERT INTO" instead of "# Dump of ..."
Also: Avoid installing any programs or uploading your data into any online service. You don't know what will be done with your information!

Finding files in subdirectories created after a certain date

I'm in the process of writing a bash script (just learning it) which needs to find files in subdirectories created after a certain date. I have a folder /images/ with jpegs in various subfolders - I want to find all jpegs uploaded to that directory (or any subdirectories) after a certain date. I know about the -mtime flag, but my "last import" date is stored in %Y-%m-%d format and it'd be nice to use that if possible?
Also, each file/pathname will then be used to generate a MySQL SELECT query. I know find generally outputs the filenames found, line-by-line. But if find isn't actually the command that I should be using, it'd be nice to have a similar output format I could use to generate the SELECT query (WHERE image.file_name IN (...))
Try below script:
DATE=<<date>>
SEARCH_PATH=/images/
DATE=`echo $DATE|sed 's/-//g'`
DATE=$DATE"0000"
FILE=~/timecheck_${RANDOM}_$(date +"%Y%m%d%H%M")
touch -t $DATE $FILE
find $SEARCH_PATH -newer $FILE 2>/dev/null|awk 'BEGIN{f=0}{if(f==1)printf("\"%s\", ",l);l=$0;f=1}END{printf("\"%s\"",l)}'
rm -f $FILE
You can convert your date into the "last X days" format that find -mtime expects.
find is the correct command for this task. Send its output somewhere, then parse the file into the query.
Beware of SQL injection attacks if the files were uploaded by users. Beware of special-character quoting even if they weren't.

Script for Testing with Input files and Output Solutions

I have a set of *.in files and a set of *.soln files with matching files names. I would like to run my program with the *.in file as input and compare the output to the ones found in the *.soln files. What would be the best way to go about this? I can think of 3 options.
Write some driver in Java to list files in the folder, run the program, and compare. This would be hard and difficult.
Write a bash script to do this. How?
Write a python script to do this?
I would go for a the bash solution. Also given that what you are doing is a test, I would always save the output of the myprogram so that if there are failures, that you always have the output to compare it to.
#!/bin/bash
for infile in *.in; do
basename=${infile%.*}
myprogram $infile > $basename.output
diff $basename.output $basename.soln
done
Adding the checking or exit statuses etc. as required by your report.
If the program exists, I suspect the bash script is the best bet.
If your soln files are named right, some kind of loop like
for file in base*.soln
do
myprogram > new_$file
diff $file new_$file
done
Of course, you can check the exit code of diff and
do various other things to create a test report . . .
That looks simplest to me . . .
Karl
This is primarily a problem that requires the use of the file-system with minimal logic. Bash isn't a bad choice for such problems. If it turns out you want to do something more complicated than just comparing for equality Python becomes a more attractive choice. Java doesn't seem like a good choice for a throwaway script such as this.
Basic bash implementation might look something like this:
cd dir_with_files
program=your_program
input_ext=".in"
compare_to_ext=".soIn"
for file in *$from_extension; do
diff <("$program" "$i") "${file:0:$((${#file}-3))}$compare_to_ext"
done
Untested.