Storing a hash in batches

Storing a hash in batches - sql

I'm trying to retrieve a bunch of rows using sql (as a test - lets say 1000 rows in each iteration up to a million rows) and store in a file (.store file in my case but could be a text file - doesn't matter) in batches to avoid an out of memory issue. I am sql within a perl script.
Will appreciate if anyone can share an example.
example would be like -
sub query{
$test = "select * from employees";
return $test;
}
// later in the code -
my $temp;
my $dataset=DBUtils::make_database_iterator({query=> test($temp)});
}
store $dataset, $result_file;
return;

The best I can offer you with the limited amount of information you have given is this, which uses the SELECT statement's LIMIT clause to retrieve a limited number of rows from the table.
Obviously you will have to provide actual values for the DSN, the name of the table, and the store_block subroutine yourself.
use strict;
use warnings;
use autodie;
use DBI;
my $blocksize = 1000;
my ($dsn, $user, $pass) = (...);
my $dbh = DBI->connect($dsn, $user, $pass);
my $sth = $dbh->prepare('SELECT * FROM table LIMIT ? OFFSET ?') or die $DBI::errstr;
open my $fh, '>', 'test.store';
for (my $n = 0; $sth->execute($blocksize, $n * $blocksize); ++$n) {
my $block = $sth->fetchall_arrayref;
last unless #$block;
store_block($block, $fh);
}
close $fh;
sub store_block {
my ($block, $fh) = #_;
...
}

You say you want to work in batches to avoid an out of memory error. This suggests you're doing something like this...
my #all_the_rows = query_the_database($sql);
store_stuff(#all_the_rows);
You want to avoid doing that as much as possible for exactly the reason you gave, if the dataset grows large you might run out of memory.
Instead, you can read one row at a time and write one row at a time using DBI.
use strict;
use warnings;
use DBI;
# The file you're writing results to
my $file = '...';
# Connect to the database using DBI
my $dbh = DBI->connect(
...however you do that...,
{RaiseError => 1} # Turn on exceptions
);
# Prepare and execute the statement
my $sth = $dbh->prepare("SELECT * FROM employees");
$sth->execute;
# Get a row, write a row.
while( my $row = $sth->fetchrow_arrayref ) {
append_row_to_storage($row, $file);
}
I leave writing append_row_to_storage up to you.

Related

Efficient semantic triples with Perl, without external db server

I have several semantic triples. Some examples:
Porky,species,pig // Porky's species is "pig"
Bob,sister,May // Bob's sister is May
May,brother,Sam // May's borther is Sam
Sam,wife,Jane // Sam's wife is Jane
... and so on ...
I store each triple in 6 different hashes. Example:
$ijk{Porky}{species}{pig} = 1;
$ikj{Porky}{pig}{species} = 1;
$jik{species}{Porky}{pig} = 1;
$jki{species}{pig}{Porky} = 1;
$kij{pig}{Porky}{species} = 1;
$kji{pig}{species}{Porky} = 1;
This lets me efficiently ask questions like:
What species is Porky (keys %{$ijk{Porky}{species}})
List all pigs (keys %{$jki{species}{pig}})
What information do I have on Porky? (keys %{$ijk{Porky}})
List all species (keys %{$jik{species}})
and so on. Note that none of the examples above go through a list one element at a time. They all take me "instantly" to my answer. In other words, each answer is a hash value. Of course, the answer itself may be a list, but I don't traverse any lists to get to that answer.
However, defining 6 separate hashes seems really inefficient. Is there
an easier way to do this without using an external database engine
(for this question, SQLite3 counts as an external database engine)?
Or have I just replicated a small subset of SQL into Perl?
EDIT: I guess what I'm trying to say: I love associative arrays, but they seem to be the wrong data structure for this job. What's the right data structure here, and what Perl module implements it?

Have you looked at using RDF::Trine? It has DBI-backed stores, but it also has in-memory stores, and can parse/serialize in RDF/XML, Turtle, N-Triples, etc if you need persistence.
Example:
use strict;
use warnings;
use RDF::Trine qw(statement literal);
my $ns = RDF::Trine::Namespace->new("http://example.com/");
my $data = RDF::Trine::Model->new;
$data->add_statement(statement $ns->Peppa, $ns->species, $ns->Pig);
$data->add_statement(statement $ns->Peppa, $ns->name, literal 'Peppa');
$data->add_statement(statement $ns->George, $ns->species, $ns->Pig);
$data->add_statement(statement $ns->George, $ns->name, literal 'George');
$data->add_statement(statement $ns->Suzy, $ns->species, $ns->Sheep);
$data->add_statement(statement $ns->Suzy, $ns->name, literal 'Suzy');
print "Here are the pigs...\n";
for my $pig ($data->subjects($ns->species, $ns->Pig)) {
my ($name) = $data->objects($pig, $ns->name);
print $name->literal_value, "\n";
}
print "Let's dump all the data...\n";
my $ser = RDF::Trine::Serializer::Turtle->new;
print $ser->serialize_model_to_string($data), "\n";
RDF::Trine is quite a big framework, so has a bit of a compile-time penalty. At run-time it's relatively fast though.
RDF::Trine can be combined with RDF::Query if you wish to query your data using SPARQL.
use RDF::Query;
my $q = RDF::Query->new('
PREFIX : <http://example.com/>
SELECT ?name
WHERE {
?thing :species :Pig ;
:name ?name .
}
');
my $r = $q->execute($data);
print "Here are the pigs...\n";
while (my $row = $r->next) {
print $row->{name}->literal_value, "\n";
}
RDF::Query supports both SPARQL 1.0 and SPARQL 1.1. RDF::Trine and RDF::Query are both written by Gregory Williams who was a member of the SPARQL 1.1 Working Group. RDF::Query was one of the first implementations to achieve 100% on the SPARQL 1.1 Query test suite. (It may have even been the first?)

"Efficient" is not really the right word here since you're worried about improving speed in exchange for memory, which is generally how it works.
Only real alternative is to store the triplets as distinct values, and then just have three "indexes" into them:
$row = [ "Porky", "species", "pig" ];
push #{$subject_index{Porky}}, $row;
push #{$relation_index{species}}, $row;
push #{$target_index{pig}}, $row;
To do something like "list all pigs", you'd have to find the intersection of $relation_index{species} and $target_index{pig}. Which you can do manually, or with your favorite set implementation.
Then wrap it all up in a nice object interface, and you've basically implemented INNER JOIN. :)

A single hash of hash should be sufficient:
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use Data::Dump qw(dump);
my %data;
while (<DATA>) {
chomp;
my ($name, $type, $value) = split ',';
$data{$name}{$type} = $value;
}
# What species is Porky?
print "Porky's species is: $data{Porky}{species}\n";
# List all pigs
print "All pigs: " . join(',', grep {defined $data{$_}{species} && $data{$_}{species} eq 'pig'} keys %data) . "\n";
# What information do I have on Porky?
print "Info on Porky: " . dump($data{Porky}) . "\n";
# List all species
print "All species: " . join(',', uniq grep defined, map $_->{species}, values %data) . "\n";
__DATA__
Porky,species,pig
Bob,sister,May
May,brother,Sam
Sam,wife,Jane
Outputs:
Porky's species is: pig
All pigs: Porky
Info on Porky: { species => "pig" }
All species: pig

I think you are mixing categories and values, such as name=Porky, and species=pig.
Given your example, I'd go with something like this:
my %hash;
$hash{name}{Porky}{species}{pig} = 1;
$hash{species}{pig}{name}{Porky} = 1;
$hash{name}{Bob}{sister}{May} = 1;
$hash{sister}{May}{name}{Bob} = 1;
$hash{name}{May}{brother}{Sam} = 1;
$hash{brother}{Sam}{name}{May} = 1;
$hash{name}{Sam}{wife}{Jane} = 1;
$hash{wife}{Jane}{name}{Sam} = 1;
Yes, this has some apparent redundancy, since we can easily distinguish most names from other values. But the 3rd-level hash key is also a top level hash key, which can be used to get more information on some element.

Or have I just replicated a small subset of SQL into Perl?
It's pretty easy to start using actual SQL, using an SQLite in memory database.
#!/usr/bin/perl
use warnings; use strict;
use DBI;
my $dbh = DBI->connect("dbi:SQLite::memory:", "", "", {
sqlite_use_immediate_transaction => 0,
RaiseError => 1,
});
$dbh->do("CREATE TABLE triple(subject,predicate,object)");
$dbh->do("CREATE INDEX 'triple(subject)' ON triple(subject)");
$dbh->do("CREATE INDEX 'triple(predicate)' ON triple(predicate)");
$dbh->do("CREATE INDEX 'triple(object)' ON triple(object)");
for ([qw<Porky species pig>],
[qw<Porky color pink>],
[qw<Sylvester species cat>]) {
$dbh->do("INSERT INTO triple(subject,predicate,object) VALUES (?, ?, ?)", {}, #$_);
}
use JSON;
print to_json( $dbh->selectall_arrayref('SELECT * from triple WHERE predicate="species"', {Slice => {}}) );
Gives:
[{"object":"pig","predicate":"species","subject":"Porky"},
{"object":"cat","predicate":"species","subject":"Sylvester"}]
You can then query and index the data in a familiar manner. Very scalable as well.

How to check a variable is it the Database and print it out?

I got a list of variables to loop through the database. How can I detect the variable is not in the database? What query should I use? How to print out error message once detected the variable is not in database.
My Code:
$variable = $sql->{'variable'};
foreach my $sql (#Records){
**Below statement will select existed variable, what should I change to make it select not existed variable**
$sqlMySQL = "Select LOT from table where LOT like '%$variable%'";
}
**If not exist**{
print("Not exist")
}
Expected result:
While the $variable loop through the database, if the $variable not exist in the database then print out the $variable or not exist.
Thanks for viewing, comments and answers.

I would go about it similar to the below.
A list of variables - Place those variables in an array (aka a list)
What query should I use - One that will only select exactly what you need and store it in the best dataset for traversal (selectall_hashref)
While the $variable loop through the database - Would require a DBI call for each $variable, so instead loop through your array to check for existence in the hash.
EXAMPLE
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbh =
DBI->connect( "dbi:SQLite:dbname=test.db", "USERNAME", "PASSWORD",
{ RaiseError => 1 },
) or die $DBI::errstr;
my #vars = ( 25, 30, 40 );
my $hash_ref =
$dbh->selectall_hashref( q / SELECT LOT FROM table /, q / LOT / );
$dbh->disconnect();
foreach (#vars) {
if ( exists $hash_ref->{$_} ) {
print $_ . "\n";
}
else {
print "Does not exist\n";
}
}
Something similar to that will pull all the LOT column values for your table into a hash key value pair that you can then compare against your array in a foreach loop.

How to convert a list of attribute-value pairs into a flat table whose columns are attributes

I'm trying to convert a csv file containing 3 columns (ATTRIBUTE_NAME,ATTRIBUTE_VALUE,ID) into a flat table whose each row is (ID,Attribute1,Attribute2,Attribute3,....). The samples of such tables are provided at the end.
Either Python, Perl or SQL is fine. Thank you very much and I really appreciate your time and efforts!
In fact, my question is very similar to this post, except that in my case the number of attributes is pretty big (~300) and not consistent across each ID, so hard coding each attribute might not be a practical solution.
For me, the challenging/difficult parts are:
There are approximately 270 millions lines of input, the total size of the input table is about 60 GB.
Some single values (string) contain comma (,) within, and the whole string will be enclosed with double-quote (") to make the reader aware of that. For example "JPMORGAN CHASE BANK, NA, TX" in ID=53.
The set of attributes is not the same across ID's. For example, the number of overall attributes is 8, but ID=53, 17 and 23 has only 7, 6 and 5 respectively. ID=17 does not have attributes string_country and string_address, so output blank/nothing after the comma.
The input attribute-value table looks like this. In this sample input and output, we have 3 ID's, whose number of attributes can be different depending on we can obtain such attributes from the server or not.
ATTRIBUTE_NAME,ATTRIBUTE_VALUE,ID
num_integer,100,53
string_country,US (United States),53
string_address,FORT WORTH,53
num_double2,546.0,53
string_acc,My BankAcc,53
string_award,SILVER,53
string_bankname,"JPMORGAN CHASE BANK, NA, TX",53
num_integer,61,17
num_double,34.32,17
num_double2,200.541,17
string_acc,Your BankAcc,17
string_award,GOLD,17
string_bankname,CHASE BANK,17
num_integer,36,23
num_double,78.0,23
string_country,CA (Canada),23
string_address,VAN COUVER,23
string_acc,Her BankAcc,23
The output table should look like this. (The order of attributes in the columns is not fixed. It can be sorted alphabetically or by order-of-appearance.)
ID,num_integer,num_double,string_country,string_address,num_double2,string_acc,string_award,string_bankname
53,100,,US (United States),FORT WORTH,546.0,My BankAcc,SILVER,"JPMORGAN CHASE BANK, NA, TX"
17,61,34.32,,,200.541,Your BankAcc,GOLD,CHASE BANK
23,36,78.0,CA (Canada),VAN COUVER,,Her BankAcc,,

This program will do as you ask. It expects the name of the input file as a parameter on the command line.
Update Looking more carefully at the data I see that not all of the data fields are available for every ID. That makes things more complex if the fields are to be kept in the same order as they appear in the file.
This program works by scanning the file and accumulating all the data for output into hash %data. At the same time it builds a hash %headers, that keeps the position each header appears in the data for each ID value.
Once the file has been scanned, the collected headers are sorted by finding the first ID for each pair that includes information for both headers. The sort order for that pair within the complete set must be the same as the order they appeared in the data for that ID, so it's just a matter of comparing the two position values using <=>.
Once a sorted set of headers has been created, the %data hash is dumped, accessing the complete list of values for each ID using a hash slice.
Update 2 Now that I realise the sheer size of your data I can see that my second attempt was also flawed, as it tried to read all of the information into memory before outputting it. That isn't going to work unless you have a monster machine with about 1TB of memory!
You may get some mileage from this version. It scans twice through the file, the first time to read the data so that the full set of header names can be created and ordered, then again to read the data for each ID and output it.
Let me know if it's not working for you, as there's still things I can do to make it more memory-efficient.
use strict;
use warnings;
use 5.010;
use Text::CSV;
use Fcntl 'SEEK_SET';
my $csv = Text::CSV->new;
open my $fh, '<', $ARGV[0] or die qq{Unable to open "$ARGV[0]" for input: $!};
my %headers = ();
my $last_id;
my $header_num;
my $num_ids;
while (my $row = $csv->getline($fh)) {
next if $. == 1;
my ($key, $val, $id) = #$row;
unless (defined $last_id and $id eq $last_id) {
++$num_ids;
$header_num = 0;
$last_id = $id;
print STDERR "Processing ID $id\n";
}
$headers{$key}[$num_ids-1] = ++$header_num;
}
sub by_position {
for my $id (0 .. $num_ids-1) {
my ($posa, $posb) = map $headers{$_}[$id], our $a, our $b;
return $posa <=> $posb if $posa and $posb;
}
0;
}
my #headers = sort by_position keys %headers;
%headers = ();
print STDERR "List of headers complete\n";
seek $fh, 0, SEEK_SET;
$. = 0;
$csv->combine('ID', #headers);
print $csv->string, "\n";
my %data = ();
$last_id = undef;
while () {
my $row = $csv->getline($fh);
next if $. == 1;
if (not defined $row or defined $last_id and $last_id ne $row->[2]) {
$csv->combine($last_id, #data{#headers});
print $csv->string, "\n";
%data = ();
}
last unless defined $row;
my ($key, $val, $id) = #$row;
$data{$key} = $val;
$last_id = $id;
}
output
ID,num_integer,num_double,string_country,string_address,num_double2,string_acc,string_award,string_bankname
53,100,,"US (United States)","FORT WORTH",546.0,"My BankAcc",SILVER,"JPMORGAN CHASE BANK, NA, TX"
17,61,34.32,,,200.541,"Your BankAcc",GOLD,"CHASE BANK"
23,36,78.0,"CA (Canada)","VAN COUVER",,"Her BankAcc",,

Use Text::CSV from CPAN:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ); # Avoids regex performance penalty
use Text::CSV;
my $col_csv = Text::CSV->new();
my $id_attr_csv = Text::CSV->new({ eol=>"\n", });
$col_csv->column_names( $col_csv->getline( *DATA ));
while( my $row = $col_csv->getline_hr( *DATA )){
# do all the keys but skip if ID
for my $attribute ( keys %$row ){
next if $attribute eq 'ID';
$id_attr_csv->print( *STDOUT, [ $attribute, $row->{$attribute}, $row->{ID}, ]);
}
}
__DATA__
ID,num_integer,num_double,string_country,string_address,num_double2,string_acc,string_award,string_bankname
53,100,,US (United States),FORT WORTH,546.0,My BankAcc,SILVER,"JPMORGAN CHASE BANK, NA, TX"
17,61,34.32,,,200.541,Your BankAcc,GOLD,CHASE BANK
23,36,78.0,CA (Canada),VAN COUVER,,Her BankAcc,,

How can I UPDATE rows returned by a SELECT in a loop?

I have a loop on the rows returned by an SQL SELECT statement, and, after some processing on a row's data, I sometimes want to UPDATE the row's value. The processing in the loop's body is non-trivial, and I can't write it in SQL. When I try to execute the UPDATE for the selected row I get an error (under Perl's DBD::SQLite::st execute failed: database table is locked). Is there a readable, efficient, and portable way to achieve what I'm trying to do? Failing that, is there a DBD or SQLite-specific way to do it?
Obviously, I can push the updates in separate data structure and execute them after the loop, but I'd hate the code's look after that.
If you're interested, here is the corresponding Perl code.
my $q = $dbh->prepare(q{
SELECT id, confLoc FROM Confs WHERE confLocId ISNULL});
$q->execute or die;
my $u = $dbh->prepare(q{
UPDATE Confs SET confLocId = ? WHERE id = ?});
while (my $r = $q->fetchrow_hashref) {
next unless ($r->{confLoc} =~ m/something-hairy/);
next unless ($locId = unique_name_state($1, $2));
$u->execute($locId, $r->{id}) or die;
}

Temporarily enable AutoCommit:
sqlite> .header on
sqlite> select * from test;
field
one
two
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect('dbi:SQLite:test.db', undef, undef,
{ RaiseError => 1, AutoCommit => 0}
);
test_select_with_update($dbh);
sub test_select_with_update {
my ($dbh) = #_;
local $dbh->{AutoCommit} = 1;
my $q = $dbh->prepare(q{SELECT field FROM test});
my $u = $dbh->prepare(q{UPDATE test SET field = ? WHERE field = ?});
$q->execute or die;
while ( my $r = $q->fetchrow_hashref ) {
if ( (my $f = $r->{field}) eq 'one') {
$u->execute('1', $f) or die;
}
}
}
After the code has been run:
sqlite> .header on
sqlite> select * from test;
field
1
two

Your problem is that you're using the same database handler to perform an update while you're in a fetching loop.
So have another instance of your database handler to perform the updates:
my $dbh = DBI->connect(...);
my $dbhForUpdate = DBI->connect(...) ;
Then use dbhForUpdate in your loop:
while(my $row = $sth->fetch()){
...
$dbhForUpdate->do(...) ;
}
Anyway, I wouldn't recommend doing this since there's good chances you run into concurrency issues at the database level.

More in answer to Zoidberg's comment but if your were able to switch to an ORM like Perl's DBIx::Class then you find that you could write something like this:
my $rs = $schema->resultset('Confs')->search({ confLocId => undef });
while ( my $data = $rs->next ) {
next unless $data->confLoc =~ m/(something)-(hairy)/;
if ( my $locId = unique_name_state( $1, $2 ) ) {
$data->update({ confLocID => $locid });
}
}
And if DBIx::Class doesn't grab your fancy there are a few others on CPAN like Fey::ORM and Rose::DB for example.

How can I avoid the program quitting when Perl's DBI encounters an error preparing a statement?

I'm making a script that goes through a table that contains all the other table names on the database. As it parses each row, it checks to see if the table is empty by
select count(*) cnt from $table_name
Some tables don't exist in the schema anymore and if I do that
select count(*)
directly into the command prompt, it returns the error:
206: The specified table (adm_rpt_rec) is not in the database.
When I run it from inside Perl, it appends this to the beginning:
DBD::Informix::db prepare failed: SQL: -
How can I avoid the program quitting when it tries to prepare this SQL statement?

One option is not to use RaiseError => 1 when constructing $dbh. The other is to wrap the prepare in an eval block.

Just put the calls that may fail in an eval block like this:
for my $table (#tables) {
my $count;
eval {
($count) = $dbi->selectrow_array("select count(*) from $table");
1; #this is here so the block returns true if it succeeds
} or do {
warn $#;
next;
}
print "$table has $count rows\n";
}
Although, in this case, since you are using Informix, you have a much better option: the system catalog tables. Informix keeps metadata like this in a set of system catalog tables. In this case you want systables:
my $sth = $dbh->prepare("select nrows from systables where tabname = ?");
for my $table (#tables) {
$sth->execute($table);
my ($count) = $sth->fetchrow_array;
$sth->finish;
unless (defined $count) {
print "$table does not exist\n";
next;
}
print "$table has $count rows\n";
}
This is faster and safer than count(*) against the table. Full documentation of the system catalog tables can be found in IBM Informix Guide to SQL (warning this is a PDF).

Working code - assuming you have a 'stores' database.
#!/bin/perl -w
use strict;
use DBI;
my $dbh = DBI->connect('dbi:Informix:stores','','',
{RaiseError=>0,PrintError=>1}) or die;
$dbh->do("create temp table tlist(tname varchar(128) not null) with no log");
$dbh->do("insert into tlist values('systables')");
$dbh->do("insert into tlist values('syzygy')");
my $sth = $dbh->prepare("select tname from tlist");
$sth->execute;
while (my($tabname) = $sth->fetchrow_array)
{
my $sql = "select count(*) cnt from $tabname";
my $st2 = $dbh->prepare($sql);
if ($st2)
{
$st2->execute;
if (my($num) = $st2->fetchrow_array)
{
print "$tabname: $num\n";
}
else
{
print "$tabname: error - missing?\n";
}
}
}
$sth->finish;
$dbh->disconnect;
print "Done - finished under control.\n";
Output from running the code above.
systables: 72
DBD::Informix::db prepare failed: SQL: -206: The specified table (syzygy) is not in the database.
ISAM: -111: ISAM error: no record found. at xx.pl line 14.
Done - finished under control.
This printed the error (PrintError=>1), but continued. Change the 1 to 0 and no error appears. The parentheses in the declarations of $tabname and $num are crucial - array context vs scalar context.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas