How to select only a few fields using kotlin exposed? - sql

I have a big database of 2,600,000 records and I want to do some advanced searches on it by looping over all records. However, running a script with selectAll() takes a very long time to load.
As a workaround, I'm looping over 100,000 records 26 times using this code:
for (i in 1..26) {
transaction {
for (app in AppsTable.selectAll().limit(n = 100000, offset = i * 100000L)) {
//..analysis
}
}
}
How can I speed up this query or if possible how can I reduce the query set by queries just the columns I need to work with? For eg, can I do something like this -
AppsTable.selectAll(AppsTable.name, AppsTable.downloadCount, AppsTable.developerId)

Just use the slice() method before selectAll() like this:
AppsTable.slice(AppsTable.name, AppsTable.downloadCount, AppsTable.developerId)
.selectAll()
.limit(n = 100000, offset = i * 100000L)

Related

Interpolation of missing values in Pentaho

I am trying to fill down the missing values using Pentaho pdi.
Input:
Desired output:
Found so far only Filling data gaps in a stream in Pentaho Data Integration, is it possible? but it fills in with the last known value.
Potentially, I thought I could work with the above solution, I also added the next amount to the analytical query, along with the next date. Then, I added the flag in the clone step and filter the original results from the input into Dummy and generated results (from the calculator) to a calculator (at the moment). Then, potentially, I can dump that separate stream to a temp table in a database and run the sql query which will do the rolling subtraction. I am also investigating the javascript step.
I disregarded the Python or R Executor step because at the end I will be running the job on the aws vm and I already foresee the pain I will go through with the installation.
What would be your suggestions? Is there a simple way to do interpolation?
Updated for the question
The method provided in your link does work from my testing, (I am using LAG instead of LEAD for your tasks though). Here I am not looking to replicate that method, just another option for you by using JavaScript to build the logic which you might also extend to other applications:
In the testing below (tested on PDI-8.0), the transformation has 5 steps, see below
Data Grid step to create testing data with three fields: date, account number and amount
Sort rows to sort the rows based on account number and date. this is required for Analytic Query step, if your source data are already sorted, then skip this step
Analytic Query step, see below, create two more fields: prev_date and prev_amount
Modified Java Script Value step, add the following code, nothing else is needed to configure in this step:
var days_diff = dateDiff(prev_date, date, "d")
if (days_diff > 0) {
/* retrieve index for two fields: 'date', 'amount'
* and modify their values accordingly
*/
var idx_date = getInputRowMeta().indexOfValue("date")
var idx_amount = getInputRowMeta().indexOfValue("amount")
/* amount to increment by each row */
var delta_amount = (amount - prev_amount)/days_diff
for (var i = 1; i < days_diff; i++) {
newRow = createRowCopy(getOutputRowMeta().size());
newRow[idx_date] = dateAdd(prev_date, "d", i);
newRow[idx_amount] = prev_amount + delta_amount * i;
putRow(newRow);
}
}
Select values step to remove unwanted fields, i.e.: prev_date, prev_amount
Run the transformation, you will have the following shown under the Preview data tab of Modified Java Script Value step:
UPDATE:
Per your comments, you can do the following, assume you have a new field account_type:
in Analytic Query step, add a new field prev_account_type, similar to two other prev_ fields, just from different Subject: account_type
in Modified Java Script Value step, you need to retrieve the Row index for account_type and modify the logic to compute delta_amount, so when prev_account_type is not the same as the current account_type, the delta_amount is ZERO, see below code:
var days_diff = dateDiff(prev_date, date, "d")
if (days_diff > 0) {
/* retrieve index for three fields: 'date', 'amount', 'account_type' */
var idx_date = getInputRowMeta().indexOfValue("date")
var idx_amount = getInputRowMeta().indexOfValue("amount")
var idx_act_type = getInputRowMeta().indexOfValue("account_type")
/* amount to increment by each row */
var delta_amount = prev_account_type.equals(account_type) ? (amount - prev_amount)/days_diff : 0;
/* copy the current Row into newRow and modify fields accordingly */
for (var i = 1; i < days_diff; i++) {
newRow = createRowCopy(getOutputRowMeta().size());
newRow[idx_date] = dateAdd(prev_date, "d", i);
newRow[idx_amount] = prev_amount + delta_amount * i;
newRow[idx_act_type] = prev_account_type;
putRow(newRow);
}
}
Note: invoking Javascript interpreter does have some performance hit, so if that matters to you, stick to the method in the link you provided.

Is there a way to records in bundles in a database?

I have a large mysql database (several 100000 records). I use PDO to access it. I need to fetch data, in units of approximately 100 records.
PDO:: fetchall results in too many records and exhausts the PC memory.
PDO::fetch gets me one record only.
Is there a way to request the next n (say 100) records?
Thanks
PDO::fetch gets me one record only.
You can always make another call to fetch to get another record. And so on, until all the records get fetched, just like it shown in every example:
while ($row = $stmt->fetch() {
print $row[0];
}
Note that you may also set PDO::MYSQL_ATTR_USE_BUFFERED_QUERY to false in order to reduce the memory consumption
The MySQL client-server protocol would allow to fetch a certain number (>1) of rows from the resultset of a statement in a single response packet (if the data fits the 224-1 bytes limit). The COM_STMT_FETCH command has the field num rows which in the OP's case could be set to 100.
But the mysqlnd implementation currently sets this field explicitly and exclusivly to 1.
So, yes, currently your only option seems to be an unbuffered query and living with the (small) network overhead of fetching the records one by one, e.g.
<?php
$pdo = new PDO('mysql:host=localhost;dbname=test;charset=utf8', 'localonly', 'localonly', array(
PDO::ATTR_EMULATE_PREPARES=>false,
PDO::MYSQL_ATTR_DIRECT_QUERY=>false,
PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION
));
setup($pdo);
// use an unbuffered query
// so the complete result set isn't transfered into the php instance's memory
// before ->execute() returns
$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
$stmt = $pdo->prepare('SELECT id FROM soFoo WHERE id>?');
$stmt->execute( array(5) );
$rowsPerChunk = 10;
do {
// not saying that you have to use LimitIterator, it's just one example of how you can process the chunks
// and a demonstration that PDOStatement implements Traversable
// all in all the example doesn't do very useful things ;-)
$lit = new LimitIterator(new IteratorIterator($stmt), 0, $rowsPerChunk);
doSomething($lit);
}
while( $lit->getPosition()===$rowsPerChunk);
function doSomething(Iterator $it) {
foreach($it as $row) {
printf('%2d ', $row['id']); // yeah, yeah, not that useful....
}
echo "\r\n-------\r\n";
}
function setup($pdo) {
$pdo->exec('
CREATE TEMPORARY TABLE soFoo (
id int auto_increment,
primary key(id)
)
');
$stmt = $pdo->prepare('INSERT INTO soFoo VALUES ()');
for($i=0; $i<44; $i++) {
$stmt->execute();
}
}

Best way to retrieve all the table records in Apache Gora 0.5

I know about
query.setStartKey(startKey);
query.setEndKey(endKey);
Isn't there something similar to
SELECT * FROM TABLE;
in Apache Gora while creating queries, that would return me all the result set.
EDIT*
I executed the program without setting anything. But still, the result set is null.
Query<String, Obj> query = store.newQuery();
Result<String, Obj> result= query.execute();
If you don't set the start nor the end key, it will retrieve all the table like the select you talk about -at least with HBase-.

Passing psql query from perl into string

Currently I have a perl script that accesses our database, performs certain queries and prints output to the terminal. Instead, I would like to output the results into a template latex file before generating a pdf. For most of my queries I pull out numbers and store these as scalar variables (eg how often a particular operator carries out a given task). eg.
foreach $op (#operator) {
$query = "SELECT count(task_name) FROM table WHERE date <= '$date_stop' and
date >= '$date_start' and task=\'$operator[$index]\';";
#execute query
$result=$conn->exec($query);
$conres = $conn->errorMessage;
if ($result->resultStatus eq PGRES_TUPLES_OK) {
if($result->ntuples > 0) {
($task[$index]) = $result->fetchrow;
}
printf("$operator[$index] carried out task: %d\n", $task[$index]);
} else {
die "Failed.\n$conres\n\n";
exit -1;
}
$index++;
}
printf("**********************************\n\n");
In the final report I will summarise how many times each operator completed each task in a table. In addition to this there will also be some incidents which must be reported. I can print these easily to the terminal using a command such as
$query = "SELECT operator, incident_type from table_name WHERE incident_type = 'Y'
and date <= '$date_stop' and date >= '$date_start';";
$result=$conn->exec($query);
$conres = $conn->errorMessage;
if ($result->resultStatus eq PGRES_TUPLES_OK) {
if($result->ntuples > 0) {
$result->print(STDOUT, 1, 1, 0, 0, 0, 1, "\t", "", "");
}
} else {
die "Failed.\n$conres\n\n";
exit -1;
}
An example of the output of this command is
operator | incident_type
-----------------------------
AB | Incomplete due to staff shortages
-------------------------------
CD | Closed due to weather
-----------------------------
How can I make my perl script pass the operator names and incidents into a string array rather than just sending the results to the terminal?
You should consider updating your script to use DBI. This is the standard for database connectivity in Perl.
DBI has a built in facility for inserting parameters into a query string. It is safer and faster than manually creating the string yourself. Before the loop, do this once:
#dbh is a database handle that you have already opened.
my $query = $dbh->prepare(
"SELECT count(task_name) FROM table WHERE date<=? and date>=? and task=?"
);
Then within the loop, you only have to do this each time:
$query->execute($date_stop,$date_start,$op);
Note that the parameters you pass to execute automatically get inserted in place of the ?'s in your statement. It handles the quoting for you.
Also in the loop, after you execute the statement, you can get the results like this:
my $array_ref = $query->fetchall_array_ref;
Now all of the rows are stored in a two-dimensional array structure. $array_ref->[0][0] would get the first column of the first row returned.
See the DBI documentation for more information.
As others have mentioned, there are quite a few other mistakes in your code. Make sure you start with use strict; use warnings;, and ask more questions if you need further help!
Lots of good feedback to your script, but nothing about your actual question.
How can I make my perl script pass the operator names and incidents into a string array rather than just sending the results to the terminal?
Have your tried creating an array and pushing items to it?
my #array;
push (#array, "foo");
Or using nested arrays:
push (#array, ["operator", "incident"]);

Lucene Field Grouping

say i m having fields stud_roll_number and date_leave.
select stud_roll_number,count(*) from some_table where date_leave > some_date group by stud_roll_number;
how to write the same query using Lucene....I tried after querying date_leave > some_date
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = search.doc(scoreDoc.doc);
String value = doc.get(fieldName);
Integer key = mapGrouper.get(value);
if (key == null) {
key = 1;
} else {
key = key+1;
}
mapGrouper.put(value, key);
}
But, I m having huge data set, it takes much time to compute this. Is there any other way to find it???? Thanks in advance...
Your performance bottleneck is almost certainly the I/O it takes to perform the document and field value lookups. What you want to do in this situation is use a FieldCache for the field you want to group by. Once you have a field cache, you can look up the values by Lucene doc ID, which will be fast because all the values are in memory.
Also remember to give your HashMap an initial capacity to avoid array resizing.
There is a very new grouping module, on https://issues.apache.org/jira/browse/LUCENE-1421 as a patch, that will do this.