How to get numFound value from response in Apache Solr using Perl - apache

I used the below code to search documents (which has a particular keyword in content) from Apache Solr
my $solrgetapi = "http://$address:$port/solr/OppsBot/select?q=content:";
my $solrgeturl = $solrgetapi.'"'.$keyword.'"';
my $browser = LWP::UserAgent->new;
my $req = HTTP::Request->new( GET => $solrgeturl );
$req->authorization_basic( "$username", "$pass" );
my $page = $browser->request( $req );
print $page->decoded_content;
The result I get is as follows:
{
"responseHeader":{
"status":0,
"QTime":2,
"params":{
"q":"content:\"ABC\""}},
"response":{"numFound":0,"start":0,"docs":[]
}}
I want to extract the numFound value to a variable.
I came across some solutions in SolrJ like these
queryResponse.getResults().getNumFound();
But I couldn't find in Perl.
I tried with these below codes also. But I couldn't get these to work. Please help.
$numFound = $page->decoded_content->{response}->{numFound};
print $page->{numFound}

You neglected to transform the JSON text into a data structure.
use JSON::MaybeXS qw(decode_json);
say decode_json($page->decoded_content)->{response}{numFound}
# 0

Related

Perl : Scrape website and how to download PDF files from the website using Perl Selenium:Chrome

So I'm studying Scraping website using Selenium:Chrome on Perl, I just wondering how can I download all pdf files from year 2017 to 2021 and store it into a folder from this website https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021 . So far this is what I've done
use strict;
use warnings;
use Time::Piece;
use POSIX qw(strftime);
use Selenium::Chrome;
use File::Slurp;
use File::Copy qw(copy);
use File::Path;
use File::Path qw(make_path remove_tree);
use LWP::Simple;
my $collection_name = "mre_zen_test3";
make_path("$collection_name");
#DECLARE SELENIUM DRIVER
my $driver = Selenium::Chrome->new;
#NAVIGATE TO SITE
print "trying to get toc_url\n";
$driver->navigate('https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021');
sleep(8);
#GET PAGE SOURCE
my $toc_content = $driver->get_page_source();
$toc_content =~ s/[^\x00-\x7f]//g;
write_file("toc.html", $toc_content);
print "writing toc.html\n";
sleep(5);
$toc_content = read_file("toc.html");
This script only download the entire content of the website. Hope someone here can help me and teach me. Thank you very much.
Here is some working code, to help you get going hopefully
use warnings;
use strict;
use feature 'say';
use Path::Tiny; # only convenience
use Selenium::Chrome;
my $base_url = q(https://www.fda.gov/drugs/)
. q(warning-letters-and-notice-violation-letters-pharmaceutical-companies/);
my $show = 1; # to see navigation. set to false for headless operation
# A little demo of how to set some browser options
my %chrome_capab = do {
my #cfg = ($show)
? ('window-position=960,10', 'window-size=950,1180')
: 'headless';
'extra_capabilities' => { 'goog:chromeOptions' => { args => [ #cfg ] } }
};
my $drv = Selenium::Chrome->new( %chrome_capab );
my #years = 2017..2021;
foreach my $year (#years) {
my $url = $base_url . "untitled-letters-$year";
$drv->get($url);
say "\nPage title: ", $drv->get_title;
sleep 1 if $show;
my $elem = $drv->find_element(
q{//li[contains(text(), 'PDF')]/a[contains(text(), 'Untitled Letter')]}
);
sleep 1 if $show;
# Downloading the file is surprisingly not simple with Selenium (see text)
# But as we found the link we can get its url and then use Selenium-provided
# user-agent (it's LWP::UserAgent)
my $href = $elem->get_attribute('href');
say "pdf's url: $href";
my $response = $drv->ua->get($href);
die $response->status_line if not $response->is_success;
say "Downloading 'Content-Type': ", $response->header('Content-Type');
my $filename = "download_$year.pdf";
say "Save as $filename";
path($filename)->spew( $response->decoded_content );
}
This takes shortcuts, switches approaches, and sidesteps some issues (which one need resolve for a fuller utility of this useful tool). It downloads one pdf from each page; to download all we need to change the XPath expression used to locate them
my #hrefs =
map { $_->get_attribute('href') }
$drv->find_elements(
# There's no ends-with(...) in XPath 1.0 (nor matches() with regex)
q{//li[contains(text(), '(PDF)')]}
. q{/a[starts-with(#href, '/media/') and contains(#href, '/download')]}
);
Now loop over the links, forming filenames more carefully, and download each like in the program above. I can fill the gaps further if there's need for that.
The code puts the pdf files on disk, in its working directory. Please review that before running this so to make sure that nothing gets overwritten!
See Selenium::Remove::Driver for starters.
Note: there is no need for Selenium for this particular task; it's all straight-up HTTP requests, no JavaScript. So LWP::UserAgent or Mojo would do it just fine. But I take it that you want to learn how to use Selenium, since it often is needed and is useful.

export data from bigquery to cloud storage- php client library - there is one extra empty new line in the cloud storage file

I followed this sample
https://cloud.google.com/bigquery/docs/exporting-data
public function exportDailyRecordsToCloudStorage($date, $tableId)
{
$validTableIds = ['table1', 'table2'];
if (!in_array($tableId, $validTableIds))
{
die("Wrong TableId");
}
$date = date("Ymd", date(strtotime($date)));
$datasetId = $date;
$dataset = $this->bigQuery->dataset($datasetId);
$table = $dataset->table($tableId);
// load the storage object
$storage = $this->storage;
$bucketName = 'mybucket';
$objectName = "daily_records/{$tableId}_" . $date;
$destinationObject = $storage->bucket($bucketName)->object($objectName);
// create the import job
$format = 'NEWLINE_DELIMITED_JSON';
$options = ['jobConfig' => ['destinationFormat' => $format]];
$job = $table->export($destinationObject, $options);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
print('Waiting for job to complete' . PHP_EOL);
$job->reload();
if (!$job->isComplete()) {
//throw new Exception('Job has not yet completed', 500);
}
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
$error = $job->info()['status']['errorResult']['message'];
printf('Error running job: %s' . PHP_EOL, $error);
} else {
print('Data exported successfully' . PHP_EOL);
}
I have 37670 rows in my table1, and the cloud storage file has 37671 lines.
And I have 388065 my table2, and the cloud storage file has 388066 lines.
The last line in both cloud storage files is empty line.
Is this a Google BigQuery feature improvement request? or I did something wrong in my codes above?
What you described seems like an unexpected outcome. The output file should generally has the same number of lines as the source table.
Your PHP code looks fine and shouldn't be the cause of the issue.
I'm trying reproduce it but unable to. Could you double-check if the last empty line is somehow added by another tool like a text editor or something? How are you counting the lines of the resulting output.
If you have ruled that out and are sure the newline is indeed added by BigQuery export feature, please consider opening a bug using the BigQuery Issue Tracker as suggested by xuejian and include your job ID so that we can investigate further.

perl CGI parameters not all showing

I am passing about seven fields from an HTML form to a Perl CGI script.
Some of the values are not getting recovered using a variety of methods (POST, GET, CGI.pm or raw code).
That is, this code
my $variable = $q->param('varname');
resulted in about half the variables either being empty or undef, although the latter may have been a coincidental situation from the HTML page, which uses JavaScript.
I wrote a test page on the same platform with a simple form going to a simple CGI, and also got results where onpy half the parameters were represented. The remaining values were empty after the assignment.
I tried both POST and GET. I also tried GET and printed the query string after attempting to write out the variables; everything was in the query string as it should be. I'm using CGI.pm for this.
I tried to see if the variable values had been parsed successfully by CGI.pm by creating a version of my test CGI code which just displays the
parameters on the HTML page. The result is a bunch of odd strings like
CGI=HASH(0x02033)->param('qSetName')
suggesting that assignment of these values results in a cast of some kind, so I was unable to tell if they actually 'contained' the proper values.
My real form uses POST, so I just commented out the CGI.pm code and iterated over STDIN and it had all the name-value pairs as it should have.
Everything I've done points to CGI.pm, so I will try reinstalling it.
Here's the test code that missed half the vars:
#!/usr/bin/perl;
use CGI;
my $q = new CGI;
my $subject = $q->param('qSetSubject');
my $topic = $q->param('qTopicName');
my $userName = $q->param('uName');
my $accessLevel = $q->param('accessLevel');
my $category = $q->param('qSetCat');
my $type = $q->param('qSetType');
print "Content-Type: text/html\n\n";
print "<html>\n<head><title>Test CGI<\/title><\/head>\n<body>\n\n<h2>Here Are The Variables:<\/h2>\n";
print "<list>\n";
print "<li>\$q->param(\'qSetSubject\') = $subject\n";
print "<li>\$q->param(\'qTopicName\') = $topic\n";
print "<li>\$q->param(\'uName\') = $userName\n";
print "<li>\$q->param(\'qSetCat\') = $accessLevel\n";
print "<li>\$q->param(\'qSetType\') = $category\n";
print "<li>\$q->param(\'accessLevel\') = $type\n";
print "<\/list>\n";
The results of ikegami's code are here:
qSetSubject: precalculus
qTopicName: polar coordinates
uName: kjtruitt
accessLevel: private
category: mathematics
type: grid-in
My attempt to incorporate ikegami's code
%NAMES = (
seqNum => 'seqNum',
uName => 'userName',
qSetName => 'setName',
accessLevel => 'accessLevel',
qSetCat => 'category',
qTopicName => 'topic',
qSetType => 'type',
qSetSubject => 'subject',
);
use CGI;
my $cgi = CGI->new();
print "Content-Type:text/html\n\n";
#print($cgi->header('text/plain'));
for my $name ($cgi->param) {
for ($cgi->param($name)) {
#print("$name: ".( defined($_) ? $_ : '[undef]' )."\n");
print "$NAMES{$name} = $_\n";
${$NAMES{$name}} = $_;
}
}
print "<html>\n<head><title>Test CGI<\/title><\/head>\n<body>\n\n<h2>Here Are The Variables:<\/h2>\n";
print "Hello World!\n";
print "<list>\n";
print "<li>\$q->param(\'qSetSubject\') = $subject\n";
print "<li>\$q->param(\'qTopicName\') = $topic\n";
print "<li>\$q->param(\'uName\') = $userName\n";
print "<li>\$q->param(\'qSetCat\') = $accessLevel\n";
print "<li>\$q->param(\'qSetType\') = $category\n";
print "<li>\$q->param(\'accessLevel\') = $type\n";
print "<\/list>\n";
You are receiving
qSetSubject: precalculus
qTopicName: polar coordinates
uName: kjtruitt
accessLevel: private
category: mathematics
type: grid-in
so
my $category = $q->param('qSetCat');
my $type = $q->param('qSetType');
should be replaced with
my $category = $q->param('category');
my $type = $q->param('type');

How can I do a SQL query to an Oracle database with Perl and get the result as JSON?

I'm working with a legacy system and need to get data out of an Oracle database using Perl. Perl is one of the languages I don't spend much time in, so I'd like to be able to run a simple SQL query and pass the data to another system via JSON.
It seems that JSON, DBI, and DBD are available on this system. I'd like to accomplish this without making too many changes or updates to the system, if possible. I believe the JSON library is at version 5.12.2
I found DBI-Link library on Github and I believe this file is almost exactly what I need:
#!/usr/bin/perl -l
use strict;
use warnings;
$|++;
use JSON;
use DBI;
use DBD::Oracle qw(:ora_types);
my $dbh = DBI->connect(
'dbi:Oracle:host=localhost;sid=xe',
'hr',
'foobar',
{
AutoCommit => 1,
RaiseError => 1,
}
);
my #methods = qw(table_info column_info primary_key_info);
foreach my $method (#methods) {
if ( $dbh->can($method) ) {
print "Handle has method $method. w00t!"
}
else {
$dbh->disconnect;
print "Sadly, handle does not have method $method. D'oh!";
exit;
}
}
my $sth=$dbh->table_info('%', '%', '%', 'TABLE');
while(my $table = $sth->fetchrow_hashref) {
my $t;
$t->{'Table Name'} = $table->{TABLE_NAME};
$t->{'Column Info'} = $dbh->column_info(
undef,
$table->{TABLE_SCHEM},
$table->{TABLE_NAME},
'%'
)->fetchall_arrayref({});
$t->{'Primary Key Info'} = $dbh->primary_key_info(
undef,
$table->{TABLE_SCHEM},
$table->{TABLE_NAME}
)->fetchall_arrayref({});
print map {"$_: ". json_encode($t->{$_})} grep{ defined $t->{$_} } 'Table Name', 'Column Info', 'Primary Key Info';
print;
}
$sth->finish;
$dbh->disconnect;
The Error
I've installed the dependencies but when I run it I am getting:
Undefined subroutine &main::json_encode called at ./oracle.t line 47.
I searched the rest of the source in that repository and don't see any my json_encode definition, so maybe I have a version of the JSON library that is too old is my possible idea, but it seems unlikely that the json_encode method would have changed names.
The Next Steps
After I get json_encode to work I know I will need to execute a custom query and then save the data, it would be something like this:
$sth = $dbh->prepare("select * from records where pending = 1");
$sth->execute;
my $records = new HASH;
while($r = $sth->fetchrow_hashref)
{
$records << $r
}
my $json = json_encode($records)
However I'm unsure how to build the $records object for encoding so any help would be appreciated. I have searched stackoverflow, google, and github for perl examples of oracle to json and only had luck with the code from that DBI-Link repo.
According to the documentation for the JSON module, the function you want is encode_json and not json_encode.
I'd probably store the records in an array of hashes; something like this:
my #records;
while (my $r = $sth->fetchrow_hashref)
{
push(#records, $r);
}
If you know what field you want a hash-of-hashes keyed on:
my %records;
while (my $r = $sth->fetchrow_hashref)
{
$records{ $r->{key_field} } = $r;
}

Compiling a week's worth of tweets automatically?

I'd like to be able to run a script that parsed through the twitter page and compiled a list of tweets for a given time period - one week to be more exact. Ideally it should return the results as a html list that could then be posted in a blog. Like here:
http://www.perezfox.com/2009/07/12/the-week-in-tweet-for-2009-07-12/
I'm sure there's a script out there that could do it, unless the guy does it manually (that would be a big pain!). If there is such a script forgive my ignorance.
Thanks.
Use the Twitter search API. For instance, this query returns my tweets between 2009-07-10 and 2009-07-17:
http://search.twitter.com/search.atom?q=from:tormodfj&since=2009-07-10&until=2009-07-17
For anyone that's interested, I hacked together a quick PHP parser that will take the XML output of the above feed and turn it into a nice list. It's sensible if you post a lot of tweets to use the rpp parameter, so that your feed doesn't get clipped at 15. The maximum limit is 100. So by sticking this url into NetNewsWire (or equivalent feed reader):
http://search.twitter.com/search.atom?q=from:yourTwitterAccountHere&since=2009-07-13&until=2009-07-19&rpp=100
and exporting the xml to a hard file, you can use this script:
<?php
$date = "";
$in = 'links.xml'; //tweets
file_exists($in) ? $xml = simplexml_load_file($in) : die ('Failed to open xml data.');
foreach($xml->entry as $item)
{
$newdate = date("dS F", strtotime($item->published));
if ($date == "")
{
echo "<h2>$newdate</h2>\n<ul>\n";
}
elseif ($newdate != $date)
{
echo "</ul>\n<h2>$newdate</h2>\n<ul>\n";
}
echo "<li>\n<p>" . $item->content ." *</p>\n</li>\n";
$date = $newdate;
}
echo "</ul>\n";
?>