Crawler hangs a few urls from the end - import.io

Whether I paste 2000 urls, 1000, 500 or whatever the crawler works perfectly returning data from the bulk urls BUT then stops a handful of urls from the end and hangs.... as there is no cancel/stop button I have to quit the program.
Example, 250 urls pasted - stops at 247, 2000 urls pasted - stops at 1986

Rod,
This is usually due to timeouts happening during processing.
Did you try copy pasting the URLs in the Bulk Extract feature instead?
We recently released error handling for Bulk Extract, the process will time out after 1' and you will be able to retry the ones who failed.
See here for an example:
http://support.import.io/knowledgebase/articles/510065-bulk-extract-tutorial

Related

Postgres Import to Google Cloud SQL failing due to missing \N

I am trying to import a dump of my postgres database from AWS RDS into GCP SQL.
I am getting a repeating error that seems to be caused when there is a missing \N from the end of a table row. Example below:
2020-10-15 12:52:19 \N f
If I add a \N to the end like so:
2020-10-15 12:52:19 \N f \N
it will work and progress until the next one which could be 500 rows down. The problem is, I have a few million rows. Is there a way I can import without this error occurring or a fix I can put in place?
A find a replace will not work as there is no whitespace after the f so I can't search for the missing version. Also because there is so much data a find crashes the application half the time I carry it out. I am using Atom for looking at the file.
I have also added a screenshot of the Google GUI we are using for importing the file as well as the log error from the dashboard.
Any help would be greatly appreciated.
Gerard

Script timeout passed, resubmit same file and import will resume error in phpmyadmin

I am trying to upload some big data onto phpmyadmin
I am getting this error where it does partial uploads
Script timeout passed, if you want to finish import, please resubmit same file and import will resume.
I have followed this link where it says change the config on \phpmyadmin\libraries\config.default.php
I cannot see this directory in the phpmyadmin. OS Ubuntu
Go to xampp/phpMyAdmin/libraries/config.default.php
find $cfg['ExecTimeLimit'] = 300; line no 695 and replace $cfg['ExecTimeLimit'] = 0;
i think there were problems with your sql codes, try to surrounding your sql codes (edit with text editor eg. notepad++) with following statement :
SET autocommit=0;
SET unique_checks=0;
SET foreign_key_checks=0;
at the beginning, and
COMMIT;
SET unique_checks=1;
SET foreign_key_checks=1;
at the end.
and you could import your sql file once again.
give it a try.

Monit for "cron-like" tasks

Have some batch-type jobs that I would like to move from cron to Monit but am struggling to get them to work properly. These scripts typically run once a day, but on occasion have to be re-ran later in the day. Goal is to take advantage of the monit & m/monit front-ends to re-run as well as be alerted on failure in similar fashion to other things under monit.
The below was my first attempt. I know the docs say to use range/wildcard for minute field but I have my monit daemon set to cycle every 20 seconds so thought I'd be able to get away with this.
check program test.sh
with path "/usr/local/bin/test.sh"
every "0 7 * * *"
if status != 0 then alert
This does not seem to work as it seems like it picks up the exit status of the program on the NEXT run. So I have a zombie process sitting around until 7am the next day, at which time I'll see the status from the previous day's run.
Would be nice if this ran immediate or if there was a way to schedule something as "batch" that would only run once when started (either from command line or web gui). Example below.
check program test.sh
with path "/usr/local/bin/test.sh"
mode batch
if status != 0 then alert
Is it possible to do what I want? Can a 'check program' be scheduled that will only run one time when started or using the 'every [cron]' type syntax supported by monit?
TIA for any suggestions.
The latest version of monit (5.18) now picks up the exit status on the next daemon cycle, not on the next execution of the program like in the past (which might not be until the next day).

Uploading job fails on the same file that was uploaded successfully before

I'm running regular uploading job to upload csv into BigQuery. The job runs every hour. According to recent fail log, it says:
Error: [REASON] invalid [MESSAGE] Invalid argument: service.geotab.com [LOCATION] File: 0 / Offset:268436098 / Line:218637 / Field:2
Error: [REASON] invalid [MESSAGE] Too many errors encountered. Limit is: 0. [LOCATION]
I went to line 218638 (the original csv has a headline, so I assume 218638 should be the actual failed line, let me know if I'm wrong) but it seems all right. I checked according table in BigQuery, it has that line too, which means I actually successfully uploaded this line before.
Then why does it causes failure recently?
project id: red-road-574
Job ID: Job_Upload-7EDCB180-2A2E-492B-9143-BEFFB36E5BB5
This indicates that there was a problem with the data in your file, where it didn't match the schema.
The error message says it occurred at File: 0 / Offset:268436098 / Line:218637 / Field:2. This means the first file (it looks like you just had one), and then the chunk of the file starting at 268436098 bytes from the beginning of the file, then the 218637th line from that file offset.
The reason for the offset portion is that bigquery processes large files in parallel in multiple workers. Each file worker starts at an offset from the beginning of the file. The offset that we include is the offset that the worker started from.
From the rest of the error message, it looks like the string service.geotab.com showed up in the second field, but the second field was a number, and service.geotab.com isn't a valid number. Perhaps there was a stray newline?
You can see what the lines looked like around the error by doing:
cat <yourfile> | tail -c +268436098 | tail -n +218636 | head -3
This will print out three lines... the one before the error (since I used -n +218636 instead of +218637), the one that had the error, and the next line as well.
Note that if this is just one line in the file that has a problem, you may be able to work around the issue by specifying maxBadRecords.

ASE ISQL output to file, occassionally is empty or blank

Give this unix script, which is scheduled batch run:
isql -U$USR -S$SRVR -P$PWD -w2000 < $SCRIPTS/sample_report.sql > $TEMP_DIR/sample_report.tmp_1
sed 's/-\{3,\}//g' $TEMP_DIR/sample_report.tmp_1 > $TEMP_DIR/sample_report.htm_1
uuencode $TEMP_DIR/sample_report.htm_1 sample_report.xls > $TEMP_DIR/sample_report.mail_1
mailx -s "Daily Sample Report" email#example.com < $TEMP_DIR/sample_report.mail_1
There are occasionally cases where the sample_report.xls attached in the mail, is empty, zero lines.
I have ruled out the following:
not command processing timeout - by adding the -t30 to isql, I get the xls and it contains the error, not empty
not sql error - by forcing an error in the sql, I get the xls and it contains the error, not empty
not sure of login timeout - by adding -l1, it does not timeout, but I can't specify a number lower than 1 second, so I can't say
I cannot reproduce this, as I do not know the cause. Has anyone else experienced this or have way to address this? Any suggestions how to find the cause? Is it the unix or the Sybase isql?
I found the cause. Since this is scheduled, and this particular report takes a long time to generate. Other scheduled scripts, I found have this line of code:
rm -f $TEMP_DIR/*
If the this long running report, overlaps with one of the scheduled scripts with the line above, the .tmp_1 can possibly be deleted, hence blank by the time it is mailed. I replicated this by manually deleting the .tmp_1 while the report was still writing the sql in there.