issue with Date while scooping data from oracle - apache

I am trying to scoop import Date column from oracle database to Avro format.
Below is the sample on how I am executing. My options file content
--target-dir
/.. /
--delete-target-dir
--as-avrodatafile
--query
Select chg_ts from abc
--map-column-java chg_ts=String
My scoop import contains
sqoop import -D oraoop.timestamp.string=false --options-file $1 --options-file $2 --fetch-size=0 -m 1 --mapreduce-job-name $job_name"_"$instance
After doing the above steps I am still getting the date as "type" : [ "null", "long" ] in avro file and takes Bigint in Hive.
Please guide me if I am missing something here.

Related

problem on changing a columns' data type in BigQuery

I try to change a columns' data type from string to DATETIME (for example '04/12/2016 02:47:30') with the format 'YY/MM/DD HH24:MI:SS' but it shoes an error like :
Failed to parse input timestamp string at 8 with format element ' '
The initial file was a csv which i uploaded from my drive. I tried to convert the column's data type from google sheets and then re'upload it, but the column type still remains as string.
I think when you load your CSV file to the BigQuery table, you use autodetect mode.
Unfortunately with this mode, BigQuery will consider your date as String even if you changed it from Google Sheet.
Instead of using autodetect, I propose you using a Json schema for your BigQuery table.
In your schema you will indicate that the column type for your date field is timestamp.
The format you indicated 04/12/2016 02:47:30 is compatible with a timestamp and BigQuery will convert it for you.
For the loading file to BigQuery, you can directly use the console or gcloud cli with bq command :
bq load \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata.csv \
./myschema.json
For the BigQuery Json schema, the timestamp type is :
{
{
"name": "yourDate",
"type": "TIMESTAMP",
"mode": "NULLABLE",
"description": "Your date"
}
}

sqoop import staging table issue

I am trying to import the data from teradata into HDFS location.
I have access to view for that database. So I created a staging table in another database. But when I try to run the code it says error
Error: Running Sqoop version: 1.4.6.2.6.5.0-292 18/12/23 21:49:41 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 18/12/23 21:49:41 ERROR tool.BaseSqoopTool: Error parsing arguments for import:staging-table, t_hit_data_01_staging, –clear-staging-table, --query, select * from table1 where cast(date1 as Date) <= date '2017-09-02' and $CONDITIONS, --target-dir, <>, --split-by, date1, -m, 25
I have given the staging table details in the code and ran it. but throws error.
(Error parsing arguments from import and as un-recognized arguments from staging table)
sqoop import \
--connect jdbc:teradata://<server_link>/Database=db01 \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username <UN> \
--password <PWD> \
–-staging-table db02.table1_staging –clear-staging-table \
--query "select * from table1 where cast(date1 as Date) <= date '2017-09-02' and \$CONDITIONS " \
--target-dir '<hdfs location>' \
--split-by date1 -m 25`
The data should be loaded into the HDFS location, using the staging table in another database in Teradata.Then later on changing the where clause it sqoop should create another file under the same folder in HDFS location. Example: part-0000, next file as part -0001 etc.,
I dont think there is a staging option available for import command.
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

BigQuery Could not parse 'null' as int for field

Tried to load csv files into bigquery table. There are columns where the types are INTEGER, but some missing values are NULL. So when I use the command bq load to load, got the following error:
Could not parse 'null' as int for field
So I am wondering what are the best solutions to deal with this, have to reprocess the data first for bq to load?
You'll need to transform the data in order to end up with the expected schema and data. Instead of INTEGER, specify the column as having type STRING. Load the CSV file into a table that you don't plan to use long-term, e.g. YourTempTable. In the BigQuery UI, click "Show Options", then select a destination table with the table name that you want. Now run the query:
#standardSQL
SELECT * REPLACE(SAFE_CAST(x AS INT64) AS x)
FROM YourTempTable;
This will convert the string values to integers where 'null' is treated as null.
Please try with job config setting.
job_config.null_marker = 'NULL'
configuration.load.nullMarker
string
[Optional] Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
BigQuery Console has it's limitations and doesn't allow you to specify a null marker while loading data from a CSV. However, it can easily be done by using the BigQuery command-line tool's bq load command. We can use the --null_marker flag to specify the marker which is simply null in this case.
bq load --source_format=CSV \
--null_marker=null \
--skip_leading_rows=1 \
dataset.table_name \
./data.csv \
./schema.json
Setting the null_marker as null does the trick here. You can omit the schema.json part if the table is already present with a valid schema. --skip_leading_rows=1 is used because my first row was a header.
You can learn more about the bg load command in the BigQuery Documentation.
The load command however lets you create and load a table in a single go. The schema needs to be specified in a JSON file in the below format:
[
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
},
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
}
]

Import data from sqoop to hive

sqoop import –connect “jdbc:mysql://quickstart.cloudera:3306/retail_db” \
–username=retail_dba –password=cloudera –table export1 –hive-import \
–hive-table export_3 –create-hive-table –fields-terminated-by “|” \
–lines-terminated-by “\n” –null-string nvl –null-non-string -2 –outdir java_files
If I use the above command it gives an error that
either use split by or -m 1 for sequential import
when I used split-by it ignored null values and imported other into hive
Can you explain the reason?
Thanks
Varun
The NULL value issues you are getting are not related to split-by.
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. Because sqoop is using those parameters in generated code, you need to properly escape value \N to \N:
$ sqoop import ... --null-string '\\N' --null-non-string '\\N'

Sqoop Export with Missing Data

I am trying to use Sqoop to export data from HDFS into Postgresql. However, I receive an error partially through the export that it can't parse the input. I manually went into the file I was exporting and saw that this row had two columns missing. I have tried a bunch of different arguments with the Sqoop command, but cannot get it to work. Here is what I was running thus far:
sqoop export --connect jdbc:postgresql://localhost:5432/XX -username
XX -password XX --table XX --input-fields-terminated-by
"\t" --input-lines-terminated-by "\n" --input-null-string '\n' --input-null
non-string '\n' -m 1 --export-dir /user/dan/output
I have also tried it without the "--input-null-string" and "--input-null-non-string" args and got the same result. My table has 6 columns and the file I am reading has tab separated values that are inserted into the table if all 6 are there. Any help would be appreciated.
I solved the problem by changing my reduce function so that if there were not the correct amount of fields to output a certain value and then I was able to use the --input-null-non-string with that value and it worked.