How to delimit a compressed fixed length file with uncompressing it - gzip

I'm dealing with compressed (gzip) fixed length flat files which I then need to turn into delimited flat files so I can feed it to gpload. I was told it is possible to delimit the file without needing to decompress it, and feed it directly to gpload since it can handle compressed files.
Does anybody know of a way to delimit the file while it is in .gz format?

There is no way to delimit the gzip-compressed data without decompressing it. But you don't need to delimit it, you can just load it as a fixed-width data type, it would be decompressed on the fly by gpfdist. Refer to the "Importing and Exporting Fixed Width Data" chapter in admin guide here: http://gpdb.docs.pivotal.io/4330/admin_guide/load.html
Here's an example:
[gpadmin#localhost ~]$ gunzip -c testfile.txt.gz
Bob Jones 27
Steve Balmer 50
[gpadmin#localhost ~]$ gpfdist -d ~ -p 8080 &
[1] 41525
Serving HTTP on port 8080, directory /home/gpadmin
[gpadmin#localhost ~]$ psql -c "
> CREATE READABLE EXTERNAL TABLE students (
> name varchar(20),
> surname varchar(30),
> age int)
> LOCATION ('gpfdist://127.0.0.1:8080/testfile.txt.gz')
> FORMAT 'CUSTOM' (formatter=fixedwidth_in,
> name='20', surname='30', age='4');
> "
CREATE EXTERNAL TABLE
[gpadmin#localhost ~]$ psql -c "select * from students;"
name | surname | age
-------+---------+-----
Bob | Jones | 27
Steve | Balmer | 50
(2 rows)

Related

Octavia apply Airbyte gives

I'm trying to create a new BigQuery destination on Airbyte with Octavia cli.
When launching:
octavia apply
I receive:
Error: {"message":"The provided configuration does not fulfill the specification. Errors: json schema validation failed when comparing the data to the json schema. \nErrors:
$.loading_method.method: must be a constant value Standard
Here is my conf:
# Configuration for airbyte/destination-bigquery
# Documentation about this connector can be found at https://docs.airbyte.com/integrations/destinations/bigquery
resource_name: "BigQueryFromOctavia"
definition_type: destination
definition_id: 22f6c74f-5699-40ff-833c-4a879ea40133
definition_image: airbyte/destination-bigquery
definition_version: 1.2.12
# EDIT THE CONFIGURATION BELOW!
configuration:
dataset_id: "airbyte_octavia_thibaut" # REQUIRED | string | The default BigQuery Dataset ID that tables are replicated to if the source does not specify a namespace. Read more here.
project_id: "data-airbyte-poc" # REQUIRED | string | The GCP project ID for the project containing the target BigQuery dataset. Read more here.
loading_method:
## -------- Pick one valid structure among the examples below: --------
# method: "Standard" # REQUIRED | string
## -------- Another valid structure for loading_method: --------
method: "GCS Staging" # REQUIRED | string}
credential:
## -------- Pick one valid structure among the examples below: --------
credential_type: "HMAC_KEY" # REQUIRED | string
hmac_key_secret: ${AIRBYTE_BQ1_HMAC_KEY_SECRET} # SECRET (please store in environment variables) | REQUIRED | string | The corresponding secret for the access ID. It is a 40-character base-64 encoded string. | Example: 1234567890abcdefghij1234567890ABCDEFGHIJ
hmac_key_access_id: ${AIRBYTE_BQ1_HMAC_KEY_ACCESS_ID} # SECRET (please store in environment variables) | REQUIRED | string | HMAC key access ID. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. | Example: 1234567890abcdefghij1234
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
# keep_files_in_gcs-bucket: "Delete all tmp files from GCS" # OPTIONAL | string | This upload method is supposed to temporary store records in GCS bucket. By this select you can chose if these records should be removed from GCS when migration has finished. The default "Delete all tmp files from GCS" value is used if not set explicitly.
credentials_json: ${AIRBYTE_BQ1_CREDENTIALS_JSON} # SECRET (please store in environment variables) | OPTIONAL | string | The contents of the JSON service account key. Check out the docs if you need help generating this key. Default credentials will be used if this field is left empty.
dataset_location: "europe-west1" # REQUIRED | string | The location of the dataset. Warning: Changes made after creation will not be applied. Read more here.
transformation_priority: "interactive" # OPTIONAL | string | Interactive run type means that the query is executed as soon as possible, and these queries count towards concurrent rate limit and daily limit. Read more about interactive run type here. Batch queries are queued and started as soon as idle resources are available in the BigQuery shared resource pool, which usually occurs within a few minutes. Batch queries don’t count towards your concurrent rate limit. Read more about batch queries here. The default "interactive" value is used if not set explicitly.
big_query_client_buffer_size_mb: 15 # OPTIONAL | integer | Google BigQuery client's chunk (buffer) size (MIN=1, MAX = 15) for each table. The size that will be written by a single RPC. Written data will be buffered and only flushed upon reaching this size or closing the channel. The default 15MB value is used if not set explicitly. Read more here. | Example: 15
It was an indentation issue on my side:
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
Should be at 1 upper level (this wasn't clear in the commented template, hence the error and the possibility that others persons will do the same).
Here is full final conf:
# Configuration for airbyte/destination-bigquery
# Documentation about this connector can be found at https://docs.airbyte.com/integrations/destinations/bigquery
resource_name: "BigQueryFromOctavia"
definition_type: destination
definition_id: 22f6c74f-5699-40ff-833c-4a879ea40133
definition_image: airbyte/destination-bigquery
definition_version: 1.2.12
# EDIT THE CONFIGURATION BELOW!
configuration:
dataset_id: "airbyte_octavia_thibaut" # REQUIRED | string | The default BigQuery Dataset ID that tables are replicated to if the source does not specify a namespace. Read more here.
project_id: "data-airbyte-poc" # REQUIRED | string | The GCP project ID for the project containing the target BigQuery dataset. Read more here.
loading_method:
## -------- Pick one valid structure among the examples below: --------
# method: "Standard" # REQUIRED | string
## -------- Another valid structure for loading_method: --------
method: "GCS Staging" # REQUIRED | string}
credential:
## -------- Pick one valid structure among the examples below: --------
credential_type: "HMAC_KEY" # REQUIRED | string
hmac_key_secret: ${AIRBYTE_BQ1_HMAC_KEY_SECRET} # SECRET (please store in environment variables) | REQUIRED | string | The corresponding secret for the access ID. It is a 40-character base-64 encoded string. | Example: 1234567890abcdefghij1234567890ABCDEFGHIJ
hmac_key_access_id: ${AIRBYTE_BQ1_HMAC_KEY_ACCESS_ID} # SECRET (please store in environment variables) | REQUIRED | string | HMAC key access ID. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. | Example: 1234567890abcdefghij1234
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
# keep_files_in_gcs-bucket: "Delete all tmp files from GCS" # OPTIONAL | string | This upload method is supposed to temporary store records in GCS bucket. By this select you can chose if these records should be removed from GCS when migration has finished. The default "Delete all tmp files from GCS" value is used if not set explicitly.
credentials_json: ${AIRBYTE_BQ1_CREDENTIALS_JSON} # SECRET (please store in environment variables) | OPTIONAL | string | The contents of the JSON service account key. Check out the docs if you need help generating this key. Default credentials will be used if this field is left empty.
dataset_location: "europe-west1" # REQUIRED | string | The location of the dataset. Warning: Changes made after creation will not be applied. Read more here.
transformation_priority: "interactive" # OPTIONAL | string | Interactive run type means that the query is executed as soon as possible, and these queries count towards concurrent rate limit and daily limit. Read more about interactive run type here. Batch queries are queued and started as soon as idle resources are available in the BigQuery shared resource pool, which usually occurs within a few minutes. Batch queries don’t count towards your concurrent rate limit. Read more about batch queries here. The default "interactive" value is used if not set explicitly.
big_query_client_buffer_size_mb: 15 # OPTIONAL | integer | Google BigQuery client's chunk (buffer) size (MIN=1, MAX = 15) for each table. The size that will be written by a single RPC. Written data will be buffered and only flushed upon reaching this size or closing the channel. The default 15MB value is used if not set explicitly. Read more here. | Example: 15

Inserting huge batch of data from multiple csv files into distinct tables with Postgresql

I have a folder with multiple csv files, they all have the same column attributes.
My goal is to make every csv file into a distinct postgresql table named as the file's name but as there are 1k+ of them it would be a pretty long process to do manually.
I've been trying to search a solution for the whole day but the closest I've came up to solving the problem was this code:
for filename in select pg_ls_dir2 ('/directory_name/') loop
if (filename ~ '.csv$') THEN create table filename as fn
copy '/fullpath/' || filename to table fn
end if;
END loop;
the logic behind this code is to select every filename inside the folder, create a table named as the filename and import the content into said table.
The issue is that I have no idea how to actually put that in practice, for instance where should I execute this code since both for and pg_ls_dir2 are not SQL instructions?
If you use DBeaver, there is a recently-added feature in the software which fixes this exact issue. (On Windows) You have to right click the section "Tables" inside your schemas (not your target table!) and then just select "Import data" and you can select all the .csv files you want at the same time, creating a new table for each file as you mentioned.
Normally, I don' t like giving the answer directly, but I think you will need to change a few things at least.
Depending on the example from here I prepared a small example using bash script. Let' s assume you are in the directory that your files are kept.
postgres#213b483d0f5c:/home$ ls -ltr
total 8
-rwxrwxrwx 1 root root 146 Jul 25 13:58 file1.csv
-rwxrwxrwx 1 root root 146 Jul 25 14:16 file2.csv
On the same directory you can run:
for i in `ls | grep csv`
do
export table_name=`echo $i | cut -d "." -f 1`;
psql -d test -c "CREATE TABLE $table_name(emp_id SERIAL, first_name VARCHAR(50), last_name VARCHAR(50), dob DATE, city VARCHAR(40), PRIMARY KEY(emp_id));";
psql -d test -c "\COPY $table_name(emp_id,first_name,last_name,dob,city) FROM './$i' DELIMITER ',' CSV HEADER;";
done

Store my "Sybase" query result /output into a script variable

I need a variable to keep the results retrieved from a query (Sybase) that´s in a script.
I have built the following script, it works fine I get the desired result when I run it
Script: EXECUTE_DAILY:
isql -U database_dba -P password <<EOF!
select the_name from table_name where m_num="NUMB912" and date="17/01/2019"
go
quit
EOF!
echo "All Done"
Output:
"EXECUTE_DAILY" 97 lines, 293 characters
user#zp01$ ./EXECUTE_DAILY
the_name
-----------------------------------
NAME912
(1 row affected)
But now I would like to keep the output(the_name: NAME912) in a variable.
So far this is basically what I'm trying with no success.
variable=$(isql -U database_dba -P password -se "select the_name from table_name where m_num="NUMB912" and date="17/01/2019" ")
But, is not working. I can't save NAME912 in a variable.
You need to parse the output for the desired string/piece-of-data that you wish to store in your variable. I tend to make my life a bit easier by making sure I can easily/quickly search/parse out what I want.
Keeping a few issues in mind ...
I tend to use isql -s"|" -w10000 to ensure (most of the time) that a) the result set has all columns delimited with the pipe ('|') and b) a single row of data does not span multiple rows; the pipe delimiter makes it easier to parse out columns that may contain white space; obviously (?) use a different delimiter if a pipe may be part of your actual data
to make parsing of the isql output a bit easier I tend to add a unique, grep-able (literal) string to the rows that I'm looking to search/parse
some databases (eg, SQLAnywhere, Oracle) tend to mimic a literal value as the column header if said literal string has not been assigned an explicit alias/header; this means that if you do a simple search on your literal string then you'll get a match for the result set header as well as the actual data row
I tend to capture all isql output to a temporary file; this allows for easier follow-on processing, eg, error checking, data parsing, dumping contents to a logfile, etc
So, with the above in mind my code typically looks something like:
$ outfile=/tmp/.$$.isql.outfile
$ isql -s"|" -w10000 -U database_dba -P password <<-EOF > ${outfile} 2>&1
-- 'GREP'||'ME' ensures that 'GREPME' only shows up in the data row
select 'GREP'||'ME',the_name
from table_name
where m_num = "NUMB912"
and date = "17/01/2019"
go
EOF
$ cat ${outfile}
... snip ...
|'GREP'||'ME'|the_name | # notice the default column header = 'GREP'||'ME' which won't match my search for 'GREPME'
|------------|----------|
|GREPME |NAME912 | # this is the line I want to search/parse
... snip ...
$ read -r namevar < <(egrep GREPME ${outfile} | awk -F"|" '{print $3}')
$ echo ${namevar}
NAME912

Export amazon mysql database to an excel sheet

I have an ec2-instance on which mysql database is there and now there are multiple tables have huge values which i want to export into an excel sheet into my local system or even some place at S3 will also work , how can i achieve this ?
Given that you installed your own MySQL instance on an EC2 node, you should have full access to MySQL's abilities. I don't see any reason why you can't just do a SELECT ... INTO OUTFILE here:
SELECT *
FROM yourTable
INTO OUTFILE 'output.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';
Once you have the CSV file, you may transfer it to a box running Excel, and use the Excel import wizard to bring in the data.
Edit:
Based on your comments below, it might be the case that you need to carefully select an output path and location to which MySQL and your user have permissions to write.
Another way to export CSV files from RDS Mysql and without getting Access denied for user '<databasename>'#'%' (using password: YES) is doing the following command:
mysql -u username -p --database=dbname --host=rdshostname --port=rdsport --batch -e "select * from yourtable" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > yourlocalfilename.csv
The secret is in this part:
--batch -e "select * from yourtable" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > yourlocalfilename.csv

Loading Data from Remote Machine to Hive Database

I have a CSV file stored on a remote machine. I need to load this data into my Hive Database which is installed in different machine. Is there any way to do this?
note: I am using Hive 0.12.
Since Hive basically applies a schema to data that resides in HDFS, you'll want to create a location in HDFS, move your data there, and then create a Hive table that points to that location. If you're using a commercial distribution, this may be possible from Hue (the Hadoop User Environment web UI).
Here's an example from the command line.
Create csv file on local machine:
$ vi famous_dictators.csv
... and this is what the file looks like:
$ cat famous_dictators.csv
1,Mao Zedong,63000000
2,Jozef Stalin,23000000
3,Adolf Hitler,17000000
4,Leopold II of Belgium,8000000
5,Hideki Tojo,5000000
6,Ismail Enver Pasha,2500000
7,Pol Pot,1700000
8,Kim Il Sung,1600000
9,Mengistu Haile Mariam,950000
10,Yakubu Gowon,1100000
Then scp the csv file to a cluster node:
$ scp famous_dictators.csv hadoop01:/tmp/
ssh into the node:
$ ssh hadoop01
Create a folder in HDFS:
[awoolford#hadoop01 ~]$ hdfs dfs -mkdir /tmp/famous_dictators/
Copy the csv file from the local filesystem into the HDFS folder:
[awoolford#hadoop01 ~]$ hdfs dfs -copyFromLocal /tmp/famous_dictators.csv /tmp/famous_dictators/
Then login to hive and create the table:
[awoolford#hadoop01 ~]$ hive
hive> CREATE TABLE `famous_dictators`(
> `rank` int,
> `name` string,
> `deaths` int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> LOCATION
> 'hdfs:///tmp/famous_dictators';
You should now be able to query your data in Hive:
hive> select * from famous_dictators;
OK
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
3 Adolf Hitler 17000000
4 Leopold II of Belgium 8000000
5 Hideki Tojo 5000000
6 Ismail Enver Pasha 2500000
7 Pol Pot 1700000
8 Kim Il Sung 1600000
9 Mengistu Haile Mariam 950000
10 Yakubu Gowon 1100000
Time taken: 0.789 seconds, Fetched: 10 row(s)