Moving data from hive views to aws s3 - amazon-s3

Hi is there any ways we could move data from hive views to S3? For tables I am using distcp but since views doesnt have data residing in HDFS location I wasn't able to do distcp and I don't have access for tables used in creating views. If I do CTAS from view I will not be getting the latest data from view whenever it was refreshed. Is there any other ways apart from creating table from view and doing distcp? Please help thanks in advance.

The best option would be to write a spark program which will load the data from your view/table using hive context and write back to S3 in required format like parquet/orc/csv/json

One of the approach that you can use is to copy Hive standard output Stream Data to S3 using AWS S3 command line interface (CLI).
Here is the sample code.
beeline -u jdbc:hive2://192.168.0.41:10000/test_db -n user1 -p password --outputformat=csv2 -e "select * from t1_view" | aws s3 cp - s3://testbucket/upload/test_view.txt
Note, you have to install AWS S3 CLI to use this method.

Related

Moving bigquery data to Redshift

I'm in need to move my bigquery table to redshift.
Currently I have a python job that is fetching data from redshift, and it is incremental loading my data on the redshift.
This python job is reading bigquery data, creating a csv file in the server, drops the same on s3 and the readshift table reads the data from the file on s3. But now the time size would be very big so the server won't be able to handle it.
Do you guys happen to know anything better than this ?
The new 7 tables on bigquery I would need to move, is around 1 TB each, with repeated column set. (I am doing an unnest join to flattening it)
You could actually move the data from Big Query to a Cloud Storage Bucket by following the instructions here. After that, you can easily move the data from the Cloud Storage bucket to the Amazon s3 bucket by running:
gsutil rsync -d -r gs://your-gs-bucket s3://your-s3-bucket
The documentation for this can be found here

How do I export hive query result to google cloud storage bucket as a CSV from GCE

I have a query that writes the query result to a CSV file:
hive -e 'select * from transactions limit 50'>abc.csv
so the result will be stored in abc.csv which is available only in that gcp instance.
But I need to export it into a GCS bucket so that later I can dump it into BigQuery.
I tried something like this but it didn't work:
hive -e 'select * from transactions limit 50'>gs://my-bucket/abc.csv
so, how can I store my hive query result in a GCS bucket?
You can write the hive query to your instance then use the gsutil command to move it to your bucket.
gsutil mv abc.csv gs://my-bucket/abc.csv
If you do not have gsutil installed on your instance, follow the steps provided here: Install gsutil | Cloud Storage
To find out more about using storage buckets with instances, you can refer to the google docs: Connecting to Cloud Storage buckets
An alternative would be to mount your cloud storage bucket within your instance allowing you to write the hive query result directly to your bucket.
To do this, you will need to make use of Cloud Storage FUSE, you can follow the steps here to install it: Cloud Storage FUSE | Cloud Storage
You can also use below Query,
insert overwrite directory 'gs://bucket-name/file_name/' row format delimited \
fields terminated by ',' stored as textfile \
select * from <db_name>.<table_name> limit 10;
The above query will put the result into the specified bucket location in a file whose format will be CSV.

How Bigquery use data stored in google cloud?

Guys a very basic question but not able to decipher ,Please help me out.
Q1: When we create bigquery table using below command , the data resides in same Cloud Storage?
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
Q2: let's say my data director is gs://sp2040/raw/cards/cust/ for customer file Table structure defined is:
bq mk --time_partitioning_type=DAY market.cust \
custid:string,grp:integer,odate:string
Everyday I create new dir in the bucket such as 20170101,20170102..to load new dataset. So after the data loaded in this bucket Do I need to fire below queries.
D1:
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
D2:
bq load --source_format=CSV 'market.cust$20170102' \
gs://sp2040/raw/cards/cust/20170102/20170102_cust.csv
When we create bigquery table using below command , the data resides in same Cloud Storage?
Nope! BigQuery is not using Cloud Storage for storing data (unless it is federated Table linked to Cloud Storage)
Check BigQuery Under the Hood with Tino Tereshko and Jordan Tigani - you will like it
Do I need to fire below queries
Yes. you need to load those files into BigQuery, so you can query the data
Yes you would need load the data into BigQuery using those commands.
However, there are a couple of alternatives
PubSub and Dataflow: You could configure PubSub to watch your cloud storage and create notification when files are added, described here. You could then have Dataflow job that imported the file into BigQuery. DataFlow Documentation
BigQuery external tables: BigQuery can query cvs files that are stored in Cloud Storage without importing the data, as described here. There is wildcard support for filenames so it could be configured once. Performance might not be as good as directly storing items in BigQuery

Google BigQuery: How to use gsutil to either remove or overwrite a table?

I have a program which will download some data from the web and save it as a csv, and then upload that data to a Google Cloud Storage Bucket. Next, that program will use gsutil to create a new Google BigQuery Table by concatenating all the files in the Google Cloud Storage Bucket. To do the concatenating I run this command in command prompt:
bq load --project_id=ib-17 da.hi gs://ib/hi/* da:TIMESTAMP,bol:STRING,bp:FLOAT,bg:FLOAT,bi:FLOAT,lo:FLOAT,en:FLOAT,kh:FLOAT,ow:FLOAT,ls:FLOAT
The issue is that for some reason this command appends to the existing table, so I get a lot of duplicate data. The question is how can I either use gsutil to delete the table first maybe how can I use gsutil to overwrite the table?
If I understood correctly your question, you should delete and recreate the table with:
bq rm -f -t da.hi
bq mk --schema da:TIMESTAMP,bol:STRING,bp:FLOAT,bg:FLOAT,bi:FLOAT,lo:FLOAT,en:FLOAT,kh:FLOAT,ow:FLOAT,ls:FLOAT -t da.hi
Another possibility is to use the --replace flag, such as:
bq load --replace --project_id=ib-17 da.hi gs://ib/hi/*
I think that this flag was once called WRITE_DISPOSITION but looks like the CLI updated the name to --replace.

How to save the results of an impala query

I've loaded a large set of data from S3 into hdfs, and then inserted the data to a table in impala.
I then ran a query against this data, and I'm looking to get these results back into S3.
I'm using Amazon EMR, with impala 1.2.4. If it's not possible to get the results of the query back to S3 directly, are there options to get the data back to hdfs and then some how send it back to S3 from there?
I have messed around with the impala-shell -o filename options, but that appears to only work on the local linux file system.
I thought this would have been a common scenario, but having trouble finding any information about saving the results of a query anywhere.
Any pointers appreciated.
To add to the knowledge above I am including the command that writes the query results to a file with a delimeter as we declared using the option --output_delimeter and also by using the option
--delimeted which actually switches off the default tab delimeter option.
impala-shell -q "query " --delimited --output_delimiter='\001' --print_header -o 'filename'
What I usually do if it's a smallish result set is run the script from the command line then upload to s3 using the AWS command line tool:
impala-shell -e "select ble from bla" -o filename
aws s3 cp filename s3://mybucket/filename
An alternative is use Hive as the last step in your data pipeline after you've run your query in Impala:
1. Impala step:
create table processed_data
as
select blah
--do whatever else you need to do in here
from raw_data1
join raw_data2 on a=b
2. Hive step:
create external table export
like processed_data
location 's3://mybucket/export/';
insert into table export
select * from processed_data;
If you have aws cli installed you can use standard out of Impala shell , impala shell query | aws cli cp - s3folder/outputfilename
You can use unix pipe and stream (-)