Load data from csv in google cloud storage as bigquery 'in' query - google-bigquery

I want to compose such query using bigquery, my file stored in Google cloud platform storage:
select * from my_table where id in ('gs://bucket_name/file_name.csv')
I get no results. Is it possible? or am I missing something?

You are able using the CLI or API to do adhoc queries to GCS files without creating tables, a full example is covered here Accessing external (federated) data sources with BigQuery’s data access layer
code snippet is here:
BigQuery query --external_table_definition=healthwatch::date:DATETIME,bpm:INTEGER,sleep:STRING,type:STRING#CSV=gs://healthwatch2/healthwatchdetail*.csv 'SELECT date,bpm,type FROM healthwatch WHERE type = "elevated" and bpm > 150;'
Waiting on BigQueryjob_r5770d3fba8d81732_00000162ad25a6b8_1 ... (0s)
Current status: DONE
+---------------------+-----+----------+
| date | bpm | type |
+---------------------+-----+----------+
| 2018-02-07T11:14:44 | 186 | elevated |
| 2018-02-07T11:14:49 | 184 | elevated |
+---------------------+-----+----------+
on other hand you can create a permament EXTERNAL table with autodetect schema to facilitate WebUI and persistence read more about that here Querying Cloud Storage Data

Related

AWS Athena Table Data Update

I have started testing out AWS Athena, and it so far looks good. One problem I am having is about the updating of data in a table.
Here is the scenario: In order to update the data for a given date in the table, I am basically emptying out the S3 bucket that contains the CSV files, and uploading the new files to become the updated data source. However, the period of time during which the bucket is empty (i.e. when the old source is deleted and new source is being uploaded) actually is a bottleneck, because during this interval, anyone querying the table will get no result.
Is there a way around this?
Thanks.
Athena is a web service that allows to query data which resides on AWS S3. In order to run queries, Athena needs to now table schema and where to look for data on S3. All this information is stored in AWS Glue Meta Data catalog. This essentially means that each time you get a new data you simply need to upload a new csv file onto S3.
Let's assume that you get new data everyday at midnight and you store them in an S3 bucket:
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
└── data-file-2019-01-03.csv
and each of these files looks like:
| date | volume | product | price |
|------------|---------|---------|-------|
| 2019-01-01 | 100 | apple | 10 |
| 2019-01-01 | 200 | orange | 50 |
| 2019-01-01 | 50 | cherry | 100 |
Then after you have uploaded them to AWS S3 you can use the following DDL statement in order to define table
CREATE EXTERNAL TABLE `my_table`(
`date` timestamp,
`volume` int,
`product` string,
`price` double)
LOCATION
's3://my-s3-bucket/'
-- Additional table properties
Now when you get a new file data-file-2019-01-04.csv and you upload it to the same location as other files, Athena would be able to query new data as well.
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
├── data-file-2019-01-03.csv
└── data-file-2019-01-04.csv
Update 2019-09-19
If your scenario is when you need to updated data in the S3 bucket, then you can try to combine views, tables and keeping different versions of data
Let's say you have table_v1 that queries data in s3://my-data-bucket/v1/ location. You create a view for table_v1 which can be seen as a wrapper of some sort:
CREATE VIEW `my_table_view` AS
SELECT *
FROM `table_v1`
Now your users could use my_table to query data in s3://my-data-bucket/v1/ instead of table_v1. When you want to update data, you can simply upload it to s3://my-data-bucket/v2/ and define table table_v2. Next, you need to update your my_table_view view since all queries are run against it:
CREATE OR REPLACE VIEW `my_table_view` AS
SELECT *
FROM `table_v2`
After this is done, you can drop table_v1 and delete files from s3://my-data-bucket/v1/. Provided that data schema hasn't changed, all queries that ran against my_table_view view while it was based on table_v1 should still be valid and succeed after my_table_view got replaced.
I don't know what would the downtime of replacing a view, but I'd expect it to less then a second, which is definitely less that the time it takes to upload new files.
What most people want to do is probably MSCK REPAIR TABLE <table_name>.
This updates the metadata if you have added more files in the location, but it is only available if you table has partitions.
You might also want to do this with a Glue Crawler which can be scheduled to refresh the table with new data.
Relevant documentation.

Drill - Parquet IO performance issues with Azure Blob or Azure Files

The problem:
Parquet read performance of Drill appears to be 5x - 10x worse when reading from azure storage and it renders it unusable for bigger data workloads.
It appears to be only a problem when reading parquets. Reading CSV, at the other hand, runs normally.
Let's have:
Azure Blob storage account with ~1GB source.csv and parquets with same data.
Azure Premium File Storage with the same files
Local disk folder containing the same files
Drill running on Azure VM in single mode
Drill configuration:
Azure blob storage plugin working as namespace blob
Azure files mounted with SMB to /data/dfs used as namespace dfs
Local disk folder used as namespace local
The VM
Standard E4s v3 (4 vcpus, 32 GiB memory)
256GB SSD
NIC 2Gbps
6400 IOPS / 96MBps
Azure Premium Files Share
1000GB
1000 IOPS base / 3000 IOPS Burst
120MB/s throughput
Storage benchmarks
Measured with dd, 1GB data, various block sizes, conv=fdatasync
FS cache dropped before each read test (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches")
Local disk
+-------+------------+--------+
| Mode | Block size | Speed |
+-------+------------+--------+
| Write | 1024 | 37MB/s |
| Write | 64 | 16MBs |
| Read | 1024 | 70MB/s |
| Read | 64 | 44MB/s |
+-------+------------+--------+
 Azure Premium File Storage SMB mount
+-------+------------+---------+
| Mode | Block size | Speed |
+-------+------------+---------+
| Write | 1024 | 100MB/s |
| Write | 64 | 23MBs |
| Read | 1024 | 88MB/s |
| Read | 64 | 40MB/s |
+-------+------------+---------+
Azure Blob
Max known throughput of azure blobs is 60MB/s. Upload/download speeds are clamped to target storage read/write speeds.
Drill benchmarks
The filesystem cache was purged before every read test.
IO performance observed with iotop
Queries were chosen simple only for demonstration. Execution time growth for more complex queries is linear.
 Sample queries:
-- Query A: Reading parquet
select sum(`Price`) as test from namespace.`Parquet/**/*.parquet`;
-- Query B: Reading CSV
select sum(CAST(`Price` as DOUBLE)) as test from namespace.`sales.csv`;
 Results
+-------------+--------------------+----------+-----------------+
| Query | Source (namespace) | Duration | Disk read usage |
+-------------+--------------------+----------+-----------------+
| A (Parquet) | dfs(smb) | 14.8s | 2.8 - 3.5 MB/s |
| A (Parquet) | blob | 24.5s | N/A |
| A (Parquet) | local | 1.7s | 40 - 80 MB/s |
| --- | --- | --- | --- |
| B (CSV) | dfs(smb) | 22s | 30 - 60 MB/s |
| B (CSV) | blob | 29s | N/A |
| B (CSV) | local | 18s | 68 MB/s |
+-------------+--------------------+----------+-----------------+
Observations
When reading parquet, more threads will spawn but only cisfd process takes the IO performance.
Trying to tune parquet reader performance as described here but without any significant results.
There is a big peak of egress data at the time of querying parquets from azure storage, that exceeds parquet data size several times. The parquets have ~300MB but the egress peak for one read query is about 2.5GB.
Conclusion
Reading parquets from Azure Files is for some reason slowed down to ridiculous speeds.
Reading parquets from Azure Blob is even a bit slower.
Reading parquets from local filesystem is nicely fast, but not suitable for real use.
Reading CSV from any source utilizes storage throughput normally, therefore I assume some problem / misconfiguration of parquet reader.
The questions
What are the reasons that parquet read performance from Azure Storage is so drastically reduced?
Is there way to optimize it?
I assume that you would have cross checked IO performance issue using Azure Monitor and if the issue still persist, I would like to work closely on this issue. This may require a deeper investigation, so If you have a support plan, I request you file a support ticket, else please do let us know, we will try and help you get a one-time free technical support. In this case, could you send an email to AzCommunity[at]Microsoft[dot]com referencing this thread. Please mention "ATTN subm" in the subject field. Thank you for your cooperation on this matter and look forward to your reply.

How to set DTU for Azure Sql Database via SQL when copying?

I know that you can create a new Azure SQL DB by copying an existing one by running the following SQL command in the [master] db of the destination server:
CREATE DATABASE [New_DB_Name] AS COPY OF [Azure_Server_Name].[Existing_DB_Name]
What I want to find out is if its possible to change the number of DTU's the copy will have at the time of creating the copy?
As a real life example, if we're copying a [prod] database to create a new [qa] database, the copy might only need resources to handle a small testing team hitting the QA DB, not a full production audience. Scaling down the assigned DTU's would result in a cheaper DB. At the moment we manually scale after the copy is complete but this takes just as long as the initial copy (several hours for our larger DB's) as it copies the database yet again. In an ideal world we would like to skip that step and be able to fully automate the copy process.
According to the the docs is is:
CREATE DATABASE database_name
AS COPY OF [source_server_name.] source_database_name
[ ( SERVICE_OBJECTIVE =
{ 'basic' | 'S0' | 'S1' | 'S2' | 'S3' | 'S4'| 'S6'| 'S7'| 'S9'| 'S12' |
| 'GP_GEN4_1' | 'GP_GEN4_2' | 'GP_GEN4_4' | 'GP_GEN4_8' | 'GP_GEN4_16' | 'GP_GEN4_24' |
| 'BC_GEN4_1' | 'BC_GEN4_2' | 'BC_GEN4_4' | 'BC_GEN4_8' | 'BC_GEN4_16' | 'BC_GEN4_24' |
| 'GP_GEN5_2' | 'GP_GEN5_4' | 'GP_GEN5_8' | 'GP_GEN5_16' | 'GP_GEN5_24' | 'GP_GEN5_32' | 'GP_GEN5_48' | 'GP_GEN5_80' |
| 'BC_GEN5_2' | 'BC_GEN5_4' | 'BC_GEN5_8' | 'BC_GEN5_16' | 'BC_GEN5_24' | 'BC_GEN5_32' | 'BC_GEN5_48' | 'BC_GEN5_80' |
| { ELASTIC_POOL(name = <elastic_pool_name>) } } )
]
[;]
CREATE DATABASE (sqldbls)
You can also change the DTU level during a copy from the PowerShell API
New-AzureRmSqlDatabaseCopy
But you can only choose "a different performance level within the same service tier (edition)" Copy an Azure SQL Database.
You can, however, copy the database into an elastic pool in the same service tier, so you wouldn't be allocating new DTU resources. You might have a single pool for all your dev/test/qa datatabases and drop the copy there.
If you want to change the service tier, you could a Point-in-time Restore instead of a Database Copy. The database can be restored to any service tier or performance level, using the Portal, PowerShell or REST.
Recover an Azure SQL database using automated database backups

How do I update a database that's in use?

I'm building a web application using ASP.NET MVC with SQL Server and my development process is going to be like
Make changes in SQL Server locally
Create LINQ-to-SQL classes as necessary
Before committing any change set that has a database, script out the database so that I can regenerate it if I ever need to
What I'm confused about is how I'm going to update the production database which will have live data in set.
For example, let's say I have a table like
People
========================================
Id | FirstName | LastName | FatherId
----------------------------------------
1 | 'Anakin' | 'Skywalker' | NULL
2 | 'Luke' | 'Skywalker' | 1
3 | 'Leah' | 'Skywalker' | 1
in production and locally and let's say I add an extra column locally
ALTER TABLE People ADD COLUMN LightsaberColor VARCHAR(16)
and update my LINQ to SQL, script it out, test it with sample data and decide that I want to add that column to production.
As part of a deployment process, how would I do that? Does there exist some tool that could read my database generation file (call it GenerateDb.sql) and figure out that it needs to update the production People table to put default values in the new column, like
People
==========================================================
Id | FirstName | LastName | FatherId | LightsaberColor
----------------------------------------------------------
1 | 'Anakin' | 'Skywalker' | NULL | NULL
2 | 'Luke' | 'Skywalker' | 1 | NULL
3 | 'Leah' | 'Skywalker' | 1 | NULL
???
You should have a staging DB that is identical to the production database.
When you add any changes to the database, you should perform these changes to the staging DB first, and you can of course compare the dev and staging DB to generate a script with the difference.
Visual Studio has a Schema Compare that generate a script with the differences between a two databases.
There are some other tools a well that does the same.
So, you can generate the script, apply it to the staging Db and if everything went fine, you can apply the script on the production DB
Actually that is right you must have a staging process whenever we commit features we use TFS from Development to Production that is called staging you can look up the history of the TFS whether the database or the solution. and if you're not using TFS in Visual Studio and MSSQL Server.
I guess that your are commiting youre features directly to your server that is your production test. or you can test that in your test server first to see the changes.
Another thing is that I guess if you use stored procedures you can use Temporary Tables if you're asking about the script.
I guess that it's your first time commiting in a live server..

Does DROP PARTITION delete data from external table in HIVE?

An external table in HIVE is partitioned on year, month and day.
So does the following query delete data from external table for the specific partitioned referenced in this query?:-
ALTER TABLE MyTable DROP IF EXISTS PARTITION(year=2016,month=7,day=11);
Partitioning scheme is not data. Partitioning scheme is part of table DDL stored in metadata (simply saying: partition key value + location where the data-files are being stored).
Data itself are stored in files in the partition location(folder). If you drop partition of external table, the location remain untouched, but unmounted as partition (metadata about this partition is deleted). You can have few versions of partition location unmounted (for example previous versions).
You can drop partition and mount another location as partition (alter table add partition) or change existing partition location. Also drop external table do not delete table/partitions folders with files in it. And later you can create table on top of this location.
Have a look at this answer for better understanding external table/partition concept: It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.
No external table have only references that will be deleted actual file will still persists at location .
External Table data files are not owned by table neither moved to hive warehouse directory
Only PARTITION meta will be deleted from hive metastore tables..
Difference between Internal & external tables :
For External Tables -
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.
For Internal Tables-
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
Deleting the table deletes the metadata & data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organisation to organisation).
Hive may have internal or external tables this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Note: Meta table if you will look in to the database ( configured details)
|BUCKETING_COLS |
| COLUMNS |
| DBS |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SORT_COLS |
| TABLE_PARAMS |
| TBLS |