Drill can query on open HDFS directories? - hive

I am successfully query HIVE and HBase tables by using Drill.In my usecase i am getting data from storm into HDFS directory,for that directory i am creating Hive structure and querying that data by using Hive and Drill.Whenever storm is writing data into that directory(means directory is opened and writing data into hdfs) then drill is not able to query that hive table,it is giving error as,
Failed with exception java.io.IOException:java.io.IOException: Cannot obtain block length for LocatedBlock{BP-517438351-192.168.1.136-1475035616867:blk_1073793923_53182; getBlockSize()=0; corrupt=false; offset=0; locs=[127.0.0.1:50010]; storageIDs=[DS-be58a5f4-58d9-4c3c-8138-ce18ffa10ef8]; storageTypes=[DISK]}
if we are stop writing then drill is able to query that hive tables.In both cases hive is working properly.I am not able to find the cause.
Anybody can you please tell me,Drill can query opened HDFS files or directories or not?I tried alot but not getting anything about that.

Technically any file system (ext2, ext3, or hdfs) should be consistent to read/ write. When you are writing data to directory, file system is set to writing mode by one process and can not give read access to another process. Even though you force to read data, the process in reading gets inconsistent data.
This is the reason, any file/ directory, when it is in writing mode may not get reading access. In my opinion, in hdfs you may not execute read query when another process is writing to same file.

Related

Creating multiple files for uploading to Snowflake

Currently, my company uses SSIS and BCP to export data from SQL Server to CSV files. However, we are only able to create a single file per SQL table (due to the limitations of BCP). Most of these files are quite large; if I am correct, they are too large to get the best performance when loading them into Snowflake. On their website, they state that we should be working with multiple gzip files to offer the best performance.
I am wondering how other people made this work? Splitting up the CSV to multiple files and zipping them? Any good tools that can do this during export from SSIS?
I'd keep the current process that exports the large .csv files using SSIS, then run 7zip via command line to create a split gzip set for each text file, either within the SSIS package or via Powershell.
The -v switch is used to specify the volume size.
https://sevenzip.osdn.jp/chm/cmdline/switches/volume.htm
You may be able to start importing/uploading the completed chunks before the later ones are finished to pick up some additional time savings, but I've not tested that.

Use RStudio to connect to, and run queries on, a locally stored, compressed SQL databse

I'm trying to connect to and run queries on two large, locally-stored SQL databases with file extensions like so:
filename.sql.zstd.part
filename2.sql.zstd
My preference is to use the RMySQL package- however i am finding it hard to find documentation of a) how to access locally stored SQL files, and b) how to deal with the zstd extension.
This may be very basic but help is appreciated!
Seems like you have problems understanding the file extensions.
filename.sql.zstd.part
.part usually means you are downloading a file from the internet, but the download isn't complete yet (so downloads that are in progress or have been stopped)
So to get from filename.sql.zstd.part to filename.sql.zstd you need to complete your download
.zstd means it is a compressed file (to save disk space). You need a decompression program to get from filename.sql.zstd to filename.sql
The compression algorithm used is called Zstandard so you need a decompressor specifically for this program. Look here https://facebook.github.io/zstd/ for such a program.
There was also once an R package for this - but it has been archived. But you could also download an older version
(https://cran.r-project.org/web/packages/zstdr/index.html)
In filename.sql is actually not a database. In an .sql file are usually SQL statements for creating / modifying database structures. You'd have to install a database e.g. MariaDB and then import this .sql file to actually really have the files in a database on your computer. And then you would access this database via R.

Mosaic Decisions Azure BLOB writer node creating multiple files

I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?
Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.

how to store auto generated files in a different AWS S3 folder while running Tableau using Athena connector?

I am using Athena to connect a single csv file stored in AWS S3 folder with Tableau Desktop and have been successful in connecting the S3 data using Athena.
However, when I perform any activity in Tableau like drag and drop, slice and dice, for each activity, an auto generated csv and a metadata gets saved in the same folder as my input file.
Due to this additional files getting auto-generated in the same input file folder, the visuals in Tableau also get affected (due to additional records).
How do i ensure that, for any activity i perform in Tableau, the auto-generated files get stored in a different folder (rather than the same folder from where the input file is being called) ?
This will solve my problem as the visuals and the analysis will show correct numbers.
Currently, the work-around that I am using is after every activity I perform in Tableau (slice,filter, etc..), i go back to the S3 folder, delete the additional files that got auto-generated, then continue with activity in Tableau, then back to S3 folder for deletion, etc... (Definitely not the ideal way).
While executing Athena query, I am storing the query results in a different folder, because there is a provision for doing the same.
Please suggest if there is a similar provision for storing the auto-generated files (while working on Tableau) in a different folder ?
P.S. If there is an option of preventing these files from getting generated, that will also be helpful.
Anand
How do I ensure that the auto-generated files get stored in a different folder?
In order to store results of you queries in a different location, you need to specify different path for S3 Staging Directory. In order to do that, you need to Edit Connection to AWS Athena.
Here we did everything within Tableau itself, but the same result can be accomplished within AWS Athena settings for query result locaion
If there is an option of preventing these files from getting generated, that will also be helpful.
On the left side of the toolbar, there is an option Pause/Resume Auto Updates. When paused, Tableau doesn't send new query to AWS Athena.

Approach for large data set for reporting

I am having 220 millions of raw files in AWS s3 which I considering to merge all into a single file which estimate around 10 terabyte. The merge file will be serve as a fact table but in file format for reporting purposes for the audit.
The raw files are source data from an application. If there is any new data changes to the application, the contain of the file will be change.
I would like to ask is anybody come across this end to end process for this user case?
s3--> ETL (file merging)--> s3 --> reporting (tableau)
I haven't personally tried it, but this is kind of what Athena is made for... Skipping your ETL process, and querying directly from the files. Is there a reason you are dumping this all into a single file instead of keeping it dispersed? Rewriting a 10TB file over and over again is very expensive and time consuming... I'd personally at least investigate keeping the files 1-1 with the source files.
Create a s3 trigger that fires when a file is rewritten on s3
Create a Lambda that creates your "audit ready" report files on s3
Use AWS Athena to query those report files
Tableau connector to Athena for your reports