Can anyone please help me understanding the below point.
I have created one HIVE table which is not a partition table, but I am working in a 10 node cluster, so in this case will the data of that table (the table is a large table) will be spread across different data nodes??? or will it be there only in one node??
If it spread across different data nodes then how we can see only one file under \hive\warehouse folder?
Also please give little idea how this storage allocated for a partition table.
The data for the table and the metadata of the table are different things.
The data for the table, which is basically just a file in HDFS, will be stored as per HDFS rules (that is based on your configuration, a file will be split into n number of blocks and stored distributedly on datanodes).
In your case the data for one hive table( a file or ay number of files) will be stored distributedly among all the 10 nodes in the cluster.
Also, this distribution is done at the block level and not visible at the user level.
You can check the number of blocks created for the file int he Web UI easily.
A partitioned table is just like adding another directory inside the table directory in HDFS. So it follows the same rules.
Related
I have a folder in HDFS, let's call it /data/users/
Inside that folder, a new csv file is added every 10 days. Basically the new file will contain only active users, so, for example
file_01Jan2020.csv: contains data for 1000 users who are currently active
file_10Jan2020.csv: contains data for 950 users who are currently active (same data in file_01Jan2020.csv with 50 less records)
file_20Jan2020.csv: contains data for 920 users who are currently active (same data in file_10Jan2020.csv with 30 less records)
In reality, these file are much bigger (~8 million records per file and decreases by MAYBE 1K EVERY 10 DAYS). Also, the newer files will never have new records that doesn't exist in the older files. it will just have less number of records.
I want to create a table in hive using the data in this folder. What I am doing now is:
Create External table from the data in the folder /data/users/
Create Internal table with the same structure
Write the data from external table to internal table where,
Duplicates are removed
If a record doesn't exist in one of the files, then I'll mark it as 'deleted' and set the 'deleted' in a new column that I defined in the internal table I created
I am concerned about the step where I create the external table, since the data are really big, that table will be huge after sometime, and I was wondering if there is a more efficient way of doing this instead of each time loading all the files in the folder.
So my question is: What is the best possible way to ingest data from a HDFS folder into a hive table that , given that, the folder contain lots of files with lottts of duplications.
I'd suggest partition data by date, that way you dont have to go through all the records every time you read the table.
I have a very large parquet table containing nested complex types such as structs and arrays. I have partitioned it by date and would like to restrict certain users to, say, the latest week of data.
The usual way of doing this would be to create a time-limited view on top of the table, e.g.:
''' CREATE VIEW time_limited_view
AS SELECT * FROM my_table
WHERE partition_date >= '2020-01-01' '''
This will work fine when querying the view in Hive. However, if I try to query this view from Impala, I get an error:
** AnalysisException: Expr 'my_table.struct_column' in select list returns a complex type **
The reason for this is that Impala does not allow complex types in the select list. Any view I build which selects the complex columns will cause errors like this. If I flatten/unnest the complex types, this would of course get around this issue. However due to the layers of nesting involved I would like to keep the table structure as is.
I see another suggested workaround has been to use Ranger row-level filtering but I do not have Ranger and will not be able to install it on the cluster. Any suggestions on Hive/Impala SQL workarounds would be appreciated
While working on a different problem I came across a kind of solution that fits my needs (but is by no means a general solution). I figured I'd post it in case anyone has similar needs.
Rather than using a view, I can simply use an external table. So firstly I would create a table in database_1 using Hive, which has a corresponding location, location_1, in hdfs. This is my "production" database/table which I use for ETL and contains a very large amount of data. Only certain users have access to this database.
CREATE TABLE database_1.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Next, I create a second, external table in the same location in hdfs. However this table is stored in a database with a much broader user group (database_2).
CREATE EXTERNAL TABLE database_2.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Since this is an external table, I can add/drop date partitions at will without affecting the underlying data. I can add 1 weeks' worth of date partitions to the metastore and as far as end users can tell, that's all that is available in the table. I can even make this part of my ETL job, where each time new data is added, I add that partition to the external table and then drop a partition from a week ago, resulting in this rolling window of 1 weeks' data being made available to this user group without having to duplicate a load of data to a separate location.
This is by no means a row-filtering solution, but is a handy way to use partitions to expose a subset of data to a broader user group without having to duplicate that data in a separate location.
I have a bucket in S3, containing hundreds of folders, each contains files with the same structure, which are csv representation of relational db tables. The different folders differ by content of the data, but overlapping might occur.
In each folder, I want to join 3 tables, and store the output in a dedicated table. The dedicated table should eventually hold joined data from all different folders. Duplications might occur between different folders but the records have a unique key that can help with the aggregation.
Data size for a specific folder, of all the files, can reach to 5 GB of disk space. 2 of the files contain hundreds of thousands of records. The third file can reach up to 20M records.
The result should be stored in AWS RDS, on a postgresql instance. However, I am considering to switch to Redshift. Will it be better for this scale of data?
The 3 tables are:
Providers
Consumers
Transactions
All of them are indexed by the key which is used in the join.
My approach is to iterate over S3 bucket, and for each folder load the 3 files to the db. Then, create the joined table for the 3 tables using sql, and finally adding the joined data to the aggregated table that should contain the data from all folders.
I am currently trying to handle 1 folder, in order to understand better how to make the process optimal, both in time and space.
After loading I noticed the db uses around 2X disk space then what I expected. Why is joining cost so much in disk space? Is there a way of loading and joining with minimal cost? The data loaded initially for each folder is used as a staging table, until I drop the duplicates and load it to the aggregated table. So its lifespan will be relatively short. I tried to use CREATE UNLOGGED TABLE but it didn't had much effect.
CREATE UNLOGGED TABLE agg_data AS SELECT * FROM
transactions t
INNER JOIN consumers c USING (consumer_id)
INNER JOIN providers p USING (provider_id);
This works ok for 1 folder, time wise. It does take a lot more disk space than I assumed it will.
How will this work in a mass scale, for hundreds of folders. How will the aggregation behave over time, as I will need to search for duplicated records in a continuously growing table?
To summarize my questions:
How to choose between RDS and Redshift? My concerns are tens of millions of records in the target table, and the need to drop duplicates while adding new data to the target table.
Why is joining data take so much db storage? Is there a way to minimize it, for data that is temporary?
What is an efficient way of inserting new data to the destination table while dropping duplications?
Will it be better to join and store the files in S3 using AWS Glue, and then load them to the target db? Currently it does not seem like an option, as Glue takes forever to join the data.
I would recommend using Amazon Athena to join the files and produce the desired output.
First, each directory needs to be recognised as a table. This can be done by manually running a CREATE EXTERNAL TABLE command in Athena and pointing at the folder. All files within the folder will be treated as containing data for the table and they should all be of the same format.
If desired, an AWS Glue crawler can instead be used to create the table definition. Create a crawler and point it to the folder. Glue will create the table definition in the AWS Glue Data Catalog, which is accessible to Athena.
Once the three input tables have been defined, you can run a query in Amazon Athena that joins the three tables and produces an output table using CREATE TABLE AS.
See: Creating a Table from Query Results (CTAS) - Amazon Athena
Glue can also be used for Program AWS Glue ETL Scripts in Python - AWS Glue, but I haven't tried this so I can't offer advice on it. However, I have used AWS Glue crawlers to create tables that I then query via Amazon Athena.
Once you have the output data, you can then load it into the database of your choice. Which database you choose depends upon your use-case. I would suggest starting with Amazon RDS for PostgreSQL since it is a traditional database and you seem to be comfortable with it. If you later need improved performance (eg billions or rows instead of millions), you can move to Amazon Redshift.
General comment: It is rather strange that you wish to join those 3 tables since there will presumably be a lot of duplicated data (very denormalized). You could instead simply load those tables into your desired database and then do the joins in the database, possibly being selective as to which columns you wish to include.
My hive external table(non partitioned) has data in an S3 bucket. It is an incremental table.
Until today all the files that come to this location through an Informatica process, were of the same structure and all the fields are included columns in my incremental table structure.
Now I have a requirement the a new file comes to the same bucket in additiona to the existing ones, but with additional columns.
I have re-created table with the additional columns.
But when I query the table, I see that the newly added columns are still blank for those specific line items where they are not supposed to be.
Can I actually have a single external table pointing to an S3 location having different file structures?
I am trying to copy a BigQuery table using the API from one table to the other in the same dataset.
While copying big tables seems to work just fine, copying small tables with a limited number of rows (1-10) I noticed that the destination table comes out empty (created but 0 rows).
I get the same results using the API and the BigQuery management console.
The issue is replicated for any table in any dataset I have. Looks like a bug or a designed behavior.
Could not find any "minimum lines" directive in the docs.. am I missing something?
EDIT:
Screenshots
Original table: video_content_events with 2 rows
Copy table: copy111 with 0 rows
How are you populating the small tables? Are you perchance using streaming insert (bq insert from the command line tool, tabledata.insertAll method)? If so, per the documentation, data can take up to 90 minutes to be copyable/exportable:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
I won't get super detailed, but the reason is that our copy and export operations are optimized to work on materialized files. Data within our streaming buffers are stored in a completely different system, and thus aren't picked up until the buffers are flushed into the traditional storage mechanism. That said, we are working on removing the copy/export delay.
If you aren't using streaming insert to populate the table, then definitely contact support/file a bug here.
There is no minimum records limit to copy the table within the same dataset or over a different dataset. This applies both for the API and the BigQuery UI. I just replicated your scenario of creating a new table with just 2 records and I was able to successfully copy the table to another table using UI.
Attaching screenshot
I tried to copy to a timestamp partitioned table. I messed up the timestamp, and 1000 x current timestamp. Guess it is beyond BigQuery's max partition range. Despite copy job success, no data is actually loaded to the destination table.