Create table in hive from data folder in HDFS - remove duplicated rows - hive

I have a folder in HDFS, let's call it /data/users/
Inside that folder, a new csv file is added every 10 days. Basically the new file will contain only active users, so, for example
file_01Jan2020.csv: contains data for 1000 users who are currently active
file_10Jan2020.csv: contains data for 950 users who are currently active (same data in file_01Jan2020.csv with 50 less records)
file_20Jan2020.csv: contains data for 920 users who are currently active (same data in file_10Jan2020.csv with 30 less records)
In reality, these file are much bigger (~8 million records per file and decreases by MAYBE 1K EVERY 10 DAYS). Also, the newer files will never have new records that doesn't exist in the older files. it will just have less number of records.
I want to create a table in hive using the data in this folder. What I am doing now is:
Create External table from the data in the folder /data/users/
Create Internal table with the same structure
Write the data from external table to internal table where,
Duplicates are removed
If a record doesn't exist in one of the files, then I'll mark it as 'deleted' and set the 'deleted' in a new column that I defined in the internal table I created
I am concerned about the step where I create the external table, since the data are really big, that table will be huge after sometime, and I was wondering if there is a more efficient way of doing this instead of each time loading all the files in the folder.
So my question is: What is the best possible way to ingest data from a HDFS folder into a hive table that , given that, the folder contain lots of files with lottts of duplications.

I'd suggest partition data by date, that way you dont have to go through all the records every time you read the table.

Related

How to maintain history data whose schema changes quarterly using Hadoop

I have json input file which stores survey data(feedback from the customers).
The columns in json file can vary
for e.g. in first quarter there can
be 70 columns and in next quarter it can have 100 columns and so on.
I want to store all this quarterly data in same table on hdfs.
Is there a way to maintain history either by drop and re-creating the table with changing schema?
How will it behave if the column length goes down let's say in 3rd quarter we get only 30 columns.
First point is that in HDFS you don't store tables just files. You create tables in hive impala etc. on top of files.
Some of the formats support schema merging at read, for example parquet
In general you will be able to recreate your table with a super-set of columns. In Impala you have similar capabilities for schema evolution.

Add column to a huge table

I have a table that has around 13 billion records. Size of this table is around 800 GB. I want to add a column of type tinyint to the table but it takes a lot of time to run add column command. Another option would be to create another table with the additional column and copy data from source table to the new table using BCP (data export and import) or copy data directly to the new table.
Is there a better way to achieve this?
My preference for tables of this size is to create a new table and then batch the records into it (BCP, Bulk Insert, SSIS, whatever you like). This may take longer but it keeps your log from blowing out. You can also do the most relevant data (say last 30 days) first, swap out the table, then batch in the remaining history so that you can take advantage of the new row immediately...if your application lines up with that strategy.

Hive external table(non partitioned) with different file structures in the same location

My hive external table(non partitioned) has data in an S3 bucket. It is an incremental table.
Until today all the files that come to this location through an Informatica process, were of the same structure and all the fields are included columns in my incremental table structure.
Now I have a requirement the a new file comes to the same bucket in additiona to the existing ones, but with additional columns.
I have re-created table with the additional columns.
But when I query the table, I see that the newly added columns are still blank for those specific line items where they are not supposed to be.
Can I actually have a single external table pointing to an S3 location having different file structures?

BigQuery "copy table" not working for small tables

I am trying to copy a BigQuery table using the API from one table to the other in the same dataset.
While copying big tables seems to work just fine, copying small tables with a limited number of rows (1-10) I noticed that the destination table comes out empty (created but 0 rows).
I get the same results using the API and the BigQuery management console.
The issue is replicated for any table in any dataset I have. Looks like a bug or a designed behavior.
Could not find any "minimum lines" directive in the docs.. am I missing something?
EDIT:
Screenshots
Original table: video_content_events with 2 rows
Copy table: copy111 with 0 rows
How are you populating the small tables? Are you perchance using streaming insert (bq insert from the command line tool, tabledata.insertAll method)? If so, per the documentation, data can take up to 90 minutes to be copyable/exportable:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
I won't get super detailed, but the reason is that our copy and export operations are optimized to work on materialized files. Data within our streaming buffers are stored in a completely different system, and thus aren't picked up until the buffers are flushed into the traditional storage mechanism. That said, we are working on removing the copy/export delay.
If you aren't using streaming insert to populate the table, then definitely contact support/file a bug here.
There is no minimum records limit to copy the table within the same dataset or over a different dataset. This applies both for the API and the BigQuery UI. I just replicated your scenario of creating a new table with just 2 records and I was able to successfully copy the table to another table using UI.
Attaching screenshot
I tried to copy to a timestamp partitioned table. I messed up the timestamp, and 1000 x current timestamp. Guess it is beyond BigQuery's max partition range. Despite copy job success, no data is actually loaded to the destination table.

Regarding HIVE Table Storage

Can anyone please help me understanding the below point.
I have created one HIVE table which is not a partition table, but I am working in a 10 node cluster, so in this case will the data of that table (the table is a large table) will be spread across different data nodes??? or will it be there only in one node??
If it spread across different data nodes then how we can see only one file under \hive\warehouse folder?
Also please give little idea how this storage allocated for a partition table.
The data for the table and the metadata of the table are different things.
The data for the table, which is basically just a file in HDFS, will be stored as per HDFS rules (that is based on your configuration, a file will be split into n number of blocks and stored distributedly on datanodes).
In your case the data for one hive table( a file or ay number of files) will be stored distributedly among all the 10 nodes in the cluster.
Also, this distribution is done at the block level and not visible at the user level.
You can check the number of blocks created for the file int he Web UI easily.
A partitioned table is just like adding another directory inside the table directory in HDFS. So it follows the same rules.