Hive CSV Import Limit - hive

I have a large csv file with about 3.3 million rows that I have uploaded to Hive metastore and created a table from.
However when I run a
select count(*) from table
query on it, it only shows about 1.7 million rows.
I've run a
select * from table
query and downloaded the results as a csv, the file only has about 1.7 million rows in it.
Is there a size limit on a csv file that you can import into hive and create a table from?
Any tips greatly appreciated.

I would suggest to check your file again, the scenario you are saying may occur for many conditions:
1.) You don't have that many records in file.
2.) Some of your rows are not separated by new line, that means records are getting merged. Thats why you are getting less records.
Hope this helps...!!!

Related

Merge rows from multiple tables SQL (incremental)

I'm consolidating the information of 7 SQL databases into one.
I've made a SSIS package and used Lookup transformation and I managed to get the result as expected.
The problem: 30 million rows. And I want to perform a daily task to add to the destination table the new rows in the source tables.
So it takes like 4 hours to execute the package...
Any suggestion ?
Thanks!
I have only tried full cache mode...

Row count results MISMATCH for Select * and Select count(1) for Hive External table for Big Files

I am running hive external table queries. Issue:
'Select * from table1' row count which hive shows is different 'Select count(*) from table1'. It should match but not matching not sure why? Result match for small data 20 MB or so but not for Big table i.e 600 MB they do not match..Any one has faced this issue ??
Below are some queries I ran to show the result. My source file is RDS file which I convert to csv file and upload it to HDFS and create external table.
additional details
Note:
I only face this issue for big files e.g. size 200 MB or more but for small files e.g 80 MB there is no ussue.
SELECT count(*) FROM dbname1.cy_tablet where Ranid Is NULL # Zero results
We resolved the issue and all count match now.
We removed headers in our csv file used as source to Hive External tables, by using col_names = FALSE
write_delim(df_data,delim = "|",col_names = FALSE, output_file)#
Removed following line from CREATE EXTERNAL TABLE command
TBLPROPERTIES('skip.header.line.count'='1'
Above steps resolved our issue.
The issue was happening in big files. In our site HDFS block size is 128MB, if we divide file size by 128MB gives us a number , I was getting same as difference. So I think the issue was with headers.
Note: We used pipe '|' as delimiter as we faced some other issues when using ','

BigQuery faster way to insert million of rows

I'm using bq command line and trying to insert large amount of json files with one table per day.
My approach:
list all file to be push (date named YYYMMDDHHMM.meta1.meta2.json)
concatenate in the same day file => YYYMMDD.ndjson
split YYYMMDD.ndjson file (500 lines files each) YYYMMDD.ndjson_splittedij
loop over YYYMMDD.ndjson_splittedij and run
bq insert --template_suffix=20160331 --dataset_id=MYDATASET TEMPLATE YYYMMDD.ndjson_splittedij
This approach works. I just wonder if it is possible to improve it.
Again you are confusing streaming inserts and job loads.
You don't need to split each file in 500 rows (that applies to streaming insert).
You can have very large files for insert, see the Command line tab examples listed here: https://cloud.google.com/bigquery/loading-data#loading_csv_files
You have to run only:
bq load --source_format=NEWLINE_DELIMITED_JSON --schema=personsDataSchema.json mydataset.persons_data personsData.json
JSON file compressed must be under 4 GB if uncompressed must be under 5 TB, so larger files are better. Always try with 10 line sample file until you get the command working.

how to import csv file in mysql database limiting the number of rows?

I have a csv file with 45,00,000 rows and i need a way to limit the number of rows to be imported in the database say 1,00,000 rows.How can I import the limited number of rows in the mysql database?
What you could do is use PhpMyAdmin - this has an option to break up your CSV file into multiple requests when doing the import. Check out this article: http://www.group3solutions.com/blog/tips-on-phpmyadmin-csv-importing/
Also - you could try this application: http://www.ozerov.de/bigdump/ . It worked for me before.

Google BigQuery - Error while downloading data to a table

I am trying to work with the github data which has been uploaded to Google's big data. I ran a few queries (which generated a lot of rows -
eg: a query SELECT actor_attributes_login, repository_watchers , repository_forks FROM [githubarchive:github.timeline]
where repository_watchers > 2 and REGEXP_MATCH(repository_created_at, '2012-')
ORDER BY actor_attributes_login;
The answer had more than 2,20,000 rows. When I attempted to download to CSV , it said
Download Unavailable
This result set contains too many rows for direct download. Please use "Save as Table" and then export the resulting table.
When I tried to do it as Save as Table I got the following error:
Access Denied: Job publicdata:job_c2338ba91e494b21970854e13cdc4b2a: RUN_JOB
Also, I ran queries where I limited the number of rows to 200 or so, even in such cases I got the error as mentioned above. However I was able to download it as CSV.
Any solution to this problem?
#Anerudh You don't have access to modify the publicdata samples dataset. Create a brand new dataset, and try to save your query results to a new table in that dataset.