My file gets truncated in Hive after uploading it completely to Cloudera Hue

My file gets truncated in Hive after uploading it completely to Cloudera Hue - hive

I am using Cloudera's Hue. In the file browser, I upload a .csv file with about 3,000 rows (my file is small <400k).
After uploading the file I go to the Data Browser, create a table and import the data into it.
When I go to Hive and perform a simple query (say SELECT * FROM table) I only see results for 99 rows. The original .csv has more than those rows.
When I do other queries I notice that several rows of data are missing although they show in the preview in the Hue File Browser.
I have tried with other files and they also get truncated sometimes at 65 rows or 165 rows.
I have also removed all the "," from the .csv data before uploading the file.

I finally solved this. There were several issues that appeared to cause a truncation.
The main was that the variable type automatically set after importing the data was assigned according to the first lines. So when the data type changed from TinyINT to INT it got truncated or changed to "NULL". To solve this perform EDA and change the datatype before creating the table.
Other issues were that the memory I had assigned to the virtual machine slowed the preview process and that the csv contained commas. You can set the VM to have more memory or change a csv to tab separated.

Related

What does this error mean: Required column value for column index: 8 is missing in row starting at position: 0

I'm attempting to upload a CSV file (which is an output from a BCP command) to BigQuery using the gcloud CLI BQ Load command. I have already uploaded a custom schema file. (was having major issues with Autodetect).
One resource suggested this could be a datatype mismatch. However, the table from the SQL DB lists the column as a decimal, so in my schema file I have listed it as FLOAT since decimal is not a supported data type.
I couldn't find any documentation for what the error means and what I can do to resolve it.

What does this error mean? It means, in this context, a value is REQUIRED for a given column index and one was not found. (By the way, columns are usually 0 indexed, meaning a fault at column index 8 is most likely referring to column number 9)
This can be caused by myriad of different issues, of which I experienced two.
Incorrectly categorizing NULL columns as NOT NULL. After exporting the schema, in JSON, from SSMS, I needed to clean it
up for BQ and in doing so I assigned IS_NULLABLE:NO to
MODE:NULLABLE and IS_NULLABLE:YES to MODE:REQUIRED. These
values should've been reversed. This caused the error because there
were NULL columns where BQ expected a REQUIRED value.
Using the wrong delimiter The file I was outputting was not only comma-delimited but also tab-delimited. I was only able to validate this by using the Get Data tool in Excel and importing the data that way, after which I saw the error for tabs inside the cells.
After outputting with a pipe ( | ) delimiter, I was finally able to successfully load the file into BigQuery without any errors.

Glue create_dynamic_frame.from_catalog return empty data

I'm debugging issue which create_dynamic_frame.from_catalog return no data, despite I'm able to view the data through Athena.
The Data Catelog is pointed to S3 folder and there are multiple files with same structure. The file type is csv, delimiter is space " ", consists of two column (string and json string), with no header.
This is CSV format file.
This is Athena query using crawler generated.
No result returned from dataframe when debug, any thought?

Take a look if you have enabled the Bookmark for this job. If you are running it multiple times, you need to reset the Bookmark or disable it.
Other thing to check is the logs. Maybe you can find some AccessDenied, the role that is running the job might have no access to this bucket.

BigQuery faster way to insert million of rows

I'm using bq command line and trying to insert large amount of json files with one table per day.
My approach:
list all file to be push (date named YYYMMDDHHMM.meta1.meta2.json)
concatenate in the same day file => YYYMMDD.ndjson
split YYYMMDD.ndjson file (500 lines files each) YYYMMDD.ndjson_splittedij
loop over YYYMMDD.ndjson_splittedij and run
bq insert --template_suffix=20160331 --dataset_id=MYDATASET TEMPLATE YYYMMDD.ndjson_splittedij
This approach works. I just wonder if it is possible to improve it.

Again you are confusing streaming inserts and job loads.
You don't need to split each file in 500 rows (that applies to streaming insert).
You can have very large files for insert, see the Command line tab examples listed here: https://cloud.google.com/bigquery/loading-data#loading_csv_files
You have to run only:
bq load --source_format=NEWLINE_DELIMITED_JSON --schema=personsDataSchema.json mydataset.persons_data personsData.json
JSON file compressed must be under 4 GB if uncompressed must be under 5 TB, so larger files are better. Always try with 10 line sample file until you get the command working.

Access 2012 importing a 500MB text file results in an out of disk space error

I have a 500MB txt file. It is not a CSV file (it does not have any delimitiers other than spaces), so I can't import it using tSQL. Currently I am trying with the help of Access's import specifications. I figured out how to call it in code and polished the code so I can import a small file (test file was 200kb large). But now i have the exact file i have to import and its 500MB in size. When i run my code it gets to around 50%, and then throws an "Your computer is out of disk space. You won't be able to undo this paste append. Do you want to continue anyway?" error.
I am inserting into a linked SQL table.
What can I do to get rid of this error and what exactly is causing it (I have plenty of disk space and memory capacity)?

You can bulk insert space delimited files in T-SQL like so:
BULK INSERT yourTable
FROM 'C:/<filepath>/yourTextFile.txt'
WITH
(
--Space delimited
FIELDTERMINATOR =' ',
--New rows starts at new line
ROWTERMINATOR ='\n'
--Use this if you have a header row with your column names.
--,FIRSTROW = 2
)

Are you sure you have disk space space? It does not just make up those messages.
Is the disk space actually allocated to data data and log files?

Well I managed to solve it using tsql. The problem was the columns in the file were not separated with any separators. So I create an import specification in access (the one where you pick where you want each column to be). Then i just used the specification as the reference for writing a tsql procedure, which uses bulk insert and then gets each "column" out of the txt file using SUBSTRING (access specification gives you the range to use in substring).
Now works without any problems. It takes about 20 min to import the 500MB file. But i have a job at night that runs it so thats not a problem.
Thank you all for your help. This question is now closed. Any questions about my solution, please ask.

Importing an .RPT (6 gigs) file into SQL Server 2005

I'm trying to import two seperate .RPT files into SQL, one is small, one is large. Both have issues with determining where the columns are seperated.
My solution for this was to import the file into access, define the columns and then save it as a txt file.
This worked perfectly.
The problem however is the larger file is 6 gigs and MS Access won't allow me to open it. When trying to change the extension to simply .txt and importing it into SQL, everything comes down under one column (despite there being 10) and there is no way to accurately seperate the data.
Please help!

As Tony stated Access has a hard 2GB limit on database size.
You don't say what kind of file the .RPT file is. If it is a text file, then you could break it into smaller chunks by reading it line by line and appending it into temporary files. Then import/export these smaller files one at a time.
Keep in mind the 2GB limit is on the Access database, so your temporary text files will need to be somewhat smaller because the import will likely introduce some additional overhead. Also, you may need to compact/repair the database in between import/export cycles to reclaim space in the database; simply deleting the records is not enough.

If the file has column delimiters or fixed column widths you can try the following in SQL Management Studio:
Right click on a database, select "Tasks" and then "Import data...". This will take you through a wizard where you can define the source columns and map them to an existing or new table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

My file gets truncated in Hive after uploading it completely to Cloudera Hue - hive

Related

What does this error mean: Required column value for column index: 8 is missing in row starting at position: 0

Glue create_dynamic_frame.from_catalog return empty data

BigQuery faster way to insert million of rows

Access 2012 importing a 500MB text file results in an out of disk space error

Importing an .RPT (6 gigs) file into SQL Server 2005

Categories

Resources