How to validate data types in pig? - apache-pig

I have been trying to validate the data type of the data that I got from a flat file through pig.
A simple CAT can do the trick but the Flat files are huge and they sometimes contain special characters.
I need to filter out the records containing special characters from the file and also when the data type is not int.
Is there any way to do this in pig?
I am trying to find a substitute for getType().getName() kind of usage of java here.
Enforcing schema and using Describe is what we do while loading data and then remove the miss match but is there anyway to it without enforcing the schema.
Any suggestions will be helpful.

Load the data into a line:charraray and use regular expression to filter out the records that contains characters other than numbers
A = LOAD 'data.txt' AS (line:chararray);
B = FILTER A BY (line matches '\\d+$'); -- Change according to your needs.
DUMP B;

Related

How does one create and load a table into BigQuery with $ value?

I have a csv with dollar amounts for various products. The column in the csv already contains the "$" in front of the number, and since it's not in "" it can't be a string. How do I load this csv into a BigQuery table?
You can load your CSV file directly to BigQuery. If you use the autodetect schema option, the column with numbers starting with "$", might be detected as a number column.
If it gets detected as a string column, then you can follow #GordonLinoff advice and transform it directly in BigQuery.

Merge multiple csv files with difference header tile in Objective C

I have multiple csv file with difference header tile, and I want to merge all of them & keep combination Header title.
I think I can doing cover csv to array then compare header tile in all file then merge csv file. However, seem it get huge processing time because o a lot of loop there. Could you help if have any fast solution.
Example:
file1.csv
No,Series,Product,A,B,C,D,E
1, AAA, XX, a1,b1,c1,d1,e1
file2.csv
No,Series,Product,A,C,D,B,E,F,G
1, AAB, XX, a1,c1,d1,b1,e1,f1,g1
file3.csv
No,Series,Product,A,A1,A2,C,D,B1,B,E
1, AAC, XX, a1,a11,a21,c1,d1,b11,b1,e1
My expected merge file is:
merge.csv
No,Series,Product,A,A1,A2,B,B1,C,D,E,F,G
1, AAA, XX, a1,0,0,b1,0,c1,d1
1, AAB, XX, a1,0,0,b1,0,c1,d1,e1,f1,g1
1, AAC, XX, a1,a11,a21,b1,b11,c1,d1,e1
"Data which not available in column will show as "0" or "NA",etc.
From your comment it seems you have no code but you think your sketch will be slow, it sounds like you are optimising prematurely – code your algorithm, test it, if its slow use Instruments to see where the time is being spent and then look at optimisation.
That said some suggestions:
You need to decide if you are supporting general CSV files, where field values may contain commas, newlines or double quotes; or simple CSV files where none of those characters is present in a field. See section 2 of Common Format and MIME Type for Comma-Separated Values (CSV) Files what what you need to parse to support general CSV files, and remember you need to output values using the same convention. If you stick with simple CSV files then NSString's componentsSeparatedByString: and NSArray's componentsJoinedByString: may be all you need to parse and output respectively.
Consider first iterating over your files reading just the header rows, parse those, and produce the merged list of headers. You will need to preserve the order of the headers, so you can pair them up with the data rows, so arrays are your container of choice here. You may choose to use sets in the merging process, but the final merged list of headers should also be an array in the order you wish them to appear in the merged file. You can use these arrays of headers directly in the dictionary methods below.
Using a dictionary as in your outline is one approach. In this case look at NSDictionary's dictionaryWithObjects:forKeys: for building the dictionary from the parsed header and record. For outputting the dictionary look at objectsForKeys:notFoundMarker: and using the merged list of headers. This supports missing keys and you supply the value to insert. For standard CSV's the missing value is empty (i.e. two adjacent commas in the text) but you can use NA or 0 as you suggest.
You can process each file in turn, a row at a time: read, parse, make dictionary, get an array of values back from dictionary with the missing value in the appropriate places, combine, write. You never need to hold a complete file in memory at any time.
If after implementing your code using dictionaries to easily handle the missing columns you find it is too slow you can then look at optimising. You might want to consider instead of breaking each input data row into fields and the recombining adding in the missing columns that you just do direct string replacement operations on the text of the data row and just add in extra delimiters as needed – e.g. if column four is missing you can change the third comma for two commas to insert the missing column.
If after designing your algorithm and coding it you hit problems you can ask a new question, include your algorithm and code, a link back to this question so people can follow the history, and explain what your issue is. Someone will undoubtedly help you take the next step.
HTH

Purpose of Json schema file while loading data into Big query from a csv file

Can someone please help me by stating the purpose of providing the json schema file while loading a file to BQtable using bq command. what are the advantages?
Dose this file help to maintain data integrity by avoiding any column swap ?
Regards,
Sreekanth
Specifying a JSON schema--instead of relying on auto-detect--means that you are ensured to get the expected types for each column being loaded. If you have data that looks like this, for example:
1,'foo',true
2,'bar',false
3,'baz',true
Schema auto-detection would infer that the type of the first column is an INTEGER (a.k.a. INT64). Maybe you plan to load more data in the future, though, that looks like this:
3.14,'foo',true
1.59,'bar',false
-2.001,'baz',true
In that case, you probably want the first column to have type FLOAT (a.k.a. FLOAT64) instead. If you provide a schema when you load the first file, you can specify a type of FLOAT for that column explicitly.

How to load a TSV containing an Array JSON field in Pig Latin

I am trying to load a file with a schema that is primarily tab separated values, but one of the fields is an ARRAY JSON value.
Each row of the data look like:
date\tuser_id\tproducts
products field is an Array Json formatted string.
What is the easiest way to load this data ?
One way is to load the TSV file with standard load command and write UDF to parse the json

Writing on HDFS messed the data

I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.