I have string column in Hive which is contains mails(as a passage).I would like to process those column for text mining.Its not taking data taking as nulls.Is there any limitations in pig chararray.and please suggest me for this task. Thank you
It appears that you want to use a chararray, but that the data is actually forced into a double. This could happen explicitly or implicitly (for instance when you try to use a number function like MAX on it).
Try to reduce your code to a minimum, then do a dump after each step and debug the following steps untill you achieve success:
Load the data, make sure the right datatype is used (check whether it still is a chararray (if it is already cast as a double, it should not show any text))
Do a describe, if your column is not yet chararray cast it to a chararray (and check the result)
Do your string operations (and check the result)
Write to your output
Related
Please help,
How could I extract 2019-04-02 out of the following string with Azure data flow expression?
ABC_DATASET-2019-04-02T02:10:03.5249248Z.parquet
The first part of the string received as a ChildItem from a GetMetaData activity is dynamically. So in this case it is ABC_DATASET that is dynamic.
Kind regards,
D
There are several ways to approach this problem, and they are really dependent on the format of the string value. Each of these approaches uses Derived Column to either create a new column or replace the existing column's value in the Data Flow.
Static format
If the format is always the same, meaning the length of the sections is always the same, then substring is simplest:
This will parse the string like so:
Useful reminder: substring and array indexes in Data Flow are 1-based.
Dynamic format
If the format of the base string is dynamic, things get a tad trickier. For this answer, I will assume that the basic format of {variabledata}-{timestamp}.parquet is consistent, so we can use the hyphen as a base delineator.
Derived Column has support for local variables, which is really useful when solving problems like this one. Let's start by creating a local variable to convert the string into an array based on the hyphen. This will lead to some other problems later since the string includes multiple hyphens thanks to the timestamp data, but we'll deal with that later. Inside the Derived Column Expression Builder, select "Locals":
On the right side, click "New" to create a local variable. We'll name it and define it using a split expression:
Press "OK" to save the local and go back to the Derived Column. Next, create another local variable for the yyyy portion of the date:
The cool part of this is I am now referencing the local variable array that I created in the previous step. I'll follow this pattern to create a local variable for MM too:
I'll do this one more time for the dd portion, but this time I have to do a bit more to get rid of all the extraneous data at the end of the string. Substring again turns out to be a good solution:
Now that I have the components I need isolated as variables, we just reconstruct them using string interpolation in the Derived Column:
Back in our data preview, we can see the results:
Where else to go from here
If these solutions don't address your problem, then you have to get creative. Here are some other functions that may help:
regexSplit
left
right
dropLeft
dropRight
The Memsql pipeline is supposed to dump data from S3 into a columnstore table. The source files are in ORC format. they are then converted to Parquet.
The files have certain columns with DATE datatype (yyyy-mm-dd).
The pipeline runs fine but inserts NULL into all the Date type columns.
The DATE values may be getting written to Parquet as int64 with a timestamp logical type annotation (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp). MemSQL doesn't currently automatically convert these to a format compatible with e.g. DATETIME or TIMESTAMP, but rather attempts to assign to the destination column as if by assigning an integer literal with the raw underlying value. This gives NULL rather than an error for MySQL compatibility reasons, though set global data_conversion_compatibility_level="7.0" will make it an error.
You can investigate by temporarily giving the problem column TEXT type and looking at the resulting value. If it's an integer string, the issue is as described above and you can use the SET clause of CREATE PIPELINE to transform the value to a compatible format via something like CREATE PIPELINE P AS LOAD DATA .... INTO TABLE T(#col_tmp <- parquet_field_name) SET col = timestampadd(microsecond, #col_tmp, from_unixtime(0));.
The value will be a count of some time unit since the the unix epoch in some time zone. The unit and time zone depends on the writer, but should become clear if you know which time it's supposed to represent. Once you know that, modify the expression above to correct for units and perhaps call convert_tz as necessary.
Yes, it's a pain. We'll be making it automatic.
What is the best way to determine the datatype of a column value if the data has already been loaded and the data has been classified as STRING datatype (i.e. BQ table metadata has "STRING" as the datatype for every column)? I've found a few different methods, but not sure if I'm missing any or any of these is substantially more performant. The result should include statistics on the grain of each value, not just per column.
Using a combination of CASE and SAFE_CAST on the STRING value to sum up all the instances where it successfully was able to CAST to X data type (where X is any datatype, like INT64 or DATETIME and having a few lines in query repeat the SAFE_CAST to cover all potential datatypes)
Similar to above, but using REGEXP_CONTAINS instead of SAFE_CAST on every value and summing up all instances of TRUE (a community UDF also seems to tackle this: https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/typeof.sql)
(For above can also use countif(), if statements etc.)
Loading data into a pandas dataframe and using something like pd.api.types.infer_dtype to infer automatically, but this adds overhead and more components
Thanks!
I am doing concat and cast operation inside spark SQL query as follows:
spark.sql ("select cast(concat(df_view.col1," ") as Long) as new_col from df_view")
But I am getting null values in resulting DF. If I just perform the cast or concat operation, I am getting the correct results, but with both operations simultaneously, I get the null values.
Please suggest if I'm missing something in the syntax, I checked other answers but couldn't figure out the issue, also I am using only spark SQL here not DF syntax operations.
If you are writing the file as text then just don't cast it to Long, and preferably use pad functions to make sure you are writing the right width.
I take from the comments that your issue is fixed-width files but as a general thing it makes no sense to concat an empty space and then try to cast the result as a number. You've explicitly made it not a number before.
Ideally you deal with the file format as a file format and not by arbitrarily manipulating each field, however the latter can work if you handle each field correctly.
I need help on how to resolve characters of unknown type from a database field into a readable format, because I need to overwrite this value on database level with another valid value (in the exact format the application stores it in) to automate system copy acitvities.
I have a proprietary application that also allows users to configure it in via the frontend. This configuration data gets stored in a table and the values of a configuration property are stored in a column of type "BLOB". For the here desired value, I provide a valid URL in the application frontend (like http://myserver:8080). However, what gets stored in the database is not readable (some square characters). I tried all sorts of conversion functions of HANA (HEX, binary), simple, and in a cascaded way (e.g. first to binary, then to varchar) to make it readable. Also, I tried it the other way around and make the value that I want to insert appear in the correct format (conversion to BLOL over hex or binary) but this does not work either. I copied the value to clipboard and compared it to all sorts of character set tables (although I am not sure if this can work at all).
My conversion tries look somewhat like this:
SELECT TO_ALPHANUM('') FROM DUMMY;
while the brackets would contain the characters in question. I cant even print them here.
How can one approach this and maybe find out the character set that is used by this application? I would be grateful for some more ideas.
What you have in your BLOB column is a series of bytes. As you mentioned, these bytes have been written by an application that uses an unknown character set.
In order to interpret those bytes correctly, you need to know the character set as this is literally the mapping of bytes to characters or character identifiers (e.g. code points in UTF).
Now, HANA doesn't come with a whole lot of options to work on LOB data in the first place and for C(haracter)LOB data most manipulations implicitly perform a conversion to a string data type.
So, what I would recommend is to write a custom application that is able to read out the BLOB bytes and perform the conversion in that custom app. Once successfully converted into a string you can store the data in a new NVCLOB field that keeps it in UTF-8 encoding.
You will have to know the character set in the first place, though. No way around that.
I assume you are on Oracle. You can convert BLOB to CLOB as described here.
http://www.dba-oracle.com/t_convert_blob_to_clob_script.htm
In case of your example try this query:
select UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(<your_blob_value)) from dual;
Obviously this only works for values below 32767 characters.