I want to query data that is stored in MongoDB and exported out into a number of JSON files, stored in S3
I am using AWS Glue to read the files into Athena however the data type for the id on each table is imported as struct<$oid:string>
I have tried every variation of adding quotations around the fields with no luck. everything I try results in the error name expected at the position 7 of 'struct<$oid:string>' but '$' is found.
Is there any way I can read these tables in their current form or do I need to declare their type in Glue?
Glue Crawlers create schemas that match what they find, without considering if they will work with, for example Athena. In Athena you can't have a struct property with an initial $, but Glue doesn't take that into account – partly because maybe you will be using the table with something else where that is not a problem, and partly because what else can it do, that is the name of the property.
There are two ways around it, but neither will work if you continue to use a crawler. You will need to modify the table schema, but if you continue to run the crawler it will just revert it back again.
The first, and probably simplest option, is to change the type of the column to STRING and then use a JSON function at query time to extract the value using JSONPath ($ is a special character in JSONPath, but you should be able to escape it).
The second option is to use the "mappings" feature of the Hive JSON serde. I'm not 100% sure if it will work for this case, but it could be worth a try. The docs are not very extensive on how to configure it, unfortunately.
Related
I an using Google Big Query and I have a field, named 'AsOfDate' which is set as a string datatype. I have a bunch of data in this field, which I really want to set as DateTime or just Date. Either is fine. I Googled for a solution, and I thought this would be pretty easy to do, but I can't seem to get the data type updated. I don't want to run a simple select statement; I want to permanently change the Schema. Has anyone run into this and figured out how to do this kind of thing? If so, please share your insights. Thanks!
To quote directly from the official documentation: 'Changing a column's data type is not supported by the BigQuery web UI, the command-line tool, or the API.'
https://cloud.google.com/bigquery/docs/manually-changing-schemas#changing_a_columns_data_type
There are two ways to manually change a column's data type:
Using a SQL query — Choose this option if you are more concerned about
simplicity and ease of use, and you are less concerned about costs.
Recreating the table — Choose this option if you are more concerned
about costs, and you are less concerned about simplicity and ease of
use.
You could use either of the approaches above along with the PARSE_DATE() function to transform your string into a date field.
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#parse_date
Can someone please help me by stating the purpose of providing the json schema file while loading a file to BQtable using bq command. what are the advantages?
Dose this file help to maintain data integrity by avoiding any column swap ?
Regards,
Sreekanth
Specifying a JSON schema--instead of relying on auto-detect--means that you are ensured to get the expected types for each column being loaded. If you have data that looks like this, for example:
1,'foo',true
2,'bar',false
3,'baz',true
Schema auto-detection would infer that the type of the first column is an INTEGER (a.k.a. INT64). Maybe you plan to load more data in the future, though, that looks like this:
3.14,'foo',true
1.59,'bar',false
-2.001,'baz',true
In that case, you probably want the first column to have type FLOAT (a.k.a. FLOAT64) instead. If you provide a schema when you load the first file, you can specify a type of FLOAT for that column explicitly.
One of the columns I send (in my code) to BigQuery is integers. I added the columns to BigQuery and I was too fast and added them as type string.
Will they be automatically converted? Or will the data be totally corrupted (= I cannot trust at all the resulting string)?
Data shouldn't be automatically converted as this would destroy the purpose of having a table schema.
What I've seen people doing is saving a whole json line as string and then processing this string inside of BigQuery. Other than that, if you try to save values not correspondent to the field schema definition, you should see an error being thrown, like so:
If you need to change a table schema's definition, you can check this tutorial on updating a table schema.
Actually BigQuery converted automatically the integers that I have sent it to string, so my table populates ok
I have a database which has a view created off other views which are created off other views (a data engineer built the views not me)
In Hive I can do this but its slow, so I want to use Impala
select * from table limit 5;
In Impala I get an error, have tried invalidate metadata and refresh with no luck.
"ERROR: AnalysisException: No matching function with signature: lower(BIGINT)."
what reason could this happen? Never seen this type of error before. Is there a way to do this recursively?
show create table;
To begin with, be aware that Hive and Impala are distinct solutions, with distinct SQL parsers, supporting a distinct set of functions and features. A syntax that is valid in Hive may not be valid in Impala. Some table formats defined with Hive may not be supported by Impala (e.g. ORC, or Parquet with a BINARY column).
In this specific case, the Hive documentation appears to match the Impala documentation for function lower() (caveat: check what versions you are using).
But there's a big catch: lower() takes a String and produces a String. It is not a number function. That smells like a gross mistake such as confusing lower() -- convert some text to lowercase -- and floor() -- get the integer value that is equal or less to a decimal value.
Check with your so-called Data Engineer what he/she was trying to do, and make sure the views were properly tested (or are properly tested after the correction is made). Hive clearly applies some implicit type conversions that enable queries to run, even though it makes no sense and produces goofy results.
I am looking for an efficient way to use a wild card search on a text (blob) column.
I have seen that it is internally stored as bytes...
The data amount will be limited, but unfortunately my vendor has decided to use this stupid datatype. I would also consider to move everything in a temp table if there is an easy system side function to modify it - unfortunately something like rpad does not work...
I can see the text value correctly via using the column in the select part or when reading the data via Perl's DBI module.
Unfortunately, you are stuck - there are very few operations that you can perform on TEXT or BYTE blobs. In particular, none of these work:
+ create table t (t text in table);
+ select t from t where t[1,3] = "abc";
SQL -615: Blobs are not allowed in this expression.
+ select t from t where t like "%abc%";
SQL -219: Wildcard matching may not be used with non-character types.
+ select t from t where t matches "*abc*";
SQL -219: Wildcard matching may not be used with non-character types.
Depending on the version of IDS, you may have options with BTS - Basic Text Search (requires IDS v11), or with other text search data blades. On the other hand, if the data is already in the DB and cannot be type-converted, then you may be forced to extract the blobs and search them client-side, which is less efficient. If you must do that, ensure you filter on as many other conditions as possible to minimize the traffic that is needed.
You might also notice that DBD::Informix has to go through some machinations to make blobs appear to work - machinations that it should not, quite frankly, have to go through. So far, in a decade of trying, I've not persuaded people that these things need fixing.