How to specify the location when creating a Bigquery dataset in Java - google-bigquery

How do I specify the location when creating a Bigquery dataset in Java?
I want to create it in the Tokyo(asia-northeast1) region.
Perhaps the official guide doesn't explain how to specify the location.

It looks like the documentation could help you
First you can find an exemple of a basic dataset creation :
https://cloud.google.com/bigquery/docs/datasets#java
Then you can fit it to your needs, with the given set location method from the dataset builder :
https://googleapis.dev/java/google-cloud-bigquery/latest/com/google/cloud/bigquery/DatasetInfo.Builder.html#setLocation-java.lang.String-

Related

how to view stats on snowflake?

I am looking for a way to visualize the stats of a table in Snowflake.
The long step is to pull a meaningful sample of the data with python and apply Pandas, but it is somewhat inefficient and unsafe to pull the data out of snowflake.
Snowflake's new interface shows these stats graphically and I would like to know if there is a way to obtain this data with query or by consulting metadata.
I need something like Pandas-profiling but without a external server. maybe snowflake store metadata/statistic about its colums. numeric, categoric
https://github.com/pandas-profiling/pandas-profiling
thank you for your advices.
You can find a lot meta information in the INFORMATION_SCHEMA.
All the views and table functions in the Snowflake INFORMATION_SCHEMA can be found here: https://docs.snowflake.com/en/sql-reference/info-schema.html
not sure if you're talking about viewing the information schema as mentioned, but if you need documentation on this whole new interface, it's called SnowSight
you can learn more there:
https://docs.snowflake.com/en/user-guide/ui-snowsight.html
cheers!
The highlight in your screenshot isn't statistics about the data in the table, but merely about the query result (which looks like a DESCRIBE TABLE query). For example, if you look at type, it simply tells you that this table has 6 VARCHAR columns, 2 timestamps, and 1 number.
What you're looking for is something that is provided by most BI tools or data catalogs. I suggest you take a look at those instead.
You could also use an independent tool, like Soda, which is open source.

Does Tableau support complex data type of hive columns and flatten it?

I am trying to create a dashboard from the data present in the Hive. The catch is the column which I want to visualize is a nested JSON type. So will tableau able to parse and flatten the JSON column and list out all possible attributes? thanks!
Unfortunately Tableau will not automatically flatten the JSON structure of the field for you, but you can manually do so.
Here is an article that explains the use of Regex within Tableau to extract pertinent information from your JSON field.
I realize this may not be the answer you were looking for, but hopefully it gets you started down the right path.
(In case it helps, Tableau does have a JSON connector in the event you are able to connect directly to your JSON as a datasource instead of embedded in your Hive connection as a complex field type.)

How to read specific columns from a Parquet file in Java

I am using a WriteSupport that knows how to write my custom object 'T' into Parquet. I am interested in only reading 2 or 3 specific columns out of 100 columns of my custom object that are written into the Parquet file.
Most examples online extend ReadSupport and read the entire record. Want to accomplish this without using things like Spark, Hive, Avro, Thrift, etc.
An example in Java, which reads selected columns of a custom object in Parquet?
This post may help.
Read specific column from Parquet without using Spark
If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).
In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.

Azure Stream Analytics -> how much control over path prefix do I really have?

I'd like to set the prefix based on some of the data coming from event hub.
My data is something like:
{"id":"1234",...}
I'd like to write a blob prefix that is something like:
foo/{id}/guid....
Ultimately I'd like to have one blob for each id. This will help how it gets consumed downstream by a couple of things.
What I don't see is a way to create prefixes that aren't related to date and time. In theory I can write another job to pull from blobs and break it up after the stream analytics step. However, it feels like SA should allow me to break it up immediately.
Any ideas?
{date} , {time} and {partition} are the only ones supported in blob output prefix. {partition} is a number.
Using a column value in blob prefix is currently not supported.
If you have a limited number of such {id}s then you could workaround by writing multiple "select --" statements with different filters writing to different outputs and hardcode the prefix in the output. Otherwise it is not possible with just ASA.
It should be noted that now you actually can do this. Not sure when it was implemented but you can now use a single property from your message as a custom partition key and the syntax is exactly as the OP has asked for: foo/{id}/something/else
More details are documented here: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-custom-path-patterns-blob-storage-output
Key points:
Only one custom property allowed
Must be a direct reference to an existing message property (i.e. no concatenations like {prop1+prop2})
If the custom property results in too many partitions (more than 8,000) then an arbitrary number of blobs may be created for the same parition

Solr 5.3 implementation processes docs but doesn't return results

I have recently set up a local instance of Solr 5.3 in an effort to get it going for my company. As an initial test case I've set up a Data Import Handler (DIH) that returns PDFs stored within a file directory. When I execute the full import in the admin tool, the DIH processes all the files within the directory, and I'm able to run a general query (*:*) which returns all indexed fields for every record in the index.
When I switch to a specific query using a word definitely contained within the files, however, Solr returns no results. What connection am I not making here?
I can provide excerpts from the schema, solrconfig, and custom data config if needed, but I don't want to oversaturate this post.
The answer I came up with involved a simple newbie mistake combined with something I wasn't anticipating.
1) First, I didn't have my field set to indexed="true". I set that. Yeesh, it stinks being new to this!
2) I needed to make a change to solrconfig.xml for the core in question. Thanks to this article, I was able to determine that I needed to add a default field in the /select requestHandler. Uncommenting the relevant line in solrconfig and changing the field name did the trick-- I no longer need to supply the name in df to return results.
My carryover question for anyone coming across this question in the future is whether this latter point is the proper way to go about using default fields. I see in schema.xml that is deprecated (or heading that direction) in 5.3.0. So is it alright to define df in solrconfig instead?