I have seen that there is Microsoft .NET SDK For Hadoop. I found that Map/Reduce programs can now be written in .NET for HDInsight.
Is there a way we can write Hive UDFs also for HDInsight?
You can use the same streaming method you would with a python UDF to run a .NET program as a UDF.
For example, if you have a .NET program which does something to STDIN and writes a result to STDOUT you can run it using a Hive UDF as follows:
SELECT TRANSFORM (<columns>)
USING '<PROGRAM.EXE>'
AS (<columns>)
FROM <table>;
Note that you can also use multiple columns in your UDF by using comma-separated data, both in and out of the .NET piece.
As far as performance goes, you might find this is really slow, so be careful about overuse, and keep an eye on it.
Also, don't forget to add the files for program.exe to your hive job before running the query.
add FILE 'wasb://...PROGRAM.EXE';
see How to add custom Hive UDFs to HDInsight
Related
When writing dataframes to S3 using the s3a connector, there seems to be no official way of determining the object paths on s3 that were written in the process. What I am trying to achieve is simply determining what objects have been written when writing to s3 (using pyspark 3.2.1 with hadoop 3.3.2 and the directory committer).
The reason this might be useful:
partitionBy might add a dynamic amount of new paths
spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing
With pyspark 3.1.2 and Hadoop 3.2.0 it used to be possible to use the not officially supported "_SUCCESS" file which was written at the path before the first partitioning on S3, which contained all the paths of all written files. Now however, the number of paths seems to be limited to 100 and this is not a option anymore.
Is there really no official, reasonable way of achieving this task?
Now however, the number of paths seems to be limited to 100 and this is not a option anymore.
we had to cut that in HADOOP-16570...one of the scale problems which surfaced during terasorting at 10-100 TB. the time to write the _SUCCESS file started to slow down job commit times. it was only ever intended for testing. sorry.
it is just a constant in the source tree. if you were to provided a patch to make it configurable, I'll be happy to review and merge, provided you follow the "say which aws endpoint you ran all the tests or we ignore your patch" policy.
I don't know where else this stuff is collected. the spark driver is told of the number of files and their total size from each task commit, but isn't given the list by tasks, not AFAIK.
spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing
the part-0001- bit of the filename comes from the task id; the bit afterwards is a uuid created to ensure every filename is unique -see SPARK-8406
Adding UUID to output file name to avoid accidental overwriting. you can probably turn that off
I have some Apache Parquet file. I know I can execute parquet file.parquet in my shell and view it in terminal. But I would like some GUI tool to view Parquet files in more user-friendly format. Does such kind of program exist?
Check out this utility. Works for all windows versions: https://github.com/mukunku/ParquetViewer
There is Tad utility, which is cross-platform. Allows you to open Parquet files and also pivot them and export to CSV. Uses DuckDB as it's backend. More info on the DuckDB page:
GH here:
https://github.com/antonycourtney/tad
Actually I found some Windows 10 specific solution. However, I'm working on Linux Mint 18 so I would like to some Linux (or ideally cross-platform) GUI tool. Is there some other GUI tool?
https://www.channels.elastacloud.com/channels/parquet-net/how-about-viewing-parquet-files
There is a GUI tool to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.
It supports complex data type like array, map, struct etc. And you can save the read file in CSV format.
GUI option for Windows, Linux, MAC
You can now use DBeaver to
view parquet data
view metadata and statistics
run sql query on one or multiple files. (supports glob expressions)
generate new parquet files.
DBeaver leverages DuckDB driver to perform operations on parquet file. Features like Projection and predicate pushdown are also supported by DuckDB.
Simply create an in-memory instance of DuckDB using Dbeaver and run the queries like mentioned in this document. Right now Parquet and CSV is supported.
Here is a Youtube video that explains the same - https://youtu.be/j9_YmAKSHoA
JetBrains (IntelliJ, PyCharm etc) has a plugin for this, if you have a professional version: https://plugins.jetbrains.com/plugin/12494-big-data-tools
Hello I was trying to learn db2 sql and I was having some problems.
I want to bind a package, but I don't have any packages to bind.
So when I try to create a package it obviously gives me an error. I know that a package is created when we create a database. But then why doesn't it list any packages when i do
db2 list packages
I have seen a lot of links but no help. I would really appreciate if someone actually explained it to me.
Thank you very much
In order to understand a package, you first need to understand dynamic and static queries.
Dynamic queries are created at execution time. Everything from PHP, Perl, Python, Ruby or Java (JDBC) are Dynamic queries. For example, when using Java, you get a Prepared statement, and you assign values (setXXX) to the parameter markers (?).
However, there are other programming languages, such as C, Java (sqlj), cobol, where you create the program, with embedded SQL. For example, when using SQLj, you write a class in a .sql file, and the queries are written in specific tags (not java, but started with #sql { }), then you do a precompilation, that is a process where the SQL are taken out from the code, and translated to natural programming language (ie. from sqlj to Java). The SQL in then inserted into a file that is called a bind file. Once you have that, you need to compile the code (javac to create the .class) and bind the file in the database. In this last step is where the packages are created.
A package is a set of data access plans. However, they were calculated at the bind time, not at the execution time, like in the dynamic queries. They are difference between them.
Finally, in order to create a package, you need to change the bind properties, and eventually the bind file itself.
I would like to play with Stack Overflow's data dump in Oracle. The format that they gave me is in XML and it is very very huge (one XML file is about 3GB). I would like to do an import of this data to my Oracle DB. I know one other guy in this topic managed to work on it using the XML directly. Any ideas or suggestions to make this happen easily?
Check out the groovy SQL and XML libraries--you should be able to get up and running pretty quick even with minimal Java/Groovy experience.
http://docs.codehaus.org/display/GROOVY/Tutorial+6+-+Groovy+SQL
Groovy XML
You'll need to install groovy and get the ojdbc14.jar drivers from Oracle. Put your code in a file and run:
groovy -cp ojdbc14.jar myscript.groovy
Which is a better option for deploying databases VSDBCMD or SQLCMD when using a database project (VS 2010). Is there any major drawback other than the defaulted variables (databasename, datapath and logpath)?
vsdbcmd is a diff tool: it can analyze the .dbschema, compare it with the target db and bring the target db up to the schema in the .dbschema file by selectively adding, dropping and altering existing objects. sqlcmd is only an execution tool, it takes a .sql script and blindly runs it. So it really apples to oranges, the two tools are quite different in purpose and capabilities.
Your question is not very clear, so it isn't easy to give you a good answer, but according to your comment to Remus's answer, I assume you are trying to execute the .sql script that vsdbcmd.exe generated. If my assumption is correct, you need to use sqlcmd.exe to execute this script.
According to this thread on MSDN Forums, the VSDB team didn't want to duplicate functionality in vsdbcmd.exe that already exists in sqlcmd.exe.