I have some Apache Parquet file. I know I can execute parquet file.parquet in my shell and view it in terminal. But I would like some GUI tool to view Parquet files in more user-friendly format. Does such kind of program exist?
Check out this utility. Works for all windows versions: https://github.com/mukunku/ParquetViewer
There is Tad utility, which is cross-platform. Allows you to open Parquet files and also pivot them and export to CSV. Uses DuckDB as it's backend. More info on the DuckDB page:
GH here:
https://github.com/antonycourtney/tad
Actually I found some Windows 10 specific solution. However, I'm working on Linux Mint 18 so I would like to some Linux (or ideally cross-platform) GUI tool. Is there some other GUI tool?
https://www.channels.elastacloud.com/channels/parquet-net/how-about-viewing-parquet-files
There is a GUI tool to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.
It supports complex data type like array, map, struct etc. And you can save the read file in CSV format.
GUI option for Windows, Linux, MAC
You can now use DBeaver to
view parquet data
view metadata and statistics
run sql query on one or multiple files. (supports glob expressions)
generate new parquet files.
DBeaver leverages DuckDB driver to perform operations on parquet file. Features like Projection and predicate pushdown are also supported by DuckDB.
Simply create an in-memory instance of DuckDB using Dbeaver and run the queries like mentioned in this document. Right now Parquet and CSV is supported.
Here is a Youtube video that explains the same - https://youtu.be/j9_YmAKSHoA
JetBrains (IntelliJ, PyCharm etc) has a plugin for this, if you have a professional version: https://plugins.jetbrains.com/plugin/12494-big-data-tools
Related
I would like to know if there's any python library that supports this conversion, currently the options i've found are SASpy, csv or SQL database but was unsuccessful.
This is not really a programming question but hope it won't be an issue.
I've found this post:
Export pandas dataframe to SAS sas7bdat format
But was hoping to find any updates on new libraries that support sas7bdat files creation and how licensing works for SASpy.
The sas7bdat is very hard to write. The read is fairly doable (but pretty hard) but the write is brutal. SAS costs a LOT of money and cannot be purchased (it is leased). My suggestions:
Use one of the products by companies that have done it. Some examples: CoyRoc (SSIS adaptor) $, StatTransfer $, SPSS $$$, SAS (lots of dollar signs). WPS might be able to do it but they save to their format to avoid the mess. They probably also support sas7bdat export.
Do not use sas7bdat format. Consider something else like SAS Transport format. Look at my github repository (savian-net) for C# code that can do it. Translate to Python or find a python library that can handle SAS Transport.
The sas7bdat is a binary, proprietary protocol that is 100% not published anywhere. Any docs are guesses based upon binary sleuthing. It is based on an old mainframe format and 'likely remnants' appear to be included. My suggestion is to avoid it like the plague and find an alternative.
An alternative to using xport as Stu suggested - as of Viya 2021.2.6, SAS supports reading externally generated parquet files via the new parquet import engine. As such, you could export the file to parquet via Python then directly import that into SAS and save it as a .sas7bdat file.
https://communities.sas.com/t5/SAS-Communities-Library/Parquet-Support-in-SAS-Compute-Server/ta-p/811733
The flow I have in mind in this:
1. Export a sas7bdat from SAS
2. Import that file in python with pd.read_sas and do some stuff on in
3. Export the pandas dataframe to sas7bdat (or some other SAS binary fileformat). I thought that pd.to_sas would exist, but it doesn't
4. Open the new file in SAS and do further stuff on it
Is there a solution to point 3 above? As I see it, my only options are csv or some SQL database.
This is not really a programming question. hope it won't be an issue.
Python is capable of writing to SAS .xpt format (see for example the xport library), which is SAS's open file format. SAS7BDAT is a closed file format, and not intended to be read/written to by other languages; some have reverse engineered enough of it to read at least, but from what I've seen no good SAS7BDAT writer exists (R has haven, for example, which is the best one I've seen, but it still has issues and things it can't do).
More common than XPT files, though, which can be slow to work with, is to write a CSV and then write a SAS input script in your python/etc. program. That allows you to use variable labels, value labels, types, etc., as you wish very easily; and writing a SAS input script is very easy to do. Many other software packages do this for their preferred method to produce SAS files. This has an additional advantage that it is easily cross-platform - doesn't matter if your SAS program is on a mainframe, UNIX, Windows, etc.; it's all the same.
Edit: If you do have SAS licensed locally, either via a server or local install, another option for exporting Python data to SAS is SASPy, which is a SAS-maintained open source project that allows Python to directly connect to SAS instances and directly send data. (Under the hood, I believe the data is actually transmitted as a CSV most of the time, and then read in using SAS code.) The SAS ODBC driver is also an option, but for Python SASPy will be the easiest option most likely.
"SAS7BDAT is a closed file format, and not intended to be read/written to by other languages; some have reverse engineered enough of it to read at least, but from what I've seen no good SAS7BDAT writer exists."
Although the SAS7BDAT is a proprietary format, it is not closed. It can be read and written by third-party products using SAS' own ODBC drivers. https://support.sas.com/en/software/sas-odbc-drivers.html. Since Python can use ODBC (pyodbc), just use the SAS ODBC Driver to write the SAS7BDAT file format.
IBM SPSS Statistics and IBM SPSS Modeler can also read and write the SAS7BDAT format as well as the earlier pre-version 7 formats and the SAS Transport File format (the .xpt) files noted above. These products do not require ODBC to do this and this capability is included in SPSS Statistics Base via the SAVE Translate command. It is included in SPSS Modeler Professional via the SAS Source node for reading and the SAS Export node for writing.
I want to open a parquet file and view the contents of the table in Intellij. Is there a way to do this currently or with a plugin?
You need to install: Avro and Parquet Viewer plugin in order to view this kind of file:
https://plugins.jetbrains.com/plugin/12281-avro-and-parquet-viewer
If you just want to open a Parquet file, it is part of the Big Data Tools plugin (JetBrain's official). Just install it and then double click the file it will open in the editor as a table.
The answer for you is no, at least now.
But if the reason you want to view Parquet tables on Intellij is because you want to view Parquet file with GUI tool, I suggest you use tools Bigdata File Viewer.
It's a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows.
It supports complex data type like array, map, etc.
I have .sql file, I want to convert it to NoSQL, as I have a coursework on MongoDB.
What application can I use or how can I do it?
In a quick Google search, I found this website that converts CREATE and INSERT INTO statements to a JSON or Javascript format. However, if you want to create a different database structure (which I would probably recommend), you might want to program a Python script to create a JSON file to import to MongoDB. I guess it all depends on what you want to create.
I have seen that there is Microsoft .NET SDK For Hadoop. I found that Map/Reduce programs can now be written in .NET for HDInsight.
Is there a way we can write Hive UDFs also for HDInsight?
You can use the same streaming method you would with a python UDF to run a .NET program as a UDF.
For example, if you have a .NET program which does something to STDIN and writes a result to STDOUT you can run it using a Hive UDF as follows:
SELECT TRANSFORM (<columns>)
USING '<PROGRAM.EXE>'
AS (<columns>)
FROM <table>;
Note that you can also use multiple columns in your UDF by using comma-separated data, both in and out of the .NET piece.
As far as performance goes, you might find this is really slow, so be careful about overuse, and keep an eye on it.
Also, don't forget to add the files for program.exe to your hive job before running the query.
add FILE 'wasb://...PROGRAM.EXE';
see How to add custom Hive UDFs to HDInsight