Custom file format in Impala - impala

We have a custom internal data format. I'd like to use Impala with this format, just for reading. I want to write the binding for this format. But there is no reason to contribute this back, as nobody else uses this format.
Does Impala support file format plugins in some way?
From hdfs-scan-node.cc it looks like the list of file formats is hardcoded unfortunately. If this is the case, is there a plan to change this? Or is this not a common problem for some reason?

No, as stated in How Impala Works with Hadoop File Formats:
Impala can only query the file formats listed in the preceding table. In particular, Impala does not support the ORC file format.
The reasons for this are probably related to the run-time code generation which would be harder to optimize if Impala didn't constrain file formats.
However, Impala is an open source project and there is no reason why you cannot suggest this by filing a JIRA.
http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/
https://issues.apache.org/jira/projects/IMPALA/issues
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_file_formats.html

Related

Generate sas7bdat files from a pandas dataframe

I would like to know if there's any python library that supports this conversion, currently the options i've found are SASpy, csv or SQL database but was unsuccessful.
This is not really a programming question but hope it won't be an issue.
I've found this post:
Export pandas dataframe to SAS sas7bdat format
But was hoping to find any updates on new libraries that support sas7bdat files creation and how licensing works for SASpy.
The sas7bdat is very hard to write. The read is fairly doable (but pretty hard) but the write is brutal. SAS costs a LOT of money and cannot be purchased (it is leased). My suggestions:
Use one of the products by companies that have done it. Some examples: CoyRoc (SSIS adaptor) $, StatTransfer $, SPSS $$$, SAS (lots of dollar signs). WPS might be able to do it but they save to their format to avoid the mess. They probably also support sas7bdat export.
Do not use sas7bdat format. Consider something else like SAS Transport format. Look at my github repository (savian-net) for C# code that can do it. Translate to Python or find a python library that can handle SAS Transport.
The sas7bdat is a binary, proprietary protocol that is 100% not published anywhere. Any docs are guesses based upon binary sleuthing. It is based on an old mainframe format and 'likely remnants' appear to be included. My suggestion is to avoid it like the plague and find an alternative.
An alternative to using xport as Stu suggested - as of Viya 2021.2.6, SAS supports reading externally generated parquet files via the new parquet import engine. As such, you could export the file to parquet via Python then directly import that into SAS and save it as a .sas7bdat file.
https://communities.sas.com/t5/SAS-Communities-Library/Parquet-Support-in-SAS-Compute-Server/ta-p/811733

Pentaho - PDI: get fields of stream

Pretty simple question here: If I read a .csv file for example, how can I know at runtime what columns that file has?
I want to convert that .csv file to JSON, but I don't know how could I set the fields for the JSON Output step dynamically, to include all the rows of that file. Can you help me expand my knowledge?
Thanks in advance
This is definitely a good use case for metadata injection. The step specifically is called ETL Metadata Injection. You'll need to get the fields dynamically probably using a scripting step (there's Java, JavaScript, and Python scripting steps available, as well as R if you're an Enterprise customer). I don't think that there is a built in step that will dynamically discover the fields at runtime.
Once you have fields, you can use the metadata injection step to inject the field names into CSV Input or Text File Input Step, as well as the JSON Output step.
Here is the official help documentation on the ETL Metadata Injection step: https://help.pentaho.com/Documentation/8.1/Products/Data_Integration/Transformation_Step_Reference/ETL_Metadata_Injection

Export pandas dataframe to SAS sas7bdat format

The flow I have in mind in this:
1. Export a sas7bdat from SAS
2. Import that file in python with pd.read_sas and do some stuff on in
3. Export the pandas dataframe to sas7bdat (or some other SAS binary fileformat). I thought that pd.to_sas would exist, but it doesn't
4. Open the new file in SAS and do further stuff on it
Is there a solution to point 3 above? As I see it, my only options are csv or some SQL database.
This is not really a programming question. hope it won't be an issue.
Python is capable of writing to SAS .xpt format (see for example the xport library), which is SAS's open file format. SAS7BDAT is a closed file format, and not intended to be read/written to by other languages; some have reverse engineered enough of it to read at least, but from what I've seen no good SAS7BDAT writer exists (R has haven, for example, which is the best one I've seen, but it still has issues and things it can't do).
More common than XPT files, though, which can be slow to work with, is to write a CSV and then write a SAS input script in your python/etc. program. That allows you to use variable labels, value labels, types, etc., as you wish very easily; and writing a SAS input script is very easy to do. Many other software packages do this for their preferred method to produce SAS files. This has an additional advantage that it is easily cross-platform - doesn't matter if your SAS program is on a mainframe, UNIX, Windows, etc.; it's all the same.
Edit: If you do have SAS licensed locally, either via a server or local install, another option for exporting Python data to SAS is SASPy, which is a SAS-maintained open source project that allows Python to directly connect to SAS instances and directly send data. (Under the hood, I believe the data is actually transmitted as a CSV most of the time, and then read in using SAS code.) The SAS ODBC driver is also an option, but for Python SASPy will be the easiest option most likely.
"SAS7BDAT is a closed file format, and not intended to be read/written to by other languages; some have reverse engineered enough of it to read at least, but from what I've seen no good SAS7BDAT writer exists."
Although the SAS7BDAT is a proprietary format, it is not closed. It can be read and written by third-party products using SAS' own ODBC drivers. https://support.sas.com/en/software/sas-odbc-drivers.html. Since Python can use ODBC (pyodbc), just use the SAS ODBC Driver to write the SAS7BDAT file format.
IBM SPSS Statistics and IBM SPSS Modeler can also read and write the SAS7BDAT format as well as the earlier pre-version 7 formats and the SAS Transport File format (the .xpt) files noted above. These products do not require ODBC to do this and this capability is included in SPSS Statistics Base via the SAVE Translate command. It is included in SPSS Modeler Professional via the SAS Source node for reading and the SAS Export node for writing.

Export from .xls to .sql / creating sql queries

Okay guys, I've been having this problem for a few weeks now and I'm getting no-where with it. I have OpenOffice and regular Office softwares. Both produce flawed .csv files, or at least phpMyAdmin can't read neither of these. Yes, I've been trying to change server's settings of uploading, etc. I also tried to contact my web hosting service and they claimed that all the .csv files I've produced are flawed.
Anyway, I'm looking for a way to convert .xls table to SQL. Most of the softwares out there cost money that I don't have. Furthermore, I've seen PHP systems that do just that, so I know this is possible.
No need converted to. sql, you can import directly with phpmyadmin or using tools like navicat for mysql in phpmyadmin go to the option to import, find the file, select the file type (csv or csv loaddata), in part below defines the column separator (if you do not know which opens the file with notepad)
if a very large file using navicat.
Flawed is "defective"?. I assume you have problem with excel, maybe you have defined the same column separator for separating thousands or decimals, use openoffice to open the file

Get a valid schema of large (1 GB) xml files

I need to bulk load huge xml files to SQL Server 2005. I decided to use SQLXMLBULKLOAD in my C# app, but I need to get valid xsd-schemas of those xml files to load them. Which is best way to generate xsd file?
I tried MS VS xsd.exe, but it tries to load the file into memory, which causes OutOfMemory exception.
Thanks!
Strip the file down to create a smaller one that is representative of the whole, then generate an XSD from that. You can then tailor the result if necessary.
There are quite a few tools to generate schemas from instances, but I don't know how many of them are able to operate in pure streaming mode. One tool which will work regardless of the file size is the DTDGenerator that was originally part of Saxon; you can find it here:
http://saxon.sourceforge.net/dtdgen.html
It produces a DTD rather than a schema, but there are plenty of tools available to convert a DTD to a schema.