How to add headers when we DUMP the data in output file using PIG scripts? - apache-pig

I tried to search for it but cannot find the tip/recommendations.
Here is my situation. I have all the data lined up correctly and output working fine using pig script. Stored the files in a output directory. The output files are more than 100 files so what i have done is accumulated the results file using another pig script.
I was wondering if there is anything in PIG LATIN that will help me add "Header" to the accumulated results file so that business users can quickly use it as it also has headers?
Please advise

If you are using DUMP in Pig script and redirecting the result to a single file, you can use DESCRIBE before DUMP. Doing so will append schema information as header to your output file
A = LOAD 'test' USING PigStorage() AS (col1:int, col2:chararray);
DESCRIBE A;
DUMP A;
output will be something like:
A: {col1: int,col2: chararray}
1,test
2,test
...

Pig can store the schema into a different file ".pig_schema" using PigStorage:
store A into 'outputFile' using PigStorage('\t', '-schema');
will save your data in the outputFile using tabs as delimiters and also creates the schema file.

You can store the header in a separate file, LOAD it and UNION it with your data. Then you need to do an ORDER BY (that might be tricky depending on your data).
Another way would be to use hadoop getmerge.
In general, this is not something pig is very good at, you might as well write a script in another language.

Related

Azure Data Factory 2 : How to split a file into multiple output files

I'm using Azure Data Factory and am looking for the complement to the "Lookup" activity. Basically I want to be able to write a single line to a file.
Here's the setup:
Read from a CSV file in blob store using a Lookup activity
Connect the output of that to a For Each
within the For Each, take each record (a line from the file read by the Lookup activity) and write it to a distinct file, named dynamically.
Any clues on how to accomplish that?
Use Data flow, use the derived column activity to create a filename column. Use the filename column in sink. Details on how to implement dynamic filenames in ADF is describe here: https://kromerbigdata.com/2019/04/05/dynamic-file-names-in-adf-with-mapping-data-flows/
Data Flow would probably be better for this, but as a quick hack, you can do the following to read the text file line by line in a pipeline:
Define your source dataset to output a line as a single column. Normally I would use "NoDelimiter" for this, but that isn't supported by Lookup. As a workaround, define it with an incorrect Column Delimiter (like | or \t for a CSV file). You should also go to the Schema tab, and CLEAR the schema. This will generate a column in the output named "Prop_0".
In the foreach activity, set the Items to the Lookup's "output.value" and check "Sequential".
Inside the foreach, you can use item().Prop_0 to grab the text of the line:
To the best of my understanding, creating a blob isn't directly supported by pipelines [hence my suggestion above to look into Data Flow]. It is, however, very simple to do in Logic Apps. If I was tackling this problem, I would create a logic app with an HTTP Request Received trigger, then call it from ADF with a Web activity and send the text line and dynamic file name in the payload.

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

Saving an sqllite3 database as csv file in Ruby on Rails 5

I am currently working on a project and I want to know how to save an sqllite database in rails as a csv file. I want it when you click the button, the current database on the system download. Can anybody help me? Thanks!
Your problem isn't really specific to Rails. Instead, you're mostly dealing with an administrative issue. You should write a script to export your database as csv, something like this:
#!/bin/bash
./bin/sqlite3 ./my_app/db/my_database.db <<!
.headers on
.mode csv
.output my_output_file.csv
select * from my_table;
!
This script exports a single table. If you have additional tables, you'll want to add them to your script.
The only Rails related issue is the matter of calling that script. Save the script within your application structure; I'd suggest my_app/assets or some similar location.
Now you can run that script using system(command) where command is the absolute path for your script, within a set of double-quotes.

Dynamically populate external tables location

I'm trying to use oracle external tables to load flat files into a database but I'm having a bit of an issue with the location clause. The files we receive are appended with several pieces of information including the date so I was hoping to use wildcards in the location clause but it doesn't look like I'm able to.
I think I'm right in assuming I'm unable to use wildcards, does anyone have a suggestion on how I can accomplish this without writing large amounts of code per external table?
Current thoughts:
The only way I can think of doing it at the moment is to have a shell watcher script and parameter table. User can specify: input directory, file mask, external table etc. Then when a file is found in the directory, the shell script generates a list of files found with the file mask. For each file found issue a alter table command to change the location on the given external table to that file and launch the rest of the pl/sql associated with that file. This can be repeated for each file found with the file mask. I guess the benefit to this is I could also add the date to the end of the log and bad files after each run.
I'll post the solution I went with in the end which appears to be the only way.
I have a file watcher than looks for files in a given input dir with a certain file mask. The lookup table also includes the name of the external table. I then simply issue an alter table on the external table with the list of new file names.
For me this wasn't much of an issue as I'm already using shell for most of the file watching and file manipulation. Hopefully this saves someone searching for ages for a solution.

Import part of MySQL dump (not all of it)

I'm going to do some stress tests and right now I have a really really huge MySQL dump file in hand that could be used as the benchmark.
There's only one table inside the dump.
What's awkward is that my server doesn't have that much disk space to actually hold this table. So I would like to just import some random part of the dump, not all of them.
Is it possible? If yes, what does the command line look like?
I have created a shell script for this. If you are on a unix based system, use
https://github.com/JoyceBabu/MySQL-Dump-Table-Extractor
Invoke the script using ./extract_table.sh sqlfile.sql
To extract a single table type the table name
To extract all tables from table1 to table2 type table1 table2
To view a list of all available tables type LIST
MySQL dump files are simply text files full of SQL statements. Write a simple program to read the dump file and write random parts of it to a new dump file.
Couldn't you just manually split the file? These are just flat text files...so open it up in your favorite text editor and delete half of the file (or however much you want).