Append data in existing file in U-SQL - azure-data-lake

Can we append data in existing file in U-SQL?
I have created a CSV file as output in U-SQL. I am writing another U-SQL query and I want to append the output of that query in the existing file.
Is it possible?

It's not supported, and would go against the design of a robust, distributed, idempotent big data system (although you could implement that behaviour by reading the previous output as a rowset and do UNION ALL).
The best way to deal with this is to use partitions properly, for example, create one or more new partitions for each of your executions: https://msdn.microsoft.com/en-us/library/azure/mt621324.aspx

Related

How to load csv file to multiple tables in postgres (mainly concerned about best practice)

I'm new to DB/postgres SQL.
Scenario:
Need to load an csv file into postgres DB. This CSV data needs to loaded into multiple tables according DB schema. I'm looking for better design using python script.
My thought:
1. Load CSV file to intermediate table in postgres
2. Write a trigger on intermediate table to insert data into multiple tables on event of insert
3. Trigger includes truncate data at end
Any suggestions for better design/other ways without any ETL tools, and also any info on modules in Python 3.
Thanks.
Rather than using a trigger, use an explicit INSERT or UPDATE statement. That is probably faster, since it is not invoked per row.
Apart from that, your procedure is fine.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.

Can a hive reducer script give output to more than one tables?

I wish to save some data which the reducer script (which is in php) accumulates over the course of its execution into another table (having a totally different schema). What is the best way to go about it?
Also, Can we write to a file from within a reducer?
One option you have is to upload the data to HDFS via WebHDFS into the appropriate directory (http://hadoop.apache.org/docs/stable/webhdfs.html / CREATE).
If it's more convenient you can also then expose it as a partition using WebHCatalog (http://hive.apache.org/docs/hcat_r0.5.0/resources.html / PUT ddl/database/:db/table/:table/partition/:partition)
It will then show up in the table you wanted to write it into.

Extract specific data from full mysqldump backup

I am making regular backups of my MySQL database with mysqldump. This gives me a .sql file with CREATE TABLE and INSERT statements, allowing me to restore my database on demand. However, I have yet to find a good way to extract specific data from this backup, e.g. extract all rows from a certain table matching certain conditions.
Thus, my current approach is to restore the entire file into a new temporary database, extract the data I actually want with a new mysqldump call, delete the temporary database and then import the extracted lines into my real database.
Is this really the best way to do this? Is there some sort of script that can directly parse the .sql file and extract the relevant lines? I don't think there is an easy solution with grep and friends unfortunately, as mysqldump generates INSERT statements that insert many values per line.
The solution to this just ended up being to import the whole file, extract the data I needed and drop the database again.

Bteq Scripts to copy data between two Teradata servers

How do I copy data from multiple tables within one database to another database residing on a different server?
Is this possible through a BTEQ Script in Teradata?
If so, provide a sample.
If not, are there other options to do this other than using a flat-file?
This is not possible using BTEQ since you have mentioned both the databases are residing in different servers.
There are two solutions for this.
Arcmain - You need to use Arcmain Backup first, which creates files containing data from your tables. Then you need to use Arcmain restore which restores the data from the files
TPT - Teradata Parallel Transporter. This is a very advanced tool. This does not create any files like Arcmain. It directly moves the data between two teradata servers.(Wikipedia)
If I am understanding your question, you want to move a set of tables from one DB to another.
You can use the following syntax in a BTEQ Script to copy the tables and data:
CREATE TABLE <NewDB>.<NewTable> AS <OldDB>.<OldTable> WITH DATA AND STATS;
Or just the table structures:
CREATE TABLE <NewDB>.<NewTable> AS <OldDB>.<OldTable> WITH NO DATA AND NO STATS;
If you get real savvy you can create a BTEQ script that dynamically builds the above statement in a SELECT statement, exports the results, then in turn runs the newly exported file all within a single BTEQ script.
There are a bunch of other options that you can do with CREATE TABLE <...> AS <...>;. You would be best served reviewing the Teradata Manuals for more details.
There are a few more options which will allow you to copy from one table to another.
Possibly the simplest way would be to write a smallish program which uses one of their communication layers (ODBC, .NET Data Provider, JDBC, cli, etc.) and use that to take a select statement and an insert statement. This would require some work, but it would have less overhead than trying to learn how to write TPT scripts. You would not need any 'DBA' permissions to write your own.
Teradata also sells other applications which hides the complexity of some of the tools. Teradata Data Mover handles provides an abstraction layer between tools like arcmain and tpt. Access to this tool is most likely restricted to DBA types.
If you want to move data from one server to another server then
We can do this with the flat file.
First we have fetch data from source table to flat file through any utility such as bteq or fastexport.
then we can load this data into target table with the help of mload,fastload or bteq scripts.