Temporary tables using RMySQL - sql

Is there a way to create a temporary table using the RMySQL package? If so what is the correct way to do it? In particular I am trying to write a dataframe from my R session to the temporary table. I have several processes running in parallel and I don't want to worry about name conflicts, that's why I want to make them temporary so they are only visible to each individual session. The solution should somehow involve dbWritetable and not dbSendQuery("create temporary table tbl;").
NOTE: I found some stuff on the net that suggests creating a temporary table manually using dbSendQuery(con, "create temporary table x (x int)") and then simply overriding it with dbWriteTable(). This does not work.

Depending on your mysql account restriction can you not do
dbSendQuery(con, "create temporary table x (x int);")
dbSendQuery(con, "drop temporary table x;")
etc..

For this type of job, I would avoid reinventing the wheel and use
https://code.google.com/p/sqldf/
By default, it is for sqlite, but it also works for MySQL (which I never tried). This package is rock-solid and well documented.

This is actually a known issue in RMySQL. Your best bet might be to write the data to a temporary file and then construct your own LOAD DATA LOCAL INFILE statement, using RMySQL::mysqlWriteTable as a guide.
For bonus points, if you can patch RMySQL::mysqlWriteTable to work with tempfiles, send a pull request to the github repo.

Related

How to append new and overwrite existing to SQL from PySpark?

So I have a table in an SQL database and I want to use Synapse (PySpark) to add new records and overwrite existing records. However in PySpark I can either use overwrite mode (which will delete old records that I am not pushing in the iteration) or append mode (which will not overwrite existing records).
So now I wonder what the best approach would be. I think these my options;
Option A: Load the old records first, combine in PySpark and then overwite everything. Downside is I have to load the whole table first.
Option B: Delete the records I will overwrite and then use append mode.
Downside is it requires extra steps that might fail.
Option C: A better way, I did not think of.
Thanks in advance.
Spark drivers don't really support that. But you can load the data into a staging table and then perform a MERGE or INSERT/UPDATE with TSQL through pyodbc (python) or jdbc (Scala).

Hive: create table and write it locally at the same time

Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher

remove source file from Hive table

When I load a (csv)-file to a hive table I can load without overwriting, thus adding the new file to the table.
Internally the file is just copied to the correct folder in HDFS
(e.g. user/warehouse/dbname/tablName/datafile1.csv). And probably some metadata is updated.
After a few loads I want to remove the contents of a specific file from the table.
I am sure I cannot simply delete the file because of the metadata that needs to be adjusted as well. There must be some kind of build-in function for this.
How do I do that?
Why do you need that?I mean Hive was developed to serve as a warehouse where you put lots n lots n lots of data and not to delete data every now and then. Such a need seems to be a poorly thought out schema or a poor use of Hive, at least to me.
And if you really have these kind of needs why don't you create partitioned tables? If you need to delete some specific data just delete that particular partition using either TRUNCATE or ALTER.
TRUNCATE TABLE table_name [PARTITION partition_spec];
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...
if this feature is needed more than just once in a while you can use MapR's distribution while allows this kind of operations with no problem (even via NFS). otherwise, if you don't have partition I think you'll have to create and new table using CTAS filterring the data in the bad file or just copy the good files back to os with "hadoop fs -copyToLocal" and move them back to hdfs into new table

How do I recreate a VIEW as a local table in SQL Server?

I am using Microsoft SQL Server Management Studio and have access to a bunch of views without the original tables that the view depends on. I have copied some data from this view into a file and would like to import it into a database that I locally created to do some analysis.
The brute-force way of doing this is to manually write down the CREATE TABLE statement looking at the columns in the view but is there a better way to get the CREATE or CREATE VIEW statement that I can directly use to recreate a similar table in my localhost?
Create a linked server in your localhost to this server. Then use (while connected to localhost)
SELECT * INTO NewTableName FROM LinkedServer.DBName.SchemaName.View
and a new table in your current DB in localhost would be created.
What I typically prefer to do is use SSIS for data transforms. The first step in the package would be to grab the definition using a SELECT INTO ... WHERE 1=0 so that it doesn't bring over any data and minimizes the locking time (SELECT INTO's result in database wide locks). Then once you have the resulting table with the source view's definition, then copy the data over.
If you're afraid the view's definition can change, stick with an INSERT INTO ... SELECT * FROM SQL task. Otherwise, save the definition that you retrieved from the SQL above and create the table (if it does not already exist). Then use a data flow task to transfer the data over.
With either of these approaches, you avoid the potential double hop scenario (if you're using Windows authentication). It's also reusable in a SQL agent job if you need to do this periodically. Otherwise, it may be a little overkill.
Or you can just run the first part in SSMS but I definitely recommend using the WHERE 1=0 then using an INSERT INTO rather than a straight SELECT INTO. Again, to minimize database locking.

Fastest way to clear the content out of many tables

Right now we're using TRUNCATE to clear out the contents of 798 tables in postgres (isolated test runs). Where possible we use transactions. However, in places where it's not possible, we'd like the fastest way to reset the state of the DB.
We're working towards only actually calling truncate on the tables that have been modified (for any given test only a few of the 798 tables will be modified).
What is the fastest way to delete all of the data from many PostgreSQL tables?
Two things come to mind:
Setup the clean DB as a template and createdb a copy from it before each test.
Setup the clean DB as the default schema, but run the TransactionTests in a different schema (SET search_path TO %s).