I have a very large data set in GPDB from which I need to extract close to 3.5 million records. I use this for a flatfile which is then used to load to different tables. I use Talend, and do a select * from table using the tgreenpluminput component and feed that to a tfileoutputdelimited. However due to the very large volume of the file, I run out of memory while executing it on the Talend server.
I lack the permissions of a super user and unable to do a \copy to output it to a csv file. I think something like a do while or a tloop with more limited number of rows might work for me. But my table doesnt have any row_id or uid to distinguish the rows.
Please help me with suggestions how to solve this. Appreciate any ideas. Thanks!
If your requirement is to load data into different tables from one table, then you do not need to go for load into file and then from file to table.
There is a component named tGreenplumRow which allows you to write direct sql queries (DDL and DML queries) in it.
Below is a sample job,
If you notice, there are three insert statements inside this component. It will be executed one by one separated by semicolon.
Related
I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.
I'm new to DB/postgres SQL.
Scenario:
Need to load an csv file into postgres DB. This CSV data needs to loaded into multiple tables according DB schema. I'm looking for better design using python script.
My thought:
1. Load CSV file to intermediate table in postgres
2. Write a trigger on intermediate table to insert data into multiple tables on event of insert
3. Trigger includes truncate data at end
Any suggestions for better design/other ways without any ETL tools, and also any info on modules in Python 3.
Thanks.
Rather than using a trigger, use an explicit INSERT or UPDATE statement. That is probably faster, since it is not invoked per row.
Apart from that, your procedure is fine.
When I am using the above syntax in "Execute row script" step...it is showing success but the temporary table is not getting created. Plz help me out in this.
Yes, the behavior you're seeing is exactly what I would expect. It works fine from the TSQL prompt, throws no error in the transform, but the table is not there after transform completes.
The problem here is the execution model of PDI transforms. When a transform is run, each step gets its own thread of execution. At startup, any step that needs a DB connection is given its own unique connection. After processing finishes, all steps disconnect from the DB. This includes the connection that defined the temp table. Once that happens (the defining connection goes out of scope), the temp table vanishes.
Note, that this means in a transform (as opposed to a Job), you cannot assume a specific order of completion of anything (without Blocking Steps).
We still don't have many specifics about what you're trying to do with this temp table and how you're using it's data, but I suspect you want its contents to persist outside your transform. In that case, you have some options, but a global temp table like this simply won't work.
Options that come to mind:
Convert temp table to a permanent table. This is the simplest
solution; you're basically making a staging table, loading it with a
Table Output step (or whatever), and then reading it with Table
Input steps in other transforms.
Write table contents to a temp file with something like a Text File
Output or Serialze to File step, then reading it back in from the
other transforms.
Store rows in memory. This involves wrapping your transforms in a
Job, and using the Copy Rows to Results and Get Rows from Results steps.
Each of these approaches has its own pros and cons. For example, storing rows in memory will be faster than writing to disk or network, but memory may be limited.
Another step it sounds like you might need depending on what you're doing is the ETL Metadata Injection step. This step allows you in many cases to dynamically move the metadata from one transform to another. See the docs for descriptions of how each of these work.
If you'd like further assistance here, or I've made a wrong assumption, please edit your question and add as much detail as you can.
Nothing technical here. Suppose I have a lot of different categorized data, and I would like to create a database out of it. Would someone literally hand plug in all that info with SQL code itself? Or do some people make a mock website just to input data? What are some of your strategies?
If there would be no way to do it automatically, then a mock website would be the way to go: you can even use it with more people at once, actually multiplying the input speed (as long as you don't mess up assigning each of them a different part of the data).
What format is your data in? And how much of it is there? If its Excel then SQL Server has tools to import it in. I'm not sure if MySQL has anything similar. Even if it doesn't one other technique I have used with Excel data is to use a formula to concatenate as required to generate the INSERT statements. Then just paste those into a query window and run that.
I wouldn't do a website for it unless I was building an admin site for it already and wanted to test that with the initial load.
Most databases have a way to do bulk inserts or have tools for data import.
My strategies normally involve such tools.
Here is an example of importing a CSV file to SQL Server.
Most database servers provide a way to import data from a variety of formats, you could look into that first.
If not, you could write a simple script or console application to parse your input data, and write out a SQL script to insert the data into appropriate tables.
For example, if you data was in a CSV file, you would parse each line in the file, and generate an insert statement to write out to a .sql file.
MyData.csv
1,2,3,'Test',4
2,3,4,'Test2,6
GeneratedInsert.sql
insert into table (col1,col2,col3,col4,col5) values (1,2,3,'Test',4)
insert into table (cal1,col2,col3,col4,col5) values (2,3,4,'Test2',6)
MySQL has a statement LOAD DATA INFILE that is intended for loading bulk data from flat files. It's easy to use and much faster than alternative methods.
But first you do have to use SQL to design tables with fields that match the field of your import data. That is, if you have some file with comma-separated data:
Titanic;1997;4 stars
Batman Begins;2005;5 stars
"Harry Potter and the Sorcerer's Stone";2001;3 stars
...
You would create a table:
CREATE TABLE Movies (
title VARCHAR(100) NOT NULL,
year YEAR NOT NULL
rating VARCHAR(10)
);
Then load data:
LOAD DATA INFILE 'movies.txt' INTO TABLE Movies
FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"';
Most web languages have some sort of auto-scaffolding that you can quickly set up. Useful for admin work as well, if your site is hosted without direct access to DB.
Otherwise, yeah - write the SQL statements. Useful to bring a database up as part of your build process.
I'm looking to execute a series of queries as part of a migration project. The scripts to be generated are produced from a tool which analyses the legacy database then produces a script to map each of the old entities to an appropriate new record. THe scripts run well for small entities but some have records in the hundreds of thousands which produce script files of around 80 MB.
What is the best way to run these scripts?
Is there some SQLCMD from the prompt which deals with larger scripts?
I could also break the scripts down into further smaller scripts but I don't want to have to execute hundreds of scripts to perform the migration.
If possible have the export tool modified to export a BULK INSERT compatible file.
Barring that, you can write a program that will parse the insert statements into something that BULK INSERT will accept.
BULK INSERT uses BCP format files which come in traditional (non-XML) or XML. Does it have to get a new identity and use it in a child and you can't get away with using SET IDENTITY INSERT ON because the database design has changed so much? If so, I think you might be better off using SSIS or similar and doing a Merge Join once the identities are assigned. You could also load the data into staging tables in SQL using SSIS or BCP and then use regular SQL (potentially within SSIS in a SQL task) with the OUTPUT INTO feature to capture the identities and use them in the children.
Just execute the script. We regularly run backup / restore scripts that are 100's Mb in size. It only takes 30 seconds or so.
If it is critical not to block your server for this amount to time, you'll have to really split it up a bit.
Also look into the -tab option of mysqldump with outputs the data using TO OUTFILE, which is more efficient and faster to load.
It sounds like this is generating a single INSERT for each row, which is really going to be pretty slow. If they are all wrapped in a transaction, too, that can be kind of slow (although the number of rows doesn't sound that big that it would cause a transaction to be nearly impossible - like if you were holding a multi-million row insert in a transaction).
You might be better off looking at ETL (DTS, SSIS, BCP or BULK INSERT FROM, or some other tool) to migrate the data instead of scripting each insert.
You could break up the script and execute it in parts (especially if currently it makes it all one big transaction), just automate the execution of the individual scripts using PowerShell or similar.
I've been looking into the "BULK INSERT" from file option but cannot see any examples of the file format. Can the file mix the row formats or does it have to always be consistent in a CSV fashion? The reason I ask is that I've got identities involved across various parent / child tables which is why inserts per row are currently being used.