I was wondering if it is possible to work on a per row basis in the kettle?
I am trying to implement a reporting scheme which consists of a table, where the requests get queued for processing and then the Pentaho job that picks up the records on that table.
my job currently has 3 transformations in it,
1st is to get records from the queued requests table
2nd is to analyze the values on each record and come up with multiple results based on that record. for example, a user would request to have records of movies of the horror genre. then it should spit out the horror movies
3rd is to further retrieve the information about the movies such as the year, director and etc, which is to be outputted to an excel file.
this is the idea, but it's a bit challenging doing it in Pentaho as it does stuff all at the same. is there a way that I can make my job work on records one by one?
EDIT.
Just to add, I have been trying to extend the implementation of the Pentaho cookbook sample but if I compare to my design, its like step 2 and step 3 only.
I can't seem to make the table input step work one at a time.
i just made it act like the implementation in the cookbook, i did adjustments on it. instead of using two transformations to gather all the necessary fields, i just retrieved all the information that i need in 1 transformation.
then after that i copied those information to the next steps, then some queries to complete the information and it is now working.
passing parameters between transformations is a bit confusing, there are parameters to be set on the transformation itself and also on the job where the transformations lay so i kinda went guessing for some time just to make it work.
Related
I am a new employee at the company. The person before me had built some tables in BigQuery. I want to investigate the create table query for that particular table.
Things I would want to check using the query is:
What joins were used?
What are the other tables used to make the table in question?
I have not worked with BigQuery before but I did my due diligence by reading tutorials and the documentation. I could not find anything related there.
Brief outline of your actions below:
Step 1 - gather all query jobs of that user using Jobs.list API - you must have Is Owner permission for respective projects to get someone else's jobs
Step 2 - extract only those jobs run by the user you mentioned and referencing your table of interest - using destination table attribute
Step 3 - for those extracted jobs - just simply check respective queries which allow you to learn how that table was populated
Hth!
I have been looking for an answer since a long time.
Finally found it :
Go to the three bars tab on the left hand side top
From there go to the Analytics tab.
Select BigQuery under which you will find Scheduled queries option,click on that.
In the filter tab you can enter the keywords and get the required query of the table.
For me, I was able to go through my query history and find the query I used.
Step 1.
Go to the Bigquery UI, on the bottom there are personal history and project history tabs. If you can use the same account used to execute the query I recommend personal history.
Step 2.
Click on the tab and there will be a list of queries ordered from most recently run. Check the time the table was created and find a query that ran before the table creation time.
Since the query will run first and create the table there will be slight differences. For me it stayed between a few seconds.
Step 3.
After you find the query used to create the table, simply copy it. And you're done.
I have few tables as shown below
Polls
PollId Question Option
1 What 1
2 Why 4
Updates
UpdateId Text
1 Sleep
2 Play
Polls and updates are just two sample tables (In reality there are more tables like ,photos, videos,links etc). But when a user visit his home (like facebook new feed) he must be displayed with data relevant to him (no such data included in this example). ie I want to select data from all tables with less number of query executions. (ie, I want to present a mixture of datas, ie polls, photos, videos etc )
Currently, I'm fetching only ids and type (ie which table) from all of the tables and gather further data while iterating through this resultset. (ie from c# calling another SqlQuery) .
Is there a way to query the data from whole tables at once? (OUTER JOIN?, UNION?)
Or simply,
How can I select different type of entities at once in a single sql Query?
You could write your query so that you have one long select list for everything you want and it all comes back in one result set but I suspect that wouldn't work too well because you might have varying numbers of different types of items per user.
If you really must have it all in one hit then you can issue multiple queries in one go and get multiple result sets back. To handle this you can use an ADO.Net DataSet. See this SO example (but not the accepted answer - see Vikram Dibyal's answer as that gives a very basic overview of what I think you're asking for).
I won't copy and paste the stuff from the linked thread, just head over and take a look.
I have what I consider a real need to create a query with several hundred columns.
We are working on a mailing for our client. In this mailing, they are listing out several locations where their customers can go to get information. As our designers create the template for this mailing, they are setting up "Slots" for each address. The number of slots on the mailing varies from one mailing to the other, from 6 to possibly 50.
My need for the query is to setup the merge of data into the mailing. I need to provide a query where each mailing is 1 record containing all the information they need for that mailing. I am dynamically creating the SQL statement with the max number of slots on that mailing. With up to 50 slots on that mailing, my query needs to look like this:
MailingID,
LogoLocation,
APNCode,
TFN,
CopyVersion,
Slot1_Name,
Slot1_Address,
Slot1_City,
Slot1_State,
Slot1_DateTime,
...
Slot50_Name,
Slot50_Address,
Slot50_City,
Slot50_State,
Slot50_DateTime
My first attempt was to create a table with all these fields, but I got this error:
The table has been created, but its maximum row size exceeds the allowed maximum of 8060 bytes. INSERT or UPDATE to this table will fail if the resulting row exceeds the size limit.
They only want the data in a CSV file, so I don't need to create a temp table for it.
My problem is that I'm trying to create a standard process and with the number of fields varying like that, I want to set this up in a way that we won't blow up the system every time we try and run it.
I've looked at a few pages and found details on the size limitations of SQL Server and several comments saying a table like this shows a bad database design.
http://msdn.microsoft.com/en-us/library/ms143432(v=sql.105).aspx
http://social.msdn.microsoft.com/Forums/en-US/fec1efbb-94ff-4fe9-8d69-12e95c48587d/its-maximum-row-size-exceeds-the-allowed-maximum-of-8060-bytes-insert-or-update-to-this-table-will?forum=transactsql
Work around SQL Server maximum columns limit 1024 and 8kb record size
I'm hoping that someone out there has some experience doing this and can share some insights on how to make this efficient. Is there another way to accomplish this that I don't know about?
UPDATE:
Thanks for all the quick replies.
More detail on my scenario. You get a flyer in the mail and when you turn the flyer over, it lists 50 locations in your county where you could go take a class or attend a meeting or something. All the details for that flyer needs to be in 1 record so they can map the fields on the one page. If that county has 50 address/date/time combinations, they need them included in the 1 record so they can properly slot the flyer. Think giant mail merge where there might only be 100 counties (100 flyers) but each flyer has tons of information.
When the data is actually stored in the database, I'm storing an id for the specific flyer (MailingID) and each address/date/time combo is its own record. It's just the file they need to merge the details onto the creative piece that has to be denormalized like this.
I haven't been able to find any details on limitations on views. Does a View have the same limitations as a table? Would it work to create a view for them that they can download when they need the data?
All the details for that flyer needs to be in 1 record so they can map the fields on the one page That is a questionable assumption. Why can't the data be stored in 50 rows in a 2nd table?
Anyway, if you insist on storing everything in one row you should probable use XML or JSON. That makes all these problems go away. SQL Server has great support for XML. You can even generate XML on the fly. So you could properly store the 50 items in a 2nd table and only combine them into one XML value for query purposes.
it might be possible I'm searching for the wrong keywords, but so far I couldn't find anything useful.
My problem is quite simple: At the moment I get a list of individual Ids through a report parameter, I pass them to a procedure and show the results.
The new request is like this: Instead of showing the list for all individuals at once, there should be a list for each individual id.
Since I'm quite a beginner in srss, I thought the easiest approach would be the best: Create a subreport, copy the shown list, and create a subreport per individual id.
The amount of this IDs is dynamic, so I have to create a dynamic amount of subreports.
Funny enought, this doesnt seem to be possible. This http://forums.asp.net/t/1397645.aspx url doesnt show exactly the problem, but it shows the limit of the subreports.
I even ran trough the whole msdn pages starting http://technet.microsoft.com/en-us/library/dd220581.aspx but I couldnt find anything there.
So is there a possibility, to create a loop like:
For each Individual ID in Individual IDs, create a subreport and pass ONE ID to this?
Or is there another approach I should use to make this work?
I tried to create a 'Fake'-Dataset with no sql query but just for iterating the id list, but it seems the dataset needs a data-source...
As usual, thanks so far for all answers!
Matthias Müller
Or is there another approach I should use to make this work?
You didn't provide much detail about what sort of information needs to be included in the subreport, but assuming it's a small amount of data (say, showing a personnel record), and not a huge amount (such as a persons sales for the last year), a List might be the way to go.
I tried to create a 'Fake'-Dataset with no sql query but just for iterating the id list, but it seems the dataset needs a data-source...
All datasets require a data source, though if you're merely hard-coding some fake return data, any data source will do, even a local SQL instance with nothing in it.
I need to scheduled events, tasks, appointments, etc. in my DB. Some of them will be one time appointments, and some will be reoccurring "To-Dos" which must be checked off. After looking a google's calendar layout and others, plus doing a lot of reading here is what I have so far.
Calendar table (Could be called schedule table I guess): Basic_Event Title, start/end, reoccurs info.
Calendar occurrence table: ties to schedule table, occurrence specific text, next occurrence date / time????
Looked here at how SQL Server does its jobs: http://technet.microsoft.com/en-us/library/ms178644.aspx
but this is slightly different.
Why two tables: I need to track status of each instance of the reoccurring task. Otherwise this would be much simpler...
so... on to the questions:
1) Does this seem like the proper way to go about it? Is there a better way to handle the multiple occurrence issue?
2) How often / how should I trigger creation of the occurrences? I really don't want to create a bunch of occurrences... BUT... What if the user wants to view next year's calendar...
Makes sense to have your schedule definition for a task in one table and then a separate table to record each instance of that separately - that's the approach I've taken in the past.
And with regards to creating the occurrences, there's probably no need to create them all up front. Especially when you consider tasks that repeat indefinitely! Again, the approach I've used in the past is to only create the next occurrence. When that instance is actioned, the next instance is then calculated and created.
This leaves the issue of viewing future occurrences. For this, you can start of with the initial/next scheduled occurrence and just calculate the future occurrences on-the-fly at display time.
While this isn't an exact answer to your question I've solved this problem before in SQL Server (though database here is irrelevant) by modeling a solution based on Unix's cron.
Instead of string parsing we used integer columns in a table to store the various time units.
We had events which could be scheduled; they could either point to a one-time schedule table that represented a distinct point in time (a date/time) or to the recurring schedule table which is modelled after cron.
Additionally remember to model your solution correctly. An event has a duration but the duration is unrelated to the schedule (but an event's duration may impact the schedule by causing conflicts). Do not try to model duration as part of your schedule.
In the past when we've done this, we had 2 tables:
1) Schedules -> Includes recurrence information
2) Exceptions -> Edit/changes to specific instances
Using SQL, it's possible to get the list of "Schedules" that have at least one instance in a given date range. Then you can expand in the GUI where each instance lies.