I have a ~20 million row dataset for addresses in the US. Currently I perform a query for each state's addresses by manually changing the WHERE statement to the proper State Abbreviation (e.g. WHERE STATE_ABV= 'NY').
I have to do queries state-by-state because I export these files as csvs which has a low row limit and cannot handle the enitre US dataset.
Is there a way to automate the process so after each run the query would be exported as a csv, the WHERE statement would be changed to the next State Abbreviation, and then the proccess run again? I would like to have 50 csvs saved somewhere (one for each state) by the end of the automation.
All my searcing about automation discussed doing recurring queries based on time not variables so I understand if this is not possible.
The software I am using is SQL Developer
Thanks as always
Related
When you are providing a report to an end user and you want to validate the report against the system and database (checking that your SQL code is pulling the details accurately), what is considered enough validation on the output Excel file?
For example:
10% in 100 would be 10
10% in 1000 would be 100 seems reasonable
10% in 1,000,000 would be 100,000 and seems completely unreasonable.
Is there a template or scale to validate across large datasets for human validation? Has anyone done or seen something like this before that I could use as a guide?
Now when you say validate against the database, the data you are giving has already been pulled from the database I am assuming. So is the question how to validate data that was input or validate that the queries are accurate?
If it's the first, the only real way to do this is to have controls for type of entry (IE: string, date, integer validation.)
If it's the second, then you should evaluate it on some quantitative criteria. For example, If I am trying to validate that I sold 100 computers last month and I have hard receipts for those 100 then I can query to ensure that this is reflective of the actual truth. Beyond that, you could have controls to make sure reports don't report duplicates and so on and so forth, but that has more to do with just general administration of input.
I have a database running now that has had all data in the "leads column" / "phone number row" removed. I have created an updated csv file that has most of the phone numbers present in addition to the client name, email and address.
How can I import the phone numbers in the phone numbers row based on matching the client name, email or address data, without affected any other columns or rows other than the phone numbers row?
This sounds like the perfect fit for an SSIS package! (this is assuming you are referring to SQL Server...since you didnt list an RDBMS it is just a guess)
Some SSIS package basics materials:
http://www.codeproject.com/Articles/155829/SQL-Server-Integration-Services-SSIS-Part-Basics
https://technet.microsoft.com/en-us/library/ms169917(v=sql.110).aspx
http://ssistutorial.blogspot.com/
SSIS is basically an ETL package development tool used with SQL Server that has countless options for moving data around. You would only need on data flow task inside of SSIS to accomplish what you are searching for. I highly recommend reading up on some of the content above and giving it a shot!
I have two data-sets in my SSRS tool, first table contain 12,000 records and second one 26,000 records. And 40 columns in each table.
While building a report each time I go preview - it takes forever to display.
Is any way to do something to avoid that, so I can at least not spent so much time to build this report?
Thank you in advance.
Add a dummy parameter to limit your dataset. Or just change your select to select top 100 while building the report
#vercelli's answer is a good one. In addition you can change your cache options in the designer (for all resultsets including patramters) so that the queries are not rerun each time.
This is really useful plus - a couple of tips for you:
1. I don't recommend caching until you are happy with the your dataset results.
2. If you are using the cache and you want to do a quick refresh then the data is stored in a ".data" file in the same location as a your .rdl. You can delete this to query the database again if required.
This is quite a long and complicated question, I will do my best to explain exactly what I need to do.
This applies to a flight department. Let's start with what I have, we use spreadsheets to track flight time, landings, and engine cycles. Currently we're using two spreadsheets, one is our "trip" sheet, and the other is our flight "log".
The trip sheet can be one to three worksheets long, it is used to track each flight flown during the trip. The trip could range from one flight (leg), up to 25 flights (legs), and could range from 1 day, to 21 days. Each DAY of the trip is it's own Log #, ie. if there are 3 flights on one day, they all share the same Log #. The trip #'s are not in order, one trip could be #672, the next #264543, the next #689. The creation date is the only thing that could be used to track the trip workbooks in order.
The flight log is the FAA required logbook for the aircraft. The Log #'s run in order, ie. 459, 460, 461. A flight log is required for each day that the aircraft flies. Some, but not all of the information from the trip sheet is required on the flight log. The most important thing is that the times, landing, and cycles calculate in order.
Now here is what I'm looking for. I'd like a spreadsheet that contains the three trip sheet worksheets as we have now, but when a flight (leg) is entered, it creates a 4th worksheet which would be the flight log. Each leg flown that day would have it's information transferred to that flight log. Now, when we fly on a NEW day, a 5th worksheet would be created for the new day's flight log. Times, landings, and cycle totals need to transfer over from the previous day's flight log, and the other information needed from the trip sheet just like the previous log. And so on, and so on, till the end of the TRIP.
Now here's the REAL tricky part, when we start a new TRIP, and create a new workbook for that trip, I need the totals from the previous trip to transfer to the new workbook, so a legal, running total of aircraft times can be kept.
So basically, what I want to do, is take two separate workbooks for each trip we use now, and cram them into one, but each time a new trip workbook is created I need to go grab info from the LAST workbook created to keep a running total.
New to this forum, if there's a way to attach a copy of the two workbooks we use now please tell me. Looking at what we are using would probably make a lot of this clearer.
Thank you!!! PQ
It sounds like you have a working solution using Excel, which is very good. Oftentimes the biggest challenge is figuring out the process flow and all its branches. Further it seems like you just want to make your solution more routine and sustainable to work with.
Although making a souped-up macro-enabled Excel document sounds like the right way to go, the features you are asking for are really more suited for a relational database. Not to say it can't be done, but implementing an Excel based solution is going to be messy. The crucial difference I believe is maintaining the the logical link between the trip sheet and the log sheet.
For example, if I understand correctly, you will have to create several Excel files, and they are going to need a naming convention in order for the computer to know which ones to look for. This exposes the data to the most basic mishaps like mistakenly renaming or moving a file. If you will be the only one to maintain the system, then perhaps that won't be an issue, but experience tells me that a lot of effort can be instantly undermined by something as simple as opening a file and editing it.
This also means that you will have to maintain a "builder" file that must contain the code you develop. Not every machine is set up for macros, and a lot of end-users will get scare notices that, "this document contains macros which could be a danger...blah blah blah." Which means every output file should probably be macro free.
Instead I would recommend recreating your system using a relational database like MS-Access. You can create unlimited number of records/tables, and use any number of variables to maintain the logical link (by log #, by date, by flight #, etc.). You can also set the rules so data can be recorded and reported in a consistent manner. And if you have the need and the programming expertise, VBA macros can also be introduced to an MS-Access based solution.
Lastly, all the data could easily be kept in one central *.mdb file, which would be far easier to maintain and backup than several overlapping excel files.
I was wondering if it is possible to work on a per row basis in the kettle?
I am trying to implement a reporting scheme which consists of a table, where the requests get queued for processing and then the Pentaho job that picks up the records on that table.
my job currently has 3 transformations in it,
1st is to get records from the queued requests table
2nd is to analyze the values on each record and come up with multiple results based on that record. for example, a user would request to have records of movies of the horror genre. then it should spit out the horror movies
3rd is to further retrieve the information about the movies such as the year, director and etc, which is to be outputted to an excel file.
this is the idea, but it's a bit challenging doing it in Pentaho as it does stuff all at the same. is there a way that I can make my job work on records one by one?
EDIT.
Just to add, I have been trying to extend the implementation of the Pentaho cookbook sample but if I compare to my design, its like step 2 and step 3 only.
I can't seem to make the table input step work one at a time.
i just made it act like the implementation in the cookbook, i did adjustments on it. instead of using two transformations to gather all the necessary fields, i just retrieved all the information that i need in 1 transformation.
then after that i copied those information to the next steps, then some queries to complete the information and it is now working.
passing parameters between transformations is a bit confusing, there are parameters to be set on the transformation itself and also on the job where the transformations lay so i kinda went guessing for some time just to make it work.