We have a manually partitioned "video metadata" table being fed fresh data each day. In our system, old data is only kept for historical reasons since the latest data is the most up to date.
What we cant figure out is how to reference only the latest partition in this table using LookML.
So far we have attempted to store views in BigQuery. We have tried and failed to store a simple "fetch the newest partition" query as a view, in both standard and legacy SQL, and upon some searching, this seems to be by design, even though the error message states "Dataset not found" instead of something more relevant.
We've also tried to build the filtering into Looker, but we're having trouble with getting things to actually work and only having the latest data returned to us through it.
Any help would be appreciated.
We've managed to find a solution, derived tables
We figured that since we couldn't define a view on BigQuery's side, we could do it on Looker's side instead, so we defined the table in a derived table block inside a view.
derived_table: {
sql: SELECT * FROM dataset.table_*
WHERE _TABLE_SUFFIX = (
SELECT max(_TABLE_SUFFIX) FROM dataset.table_*
);;
sql_trigger_value: SELECT max(_TABLE_SUFFIX) FROM dataset.table_*;;
}
This gave us a view with just the newest data in it.
Related
I have seen some other posts about this, but have not found an answer that permanently works.
We have a table, and I had to add two columns to it. In order to do so, I dropped the table and recreated it. But since it was an external table, it did not drop the associated data files. The data gets loaded from a control file and is partitioned by date. So let's say the dates that were in the table were 2021-01-01 and 2021-01-02. But only 2021-01-02 is in the control file. So when I am loading that date, it gets re-run with the new columns and everything is fine. But 2021-01-01 is still there, but with a different schema.
This is no issue in Hive, as it seems to default to resolve by name, not position. But Impala resolves by position, so the new columns throw it off.
If I have a table that before had the columns c1,c2,c3, and now have the additional columns c4,c5, if I try to run a query such as
select * from my_table where c5 is null limit 3;
This will give an incompatible parquet schema error in Impala (but Hive is fine, it would just have null for c4 and c5 for the date 2021-01-01).
If I run the command set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; and then the above query again, it is fine. But I would have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; at the beginning of each session, which is not ideal.
From searching online, I have come up with a few solutions:
Drop all data files when creating the new table and just start loading from scratch (I think we want to keep the old files)
Re-load each date (this might not be ideal as there could be many, many dates that would have to be re-loaded and overwritten)
Change the setting permanently in Cloudera Manager (I do not have access to CM and don't know how feasible it would be to change it)
Are there any other solutions to have it so I don't have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; each time I want to use this table in Impala?
It is possible to return nested results(RECORD type) if noflatten_results flag is specified but it is possible to just view them on screen without writing it to table first.
for example, here is an simple user table(my actual table is big large(400+col with multi-level of nesting)
ID,
name: {first, last}
I want to view record particular user & display in my applicable, so my query is
SELECT * FROM dataset.user WHERE id=423421 limit 1
is it possible to return the result directly?
You should write your output to "temp" table with noflatten_results option (also respective expiration to be set to purge table after it is used) and serve your client out of this temp table. All "on-fly"
Have in mind that no matter how small "temp" table is - if you will be querying it (in above second step) you will be billed for at least 10MB, so you better use Tabledata.list API in this step (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list) which is free!
So if you try to get repeated records it will fail on the interface/BQ console with the error:
Error: Cannot output multiple independently repeated fields at the same time.
and in order to get past this error is to FLATTEN your output.
On a small pre-registration database, I'm using the following SQL to generate a VIEW whenever a specific user name is given. I'm using this mainly to get a snapshot whenever a sysadmin suspects duplicate names are registering. This will be done rarely (atmost once per hour), so the database schema shouldn't be getting excessively big.
CREATE OR REPLACE TRIGGER upd_on_su_entry
AFTER UPDATE OR INSERT
ON PRE_REG_MEMBER
FOR EACH ROW
BEGIN
IF :new.MEMBER_NAME = 'SysAdmin Dup Tester' THEN
EXECUTE IMMEDIATE 'CREATE OR REPLACE VIEW mem_n AS SELECT :old.MEMBER_NAME, COUNT(:old.MEMBER_NAME) FROM MEMBER GROUP BY MEMBER_NAME';
END IF;
END;
However, this appears to be a bloated, inefficient and erroneous way of working (according to my admin). Is there a fundamental error here ? Can I take an equivalent snapshot some other way?
I'm very new to SQL, so please bear with me.
Also, I want to be using the view as :
public void dups()throws Exception
{
Calendar cal = Calendar.getInstance();
jt.setText("Duplicate List at : "+ cal.getTime());
try{
rs=stat.executeQuery("select * from upd_on_su_entry");
while(rs.next())
{
jt.append(rs.getString("MEMBER_NAME")+"\t");
jt.append(rs.getString(2)+"\t");
}
}
catch(Exception e){System.out.print("\n"+e);}
}
There seem to be some misunderstandings here.
1.) views are basically stored sql statements, not stored sql results, so your view will always display the data as it is at the point of querying the view.
2.) Never ever use DDL (create statements) and similar during normal processing of an application. Its just not the way databases are intended to work.
If you want a snapshot at a point in time, create a secondary table which contains all the columns of the original table plus a snapshot time stamp.
When ever you want to make a snapshot copy all the data you want from the original table into the snapshot table while adding the current time stamp.
Based on your comment, it sounds like you want something like this
SELECT MEMBER_NAME FROM PRE_REG_MEMBER
GROUP BY MEMBER_NAME HAVING COUNT(*) > 1;
This will return all members with more than one row in the table
Again, ask yourself "what am I trying to do"?
Don't focus on the technology. Do that after you have a clear idea of what your goal is.
If you are just trying to avoid duplicate registrations on your database, just search the users table and show an error if that username is already there.
Also, think of your datamodel carefully before going into the implementation details.
All,
I have a package that I'm building as a data importer so I can copy sets of data from my production environment and develop on another instance.
I have two tables that contain header and detail rows for service tickets. Those service tickets are tied back to orders.
I am pulling the service tickets from a certain time window, however, the originating orders fall outside of the date range that I'm pulling for the tickets.
I want to be able to take the following steps in an SSIS package:
Import the header and detail rows within the given date range from prod to dev
Select the relevant order numbers from dev tables
Use the list of order numbers to import only the relevant orders from prod
I poked through other answers and couldn't find answers that addressed this directly, so I apologize if there is an answer out there and I missed it. I may not have been asking the question correctly. I'm assuming that I would need to pull those order numbers into a temp table or variable in order to apply them as a filter.
As I write this, it just crossed my mind to use a join on the source system with the ticket to order tables and still use the date range to limit, but I'm still posting the question to see if anyone has dealt with this before.
Your steps are already fairly clear, are you asking how to actually implement them? It looks like you can do all three steps by using SELECT statements in your data sources:
Build a SELECT statement dynamically with the correct dates to use in your data source. The dates could be programmatically generated in a script task, or saved in a database table and populated into variables. Then you copy the data across to the dev system.
Run a SELECT statement in the dev system that returns the order numbers, and copy the results to a table in the prod database.
Run a SELECT statement in the prod database that joins on the table from step 2 and copy the results back to dev.
An alternative to the table in steps 2 and 3 would be a lookup transformation, but if you have a large number of rows then using a table will probably be faster.
I'm developing an app which requires that the user selects a year formatted like this 1992-1993 from a spinner. The tablename is also named 1992-1993 and the idea is that using SQL the values from this table are pulled through with a statement like this select * from 1992-1993. When I run the emulator however, it throws an error.
If I then relabel the Spinner item to NinetyTwo and rename the table to NinetyTwo and run the emulator it runs as expected and the data is pulled through from the table.
Does SQLite have an issue with numbers in table names?
Yes and No. It has an issue with numbers at the beginning of a table name. 1992-1993 is an expression returning -1. Try to rename the table to Year1992.
Here a similar issue on SO.
sorry for late post
There may be a deeper issue here - is the structure you are using (table name per item in spinner) the best one for the job?
You may find that you want a number of tables e.g.
spinner_value (id, value)
form_data(id, spinner_value_id, etc ....)