I wanted to reach out to ask if there is a practical way of finding out a given table's structure/schema e.g.,the column names and example row data inserted into the table(like the head function in python) if you only have the table name. I have access to several tables in my current role, however, a person who developed the tables left the team I am on. I was interested in examining the tables closer via SQL Assistant in Teradata (these tables often contain often hundreds of thousands of rows hence there are issues of hitting CPU exception criteria errors).
I have tried the following select statement, but there is an issue of hitting internal CPU exception criteria limits.
SELECT top10 * FROM dbc.table1
Thank you in advance for any tips/advice!
You can use one of these commands to get table's structure details in teradata
SHOW TABLE Database_Name.Table_Name;
or
HELP TABLE Database_Name.Table_Name;
It shows the table structure details
Problem statement:
I need to insert/update a few columns in a big query table that is partitioned by date.So basically I need to do the necessary changes for each partitioned date (done by day).
(its the sessions table that is created automatically by linking the GA View to BQ so I haven't done the partition manually but its automatically taken care by google).
query reference from google_docs
my query:
I also tried the below :
Can anyone help me here ? sorry I am a bit naive with BQ.
You are trying to insert into a wildcard table, a meta-table that is actually composed of multiple tables. Wildcard table is read only and cannot be inserted into.
As Hua said, ga_sessions_* is not a partitioned table, but represents many tables, each with a different suffix.
You probably want to do this then:
INSERT INTO `p.d.ga_sessions_20191125` (visitNumber, visitId)
SELECT 1, 1574
I am migrating data from csv SQL files (1 per table) to a Cassandra database that is using a pre-determined and standardized format. As a result, I am doing transformations, joins, etc on the SQL data to get it matching this format before writing it to Cassandra. My issue is that this db migration is happening in batches (not all at once) and I cannot ensure that information from the multiple sides of a table join will be present when an entry to Cassandra is written.
ex.
Table 1 and table 2 both have the partitioning and clustering keys (allowing the join since their combination is unique) and are joined using full outer join. With the way that we are being given data, however, there is a chance that we could get a record from Table 1 but not from Table 2 in a "batch" of data. When I perform the full outer join, no problems...extra columns from the other table are added and just fill with nulls. On the next interval that I get data, I then receive the Table 2 portion that should have previously been joined to Table 1.
How do I get those entries combined?
I have looked for an update or insert type method in Spark depending if that set of partitioning and clustering keys exists but have not turned up anything. Is this the most efficient way? Will I just have to add every entry with spark.sql query then update/write?
Note: using uuids that would prevent the primary key conflict will not solve the issue, I do not want 2 partial entries. All data with that particular primary key needs to be in the same row.
Thanks for any help that you can provide!
I think you should be able to just directly write the data to cassandra and not have to worry about it, assuming all primary keys are the same.
Cassandra's inserts are really "insert or update" so I believe when you insert one side of a join, it will just leave some columns empty. Then when you insert the other side of the join, it will update that row with the new columns.
Take this with a grain of salt, as I don't have a Spark+Cassandra cluster available to test and make sure.
I am currently writing an application that needs to be able to select a subset of IDs from Millions of users...
I am currently writing software to select a group of 100.000 IDs from a table that contains the whole list of Brazilian population 200.000.000 (200M), I need to be able to do this in a reasonable amount of time... ID on Table = ID on XML
I am thinking of parsing the xml file and starting a thread that performs a SELECT statement on a database, I would need a connection for each thread, still this way seems like a brute force approach, perhaps there is a more elegant way?
1) what is the best database to do this?
2) what is a reasonable limit to the amount of db connections?
Making 100.000 queries would take a long time, and splitting up the work on separate threads won't help you much as you are reading from the same table.
Don't get a single record at a time, rather divide the 100.000 items up in reasonably small batches, for example 1000 items each, which you can send to the database. Create a temporary table in the database with those id values, and make a join against the database table to get those records.
Using MS SQL Server for example, you can send a batch of items as an XML to a stored procedure, which can create the temporary table from that and query the database table.
Any modern DBMS that can handle an existing 200M row table, should have no problem comparing against a 100K row table (assuming your hardware is up to scratch).
Ideal solution: Import your XML (at least the IDs) into to a new table, ensure the columns you're comparing are indexed correctly. And then query.
What language? If your using .NET you could load your XML and SQL as datasources, and then i believe there are some enumerable functions that could be used to compare the data.
Do this:
Parse the XML and store the extracted IDs into a temporary table1.
From the main table, select only the rows whose ID is also present in the temporary table:
SELECT * FROM MAIN_TABLE WHERE ID IN (SELECT ID FROM TEMPORARY_TABLE)
A decent DBMS will typically do the job quicker than you can, even if you employed batching/chunking and parallelization on your end.
1 Temporary tables are typically created using CREATE [GLOBAL|LOCAL] TEMPORARY TABLE ... syntax and you'll probably want it private for the session (check your DBMS's interpretation of GLOBAL vs. LOCAL for this). If your DBMS of choice doesn't support temporary tables, you can use "normal" tables instead, but be careful not to let concurrent sessions mess with that table while you are still using it.
Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete