Bad data performance issue in hive

Bad data performance issue in hive - hive

Currently I'am facing with bad data performance issue.
eg. data in hive table,
columns: country, state, customer_name
there is a typo error in column state.
(i.e) TN but typed TM
Kindly help me how to overcome this issue by clearing the bad data.

I recommand to load data into Temp table and then LOAD main Table with cross validation : data state table like (select * from temp_tbl where State exists in (select 'd' from STATE_TBL where parent.state = state)
This way program will not fail and capture errors into other record or file.

Related

Clickhouse deleting data with WHERE

I am trying to delete some data using WHERE, note that I need 2 tables in order to identify the rows that should be deleted, so I need to join them, I thought of something like:
ALTER TABLE sample_db.test_first_table DELETE WHERE
(
SELECT s.value
FROM sample_db.test_first_table ft
JOIN sample_db.test_second_table st ON (ft.value=st.value)
WHERE `EXPRESSION HERE`
)
I understood that this Alter operation is a mutation, so when checking system.mutations table I see there is this fail reason: Code: 125, e.displayText() = DB::Exception: Scalar subquery returned more than one row
I checked that the expression I am writing is fine with a simple SELECT statement, so I am out of ideas how I can delete multiple rows based on an expression, any help is much appreciated

First of all: MUTATIONS are admin operations. They cannot be used on daily bases.
ALTER TABLE sample_db.test_first_table DELETE WHERE
value in
( SELECT value
from sample_db.test_second_table
WHERE `EXPRESSION HERE`
)

Import a single column dataset (CSV or TXT or XLXS) to act as a list in SQL WHERE IN clause

I have a dataset that I receive on a weekly basis, this dataset is a single column of unique identifiers. Currently this dataset is gathered manually by our support staff. I am trying to query this dataset (CSV file) in my WHERE clause of a SQL Query.
In order to add this dataset to my query I do some data transformation to tweak the formatting, the reformatted data is then pasted directly into the WHERE IN part of my query. Ideally I would have the ability to import this list to the SQL query directly potentially bypassing the manual effort involved in the data formatting and swapping between programs.
I am just wondering if this is possible, have tried my best to scour the internet and have had no luck finding any reference to this functionality.

Using where in makes this more complex than it needs to be. Store the IDs you want to filter on in a table called MyTableFilters with a column of the ID values you want to use as filter(s) and join from MyTable on ID to MyTableFilters on ID. The join will cause MyTable to only return rows if the ID in MyTable is also on MyTableFilters
select * from MyTable A join MyTableFilters F on A.ID = F.ID
Since you don't really need to any transformations or data manipulation of what you want to ETL you could also easily truncate and use bulk insert to keep MyFiltersTable up to date
truncate table dbo.MyFiltersTable
BULK INSERT dbo.MyFiltersTable
FROM 'X:\MyFilterTableIDSourceFile.csv'
WITH
(
FIRSTROW = 1,
DATAFILETYPE='widechar', -- UTF-16
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
TABLOCK,
KEEPNULLS -- Treat empty fields as NULLs.
)

I'm guessing that you currently have something like the following:
SELECT *
FROM MyTable t
WHERE t.UniqueID in ('ID12','ID345','ID84')
My recommendation would be to create table in which to store the IDs referenced in the WHERE clause. So for the above, your table would look like this:
UniqueID
========
ID12
ID345
ID84
Supposing the table is called UniqueIDs the original query then becomes:
SELECT *
FROM MyTable t
WHERE t.UniqueID in (SELECT u.UniqueID FROM UniqueIDs u)
The question you're asking is then how to populate the UniqueIDs table. You need some means to expose that table to your users. There are several ways you could go about that. A lazy but relatively effective solution would be a simple MS Access database with that table as a "linked" table. You may need to be careful about permissions.
Alternatively, assuming your wedded to the CSV, set up an SSIS job which clears down the table and then imports from that CSV into the UniqueIDs table.

Data manipulation logic !! should be as a part of fetch via DB link or an independent routine

The question may not be sufficient to give an insight.
I have two DB instances : A and B on the same server.
B reads data from several tables in A(A1,A2,A3...) via DB link and maintains history of data in the replicated tables(A1_ext, A2_ext, A3_ext, as they have additional columns assume its status column) , i.e if its identified that a new row has been added in A1 ,a row is created in A1_ext with some status called as VALID , if a row is updated in A1, the existing data in A1_ext is updated to INVALID and a new row having latest data from A1 is created in A1_ext with status VALID.
For now the implemenetd logic is : Read data from A1 via db link , check if its exists in A1_ext , if does, delimit the existing one and create a new one.
Is it an efficient approach??
Or should it be like read all updated data from A1 and pull them at one go(bulk collect say) on B instance in A1_stag table(new). Then run the logic of update/insert on A1_ext.

The best I can come up with is something like the following:
-- Insert the new and changed records with status NEW
insert into A1_ext
with upsert as (
select id, val from A1#RemoteDB
minus
select id, val from A1_ext where status = 'VALID'
) select id, val, 'NEW' from upsert;
-- Update the old VALID records that have NEW records to INVALID
update A1_ext old
set status = 'INVALID'
where status = 'VALID'
and exists (select 1 from A1_ext new
where new.id = old.id
and new.status = 'NEW');
-- Update all NEW records to VALID
update A1_ext set status = 'VALID' where status = 'NEW';
Unfortunately the first query is going to do a full table scan on A1#RemoteDB and transmit all that data across the database link. Possibly not a big deal when both DBs reside on the same server, but possibly a performance problem for large tables across a network. The minus operation will prune away the unchanged records after they've crossed the link but before they get into the *_EXT table. If you can reliably filter the source data to just the new and updated records that would help limit the amount of data crossing the DB link.
The 2nd and 3rd queries are just house keeping to mark updated records as invalid and to mark the new data as valid.
If possible keep this as pure SQL and avoid context switching between SQL and PL/SQL as much as possible.

Comparing if a stack of data exists in two different tables

im new to Oracle and sql but I was assigned this job and I hope someone can help me out with this one.
Basically I am given a database link to connect to a remote database and I extract some information from a single table in there and a few other tables from a local database, and then process it and insert it into a table in the local database. I`ve managed to do this succesfully but now I need a way to confirm that all of the data from the remote database was actually copied into the local database. How would I go about doing this?
This is the code I have to insert the information to my local db.
INSERT INTO kcrt_requests_int RI
RI.TRANSACTION_ID,
RI.DESCRIPTION,
RI.CREATED_USERNAME,
RI.REQUEST_TYPE_ID,
RI.STATUS_ID,
RI.WORKFLOW_ID,
RI.WORKFLOW_STEP_ID,
RI.RELEASED_FLAG,
RI.USER_DATA1,
RI.USER_DATA2,
RI.USER_DATA3,
RI.USER_DATA4,
RI.USER_DATA7)
SELECT
KCRT_TRANSACTIONS_S.NEXTVAL,
RD.PARAMETER13||' '||R.DESCRIPTION,
'[SYS.USERNAME]',
'0001',
'31876',
'34987', '1234',
'Y',
PP.PROJECT_ID,
VP.REVIEWDATE,
RD.PARAMETER9,
R.REQUEST_ID,
RD.PARAMETER13
FROM
KCRT_REQUEST_TYPES_NLS RT,
KCRT_REQUESTS R,
KCRT_REQUEST_DETAILS RD,
v_projects#XXXXX VP,
PM_PROJECTS PP
WHERE
R.REQUEST_TYPE=RT.REQUEST_TYPE_ID
AND R.REQUEST_ID=RD.REQUEST_ID
AND RD.BATCH_NUMBER=1
AND RT.REQUEST_TYPE_NAME 'AAAAA'
AND R.STATUS_CODE = 'BBBBB'
AND RD.PARAMETER13 = to_char(VP.IDBANK)
AND VP.REVIEWDATE=(SELECT MAX (VP.REVIEWDATE) FROM v_projects#XXXXX VP)
AND R.REQUEST_ID=PP.PFM_REQUEST_ID
AND RD.BATCH_NUMBER=1
So pretty much I will try to compare RI.USER_DATA7 to VP.IDBANK and see if KCRT_REQUESTS_INT has every row that v_projects#XXXXX has.
Thanks for any help!

If there is a unique identifier column which is defined as primary key.if yes,you can join both tables on Primary key and see if the count matches with count without join on source table.
assumes ,Table A is your source and Table B is where you have loaded data. let P_key be primary key column.
You can match:
select count(1)
from Table_A with
select count (1)
from Table_A,Table_B
where Table_A.P_key=Table_B.P_Key
If they match ,you have all the records. Hope this helps.

Should I create a new DB column or not?

I don't know if it is better for me to create a new column in my Mysql database or not.
I have a table :
calculated_data
(id, date, the_value, status)
status is a boolean.
I need an extra value named : the_filtered_value
I can get it easily like this :
SELECT IF(status IS FALSE, 0, the_value) AS the_filtered_value FROM calculated_data
The calculated_data table has millions of entries and I display the_value and the_filtered_value in charts and data tables (using php).
Is it better to create a new column the_filtered_value in the calculated_data table or just use the SELECT IF query?
In "better" I see :
better in performance
better in DB design
easier to maintain
...
Thanks for your help!

Do not add a column. Instead, create a VIEW based on the original data table and in the view add a "virtual" calculated column called the the_filtered_value based on your expression.
In this way you will have easy access to the filtered value without having to copy the "logic" of that expression to different places in your code, while at the same time not storing any derived data. In addition, you will be able to operate directly on the view as if it were a table in most circumstances.
CREATE VIEW calculated_data_ex (id, date, the_value, status, the_filtered_value)
AS SELECT id, date, the_value, status, IF(status IS FALSE, 0, the_value)
FROM calculated_data

Adding the extra field adds complexity to your app but make queries easier (specially when joined on other tables).
I personally always try to keep the data as separated as possible on the database and I handle this cases on my application. Using a MVC pattern makes this task easier.

This works in MS SQL but I do not know if MySQL will support the syntax.
declare #deleteme table (value int, flag bit)
insert #deleteme
Values
(1,'False')
,(2,'true')
Select *, (flag*value) AS the_filtered_value from #deleteme

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Bad data performance issue in hive - hive

Currently I'am facing with bad data performance issue. eg. data in hive table, columns: country, state, customer_name there is a typo error in column state. (i.e) TN but typed TM Kindly help me how to overcome this issue by clearing the bad data.

I recommand to load data into Temp table and then LOAD main Table with cross validation : data state table like (select * from temp_tbl where State exists in (select 'd' from STATE_TBL where parent.state = state) This way program will not fail and capture errors into other record or file.

Related

Clickhouse deleting data with WHERE

Import a single column dataset (CSV or TXT or XLXS) to act as a list in SQL WHERE IN clause

Data manipulation logic !! should be as a part of fetch via DB link or an independent routine

Comparing if a stack of data exists in two different tables

Should I create a new DB column or not?

Categories

Resources