Need to copy data to SQL based on conditions - sql

I have a dataset that comes in an excel file that I need to copy to a SQL table. Before copying it I need to make sure of the following two conditions:
If the entire row is already in the SQL table then I do not need to copy that particular row.
If the value in column 'Step_API' is already in that column in the SQL table and column 'Source' is equal to 'PIDM' then I do not need to copy that row either.
I solved the first part with SELECT EXCEPT and it works. I do not know how to make the second part work. Please help me. Thanks
I tried the code below for the 1st part:
# write panda into a different table as a placeholder
col_options = dict(
dtype={
'Step_ID': sqlalchemy.types.INTEGER(),
'Step_Level': sqlalchemy.types.VARCHAR(length=50),
'Step_API': sqlalchemy.types.VARCHAR(length=50),
'Source_ID': sqlalchemy.types.VARCHAR(length=150),
'Source_Well_Name': sqlalchemy.VARCHAR(length=150),
'Start_Date': sqlalchemy.types.Date(),
'Stop_Date': sqlalchemy.types.Date(),
'Source': sqlalchemy.types.VARCHAR(length=150),
'Created_By': sqlalchemy.types.VARCHAR(length=50),
'Created_Dt': sqlalchemy.types.Date(),
'Updated_By': sqlalchemy.types.VARCHAR(length=50),
'Updated_Dt': sqlalchemy.types.Date(),
'ETL_Load_Date': sqlalchemy.types.Date(),
'Comment':sqlalchemy.types.VARCHAR(length=150)
}
)
df.to_sql(name="obo_external_xref_temp", con=engine, schema= 'MDM', if_exists='replace', index=False, **col_options)
# retrieve only new records comparing placeholder table and table where I intend to write
query = """
SELECT Step_ID, Step_Level, Step_API, Source_ID, Source_Well_Name, Start_Date, Stop_Date, Source, Created_By, Created_Dt, Updated_By, Updated_Dt, ETL_Load_Date, Comment FROM MDM.obo_external_xref_temp
EXCEPT
SELECT Step_ID, Step_Level, Step_API, Source_ID, Source_Well_Name, Start_Date, Stop_Date, Source, Created_By, Created_Dt, Updated_By, Updated_Dt, ETL_Load_Date, Comment FROM MDM.obo_external_xref;
"""
new_entries = pd.read_sql(query, con=engine)
# append only new records in the table
new_entries.to_sql(name="obo_external_xref", con=engine, schema= 'MDM', if_exists='append', index=False, **col_options)

I think this could work if you partition the two incoming data sets by the Source.
Use SELECT EXCEPT SELECT but with SELECT WHERE Source != 'PIDM' EXCEPT.
Then INSERT from the other partition Source = 'PIDM' but for Step_API values not in the existing dataset by using a subquery like this : INSERT INTO MDM.obo_external_xref SELECT Step_ID, Step_Level, Step_API, Source_ID, Source_Well_Name, Start_Date, Stop_Date, Source, Created_By, Created_Dt, Updated_By, Updated_Dt, ETL_Load_Date, Comment FROM MDM.obo_external_xref_temp WHERE Source = 'PIDM' AND Step_API NOT IN (SELECT Step_API FROM MDM.obo_external_xref WHERE Source = 'PIDM')

Related

INSERT values into table using cursor.execute

I'm writing some code that will pull data from an API and insert the records into a table for me.
I'm unsure how to go about formatting my insert statement. I want to insert values where there is no existing match in the table (based on date), and I don't want to insert values where the column opponents = my school's team.
import datetime
import requests
import cx_Oracle
import os
from pytz import timezone
currentYear = 2020
con = Some_datawarehouse
cursor = con.cursor()
json_obj = requests.get('https://api.collegefootballdata.com/games?year='+str(currentYear)+'&seasonType=regular&team=myteam')\
.json()
for item in json_obj:
EVENTDATE = datetime.datetime.strptime(item['start_date'], '%Y-%m-%dT%H:%M:%S.%fZ').date()
EVENTTIME = str(datetime.datetime.strptime(item['start_date'], '%Y-%m-%dT%H:%M:%S.%fZ').replace(tzinfo=timezone('EST')).time())
FINAL_SCORE = item.get("home_points", None)
OPPONENT = item.get("away_team", None)
OPPONENT_FINAL_SCORE = item.get("away_points", None)
cursor.execute('''INSERT INTO mytable(EVENTDATE,EVENTTIME,FINAL_SCORE,OPPONENT,OPPONENT_FINAL_SCORE) VALUES (:1,:2,:3,:4,:5)
WHERE OPPONENT <> 'my team'
AND EVENTDATE NOT EXISTS (SELECT EVENTDATE FROM mytable);''',
[EVENTDATE,EVENTTIME,FINAL_SCORE,OPPONENT,OPPONENT_FINAL_SCORE])
con.commit()
con.close
This may be more of an ORACLE SQL rather than python question, but I'm not sure if cursor.execute can accept MERGE statements. I also recognize that the WHERE statement will not work here, but this is more of an idea of what I'm trying to accomplish.
change the sql query to this :
INSERT INTO mytable(EVENTDATE,EVENTTIME,FINAL_SCORE,OPPONENT,OPPONENT_FINAL_SCORE)
SELECT * FROM (VALUES (:1,:2,:3,:4,:5)) vt(EVENTDATE,EVENTTIME,FINAL_SCORE,OPPONENT,OPPONENT_FINAL_SCORE)
WHERE vt.OPPONENT <> 'my team'
AND vt.EVENTDATE NOT IN (SELECT EVENTDATE FROM mytable);

Pig: efficient filtering by loaded list

In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields?
For example,
(Updated per #inquisitive_mind's tip)
Input: a line-separated file with one value per line
my_codes.txt
'110'
'100'
'000'
sample_data.txt
'110', 2
'110', 3
'001', 3
'000', 1
Desired Output
'110', 2
'110', 3
'000', 1
Sample script
%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);
Error:
Scalar has more than one row in the output. 1st : ('110'), 2nd :('100')
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
I had also tried FILTER sample_data BY code IN my_codes; but the "IN" clause seems to require parenthesis.
I also tried FILTER sample_data BY code IN (my_codes); but got the error:
A column needs to be projected from a relation for it to be used as a scalar
The my_codes.txt file has the codes as a row instead of a column.Since you are loading it into a single field the codes should be like this below
'110'
'100'
'000'
Alternatively,you can use JOIN
joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;

Dynamics Ax 2009: create view using UNION and tables with this same fields name

I want to create a view in dynamics ax 2009, this view has to return two or more tables with this same fields name.
I prepare the sql query (query is below), but I don't known how to move these scripts into the ax view.
select invent.ASSETID, invent.ITEMID, invent.JOURNALID as 'ids'
from inventjournaltrans invent
where invent.ASSETID != ''
UNION
select purch.ASSETID, purch.ITEMID, purch.PURCHID as 'ids'
from purchline purch
where purch.ASSETID != ''
Please find below sample of x++ query. But you must bear in mind that in standard AX fields PurchId and JournalId have different lengths, and you will get the following error:
There is a field mismatch in the union query. Field JournalId is not compatible with field PurchId.
Query query;
QueryBuildDataSource qbdsInventJournalTrans;
QueryBuildDataSource qbdsPurchLine;
QueryBuildRange qbrInventJournalTrans;
QueryBuildRange qbrPurchLine;
;
query = new Query();
query.queryType(QueryType::Union);
qbdsInventJournalTrans = query.addDataSource(tableNum(InventJournalTrans));
qbdsInventJournalTrans.unionType(UnionType::UnionAll); // Include duplicate records
qbdsInventJournalTrans.fields().dynamic(false);
qbdsInventJournalTrans.fields().clearFieldList();
qbdsInventJournalTrans.fields().addField(fieldNum(InventJournalTrans, AssetId));
qbdsInventJournalTrans.fields().addField(fieldNum(InventJournalTrans, ItemId));
//qbdsInventJournalTrans.fields().addField(fieldNum(InventJournalTrans, JournalId));
qbrInventJournalTrans = qbdsInventJournalTrans.addRange(fieldNum(InventJournalTrans, AssetId));
qbrInventJournalTrans.value(SysQuery::valueNotEmptyString());
qbdsPurchLine = query.addDataSource(tableNum(PurchLine));
qbdsPurchLine.unionType(UnionType::UnionAll); // Include duplicate records
qbdsPurchLine.fields().dynamic(false);
qbdsPurchLine.fields().clearFieldList();
qbdsPurchLine.fields().addField(fieldNum(PurchLine, AssetId));
qbdsPurchLine.fields().addField(fieldNum(PurchLine, ItemId));
//qbdsPurchLine.fields().addField(fieldNum(PurchLine, PurchId));
qbrPurchLine = qbdsPurchLine.addRange(fieldNum(PurchLine, AssetId));
qbrPurchLine.value(SysQuery::valueNotEmptyString());
Please refer this link if you need to create AOT query How to: Combine Data Sources

Postgresql : How to update one field for all duplicate values based at the end of the string of a field except one row

http://sqlfiddle.com/#!9/b98ea/1 (Sample Table)
I have a table with the following fields:
transfer_id
src_path
DH_USER_ID
email
status_state
ip_address
src_path field contains a couple of duplicates filename values but a different folder name at the beginning of the string.
Example:
191915/NequeVestibulumEget.mp3
/191918/NequeVestibulumEget.mp3
191920/NequeVestibulumEget.mp3
I am trying to do the following:
Set status_state field to 'canceled' for all the duplicate filenames within (src_path) field except for one.
I want the results to look like this:
http://sqlfiddle.com/#!9/5e65f/2
*I apologize in advance for being a complete noob, but I am taking SQL at college and I need help.
SQL Fiddle Demo
fix_os_name: Fix the windows path string to unix format.
file_name: Split the path using /, and use char_length to bring last split.
drank: Create a seq for each filename. So unique filename only have 1, but dup also have 2,3 ...
UPDATE: check if that row have rn > 1 mean is a dup.
.
Take note the color highlight is wrong, but code runs ok.
with fix_os_name as (
SELECT transfer_id, replace(src_path,'\','/') src_path,
DH_USER_ID, email, status_state, ip_address
FROM priority_transfer p
),
file_name as (
SELECT
fon.*,
split_part(src_path,
'/',
char_length(src_path) - char_length(replace(src_path,'/','')) + 1
) sfile
FROM fix_os_name fon
),
drank as (
SELECT
f.*,
row_number() over (partition by sfile order by sfile) rn
from file_name f
)
UPDATE priority_transfer p
SET status_state = 'canceled'
WHERE EXISTS ( SELECT *
FROM drank d
WHERE d.transfer_id = p.transfer_id
AND d.rn > 1);
ADD: One row is untouch
Use the regexp_matches function to separate the file name from the directory.
From there you can use distinct() to build a table with unique values for the filename.
select
regexp_matches(src_path, '[a-zA-Z.0-9]*$') , *
from priority_transfer
;

Issue with union query - select list not compatible

Been working on this query for some time now... Keep getting the error "Corrosponding select-list expressions are not compatible. I am selecting the same # of columns in each select statement.
create volatile table dt as (
SELECT
gcv.I_SYS_IDV,
gcv.i_pln,
gcv.c_typ_cov,
gcv.d_eff,
gcv.d_eff_pln,
gcv.c_sta,
gcv.d_sta,
gcv.c_mde_bft_fst,
gcv.a_bft_fst,
gcv.c_mde_bft_sec,
gcv.a_bft_sec,
gcv.c_mde_bft_trd,
gcv.a_bft_trd,
gcv.p_cre_hom,
gcv.c_cl_rsk,
gpv.c_val,
gpv.i_val,
gcv.c_pol,
gpv.i_prv
FROM Pearl_P.tltc906_gcv gcv,
pearl_p.tltc912_gpv gpv
WHERE gcv.i_pln > 0
AND gcv.i_pln = gpv.i_pln
and gpv.i_prv = '36'
and gcv.c_pol between 'lac100001' and 'lac100004'
UNION
SELECT
gcv.I_SYS_IDV,
gcv.i_pln,
gcv.c_typ_cov,
gcv.d_eff,
gcv.d_eff_pln,
gcv.c_sta,
gcv.d_sta,
gcv.c_mde_bft_fst,
gcv.a_bft_fst,
gcv.c_mde_bft_sec,
gcv.a_bft_sec,
gcv.c_mde_bft_trd,
gcv.a_bft_trd,
gcv.p_cre_hom,
gcv.c_cl_rsk,
gcv.c_pol,
gpv.i_val,
gpv.i_pln,
gpv.i_prv
FROM Pearl_P.tltc906_gcv gcv,
pearl_p.tltc912_gpv gpv
where NOT EXISTS(
SELECT 1
FROM pearl_p.tltc906_gcv gcv,
pearl_p.tltc912_gpv gpv
WHERE gcv.i_pln > 0
AND gcv.i_pln = gpv.i_pln
and gpv.i_prv = '36'
)
) with data
PRIMARY INDEX (i_sys_idv)
on commit preserve rows;
You should check the data types of each column. The data types must be compatible between each SELECT statement.
The last 4 values of your second select statement don't match the ones in your first statement. Try naming(using aliases) those columns the same thing(pairing them). To union you need to have the same columns between the sets.
Check out: http://msdn.microsoft.com/en-us/library/ms180026.aspx
The following are basic rules for combining the result sets of two
queries by using UNION:
The number and the order of the columns must be the same in all
queries.
The data types must be compatible.
Are the data types of each column the same in both portions of the query?
If the first character of the column name indicates the data type (i = integer, c = character, etc.) I'm guessing that the problem is with the second to last column in the select list. The first query is selecting gcv.c_pol, the second query is selecting gpv.i_pln.
Start commenting-out lines until it works. I.e., comment out all of the fields in each select list, and then one-by-one, un-comment the first one out, then the second, etc. You'll find it eventually.