Create an Impala text table where rows meet a condition

Create an Impala text table where rows meet a condition - sql

I am trying to create a table in Impala (SQL) that takes rows from a parquet table. The data represents bike rides in a city. Rows will be imported into the new table if there starting code (a string, ex: '6100') shows up more than 100 times in the first table. Heres what I have so far:
#I am using Apache Impala via the Hue Editor
invalidate metadata;
set compression_codec=none;
invalidate metadata;
Set compression_codec=gzip;
create table bixirides_parquet (
start_date string, start_station_code string,
end_date string, end_station_code string,
duration_sec int, is_member int)
stored as parquet;
Insert overwrite table bixirides_parquet select * from bixirides_avro;
invalidate metadata;
set compression_codec=none;
create table impala_out stored as textfile as select start_date, start_station_code, end_date, end_station_code, duration_sec, is_member, count(start_station_code) as count
from bixirides_parquet
having count(start_station_code)>100;
For some reason the statement will run, but no rows are inserted in the new table. It should import a row into the new table if that rows starting code shows up more than 100 times in the original table. I think I'm wording my select statement improperly but I'm not sure how exactly.

I think the final query you want is:
select start_date, start_station_code, end_date,
end_station_code, duration_sec, is_member, cnt
from (select bp.*,
count(*) over (partition by start_station_code) as cnt
from bixirides_parquet bp
) bp
where cnt > 100;

Related

For loop with output arrays

In snowflake :
I have two tables available:
"SEG_HISTO": This is a segmentation run once a month.
columns: Client ID /date (1st of each month) /segment.
"TCK": a table that contains the tickets with the columns: Ticket ID / Customer ID / Date / Amount.
For each customer ID in the "SEG_HISTO" table, I searched for all the customer's tickets over a rolling year and associated the sum of the amount spent:
SELECT SEG_OMNI.*, TCK_12M.TOTAL_AMOUNT_HT
FROM "SHARE"."DATAMARTS_DATASCIENCE"."SEG_OMNI" SEG_OMNI
LEFT OUTER JOIN
(
SELECT DISTINCT PR_ID_BU,
SUM(TOTAL_AMOUNT_HT) AS "TOTAL_AMOUNT_HT",
COUNT(*) "NB_ACHAT"
FROM
(
SELECT * FROM "SHARE"."RAW_BDC"."TCK"
WHERE TO_DATE(DT_SALE) >= DATEADD(YEAR, -1, '2022-07-01') -- <<<===== date add manually
)
GROUP BY PR_ID_BU
) TCK_12M
ON SEG_OMNI."pr_id_bu" = TCK_12M.PR_ID_BU
Now I need to create a for loop that iterates this for each date in the SEG_OMNI table (SELECT DISTINCT TO_DATE(DT_MAJ) DT FROM "SHARE"."DATAMARTS_DATASCIENCE"."SEG_HISTO") and stack the output in a view.
And it is at this level where I block
Thank you for your help in advance

As Dave said in the comments, it would be better if you could figure out how to run all this in one query, instead of running the same query multiple times.
But as you are asking how to output the results of multiple queries out of one stored procedure I'm going to give you the pattern for that here. I'm also assuming you want this in a SQL script (we could use Python/Java/JS instead):
declare
your_var string;
all_dates cursor for (
select dates
from your_table
);
begin
-- create a table to store results
create or replace temp table discovery_results(x string, y string, z int);
for record in all_dates do
-- for each date run the query an insert results into the table created
insert into discovery_results
select x, y, z
from the_query
where (:dates_cursor_data)
;
end for;
return 'run [select * from discovery_results] to find the results';
end;
select *
from discovery_results

Data Loaded wrongly into Hive Partitioned table after adding a new column using ALTER

I already have a Hive partitioned table. I needed to add a new column to the table, so i used ALTER to add the column like below.
ALTER TABLE TABLE1 ADD COLUMNS(COLUMN6 STRING);
I have my final table load query like this:
INSERT OVERWRITE table Final table PARTITION(COLUMN4, COLUMN5)
select
stg.Column1,
stg.Column2,
stg.Column3,
stg.Column4(Partition Column),Field Name:Code Sample value - YAHOO.COM
stg.Column5(Partition Column),Field Name:Date Sample Value - 2021-06-25
stg.Column6(New Column) Field Name:reason sample value - Adjustment
from (
select fee.* from (
select
fees.* ,
ROW_NUMBER() OVER (PARTITION BY fees.Column1 ORDER BY fees.Column3 DESC) as RNK
from Stage table fee
) fee
where RNK = 1
) stg
left join (
select Column1 from Final table
where Column5(date) in (select distinct column5(date) from Stage table)
) TGT
on tgt.Column1(id) = stg.Column1(id) where tgt.column1 is null
UNION
select
tgt.column1(id),
tgt.column2,
tgt.column3,
tgt.column4(partiton column),
tgt.column5(partiton column-date),
tgt.column6(New column)
from
Final Table TGT
WHERE TGT.Column5(date) in (select distinct column5(date) from Stage table);"
Now when my job ran today, and when i try to query the final table, i get the below error
Invalid partition value 'Adjustment' for DATE partition key: Code=2021-06-25/date=Adjustment
I can figure out something wrong happend around the partition column but unable to figure out what went wrong..Can someone help?

Partition columns should be the last ones in the select. When you add new column it is being added as the last non-partition column, partition columns remain the last ones, they are not stored in the datafiles, only metadata contains information about partitions. All other columns order also matters, it should match table DDL, check it using DESCRIBE FORMATTED table_name.
INSERT OVERWRITE table Final table PARTITION(COLUMN4, COLUMN5)
select
stg.Column1,
stg.Column2,
stg.Column3,
stg.Column6 (New column) ------------New column
stg.Column4(Partition Column) ---partition columns
stg.Column5(Partition Column)
...

Dynamically Updating Columns with new Data

I am handling an SQL table with over 10K+ Values, essentially it controls updating the status of a production station over the day. Currently the SQL server will report a new message at the current time stamp - ergo a new entry can be generated for the same part hundreds of times a day whilst only having the column "Production_Status" and "TimeStamp" changed. I want to create a new table that selects unique part names then have two other columns that control bringing up the LATEST entry for THAT part.
I have currently selected the data - reordered it so the latest timestamp is first on the list. I am currently trying to do this dynamic table but I am new to sql.
select dateTimeStamp,partNumber,lineStatus
from tblPLCData
where lineStatus like '_ Zone %' or lineStatus = 'Production'
order by dateTimeStamp desc;
The Expected results should be a NewTable with the row count being based off how many parts are in our total production facility - this column will be static - then two other columns that will check Originaltable for the latest status and timestamp and update the two other columns in the newTable.
I don't need help with the table creation but more the logic that surrounds the updating of rows based off of another table.
Much Appreciated.

It looks like you could take advantage of a sub join that finds the MAX statusDate for each partNumber, then joins back to itself so that you can get the corresponding lineStatus value that corresponds to the record with the max date. I just have you inserting/updating a temp table but this can be the general approach you could take.
-- New table that might already exist in your db, I am creating one here
declare #NewTable(
partNumber int,
lineStatus varchar(max),
last_update datetime
)
-- To initially set up your table or to update your table later with new part numbers that were not added before
insert into #NewTable
select tpd.partNumber, tpd.lineStatus, tpd.lineStatusdate
from tblPLCData tpd
join (
select partNumber, MAX(lineStatusdate) lineStatusDateMax
from tblPLCData
group by partNumber
) maxStatusDate on tpd.partNumber = maxStatusDate.partNumber
and tpd.lineStatusdate = maxStatusDate.lineStatusDateMax
left join #NewTable nt on tbd.partNumber = nt.partNumber
where tpd.lineStatus like '_ Zone %' or tpd.lineStatus = 'Production' and nt.partNumber is null
-- To update your table whenever you deem it necessary to refresh it. I try to avoid triggers in my dbs
update nt set nt.lineStatus = tpd.lineStatus, nt.lineStatusdate = tpd.lineStatusDate
from tblPLCData tpd
join (
select partNumber, MAX(lineStatusdate) lineStatusDateMax
from tblPLCData
group by partNumber
) maxStatusDate on tpd.partNumber = maxStatusDate.partNumber
and tpd.lineStatusdate = maxStatusDate.lineStatusDateMax
join #NewTable nt on tbd.partNumber = nt.partNumber
where tpd.lineStatus like '_ Zone %' or tpd.lineStatus = 'Production'

Track table counts from a schema

I use postgres and I need some help from you all PG experts...
I am looking to track counts from a large set of source tables whose counts keep changing everyday. I want to use the tablename, row count and tablesize in a tracker table, and a column called created_dttm field to show when this row count is recorded from source table. This is for trending how the table counts are changing with time and look for peaks.
insert into tracker_table( tablename, rowcount, tablesize, timestamp)
from
(
(select schema.tablename ... - not sure how to drive this to pick up a list of tables??
, select count(*) from schema.tablename
, SELECT pg_size_pretty(pg_total_relation_size('"schema"."tablename"'))
, select created_dttm from schema.tablename
)
);
Additionally, I want to get a particular column from source table for a fourth column. This would be a created_dttm timestamp field in the source table, and I want to run a simple sql to get this date to the tracker table. Any suggestions how to attack this problem?

before reading the code please consider this:
instead of selecting several subqueries, this if you can join them into one qry, eg select (select 1 from t), (select 2 from t) can be refactored to select 1,2 from t
pg_total_relation_size is sum of data pages, so it is size of table, but not size of data in it.
you need aggregation on your created_dttm column (I used oid instead), otherwise your subquery returns more then one row, so you won't be able to insert the result.
instead of select count(*) maybe use pg_stat_all_tables stats?.. counting can be very expensive and acuracy of the count() is neglected by the fact that next minute same select count() will be different and you probably wont run this count every two seconds...
code:
t=# create table so30 (n text, c int, s text, o int);
CREATE TABLE
t=# do
$$
declare
_r record;
_s text;
begin
for _r in (values('pg_database'),('pg_roles')) loop
_s := format('select %1$L,(select count(*) from %1$I), (SELECT pg_size_pretty(pg_total_relation_size(%1$L))), (select max(oid) from %1$I)',_r.column1);
execute format('insert into so30 %s',_s);
end loop;
end;
$$
;
DO
t=# select * from so30;
n | c | s | o
-------------+---+---------+-------
pg_database | 4 | 72 kB | 16384
pg_roles | 2 | 0 bytes | 4200
(2 rows)

Insert output IDs into another table

I have a status table, and another table containing additional data. My object IDs are the PK in the status table, so I need to insert those into the additional data table for each new row.
I need to insert a new row into my statusTable for each new listing, containing just constants.
declare #temp TABLE(listingID int)
insert into statusTable(status, date)
output Inserted.listingID into #temp
select 1, getdate()
from anotherImportedTable
This gets me enough new listing IDs to use.
I now need to insert the actual listing data into another table, and map each row to one of those listingIDs -
insert into listingExtraData(listingID, data)
select t.listingID, a.data
from #temp t, anotherImportedTable a
Now this obviously doesn't work, because otherDataTable and the IDs in #temp are unrelated... so I get far too many rows inserted.
How can I insert each row from anotherImportedTable into listingExtraData along with a unique newly created listingID? could I possibly trigger some more sql at the point I do the output in the first block of sql?
edit: thanks for the input so far, here's what the tables look like:
anotherImportedTable:
data
statusTable:
listingID (pk), status, date
listingExtraData:
data, listingID
You see that I only want to create one entry into statusTable per row in anotherImportedTable, then put one listingID with a row from anotherImportedTable into listingExtraData... I'm thinking that I might have to resort to a cursor perhaps?

Ok, here's how you can do it (if I'm right about what you actually want to do):
insert into listingExtraData(listingID, data)
select q1.listingID, q2.data
from
(select ListingID, ROW_NUMBER() OVER (order by ListingID) as rn from #temp t) as q1
inner join (select data, ROW_NUMBER() over (order by data) as rn from anotherImportedTable) q2 on q1.rn = q2.rn
In case you matching logic differs you will need to change sorting of anotherImportedTable. In case your match order can not be achieved by ordering anotherImportTable [in one way or another] then you're out of luck.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create an Impala text table where rows meet a condition - sql

I think the final query you want is: select start_date, start_station_code, end_date, end_station_code, duration_sec, is_member, cnt from (select bp., count() over (partition by start_station_code) as cnt from bixirides_parquet bp ) bp where cnt > 100;

Related

For loop with output arrays

Data Loaded wrongly into Hive Partitioned table after adding a new column using ALTER

Dynamically Updating Columns with new Data

Track table counts from a schema

Insert output IDs into another table

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create an Impala text table where rows meet a condition - sql

I think the final query you want is: select start_date, start_station_code, end_date, end_station_code, duration_sec, is_member, cnt from (select bp.*, count(*) over (partition by start_station_code) as cnt from bixirides_parquet bp ) bp where cnt > 100;

Related

For loop with output arrays

Data Loaded wrongly into Hive Partitioned table after adding a new column using ALTER

Dynamically Updating Columns with new Data

Track table counts from a schema

Insert output IDs into another table

Categories

Resources

I think the final query you want is: select start_date, start_station_code, end_date, end_station_code, duration_sec, is_member, cnt from (select bp., count() over (partition by start_station_code) as cnt from bixirides_parquet bp ) bp where cnt > 100;