Ignored duplicate property derby.module.dataDictionary in Hive - hive

I have an EMPLOYEES table which is partitioned on the basis of COUNTRY and STATE. Below are the partitions.
hive (human_resources)> show partitions employees ;
OK
country=IN/state=PU
country=US/state=CA
country=US/state=IL
Time taken: 0.119 seconds, Fetched: 3 row(s)
I loaded data in these partitions from local file system.
When I execute SELECT * FROM EMPLOYEES WHERE STATE = 'IL' ;, the output is shown after these lines
Thread[main 5.0 ["main] Ignored duplicate property derby.module.dataDictionary in jar:file:/usr/local/hive/apache-hive-1.2.1-bin/lib/hive-jdbc-1.2.1-standalone.jar!/org/apache/derby/modules.properties"] NULL NULL US IL
Thread[main 5.0 ["main] \u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"] NULL NULL US IL
---------------------------------------------------------------- NULL NULL NULL NULL US IL
I do not get these messages when I query the table on the other two partitions. Please let me know how to get rid of these messages.

Can you try this SELECT * FROM employees WHERE country ='US' and state ='IL' and let me know if it is works, if not can you post partial of your data and your table metadata
`

Related

Duplicate rows because 1 column has multiple distinct values

I'm running a SELECT query to get data across multiple tables in the same server instance. However I've just noticed that the rows pulled on some data get duplicated because the main table I'm pulling from has a few different values in one of the columns. Here's the query:
SELECT DISTINCT BIF030.C_ACCOUNT AS ACCOUNTNUMBER,
BIF003.C_ACCOUNTTYPE AS ACCOUNTTYPECODE,
CON013.C_DESCRIPTION AS ACCOUNTTYPE,
BIF003.C_DIVISION AS ZONE_DIVISONCODE,
CON028.C_DESCRIPTION AS ZONE_DIVISION,
BIF030.C_METER as METERNUMBER,
BIF005.C_METERCUSTOM1 AS REGISTERNUMBER,
CONVERT(DECIMAL(20,2), BIF030.N_CONSUMP) AS CONSUMPTION,
CON007.C_DESCRIPTION AS UNITS,
BIF030.T_READDATE AS READINGDATE,
MONTH(BIF030.T_READDATE) AS READINGMONTH,
DAY(BIF030.T_READDATE) AS READINGDAY,
YEAR(BIF030.T_READDATE) AS READINGYEAR,
BIF030.I_DAYS AS READINGDAYSCOUNT
FROM ADVANCED.BIF030
LEFT JOIN ADVANCED.CON007 ON CON007.C_UNITS=BIF030.C_UNITS
LEFT JOIN ADVANCED.BIF005 ON BIF005.C_METER=BIF030.C_METER
LEFT JOIN ADVANCED.BIF003 ON BIF003.C_ACCOUNT=BIF030.C_ACCOUNT
LEFT JOIN ADVANCED.CON013 ON CON013.C_ACCOUNTTYPE=BIF003.C_ACCOUNTTYPE
LEFT JOIN ADVANCED.CON028 ON CON028.C_DIVISION=BIF003.C_DIVISION
WHERE T_READDATE > '01-01-2014'
ORDER BY ACCOUNTNUMBER, READINGDATE ASC
I know SELECT DISTINCT is frowned upon, but I get even more rows without it. Here's a sample of what the data looks like when pulled:
ACCOUNTNUMBER
ACCOUNTTYPECODE
ACCOUNTTYPE
ZONE_DIVISIONCODE
ZONE_DIVISION
METERNUMBER
REGISTERNUMBER
CONSUMPTION
UNITS
READINGDATE
READINGMONTH
READINGDAY
READINGYEAR
READINGDAYSCOUNT
1234567
SP
ACCOUNT TYPE 1
00
00-NO ZONE
123456789
987654321
3.00
Thousands of Gallons
2014-01-16 00:00:00.00
1
16
2014
30
1234567
MF
ACCOUNT TYPE 2
02
02-GRAVITY
123456789
987654321
3.00
Thousands of Gallons
2014-01-16 00:00:00.00
1
16
2014
30
1234567
SR
ACCOUNT TYPE 3
02
02-GRAVITY
123456789
987654321
3.00
Thousands of Gallons
2014-01-16 00:00:00.00
1
16
2014
30
I also know the column that is messing this up is the "AccountTypeCode" because other accounts that don't have multiple codes associated with the "AccountNumber" only show 1 set of rows. So this one specifically (and probably others) is tripling the amount of rows pulled when it should only pull one for each "ReadingDate".
Also if anyone knows a good way to optimize the query I'd be happy to learn. I know just enough SQL to be dangerous, but not enough to figure this out. Thanks.
Ok. So good news and I want to add this in case it helps anyone else in the future. I found out that since the ACCOUNTTYPECODE and ZONE_DIVISIONCODE were coming from the table BIF003 I needed to add more in the WHERE statement. This is what fixed it for me:
AND BIF030.C_CUSTOMER = BIF003.C_CUSTOMER
Because the C_CUSTOMER column was different (it's a column in the BIF003 and BIF030 tables) which lead to the separate ACCOUNTTYPECODE results I need to check it in the WHERE statement.
Thanks everyone for kick starting my brain on this one.

Scheduling of jobs through SQL Server stored procedure

I have to write a stored procedures for scheduling the Azure pipelines (Jobs).
Frequency ----Number of times batch needs to run in a day
Timing column will have entry for batch start time
Table A will have static entries for batches. Frequency denotes in a day how many times job will run and timing column will have the batch run time separated by comma(,)
Batch_ID Batch_Name Frequency Timing
-----------------------------------------------
1 ABC 2 7:00,13:00
Table B will have listing of jobs corresponding to one particular batch.This table will be static and have one time entry like table B.
Table B
Batch_ID JOB_ID JOB_NM
--------------------------------
1 1 Job_1
1 2 Job_1
Table C will contain the dependencies of the jobs in a batch
Table C
Batch_ID JOB_ID DEPENDENY_JOB_ID
----------------------------------------
1 1
1 2 1
When Batch executes, table D will be populated with batch start time.
Table D
Batch_ID Batch_Name Status start_Time end_time
-------------------------------------------------------
1 abc Start 7:00
As soon as Table E is populated,table D will populated with Job details.Job 2 will start only when job 1 finishes.
Table E
Batch_ID Batch_Name JOB_ID JOB_NM Start_Time End_Time
----------------------------------------------------------------------
1 abc 1 Job_1 7:00
1 abc 2 Job_2 7:15
When Job 2 completes then we will update the Table D end time column.
Once first run is completed, we need to check frequency column of table A and run the job again (if it's more than 1) and do the entire exercise again.
In case our 1st batch didn't complete before the start time of batch 2 then we have to hold the 2nd batch until batch 1 is completed.
Could anyone help me how to start this?
As #Gordon Linoff said, you are lacking a question on your "question".
If I can give an opinion on this, I dont think its a good design idea to split your logic between data factory and stored procedures in a database. Be mindful that in the future, the user mantaining the pipelines may not have access to the database and will not be able to understand half of it. Even if YOU are the one mantaining this, 2 years from now chances are you are going to forget what you did and following the line between 2 resources may take you more time than it should. It will also make troubleshooting harder.
It really depends on the scenario you are working on, but to sum it up: try to have everything logic related in one place.
Hope this helped!

Redshift SQL - Skipped sequence

I'm working on applicant pipeline data and need to get a count of applicants who made it to each phase of the pipeline/funnel. If an applicant skips a phase, I need to need to count them in the phase anyway. Here's an example of how that data might look for one applicant:
Stage name | Entered on
Application Review | 9/7/2018
Recruiter Screen | 9/10/2018
Phone Interview | blank
Interview | 9/17/2018
Interview 2 | 9/20/2018
Offer | blank
this is what the table looks like:
CREATE TABLE application_stages (
application_id bigint,
stage_id bigint,
entered_on timestamp without time zone,
exited_on timestamp without time zone,
stage_name character varying
);
In this example, I want to count Application Review through Interview 2 (including the skipped/blank Phone Interview phase), but not the Offer. How would I write the above in SQL? (Data is stored in Amazon Redshift. Using SQL workbench to query.)
Also, please let me know if there is anything else I can add to my question to make the issue/solution clearer.
You can hardcode the stages of the pipeline in event_list table like this:
id | stage_name
1 | first stage
2 | second stage
3 | third stage
4 | fourth stage
UPD: The deeper is the stage of the funnel, the higher is its ID. This way, you can compare them, i.e. third stage is deeper than second stage because 3>2. Thus, if you need to find people that reached the 2nd stage it includes people that have events with id=2 OR events with id>2, i.e. events deeper in the funnel.
If the second stage is missed and the third stage is recorded for some person you can still count that person as "reached second stage" by joining your event data to this table by stage_name and counting the number of records with id>=2, like
select count(distinct user_id)
from event_data t1
join event_list t2
using (stage_name)
where t2.id>=2
Alternatively, you can left join your event table to event_list and fill the gaps using lag function that returns the value of the previous row (i.e. assigning the timestamp of first stage to the second stage in the case above)
Here is the SQL I ended up with. Thanks for the ideas, #AlexYes!
select stage_name,
application_stages.application_id, entered_on,
case when entered_on is NULL then lead(entered_on,1)
ignore nulls
over
(PARTITION BY application_stages.application_id order by case stage_name
when 'Application Review' then 1
when 'Recruiter Screen' then 2
when 'Phone Interview' then 3
when 'Interview' then 4
when 'Interview 2' then 5
when 'Offer' then 6
when 'Hired' then 7 end) else entered_on end as for_count, exited_on
from application_stages
I realize that the above SQL doesn't give me the counts but I am doing the counts in Tableau. Happy to have the format above in case I need to do other calculations on the new "for_count" field.

Poor performance on Amazon Redshift queries based on VARCHAR size

I'm building an Amazon Redshift data warehouse, and experiencing unexpected performance impacts based on the defined size of the VARCHAR column. Details are as follows. Three of my columns are shown from pg_table_def:
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
------------+-----------+-----------------+-----------------------------+-----------+---------+---------+---------
public | logs | log_timestamp | timestamp without time zone | delta32k | f | 1 | t
public | logs | event | character varying(256) | lzo | f | 0 | f
public | logs | message | character varying(65535) | lzo | f | 0 | f
I've recently run Vacuum and Analyze, I have about 100 million rows in the database, and I'm seeing very different performance depending on which columns I include.
Query 1:
For instance, the following query takes about 3 seconds:
select log_timestamp from logs order by log_timestamp desc limit 5;
Query 2:
A similar query asking for more data runs in 8 seconds:
select log_timestamp, event from logs order by log_timestamp desc limit 5;
Query 3:
However, this query, very similar to the previous, takes 8 minutes to run!
select log_timestamp, message from logs order by log_timestamp desc limit 5;
Query 4:
Finally, this query, identical to the slow one but with explicit range limits, is very fast (~3s):
select log_timestamp, message from logs where log_timestamp > '2014-06-18' order by log_timestamp desc limit 5;
The message column is defined to be able to hold larger messages, but in practice it doesn't hold much data: the average length of the message field is 16 charachters (std_dev 10). The average length of the event field is 5 charachters (std_dev 2). The only distinction I can really see is the max length of the VARCHAR field, but I wouldn't think that should have an order of magnitude affect on the time a simple query takes to return!
Any insight would be appreciated. While this isn't the typical use case for this tool (we'll be aggregating far more than we'll be inspecting individual logs), I'd like to understand any subtle or not-so-subtle affects of my table design.
Thanks!
Dave
Redshift is a "true columnar" database and only reads columns that are specified in your query. So, when you specify 2 small columns, only those 2 columns have to be read at all. However when you add in the 3rd large column then the work that Redshift has to do dramatically increases.
This is very different from a "row store" database (SQL Server, MySQL, Postgres, etc.) where the entire row is stored together. In a row store adding/removing query columns does not make much difference in response time because the database has to read the whole row anyway.
Finally the reason your last query is very fast is because you've told Redshift that it can skip a large portion of the data. Redshift stores your each column in "blocks" and these blocks are sorted according the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned.
The limit clause doesn't reduce the work that has to be done because you've told Redshift that it must first order all by log_timestamp descending. The problem is your ORDER BY … DESC has to be executed over the entire potential result set before any data can be returned or discarded. When the columns are small that's fast, when they're big it's slow.
Out of curiosity, how long does this take?
select log_timestamp, message
from logs l join
(select min(log_timestamp) as log_timestamp
from (select log_timestamp
from logs
order by log_timestamp desc
limit 5
) lt
) lt
on l.log_timestamp >= lt.log_timestamp;

Cumulative average number of records created for specific day of week or date range

Yeah, so I'm filling out a requirements document for a new client project and they're asking for growth trends and performance expectations calculated from existing data within our database.
The best source of data for something like this would be our logs table as we pretty much log every single transaction that occurs within our application.
Now, here's the issue, I don't have a whole lot of experience with MySql when it comes to collating cumulative sum and running averages. I've thrown together the following query which kind of makes sense to me, but it just keeps locking up the command console. The thing takes forever to execute and there are only 80k records within the test sample.
So, given the following basic table structure:
id | action | date_created
1 | 'merp' | 2007-06-20 17:17:00
2 | 'foo' | 2007-06-21 09:54:48
3 | 'bar' | 2007-06-21 12:47:30
... thousands of records ...
3545 | 'stab' | 2007-07-05 11:28:36
How would I go about calculating the average number of records created for each given day of the week?
day_of_week | average_records_created
1 | 234
2 | 23
3 | 5
4 | 67
5 | 234
6 | 12
7 | 36
I have the following query which makes me want to murderdeathkill myself by casting my body down an elevator shaft... and onto some bullets:
SELECT
DISTINCT(DAYOFWEEK(DATE(t1.datetime_entry))) AS t1.day_of_week,
AVG((SELECT COUNT(*) FROM VMS_LOGS t2 WHERE DAYOFWEEK(DATE(t2.date_time_entry)) = t1.day_of_week)) AS average_records_created
FROM VMS_LOGS t1
GROUP BY t1.day_of_week;
Halps? Please, don't make me cut myself again. :'(
How far back do you need to go when sampling this information? This solution works as long as it's less than a year.
Because day of week and week number are constant for a record, create a companion table that has the ID, WeekNumber, and DayOfWeek. Whenever you want to run this statistic, just generate the "missing" records from your master table.
Then, your report can be something along the lines of:
select
DayOfWeek
, count(*)/count(distinct(WeekNumber)) as Average
from
MyCompanionTable
group by
DayOfWeek
Of course if the table is too large, then you can instead pre-summarize the data on a daily basis and just use that, and add in "today's" data from your master table when running the report.
I rewrote your query as:
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week
The reason why your query takes so long is because of your inner select, you are essentialy running 6,400,000,000 queries. With a query like this your best solution may be to develop a timed reporting system, where the user receives an email when the query is done and the report is constructed or the user logs in and checks the report after.
Even with the optimization written by OMG Ponies (bellow) you are still looking at around the same number of queries.
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week