Join two hive tables and search for a string - hive

I am new to hive and I have two tables which contains access logs created like this.
CREATE EXTERNAL TABLE rwloglines(line string) STORED AS TEXTFILE LOCATION 'hdfs:///rwlogs'
CREATE EXTERNAL TABLE dpxloglines(line string) STORED AS TEXTFILE LOCATION 'hdfs:///dpxlogs'
both of these will contain a Id which is made of 20 characters [A-Z][0-9]. I want to join these two tables and search for the Id. What is the query I should write in hive
Can some one please help me.

The easiest approach would be to spilt the contents of each file into various columns like id , ip adde, error msg etc. And then load it to hive table specifying these columns in its schema.
Then
select id from rwloglines a join dpxloglines b where id='';

Related

Is there a query to retrieve all source table names and target table names of a particular mapping in informatica?

Is there any query that results from source table names and column table names using a mapping or mapping Id in informatica. This has been very hard and challenging
Like when we search up SELECT * FROM opb_mapping WHERE mapping_name LIKE '%CY0..%'
It is resulting in some details but I cannot find source table names and target table names. Help if you can.
Thanks.
You can use below view to get the data.
(Assuming you have full access to metadata tables.
select source_name, source_field_name,
target_name, target_column_name,
mapping_name, subject_name folder
from REP_FLD_MAPPING
where mapping_name like '%xx%';
Only issue i can see is, if you have overwrite sql, then you need to check sql for true source.

Nested select statement in regexp_like does not resolve correctly

I'm very new to AWS & Athena. I'm using Athena to query a data file (CSV) from S3 using glue crawlers to create catalog and then querying that info. I have the catalog table created by glue, containing fName, sName, mName info. I'm trying to search a regexp pattern from all the rows and columns with a single query.
I have created a second table containing the column names of the primary table, i.e. fName, sName, mName.
I would like to loop through the second table rows -> using each value in my regexp_like function to search for any names starting with 'B'
e.g.
where regexp_like(fname,'^B')
where regexp_like(sname,'^B')
where regexp_like(mname,'^B')
and display all of them.
Is this possible? I have not been able to get the first query working even when hardcoding the search criteria
e.g.
select * from primary_table
where regexp_like((Select column from secondary_table where column_name='fname'),'^B')
above SQL -> Select column from secondary_table where column_name='fname' resolves to fname as string, not fname column in primary table.
Storing column names and table names as data is generally not recommended. In order to use the information, you need to use dynamic SQL -- that is, construct the query as a string and execute that.
You can get something similar using a lot of logic. But you have to check for each column explicitly:
select p.*
from primary p join
secondary s
on s.column_name = 'fname' and regexp_like(p.fname, '^B') or
s.column_name = 'sname' and regexp_like(p.sname, '^B') or
s.column_name = 'mname' and regexp_like(p.mname, '^B') ;

select row from orc snappy table in hive

I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?
The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.
On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");

Bigquery : get the name of the table as column value

I have a dataset that contains several tables that have suffixes in their name:
table_src1_serie1
table_src1_serie2
table_src2_opt1
table_src2_opt2
table_src3_type1_v1
table_src3_type2_v1
table_src3_type2_v2
I know that i can use this type of queries in BQ:
select * from `project.dataset.table_*`
to get all the rows from theses different tables.
What i am trying to achieve is to have a column that will contain for instance the type of source (src1, src2, src3)
Assuming the schema of all tables the same - you can add below to your select list (for BigQuery Standard SQL)
SPLIT(_TABLE_SUFFIX, '_')[SAFE_OFFSET(0)] AS src

Redshift showing 0 rows for external table, though data is viewable in Athena

I created an external table in Redshift and then added some data to the specified S3 folder. I can view all the data perfectly in Athena, but I can't seem to query it from Redshift. What's weird is that select count(*) works, so that means it can find the data, but it can't actually show anything. I'm guessing it's some mis-configuration somewhere, but I'm not sure what.
Some stuff that may be relevant (I anonymized some stuff):
create external schema spectrum_staging
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::############:role/RedshiftSpectrumRole'
create external database if not exists;
create external table spectrum_staging.errors(
id varchar(100),
error varchar(100))
stored as parquet
location 's3://mybucket/errors/';
My sample data is stored in s3://mybucket/errors/2018-08-27-errors.parquet
This query works:
db=# select count(*) from spectrum_staging.errors;
count
-------
11
(1 row)
This query does not:
db=# select * from spectrum_staging.errors;
id | error
----+-------
(0 rows)
Check your parquet file and make sure the column data types in the Spectrum table match up.
Then run SELECT pg_last_query_id(); after your query to get the query number and look in the system tables STL_S3CLIENT and STL_S3CLIENT_ERROR to find further details about the query execution.
You don't need to define external tables when you have defined external schema based on Glue Data Catalog. Redshift Spectrum pics up all the tables that are in the Catalog.
What's probably going on there is that you somehow have two things with the same name and in one case it picks it up from the data catalog and in the other case it tries to use the external table.
Check these tables from Redshift side to get a better view of what's there:
select * from SVV_EXTERNAL_SCHEMAS
select * from SVV_EXTERNAL_TABLES
select * from SVV_EXTERNAL_PARTITIONS
select * from SVV_EXTERNAL_COLUMNS
And these tables for queries that use the tables from external schema:
select * from SVL_S3QUERY_SUMMARY
select * from SVL_S3LOG order by eventtime desc
select * from SVL_S3QUERY where query = xyz
select * from SVL_S3PARTITION where query = xyz
was there ever a resolution for this? a year down, i have the same problem today.
nothing stands out in terms of schema differences- an error exists though
select recordtime, file, process, errcode, linenum as line,
trim(error) as err
from stl_error order by recordtime desc;
/home/ec2-user/padb/src/sys/cg_util.cpp padbmaster 1 601 Compilation of segment failed: /rds/bin/padb.1.0.10480/data/exec/227/48844003/de67afa670209cb9cffcd4f6a61e1c32a5b3dccc/0
Not sure what this means.
I encountered a similar issue when creating an external table in Athena using RegexSerDe row format. I was able to query this external table from Athena without any issues. However, when querying the external table from Redhift the results were null.
Resolved by converting to parquet format as Spectrum cannot handle regular expression serialization.
See link below:
Redshift spectrum shows NULL values for all rows