Select a large number of ids from a Hive table - sql

I have a large table with format similar to
+-----+------+------+
|ID |Cat |date |
+-----+------+------+
|12 | A |201602|
|14 | B |201601|
|19 | A |201608|
|12 | F |201605|
|11 | G |201603|
+-----+------+------+
and I need to select entries based on a list with around 5000 thousand IDs. The straighforward way would be to use the list as a WHERE clause but that would have a really bad performance and probably it even would not work. How can I do this selection?

Using a partitioned table things run fast. Once you partitioned the table add your ids into the where.
You can also extract a subtable from the original one selecting all the rows which have their ids between the min and the max of you ids list.

Related

SQL Architecture design

I have the following tables:
+---+---------+
|id | name | foreign_key1 = this table's id
+---+---------+
|1 | White |
|2 | Black |
+---+---------+
+----+------+-------------+
|id | name | foreign_key1|
+----+------+-------------+
|1 | Grey | 1 |
|2 | Grey | 2 |
+----+------+-------------+
Is there a way that I could persist the last table's information with only one row? So that table could represent that grey is both white and black in one row?
You could use an array-like column type (string) and make it a one row record, but I wouldn't suggest that, it's better to have them as separate rows. Your approach is fine, but I'll suggest (if I've understood your idea) a little different schema:
You can make two tables: Colors, and Related_Colors, like this:
Colors
+---+---------+
|id | name |
+---+---------+
|1 | White |
|2 | Black |
|3 | Gray |
+---+---------+
Related_Colors
+---+---------+---------+
|id |color1_id|color2_id|
+---+---------+---------|
|1 |3 |1 |
|2 |3 |2 |
+---+---------+---------+
You could; it's called denormalization (https://en.wikipedia.org/wiki/Denormalization). Essentially you would need to create (in the second table) a column for each one of the possible IDs in the first table. So the schema of the second table would be:
ID
Name
White
Black
This also explains why you probably should not do it; what happens when you want to add another ID in the first table (e.g. purple)? As things are you would just need to add another row in table 1, and reference it from the relevant rows. If you denormalize this way, you would need to change the schema to accommodate the new possible value. The new column would of course be empty for most rows.
Another possibility would be to maintain the values as a concatenated string; so the schema would be
ID
Name
List Of IDs From Table1
And in this case the last field would contain White, Black. The drawback of this approach is that you can no longer query efficiently by the values from table1. (You can't properly index that field)
Ultimately the question is - what are your needs. If you need to read rows quickly, and have them in a 'reporting friendly' format, denormalization may work for you. But in most DB design cases it would not be required.

Database Sql Chain Integers in row

At the moment I have for each users an integer in my database
which datatype would I have to use if I want to chain multiply integers for a single users in my database?
Now:
__________________________________
Users Numbers
Tom 2
__________________________________
What I want:
__________________________________
Users Numbers
Tom 2,12
__________________________________
As #jarlh stated, you shouldn't design your database to contain a set of data.
A relational database column must contain only a data of a single kind and not a set of data or different kinds of data through your rows.
To fix your error you can create another table named Numbers and associate it to your Users table with a 1:N (one to many) relation like shown here:
_Users___ _Numbers________________
|ID |name | |NumberID |UserID |value |
|1 |Tom | |1 |3 |243 |
|2 |Jess | |2 |1 |12 |
|3 |Luis | |3 |2 |87 |
In the Users table you have an ID and the name, then in your Numbers table you associate a number (for each new number you must insert a new row) to its owner with the foreign key UserID

Is is possible to join 2 tables if one of them has duplicate values in the column that's joining?

let's say I'm Joining the Shoes table to the Clothes table, They both have a column called ShoesID, so to me it would make sense to join those 2 tables on ShoesID (it also happens to be the primary key of the Shoes table). But here's my problem, it's not the primary key of the Clothes table, so in the Clothes table, in the ShoesID column, some of the rows repeat themselves and that's ruining my join.
Is there a way to get around that?
Clothes Table
ClothesID ShoesID NakedVarchar
99 |1 | e|
100 |1 | f|
101 |4 | g|
102 |4 | d|
I want to join this to this:
Shoes Table
ShoesID Descriptionvarchar
|1 | a|
|2 | b|
|3 | c|
|4 | d|
so I figured the logical way of doing this would be to do
LEFT JOIN Clothes ON Shoes.ShoesID = Clothes.ShoesID
unfortunately because the Clothes table contains duplicates it seems Postgres cuts them out,
I'd like all the data to be joined including the duplicates, how can I get around this?
it's not as simple as reversing my join statement as I'm technically trying to join them in a big
query that has got many other joins.
you can make any join that you want to, even a "self-join", but if there is no coincidence or match between the keywords your query could be empty. if you want to list them all, you should use a UNION instead of a join.

How to concat all of a select sql query

So I have an access 2000 database and i want to write a sql query that would do one SELECT query and based on an id of each row returned in that SELECT query call another nested SELECT query that would concat all those results and the id are linked as a relationship so i just need to concat all the results of the nested second select query
so if the databases are like this...
Table 1 Table 2
|ID | First Name| |ID | Notes|
----------------- ------------
|1 | Mike | |1 | testing|
|2 | Alex | |1 | test2 |
|3 | Jon | |2 | testing|
so when the query is called it returns
1 mike testing test2
2 alex testing
3 jon
A LEFT JOIN or INNER JOIN, such as can be built in the query design window is only going to get you so far. It seems from the above that you also wish to concatenate several rows in table 2 when the id is the same. This cannot be done with Access (Jet) SQL. You will need a user defined function (UDF). You will find two examples here and a search for concatenate + Access should return others.

complex'ish SQL joins across multiple tables with multiple conditions across all tables

Given the following tables:
labels tags_labels
|id |name | |url |labelid |
|-----|-------| |/a/b |1 |
|1 |punk | |/a/c |2 |
|2 |ska | |/a/b |3 |
|3 |stuff | |/a/z |4 |
artists tags
|id |name | |url |artistid |albumid |
|----|--------| |------|-----------|---------|
|1 |Foobar | |/a/b |1 |2637 |
|2 |Barfoo | |/a/z |2 |23 |
|3 |Spongebob| |/a/c |1 |32 |
I would like to get a list of urls that match a couple of conditions (which can be entered by the user into the script that uses these statements).
For example, the user might want to list all urls that have the labels "(1 OR 2) AND 3", but only if they are by the artists "Spongebob OR Whatever".
Is it possible to do this within a single statement using inner/harry potter/cross/self JOINs?
Or would I have to spread the query across multiple statements and buffer the results inside my script?
Edit:
And if it is possible, what would the statement look like? :p
Yes, you can do this in one query. And maybe an efficient way would be to dynamically generate the SQL statement, based on the conditions the user entered.
This query would allow you to filter by label name or artist name.
Building the sql dynamically to concatenate the user parameters or
passing the desired parameters into a stored procedure would obviously change
the where clauses but that really depends on how dynamic your 'script' must be...
SELECT tl.url
FROM labels l INNER JOIN tags_labels tl ON l.id = tl.labelid
WHERE l.name IN ('ska','stuff')
UNION (
SELECT t.url
FROM artists a INNER JOIN tags t ON a.id = t.artistid
WHERE a.name LIKE '%foo%'
)
Good Luck!