Pig Latin: using field in one table as position value to access data in another table - apache-pig

Let's say we have two tables. The first table has following description:
animal_count: {zoo_name:chararray, counts:()}
The meaning of "zoo_name" fields is obvious. "counts" fields contains counts of each specific animal species. In order to know what exact species a given field in "counts" tuple represents, we use another table:
species_position : {species:chararray, position:int}
Let assume we have following data in "species_position" table:
"tiger", 0
"elephant", 1
"lion", 2
This data means the first field in animal_count.counts is the number of tigers in a given zoo. The second field in that tuple is the number of elephants, and so on. So, if we want to represent that fact that "san diego zoo" has 2 tigers, 4 elephants and no lion, we will have following data in "animal_count" table:
"san diego zoo", (2, 4, 0)
Given this setup, how can I write a query to extract the number of a given species in all zoos? I was hoping for something like:
FOREACH species_position GENERATE species, animal_count.counts.$position;
Of course, the "animal_count.counts.$position" won't work.
Is this possible without resorting to UDF?

Related

In Google BigQuery what is the benefit of ARRAY of STRUCT vs STRUCT of ARRAY?

I'm just getting to grips with ARRAYS and STRUCT in BigQuery
I'm wondering why one would choose this formatting
SELECT [STRUCT('Alice' AS col_1, 'Bob' AS col_2),STRUCT('Charlie' AS col_1, 'David' AS col_2)] AS names;
over formatting like this
SELECT STRUCT(['Alice','Charlie'] AS col_1, ['Bob','David'] AS col_2) AS names;
output 1
for both the output looks the same. What would be an example of why you would use one over the other? To me the first example makes more sense because I'd want Alice and Bob to be on the same record and it's more clear in the first example. However I've seen in Google's Vertex AI prediction output they use the second example. I.e. when outputting binary predictions they output as
SELECT STRUCT([0,1] AS class, [0.8,0.2] AS col_2) AS prediction;
instead of
SELECT [STRUCT(0 AS class, 0.8 AS prediction),STRUCT(1 AS class, 0.2 AS prediction)] AS names;
output 2
When is the right time to use each?
It should be based on what a row/entity represents.
In your first example, each element of the array could be a pair/team and you can add as many as you want. Alice and Bob are pair 1, Charlie and David are pair 2...
The schema would be:
names (repeated field - struct)
- col_1 (string)
- col_2 (string)
In your second example, you have an entity with two types of names. An example here would be a list of names you like and names that you don't like. So Alice and Charlie would be liked (col_1) and you can add as many as you want to this list, on the other side Bob and David would be names that you don't like (and you can add as many here as well).
Schema.
names.col_1 (repeated field - string)
names.col_2 (repeated field - string)
In JSON format:
Example 1: [{"col_1":"Alice", "col_2":"Bob"}, {"col_1":"Charlie", "col_2":"David"}]
Example 2: {"col_1":['Alice', 'Charlie'], "col_2":["Bob", "David"]}

Merge tables in Power BI

I have a problem creating a table in Power BI/SQL
Basically, i have a CSV file with a dataset of the crime reports from a designated year.
Each crime has date (in 3 columns, day, month and year), location (in coordinates lat and long), type of crime, neighborhood and so on.
To make things less "dense" i created a few tables (Like for example, a "Location_ID" table with a PK and a combined Lat and Long for each ID), same as for Dates, Types of crime, neighborhood, etc.
The thing is that now i have my main table empty, and need to "replace" each of the data with the aforemention PK from each new table created. For example, i have the crime N°121 which happened in Buenos Aires, Argentina (Thats "3" in the New table ID_LOCATION), at 4/3/2022 (Thats "Z" in the New Table ID_DATE) and so on.
I dont know how to reassing every data in the column with the correct new value from the tables that i created without doing it manually (they are over 80k entrys, would take forever).
Thanks in advance

Ordering based on one value of many

I have three SQL tables. Users, Registration Field Values, and Registration Fields.
Name
zip code
favorite food
Sue
55555
sushi
Gary
12345
eggs
Where zip code and favorite food are different registration fields.
The relationship is a user has many registration field values, and those values belong to the registration field.
I'm wondering how I can order my table based on a certain registration field. For example, selecting "favorite food", I would want "eggs" before "sushi".
This is confusing to me because I've only seen ORDER BY for an individual column or series of columns. I can't just ORDER BY registration_field_value.value because it needs to be based on only one of those registration fields.
This is like "ORDER BY field value where the associated field id is 'favorite food'", although I don't want to filter anything out.
I'm using Postgres if that makes a difference.
EDIT, adding a
:
You can use case to order based on specific value.
For eg:
ORDER BY
CASE "favorite food"
WHEN 'eggs' THEN 1
ELSE 2
END
The above query will move row with eggs to start and all other value will be moved to bottom.

How to match entires in SQL based on their ending letter?

So I'm trying to match entries in two databases so in the new table the row is comprised of two words that end in the same ending letter. I'm working with two tables that have one column in each of them, each named word. table 1 contains the following data in order: Dog, High, It, Weeks, while table two contains the data: Bat, Is, Laugh, Sing. I need to select from both of these tables and match the words so that each row is as follows: Dog | Sing, High | Laugh, It | Bat, Weeks | Is
The screenshot is what I have so far for my SQL statement. I'm still early on in learning SQL so any info to help on this would be appreciated.
Recommend reading up on SUBSTR() for more information on why the below code works: https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2101.htm#OLADM679
SELECT
a.word
, b.word
FROM sec1313_words1 a
JOIN sec1313_words2 b
ON SUBSTR(b.word, -1) = SUBSTR(a.word, -1)
ORDER BY a.word

SQL Server Multiple Likes

I have an unusual question that seems simple but has me stumped in a SQL Server stored procedure.
I have 2 tables as described below.
tblMaster
ID, CommitDate, SubUser, OrigFileName
Sample data
ID CommitDate SubUser OrigFileName
----------------------------------------
1 2014-10-07 Test1 Test1.pdf
2 2014-10-08 Test2 Test2.pdf
3 2014-10-09 Test3 Test3.pdf
The above table is basically the header table that tracks the committed files. In addition to this, we have a details table with the following structure.
tblIndex
ID, FileID (Linking column to the header row described above), Word
Sample data:
1. 1, 1, Oil
2. 2, 1, oil
3. 3, 2, oil
4. 4, 2, tank
5. 5, 3, tank
The above rows represent the words that we want to search on and if a certain criteria matches return the corresponding filename/header row ID. What I would love to figure out to do is if I do a search for
One word (i.e. "oil"), then the system should respond with all the files that meet the criteria (easiest case and figured out)
If more than one word is searched for (i.e. "oil" and "tank"), then we should only see the second file since it is the only one that has both oil and tank as its key words.
Tried using a LIKE "%oil%" AND LIKE "%tank%" and that resulted in no rows being created since one value can't be both oil and tank.
Tried doing a LIKE "%oil%" OR LIKE "%tank%" but I get files 1, 2, and 3 since the OR is inclusive of all the other rows.
One last thing, I recognize I could just do a search for the first term and then save the results into a temp table and then search for the second term in that second table and I will get what I am looking for. The problem with that is that I don't exactly know how many items will be searched for. I don't want to have to create a structure where I am constantly having to store data into another temp table if someone does a search for 6 "keywords".
Any help/ideas will be much appreciated.
try this ! slightly differing from the previous answer
SELECT distinct FileID,COUNT(distinct t.word) FROM tblIndex t
WHERE t.Word LIKE '%oil%' OR t.Word LIKE '%tank%'
GROUP BY FileID
HAVING COUNT(distinct t.word) > 1
One simple option would be to do something like this :
SELECT FileID
FROM tblIndex t
WHERE t.Word LIKE '%oil%' OR t.Word LIKE '%tank%'
GROUP BY FileID
HAVING COUNT(*) > 1
This assume you do not have duplicate in your tblIndex.
I'm also unsure whether you really need the like or not. According to your sample data you don't, a basic comparison would be way more efficient and would avoid possible collisions.