Pig Query - Giving inconsistent results in AWS EMR - apache-pig

I am new to PIG. I have written one query which is not working as expected. I am trying to process Google ngrams dataset provided to me.
I load the data which is 1GB
bigrams = LOAD '$(INPUT)' AS (bigram:chararray, year:int, occurrences:int, books:int);
Then I select a subset which is limited to 2000 entries
limbigrams = LIMIT bigrams 2000;
Then see the dump of the limited data (pasting sample output)
(GB product,2006,1,1)
(GB product,2007,5,5)
(GB wall_NOUN,2007,27,7)
(GB wall_NOUN,2008,35,6)
(GB2 ,_.,1906,1,1)
(GB2 ,_.,1938,1,1)
Now I do a group by on limbigrams
D = GROUP limbigrams BY bigram;
When I see the data dump of D I see an entirely different data set (sample)
(GLABRIO .,1977,3,3),(GLABRIO .,1992,3,3),(GLABRIO .,1997,1,1),(GLABRIO .,2000,6,6),(GLABRIO .,2001,9,1),(GLABRIO .,2002,24,3),(GLABRIO .,2003,3,1)})
(GLASS FILMS,{(GLASS FILMS,1978,1,1),(GLASS FILMS,1976,2,1),(GLASS FILMS,1970,3,3),(GLASS FILMS,1966,7,1),(GLASS FILMS,1962,1,1),(GLASS FILMS,1958,1,1),(GLASS FILMS,1955,1,1),(GLASS FILMS,1899,2,2),(GLASS FILMS,1986,6,3),(GLASS FILMS,1984,1,1),(GLASS FILMS,1980,7,3)})
Now I am not attaching the entire output because there is not even a single row of overlap between both the outputs (i.e. before group-by and after group-by). Hence it really doesn't matter to see the output files.
Why does this happen?

The dumps are accurate. The GROUP BY operator in Pig creates a single record for each group and puts every record belonging to that group inside a bag. You can indeed see this in the last record of your second dump. The record stands for the group GLASS FILMS and has a bag containing records which have the bigram as GLASS FILMS. You can read more about the GROUP BY operator here: https://www.tutorialspoint.com/apache_pig/apache_pig_group_operator.htm

Related

Nested SQL query: how to return one sample from each log

I'm trying to implement a nested query to pull no more than one sample per log, and I think I know how to implement its components separately:
Query a set of logs that contain data relevant to my analysis:
SELECT
runs.object_type as object_type,
runs.name as log_name,
from project.runs.latest_runs
WHERE object_type = "ROCKET"
group by object_type, log_name
This results in a list of log names, e.g. "log_name_2021_09_01", "log_name_2021_09_03" etc.
Query no more than one event with a specific condition from a single known log:
SELECT
object.path_meters as pos,
object_speed as speed,
log.run as run_name,
FROM project.events.last30days
WHERE log.run = "log_name_2021_10_01"
AND object.speed > 0.0
LIMIT 1
The above query returns no more than one sample for the specified log.
How can I combine these queries to pull samples from a set of logs that is returned by Query 1, and at the same time there should be no more than one sample per log?
Update:
Let's say a DB contains three logs:
log_name_2021_09_01. The associated object_type to the log is ROCKET. The log contains 100k data samples: 90k of them have object.speed = 0.0, 10k of them have speed > 0.0.
log_name_2021_09_02. The associated object_type to the log is CAR. The log also contains 100k samples with a similar proportion to log 1.
log_name_2021_09_03. The associated object_type to the log is ROCKET. The log also contains 100k samples with a similar proportion to log 1.
I'm only interested in logs with the object type ROCKET. Two logs correspond to this condition: log_name_2021_09_01 and log_name_2021_09_03. These log names can be obtained by query 1 depicted above. I'd like to pull only one sample point (with speed > 0) from each of the two logs. That is, in the end I'd like to have a query that returns two samples: one from log_name_2021_09_01 and one from log_name_2021_09_03.
You question omits actual example data, so we're forced to infer much from your description. It is strongly recommended that your questions include sample data and the results you'd want from that sample data. This allows us both a concrete example to base our understanding on, and provides a test-set for us to use when developing an answer.
Would you trust any code you've written without having run it against test data?
That said, the following should be something like what you're looking for... (For each log, it only selects the one row with the highest pos.)
WITH
rocket_logs AS
(
SELECT DISTINCT
runs.object_type AS object_type,
runs.name AS log_name
FROM
project.runs.latest_runs
WHERE
object_type = "ROCKET"
),
sorted_logs AS
(
SELECT
log.run AS run_name,
object.path_meters AS pos,
object_speed AS speed,
ROW_NUMBER()
OVER (
PARTITION BY log.run
ORDER BY object.path_meters DESC
)
AS seq_num
FROM
project.events.last30days
WHERE
object.speed > 0.0
)
SELECT
*
FROM
rocket_logs r
INNER JOIN
sorted_logs s
ON s.run_name = r.log_name
WHERE
s.seq_num = 1
For a more exact answer, please give:
example data, for both tables
example results for that data
both being sufficient to demonstrate all necessary behaviours

Unnesting a json in Redshift causing nested loop in the query plan

I have a column in my tables called 'data' with JSONs in it like below:
{"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
I have written a code to unnest it into separate columns like tr,r,s.
Below is the code
with raw as (
SELECT json_extract_path_text(B.Data, 'records', true) as items
FROM tableB as B where B.date::timestamp between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
UNION ALL
SELECT json_extract_path_text(C.Data, 'records', true) as items
FROM tableC as C where C.date-5 between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
),
numbers as (
SELECT ROW_NUMBER() OVER (ORDER BY TRUE)::integer- 1 as ordinal
FROM <any_random_table> limit 1000
),
joined as (
select raw.*,
json_array_length(orders.items, true) as number_of_items,
json_extract_array_element_text(
raw.items,
numbers.ordinal::int,
true
) as item
from raw
cross join numbers
where numbers.ordinal <
json_array_length(raw.items, true)
),
parsed as (
SELECT J.*,
json_extract_path_text(J.item, 'tr',true) as tr,
json_extract_path_text(J.item, 'r',true) as r,
json_extract_path_text(J.item, 's',true)::float8 as s
from joined J
)
select * from parsed
The above code is working when there are small number of records but this taking more than a day to run and CPU utilization (in redshift) is reaching 100 % and even the disk space used also reaching 100% if I am putting date between last two years etc.. or if the number of records is large.
Can anyone please suggest any alternative way to unnnest JSON objects like above in redshift.
My query plan is saying:
Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products
Goal: To Unnest without using any cross joins
Input: data column having JSON
"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
Output should be for example
tr,r,s columns from the above json
You want to unnest json records of up to 1000 stored in a json array but nested loop join is taking too long.
The root issues is likely your data model. You have stored structured records (called "records"), inside a semi-structure text element (json), within a column of a structured columnar database. You want to perform some operation on these buried records that you haven't described but here's the problem. Columnar databases are optimized for performing read-centric analytic queries but you need to expand these json internal records into Redshift rows (records) which is fundamentally a write operation. This is working against the optimizations of the database.
The size of this expanding data is also large as compared to your disk storage on your cluster which is why the disks are filling up. You CPUs are likely spinning unpacking the jsons and managing overloaded disk and memory capacity. At the edge of filling up disks Redshift shifts to a mode that optimizes disk space utilization at the expense of execution speed. A larger cluster may give you a significantly faster execution if you can avoid this effect but that will cost money you may not have budgeted. Not an ideal solution.
One area that would improve speed of your query is not carrying all the data along. You keep raw.* and J.* all through the query but it is not clear you need these. Since part of the issue is data size during execution and that this execution includes loop joining, you are making the execution much harder that it needs to be by carrying all this data (including the original jsons).
The best way out of this situation is to change your data model and expand these json internal records into Redshift records on ingestion. Json data is fine for seldom used information or information that is only needed at the end of a query where the data is small. Needing the expanded json at the input end of the query for such a large amount of data is not good use case for json in Redshift. Each of these "records" inside of the json are records and need to be stored as such if you need to work across them as query input.
Now you want to know if there is some slick way to get around this issue in your case and the answer is "unlikely but maybe". Can you describe how you are using the final values in your query (t, r, and s)? If you are just using some aspect of this data (max value or sum or ...) then there may be a way to get to the answer without the large nested loop join. But if you need all the values then there is no other way to get these AFAIK. A description of what comes next in the data process could open up such an opportunity.

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

SQL add up rows in a column

I'm running SQL queries in Orion Report Writer for Solarwinds Netflow Traffic Analyzer and am trying to add up data usage for specific conversations coming from the same general sources. In this case it is netflix. I've made some progress with my query.
SELECT TOP 10000 FlowCorrelation_Source_FlowCorrelation.FullHostname AS Full_Hostname_A,
SUM(NetflowConversationSummary.TotalBytes) AS SUM_of_Bytes_Transferred,
SUM(NetflowConversationSummary.TotalBytes) AS Total_Bytes
FROM
((NetflowConversationSummary LEFT OUTER JOIN FlowCorrelation FlowCorrelation_Source_FlowCorrelation ON (NetflowConversationSummary.SourceIPSort = FlowCorrelation_Source_FlowCorrelation.IPAddressSort)) LEFT OUTER JOIN FlowCorrelation FlowCorrelation_Dest_FlowCorrelation ON (NetflowConversationSummary.DestIPSort = FlowCorrelation_Dest_FlowCorrelation.IPAddressSort)) INNER JOIN Nodes ON (NetflowConversationSummary.NodeID = Nodes.NodeID)
WHERE
( DateTime BETWEEN 41539 AND 41570 )
AND
(
(FlowCorrelation_Source_FlowCorrelation.FullHostname LIKE 'ipv4_1.lagg0%')
)
GROUP BY FlowCorrelation_Source_FlowCorrelation.FullHostname, FlowCorrelation_Dest_FlowCorrelation.FullHostname, Nodes.Caption, Nodes.NodeID, FlowCorrelation_Source_FlowCorrelation.IPAddress
So I've got an output that filters everything but netflix sessions (Full_Hostname_A) and their total usage for each session (Sum_Of_Bytes_Transferred)
I want to add up Sum_Of_Bytes_Transferred to get a total usage for all netflix sessions
listed, which will output to Total_Bytes. I created the column Total_Bytes, but don't know how to output a total to it.
For some asked clarification, here is the output from the above query:
I want the Total_Bytes Column to be all added up into one number.
I have no familiarity with the reporting tool you are using.
From reading your post I'm thinking you want the the first 2 columns of data that you've got, plus at a later point in the report, a single figure being the sum of the total_bytes column you're already producing.
Your reporting tool probably has some means of totalling a column, but you may need to get the support people for the reporting tool to tell you how to do that.
Aside from this, if you can find a way of calling a separate query in a latter section of the report, or if you embed a new report inside your existing report, after the detail section, and use that to run a separate query then you should be able to get the data you want with this:
SELECT Sum(Total_Bytes) as [Total Total Bytes]
FROM ( yourExistingQuery ) x
yourExistingQuery means the query you've already got, in full (doesnt have to be put on one line), the paretheses are required, and so is the "x". (The latter provides a syntax-required name for the virtual table which your query defines).
Hope this helps.

transform rows into columns in a sql table

Supose I would like to store a table with 440 rows and 138,672 columns, as SQL limit is 1024 columns I would like to transform rows into columns, I mean to convert the
440 rows and 138,672 columns to 138,672 rows and 440 columns.
Is this possible?
SQL Server limit is actually 30000 columns, see Sparse Columns.
But creating a query that returns 30k columns (not to mention +138k) will be basically uncontrollable, the sheer size of the metadata on each query result would halt the client to a crawl. One simply does not design databases like that. Go back to the drawing board, when you reach 10 columns stop and think, when you reach 100 column erase the board and start anew.
And read this: Best Practices for Semantic Data Modeling for Performance and Scalability.
The description of the data is as follows....
Each attribute describes the measurement of the occupancy rate
(between 0 and 1) of a captor location as recorded by a measuring
station, at a given timestamp in time during the day.
The ID of each station is given in the stations_list text file.
For more information on the location (GPS, Highway, Direction) of each
station please refer to the PEMS website.
There are 963 (stations) x 144 (timestamps) = 138,672 attributes for
each record.
This is perfect for normalision.
You can have a stations table and a measurements table. Two nice long thin tables.