How can I sum the amount of items in an array over time from a column in Hive partitions? - hive

I am trying to add together the instances of multiple tags within an array in a column.
Table Name: Task_Queue
Partition 1
item_id
category
task_tags
created_time
completed_time
date_stamp
4562
alert
windows, secevent, manual
2022-1-1
0
2022-7-29
4563
event
linux, opsevent, manual
2022-2-10
0
2022-7-29
4564
alert
windows, secevent, manual
2022-2-11
0
2022-7-29
4565
event
mac, secevent, manual
2022-2-16
0
2022-7-29
4585
alert
windows, opsevent, manual
2022-3-13
0
2022-7-29
4692
event
linux, secevent, manual
2022-3-14
0
2022-7-29
4662
alert
mac, event, manual
2022-5-5
0
2022-7-29
4673
event
linux, secevent, manual
2022-5-15
0
2022-7-29
4854
alert
mac, secevent, manual
2022-5-16
0
2022-7-29
4955
event
linux, ide-event, manual
2022-5-17
0
2022-7-29
4965
alert
windows, secevent, manual
2022-6-1
0
2022-7-29
4972
event
mac, secevent, manual
2022-6-10
0
2022-7-29
Partition 2
item_id
category
task_tags
created_time
completed_time
date_stamp
4462
alert
windows, opsevent, manual
2021-1-1
0
2022-6-29
4463
event
linux, opsevent, manual
2021-2-10
0
2022-6-29
4464
alert
windows, opsevent, manual
2021-2-11
0
2022-6-29
4465
event
mac, secevent, manual
2021-2-16
0
2022-6-29
4485
alert
windows, opsevent, manual
2021-3-13
0
2022-6-29
4492
event
linux, opsevent, manual
2021-3-14
0
2022-6-29
4462
alert
mac, event, manual
2021-5-5
0
2022-6-29
4473
event
linux, event, manual
2021-5-15
0
2022-6-29
4454
alert
mac, opsevent, manual
2021-5-16
0
2022-6-29
4455
event
linux, ide-event, manual
2021-5-17
0
2022-6-29
4465
alert
windows, secevent, manual
2021-6-1
0
2022-6-29
4472
event
mac, opsevent, manual
2021-6-10
0
2022-6-29
My query:
SELECT
ds,
COUNT(task_tags) AS "Total Tags)
FROM task_queue
WHERE completed_time = 0
AND task_tags CONTAIN('secevent', 'opsevent' 'ide-event)
GROUP BY ds
ORDER BY ds DESC
The result should be a trend of the total count of those tags above per partition in our data store. For the partition provided, the count would be 11 (4662 is not counted as 'event' is not needed)
Edit: Added all columns and created a second partition. Goal is to find a trend of task tags over time by identifying one of the varchar tags 'secevent', 'opsevent', 'ide-event'. `

Related

Output progress over time in hashcat

I am analysing the amount of hashes cracked over a set period of time.
I am looking to save the current status of the crack every 10 seconds.
'''
Recovered........: 132659/296112 (44.80%) Digests, 0/1 (0.00%) Salts
Recovered/Time...: CUR:3636,N/A,N/A AVG:141703,8502198,204052756 (Min,Hour,Day)
Progress.........: 15287255040/768199139595 (1.99%)
'''
I want these 3 lines of the status saved every 10 seconds or so.
Is it possible to do this within hashcat or will I need to make a separate script in python?
Getting the status every 10 seconds
You can enable printing the status with --status and you can set the status to prints every X seconds with --status-timer X. You can see these command line arguments on the hashcat options wiki page, or hashcat --help.
Example: hashcat -a 0 -m 0 example.hash example.dict --status --status-timer 10
Saving all the statuses
I'm assuming that you just want to save everything that gets printed by hashcat while it's running. An easy way to do this is just copy everything from stdout into a file. This is a popular s/o question, so we'll just use this answer.
To be safe, let's use -a which appends to the file, so we don't accidentally overwrite previous runs. All we need to do is put | tee -a file.txt after our hashcat call.
Solution
Give this a shot, it should save all the statuses (and everything else from stdout) to output.txt:
hashcat -a A -m M hashes.txt dictionary.txt --status --status-timer 10 | tee -a output.txt
Just swap out A, M, hashes.txt, and dictionary.txt with the arguments you're using.
If you need help getting just the "Recovered" lines from this output file, or if this doesn't work on your computer (I'm on OSX), let me know in a comment.
In addition to Andrew Zick's answer, note that for machine-readable status, hashcat has native support for machine-readable output - see the --machine-readable option. This produces tab-separated output like so:
STATUS 5 SPEED 111792 1000 EXEC_RUNTIME 0.007486 CURKU 1 PROGRESS 62 62 RECHASH 0 1 RECSALT 0 1 REJECTED 0 UTIL -1
STATUS 5 SPEED 14247323 1000 EXEC_RUNTIME 0.038953 CURKU 36 PROGRESS 2232 2232 RECHASH 0 1 RECSALT 0 1 REJECTED 0 UTIL -1
STATUS 5 SPEED 36929864 1000 EXEC_RUNTIME 1.661804 CURKU 1296 PROGRESS 80352 80352 RECHASH 0 1 RECSALT 0 1 REJECTED 0 UTIL -1
STATUS 5 SPEED 66538858 1000 EXEC_RUNTIME 3.237319 CURKU 46656 PROGRESS 28926722892672 RECHASH 0 1 RECSALT 0 1 REJECTED 0 UTIL -1
STATUS 5 SPEED 63562975 1000 EXEC_RUNTIME 3.480536 CURKU 1679616 PROGRESS 104136192 104136192 RECHASH 0 1 RECSALT 0 1 REJECTED 0 UTIL -1
... which is exactly what tools like Hashtopolis use to provide a front-end to hashcat output.
For machine-readable output, the options --outfile, and --outfile-format are available. See the Format section of the output of hashcat --help for the options to --outfile-format:
- [ Outfile Formats ] -
# | Format
===+========
1 | hash[:salt]
2 | plain
3 | hex_plain
4 | crack_pos
5 | timestamp absolute
6 | timestamp relative

TEZ mapper resource request

We recently migrated from MapReduce to TEZ for executing Hive queries on EMR. We are seeing cases where for the exact hive query launches very different number of mappers. See Map 3 phase below. On the first run it requested for 305 resources and on another run it requested for 4534 mappers. ( Please ignore the KILLED status because I manually killed the query.) Why does this happen ? How can we change it to be based on underlying data size instead ?
Run 1
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 container KILLED 5 0 0 5 0 0
Map 3 container KILLED 305 0 0 305 0 0
Map 5 container KILLED 16 0 0 16 0 0
Map 6 container KILLED 1 0 0 1 0 0
Reducer 2 container KILLED 333 0 0 333 0 0
Reducer 4 container KILLED 796 0 0 796 0 0
----------------------------------------------------------------------------------------------
VERTICES: 00/06 [>>--------------------------] 0% ELAPSED TIME: 14.16 s
----------------------------------------------------------------------------------------------
Run 2
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 5 5 0 0 0 0
Map 3 container KILLED 4534 0 0 4534 0 0
Map 5 .......... container SUCCEEDED 325 325 0 0 0 0
Map 6 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 container KILLED 333 0 0 333 0 0
Reducer 4 container KILLED 796 0 0 796 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/06 [=>>-------------------------] 5% ELAPSED TIME: 527.16 s
----------------------------------------------------------------------------------------------
This article explains the process in which Tez allocates resources. https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
If Tez grouping is enabled for the splits, then a generic grouping
logic is run on these splits to group them into larger splits. The
idea is to strike a balance between how parallel the processing is and
how much work is being done in each parallel process.
First, Tez tries to find out the resource availability in the cluster for these tasks. For that, YARN provides a headroom value (and
in future other attributes may be used). Lets say this value is T.
Next, Tez divides T with the resource per task (say M) to find out how many tasks can run in parallel at one (ie in a single wave). W =
T/M.
Next W is multiplied by a wave factor (from configuration - tez.grouping.split-waves) to determine the number of tasks to be used.
Lets say this value is N.
If there are a total of X splits (input shards) and N tasks then this would group X/N splits per task. Tez then estimates the size of
data per task based on the number of splits per task.
If this value is between tez.grouping.max-size & tez.grouping.min-size then N is accepted as the number of tasks. If
not, then N is adjusted to bring the data per task in line with the
max/min depending on which threshold was crossed.
For experimental purposes tez.grouping.split-count can be set in configuration to specify the desired number of groups. If this config
is specified then the above logic is ignored and Tez tries to group
splits into the specified number of groups. This is best effort.
After this the grouping algorithm is executed. It groups splits by node locality, then rack locality, while respecting the group size
limits.

optaplanner vrp file with road time and time window

I am trying to create VRP file which defines a problem with time window and distance in seconds. I currently do not need capacity (can I turn it off?)
this is my file :
NAME: almirs-test
COMMENT: Generated for OptaPlanner Examples
TYPE: CVRPTW
DIMENSION: 2
EDGE_WEIGHT_TYPE: EXPLICIT
EDGE_WEIGHT_FORMAT: FULL_MATRIX
EDGE_WEIGHT_UNIT_OF_MEASUREMENT: sec
CAPACITY: 125
NODE_COORD_SECTION
0 0 0 BRUSSEL
55 1 1 ANTHISNES
EDGE_WEIGHT_SECTION
0.0 1
1 0.0
DEMAND_SECTION
0 0 0 100 0
55 1 0 10 1
DEPOT_SECTION
0
-1
EOF
it is corcectly parsed, and I see locations on screen, but when I try to solve it I get message : "Not feasible"
org.optaplanner.examples.vehiclerouting.solver/arrivalAfterDueTime/level0/[ANTHISNES]=-990
any idea what am I doing wrong? any samples where I can see how it is done?
thanks
almir

How to proceed with my Spark / Scala project

I am new to Spark and Scala. I am working on a Scala project where I will have data access from SQL Server.
There is a table in SQL Server has info about clothes. itemCode is the primary key and several attributes with Boolean value 0/1 - Designer, Exclusive, Handloom and several other columns having attributes of the product etc.
Code Designer Exclusive Handloom
A 1 0 1
B 1 0 0
C 0 0 1
D 0 1 0
E 0 1 0
F 1 0 1
G 0 1 0
H 0 0 0
I 1 1 1
J 1 1 1
K 0 0 1
L 0 1 0
M 0 1 0
N 1 1 0
O 0 1 1
P 1 1 0
and the list continues.
I have to select a collection of 32 items out of 320 items that have ATLEAST:
8 Designer, 8 Exclusive, 8 Handloom, 8 WeddingStyle, 8 PartyStyle,
8 Silk, 8 Georgette
I had solved the problem in MS Excel solver (it uses Gradient Descent algo) by adding an extra column and using sumproduct function between added column and required columns. So, the problem was solved there and it took around 1 minute 30 seconds for the same.
Also, the problem can be solved by writing an SQL query with 32 joins (so many), for example, if i want to select 6 items out of those 16 above with atleast 4 items designer, 4 exclusive, 4 handloom, the query would be like in my post: MYSQL - Select rows fulfilling many count conditions
In production, I have to fetch 32 rows like this way, So my question is how do I proceed further with the project.
I am working on Scala IDE for Eclipse, and have added spark mllib there. I have fetched data via JDBC and stored in a dataframe, and the created a temporary table:
dataFrame.registerTempTable("Data")
There is a class optimizer in mllib optimization that uses gradient descent (like excel solver does) to solve problems. But, that is for machine learning and takes as input training data.
I am not able to understand how do I proceed with my project. Can i use mllib, or use a better simple version of the sql with sparkSQL. I need serious help.
I'd recommend you to use https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#creating-dataframes rather than MLLib.
I solved this problem through linear programming. I have now used lpsolver library for java in my scala project. It is giving almost the same result as in excel solver.

Dot Net Cisco Command Line Console Parser [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm Trying to write a Cisco Command Line Parser to have an automated Graphical User Interface replacement for the Cisco console output. I have been able to get the ping time using Regular Expressions from a ping output and graph it, but am now stuck with more detailed out put of other commands like "Show interfaces" command,
any ideas how I can parse the Show Interface command output and extract all the useful info which i need?
Here is a "Show Interfaces" out put example:
FastEthernet0/0 is up, line protocol is up
Hardware is MV96340 Ethernet, address is 0018.189d.1df0 (bia 0018.189d.1df0)
Description: IP+ connection
Internet address is 164.128.251.50/24
MTU 1500 bytes, BW 100000 Kbit/sec, DLY 100 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 100Mb/s, 100BaseTX/FX
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:00, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/3718/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 2000 bits/sec, 6 packets/sec
5 minute output rate 3000 bits/sec, 10 packets/sec
152817108 packets input, 1043050554 bytes
Received 77347880 broadcasts (67140888 IP multicasts)
0 runts, 0 giants, 3351 throttles
381823 input errors, 0 CRC, 0 frame, 0 overrun, 381823 ignored
0 watchdog
0 input packets with dribble condition detected
--More-- 99065802 packets output, 440637782 bytes, 0 underruns
0 output errors, 0 collisions, 2 interface resets
300246 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
FastEthernet0/1 is administratively down, line protocol is down
Hardware is MV96340 Ethernet, address is 0018.189d.1df1 (bia 0018.189d.1df1)
MTU 1500 bytes, BW 100000 Kbit/sec, DLY 100 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Auto-duplex, Auto Speed, 100BaseTX/FX
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output never, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes
Received 0 broadcasts (0 IP multicasts)
--More-- 0 runts, 0 giants, 0 throttles
--More-- 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog
0 input packets with dribble condition detected
0 packets output, 0 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
Tunnel0 is up, line protocol is up
Hardware is Tunnel
Interface is unnumbered. Using address of FastEthernet0/0 (164.128.251.50)
MTU 17912 bytes, BW 100 Kbit/sec, DLY 50000 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation TUNNEL, loopback not set
Keepalive not set
Tunnel source 164.128.251.50 (FastEthernet0/0), destination 164.128.32.1
Tunnel Subblocks:
src-track:
Tunnel0 source tracking subblock associated with FastEthernet0/0
Set of tunnels with source FastEthernet0/0, 1 member (includes iterators), on interface
Tunnel protocol/transport PIM/IPv4
--More-- Tunnel TOS/Traffic Class 0xC0, Tunnel TTL 255
--More-- Tunnel transport MTU 1472 bytes
Tunnel is transmit only
Tunnel transmit bandwidth 8000 (kbps)
Tunnel receive bandwidth 8000 (kbps)
Last input never, output 28w1d, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/0 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
44 packets output, 2464 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 output buffer failures, 0 output buffers swapped out
Virtual-Access1 is up, line protocol is up
Hardware is Virtual Access interface
Description: Internally created by SSLVPN context TEST
MTU 1406 bytes, BW 100000 Kbit/sec, DLY 100000 usec,
--More-- reliability 255/255, txload 1/255, rxload 1/255
--More-- Encapsulation SSL
Internal vaccess
Vaccess status 0x0, loopback not set
Keepalive set (10 sec)
DTR is pulsed for 5 seconds on reset
Last input never, output never, output hang never
Last clearing of "show interface" counters 29w5d
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
0 packets output, 0 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 output buffer failures, 0 output buffers swapped out
0 carrier transitions
Interface_Long_Split = Regex.Split(Result_Long, "(POS[0-9]/[0-9]/[0-9])|(POS[0-9]/[0-9])|(GigabitEthernet[0-9]/[0-9])|(FastEthernet[0-9]/[0-9])")
Dim count As Integer = 0
For i = 0 To Interface_Long_Split.Length
If Regex.IsMatch(Interface_Long_Split(i), "(POS[0-9]/[0-9]/[0-9])|(POS[0-9]/[0-9])|(GigabitEthernet[0-9]/[0-9])|(FastEthernet[0-9]/[0-9])") = True Then
ReDim Preserve Interfaces_List(count)
Interfaces_List(count) = Interface_Long_Split(i)
count = count + 1
End If
imho you are probably on a hiding to nothing.
you could try parsing those complex outputs a line at a time rather than as one big blob.