Hive sql pack array based off column - sql

I have multiple columns listed below:
state sport size color name
florida football 1 red Max
nevada football 1 red Max
ohio football 1 red Max
texas football 1 red Max
florida hockey 1 red Max
nevada hockey 1 red Max
ohio hockey 1 red Max
texas hockey 1 red Max
florida tennis 2 green Max
nevada tennis 2 green Max
ohio tennis 2 green Max
texas tennis 2 green Max
Is there a way to combine these into arrays like the desired output below based on one column (in this case Name). Mac the results will have one record, instead of repeating and the records will be contained in an array.
state sport
[florida, nevada, ohio,texas] [football, hockey, tennis]
size color
[1,2] [red, green]

You can use collect_set.
select name,collect_set(state),collect_set(sport),collect_set(size),collect_set(color)
from tbl
group by name

You need to use collect_set. Hope this helps. Thanks.
query:
select collect_set(state),
collect_set(sport),
collect_set(size),
collect_set(color)
from myTable
where name = 'Max';

Related

pandas: how to create new columns based on two columns and aggregate the results

I am trying to perform a sort of aggregation, but with the creation of new columns.
Let's take the example of the dataframe below:
df = pd.DataFrame({'City':['Los Angeles', 'Denver','Denver','Los Angeles'],
'Car Maker': ['Ford','Toyota','Ford','Toyota'],
'Qty': [50000,100000,80000,70000]})
That generates this:
City
Car Maker
Qty
0
Los Angeles
Ford
50000
1
Denver
Toyota
100000
2
Denver
Ford
80000
3
Los Angeles
Toyota
70000
I would like to have one line per city and the Car Maker as a new column with the Qty related to that City:
City
Car Maker
Ford
Toyota
0
Los Angeles
Ford
50000
70000
1
Denver
Toyota
80000
100000
Any hints on how to achieve that?
I've tried some options with transforming it on a dictionary and compressing on a function, but I am looking for a more pandas' like solution.
df.pivot(index='City', columns='Car Maker', values='Qty').reset_index()
Try dataframe.pivot_table()
df.pivot_table(values='Qty', index=['City', 'Car Maker'], columns='Car Maker').reset_index()

Count count in splunk

Can you do double counting in Splunk via time_span?
I want to count the number of hits of number of fruits sold in an hour.
My code:
|bucket _time span=1h |eventstats count as count_in_an_hour by fruit
time |stats count as count_count by fruit |table fruit count
count_count |sort count_count count
I can run this with a bit of data; but because I have a huge number of data, it's taking very long and taking up a lot of space resulting in "not enough space error".
My sample set of data,
name fruit location time
mary apple east 5.10
ben pear east 6.10
peter pear east 5.50
ben apple north 7.10
ben mango north 7.40
peter mango north 5.30
mary orange north 7.20
alice pear north 7.20
janet pear north 7.20
janet mango west 6.30
janet mango west 5.50
peter mango west 4.20
janet pear west 5.50
You can try asking your admin to increase your disk space limit, if that's the limiting factor.
If your admin has enabled the search_process_memory_usage_threshold setting then ask for the threshold to be increased.
Perhaps a better option is to reduce the number of results processed. You can do that in a few ways:
Use a smaller time window
Use the fields command early to reduce the amount of data
processed
Make the base search as specific as possible to reduce the amount of
data processed
For example:
index=foo name=* fruit=* earliest=-24h
| fields _time name fruit
| bucket _time span=1h
| eventstats count as count_in_an_hour by fruit time
| stats count as count_count by fruit
| sort count_count count_in_an_hour
| table fruit count_in_an_hour count_count

shop that has served more than 3 people query sql

I have a table called frequents that has two columns name and pizzeria. Each name is linked to a pizzeria. Some names are mentioned more than once since they are linked to different pizzerias. I need help writing a query that shows all the pizzerias that have served more than 3 people. Thank you.
Name Pizzareia
Amy Pizza Hut
Ben Pizza Hut
Ben Chicago Pizza
Cal Straw Hat
Cal New York Pizza
Dan Straw Hat
Dan New York Pizza
Eli Straw Hat
Eli Chicago Pizza
Fay Dominos
Fay Little Caesars
Gus Chicago Pizza
Gus Pizza Hut
Hil Dominos
Hil Straw Hat
Hil Pizza Hut
Ian New York Pizza
Ian Straw Hat
Ian Dominos
And the query:
SELECT name, count(pizzeria)
FROM frequency
GROUP BY name
HAVING COUNT(pizzeria) >= 3
The result is supposed to show the pizzeria where its name has come up more than 3 times
I need help writing a query that shows all the pizzerias that have served more than 3 people
You need to GROUP BY pizzeria, not by name:
SELECT pizzeria FROM frequency GROUP BY pizzeria HAVING COUNT(*) >= 3

Get data from string of specific values SQL

I'm rather new at SQL programming, and still struggling with the basics. I need to extract some specific rows, from a specified string of IDs.
ID Product City
1 Apple London
2 Banana Berlin
3 Orange Berlin
4 Orange Paris
5 Apple Paris
6 Banana Copenhagen
7 Banana Copenhagen
8 Banana London
9 Apple Paris
10 Orange London
11 Apple Berlin
12 Apple Copenhagen
13 Apple Paris
If I need to select ID=1,2,5,6,10,11,13 how do I extract these specific rows from the database?
I'm using SQLite.
Thanks in advance.
You should use the in clause
select * from your_table
where id in (1,2,5,6,10,11,13)

How do I Sum a total based on Grouping

I've got data (which changes every time) in 2 columns - basically state and number. This is an example:
Example Data
State Total
Connecticut 624
Georgia 818
Washington 10
Arkansas 60
New Jersey 118
Ohio 2,797
N. Carolina 336
Illinois 168
California 186
Utah 69
Texas 183
Minnesota 172
Kansas 945
Florida 113
Arizona 1,430
S. Dakota 293
Puerto Rico 184
Each state needs to be grouped. The groupings are as follows:
Groupings
**US Group 1**
California
District of Columbia
Florida
Hawaii
Illinois
Michigan
Nevada
New York
Pennsylvania
Texas
**US Group 3**
Iowa
Idaho
Kansas
Maine
Missouri
Montana
North Dakota
Nebraska
New Hampshire
South Dakota
Utah
Wyoming
Every other state belongs in US Group 2..
What I am trying to do is sum a total for each group. So in this example I would have totals of:
Totals
650 in Group 1 (4 states)
6365 in Group 2 (9 states)
1307 in Group 3 (3 states)
So what I would like to do each time I get a new spreadsheet with this data, is not have to create an if/countif/sumif formula each time. I figure it would be much more efficient to select my data and possibly run a macro which will do that (possibly checking against some legend or something)
Can anyone point me in the right direction? I have been banging my head against the VBA editor for 2 days now...
Here is one way.
Step 1: Create a named range for each of your groups.
Step 2: Try this formula: =SUMPRODUCT(SUMIF(A2:A18,Group1,B2:B18))
Formula Breakdown:
A2:A18 is the the state names
Group1 is the named range that has each of your states in group 1
B2:B18 is the values you want to sum.
It's important that your state names and the values you want summed are the same size (number of rows). You should also standardize your state names. Having S. Dakota in your data and South Dakota in your named range won't work. Either add in the different variations of the state name(s) to your list, or standardize your data coming in.
To get a clear visual of what the formula is doing, use the Evaluate Formula button on the Formulas Tab, it will be much better than me trying to explain it.
EDIT
Try this formula for summing up values that are not in Group1 or Group3:
=SUMPRODUCT(--(NOT(ISNUMBER(MATCH(A2:A18,Group1,0)))),--(NOT(ISNUMBER(MATCH(A2:A18,Group3,0)))),B2:B18)
Seemed to work on my end. Basically it works by only summing valyes in B2:B18 where both match functions return N/A (meaning it's not in the defined group list).
Use a vlookup with a mapping of your states to groups. Then from the group number, add it if it's found, or add 0.