SQL in R: HAVING condition with only the condition of one row? - sql

I am learning to use SQL in R.
I want to select cities that are more northern than Berlin from a dataset.
For this, I have tried:
sql4 = "SELECT Name, Latitude FROM city HAVING Latitude > Latitude(Berlin)"
result4 = dbSendQuery(con, sql4)
df4 = dbFetch(result4)
head(df4)
and
sql4 = "SELECT Name, Latitude FROM city HAVING Latitude > (Name IN 'Berlin')"
result4 = dbSendQuery(con, sql4)
df4 = dbFetch(result4)
head(df4)
Neither syntax works unfortunatley.
So my question is: How do I select all cities "north from Berlin", i.e. latitude value higher than that of the Name row 'Berlin'? Is there a different, better approach?

Assuming Berlin occur at most once in the city table, you may use:
SELECT Name, Latitude
FROM city
WHERE Latitude > (SELECT Latitude FROM city WHERE Name = 'Berlin');
You want to be using WHERE here to filter, rather than HAVING. HAVING is for filtering aggregates when using GROUP BY, which you are not using.

You cannot actually use Latitude(Berlin), I think. I typically use something like this:
SELECT Name, Latitude FROM city WHERE Latitude = (SELECT Latitude from city WHERE Name = "Berlin")
Hope this helps.
-Suhas

Related

SQL query with many 'AND NOT CONTAINS' statements

I am trying to exclude timezones that have a substring in them so I only have records likely from the US.
The query works fine (e.g., the first line after the OR will remove local_timezones that include 'Africa/Abidjan'), but there's got to be a better way to write it.
It's too verbose, repetitive, and I suspect it's slower than it could be. Any advice greatly appreciated. (I'm using Snowflake's flavor of SQL but not sure that matters in this case).
NOTE: I'd like to keep a timezone such as America/Los_Angeles, but not America/El_Salvador, so for this reason I don't think wildcards are a good solution.
SELECT a_col
FROM a_table
WHERE
(country = 'United States')
OR
((country is NULL and not contains (local_timezone, 'Africa')
AND
country is NULL and not contains (local_timezone, 'Asia')
AND
country is NULL and not contains (local_timezone, 'Atlantic')
AND
country is NULL and not contains (local_timezone, 'Australia')
AND
country is NULL and not contains (local_timezone, 'Etc')
AND
country is NULL and not contains (local_timezone, 'Europe')
AND
country is NULL and not contains (local_timezone, 'Araguaina')
etc etc
If you have a known list of "good things" I would make a table, and then just JOIN to id. Here I made you a list of good timezones:
CREATE TABLE acceptable_timezone (tz_name text) AS
SELECT * FROM VALUES
('Pacific/Auckland'),
('Pacific/Fiji'),
('Pacific/Tahiti');
I love me some Pacific... now we have some important data in a CTE
WITH data(id, timezone) AS (
SELECT * FROM VALUES
(1, 'Pacific/Auckland'),
(2, 'Pacific/Fiji'),
(3, 'America/El_Salvador')
)
SELECT d.*
FROM data AS d
JOIN acceptable_timezone AS a
ON a.tz_name = d.timezone
ORDER BY 1;
which total does not match the El Salvador:
ID
TIMEZONE
1
Pacific/Auckland
2
Pacific/Fiji
You cannot get much faster than an equijoin, but if your DATA has the timezones as substrings, then the TABLE can have the wildcard matches % and you can use a LIKE just like Felipe's answer does but as
JOIN acceptable_timezone AS a
ON d.timezone LIKE a.tz_name
You can use LIKE ANY:
with data as
(select null country, 'something Australia maybe' local_timezone)
select *
from data
where country = 'United States'
or (
country is null
and not local_timezone like any ('%Australia%', '%Africa%', '%Atlantic%')
)

How Do I Programmatically Use "Count" In Pyspark?

Trying to do a simple count in Pyspark programmatically but coming up with errors. .count() works at the end of the statement if I drop AS (count(city)) but I need the count to appear inside not on the outside.
result = spark.sql("SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'")
One of many errors
Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting ')'(line 1, pos 21)
== SQL ==
SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'
---------------------^^^
Your syntax is incorrect. Maybe you want to do this instead:
result = spark.sql("""
SELECT
count(city) over(partition by city),
business_id
FROM business
WHERE city = 'Reading'
""")
You need to provide a window if you use count without group by. In this case, you probably want a count for each city.
Just my solution to the problem I'm trying to solve. The solution above is where I would like to be at.
result = spark.sql("SELECT count(*) FROM business WHERE city='Reading'")

Groupby issue on multiple join using Grafana's (TimescaleDB) SQL plugin

I'm using Grafana's SQL plugin to query a TimescaleDB database.
The DB stores weather information as
| timestamp | location_id | data_type_id | value |
where location_id and data_type_id are foreign keys to table locations describing the locations and weather_data_types defining the measurement types (temperature, relative_humidity,...).
I'd like to query data on a time range, grouped by location and type.
I manage to group by one of them, but not both.
This works and groups by location:
SELECT
$__timeGroupAlias("timestamp", $__interval),
avg(value),
locations.name
FROM weather_data
JOIN locations ON weather_data.location_id = locations.id
GROUP BY 1, locations.name
ORDER BY 1
This works and groups by type:
SELECT
$__timeGroupAlias("timestamp", $__interval),
avg(value),
weather_data_types.name
FROM weather_data
JOIN weather_data_types ON weather_data.type_id = weather_data_types.id
GROUP BY 1, weather_data_types.name
ORDER BY 1
This does not work:
SELECT
$__timeGroupAlias("timestamp", $__interval),
avg(value),
locations.name,
weather_data_types.name
FROM weather_data
JOIN locations ON weather_data.location_id = locations.id
JOIN weather_data_types ON weather_data.type_id = weather_data_types.id
GROUP BY 1, locations.name, weather_data_types.name
ORDER BY 1
More specifically, I get the following error
Value column must have numeric datatype, column: name type: string value: relative_humidity
It seems the third groupby (silently) doesn't happen and weather_data_types.name is returned, which Grafana complains about because it can't plot strings.
Changing this to return the (integer) id instead removes the error message
SELECT
$__timeGroupAlias("timestamp", $__interval),
avg(value),
locations.name,
weather_data_types.id
FROM weather_data
JOIN locations ON weather_data.location_id = locations.id
JOIN weather_data_types ON weather_data.type_id = weather_data_types.id
GROUP BY 1, locations.name, weather_data_types.id
ORDER BY 1
but two series are plotted: avg and id, which shows the groupby type is not applied.
Is there anything wrong in my query? Is it an issue with the Grafana plugin?
I don't think it matters, but here's the model, defined with SQLAlchemy and hopefully self-explanatory.
class Location(Base):
__tablename__ = "locations"
id = sqla.Column(sqla.Integer, primary_key=True)
name = sqla.Column(sqla.String(80), unique=True, nullable=False)
country = sqla.Column(sqla.String(80), nullable=False)
latitude = sqla.Column(sqla.Float(), nullable=False)
longitude = sqla.Column(sqla.Float(), nullable=False)
class WeatherDataTypes(Base):
__tablename__ = "weather_data_types"
id = sqla.Column(sqla.Integer, primary_key=True)
name = sqla.Column(sqla.String(80), unique=True, nullable=False)
description = sqla.Column(sqla.String(500), nullable=False)
unit = sqla.Column(sqla.String(20), nullable=False)
min_value = sqla.Column(sqla.Float)
max_value = sqla.Column(sqla.Float)
class WeatherData(Base):
__tablename__ = "weather_data"
timestamp = sqla.Column(sqla.DateTime(timezone=True), primary_key=True)
location_id = sqla.Column(
sqla.Integer,
sqla.ForeignKey('locations.id'),
nullable=False,
primary_key=True
)
location = sqla.orm.relationship('Location')
type_id = sqla.Column(
sqla.Integer,
sqla.ForeignKey('weather_data_types.id'),
nullable=False,
primary_key=True
)
type = sqla.orm.relationship('WeatherDataTypes')
value = sqla.Column(sqla.Float)
Sending requests directly to postgresql helped me understand what is happening.
Apparently, when the query returns a column of values and a column of strings, the Grafana plugin assumes the values are to be plotted and the string column is meant to be used as labels for the plots.
I thought the plugin used the groupby to sort of extract the column to make it label information, but this magic doesn't work with two string columns as the plugin won't concatenate the values itself. Therefore the plugin complains about the second string column not being numbers which is kind of misleading because it would not complain about the first string column.
I could get it to work by concatenating the values I use for the groupby into a single column:
SELECT
time_bucket('21600s',"timestamp") AS "time",
avg(value),
CONCAT(locations.name, ' ', weather_data_types.name) AS "name"
FROM weather_data
JOIN locations ON weather_data.location_id = locations.id
JOIN weather_data_types ON weather_data.type_id = weather_data_types.id
GROUP BY 1, locations.name, weather_data_types.name
ORDER BY 1
This returns
time | avg | name
------------------------+--------------------+---------------------------
which is correctly interpreted by the plugin.

Creating a Hive view

I have a Hive UDF named find_distance which calculates the coordinate distance between a pair of lat-long coordinates.
I also have a table containing a list of city names and their respective lat-long coordinates.
So currently if I need to find the distance between two cities, say Denver and San Jose, I need to perform a self join:
Select find_Distance(cityA.latitude, cityA.longitude, cityB.latitude, cityB.longitude) from
(select latitude, longitude from city_data.map_info where city = 'Denver') cityA
join
(select latitude, longitude from city_data.map_info where city = 'San Jose') cityB;
How would I go about building a view that would accept just the city names as parameters? So in effect I could just use
SELECT distance from city_distance where cityA = 'Denver' and cityB = 'San Jose'
Try this VIEW:
CREATE VIEW city_distance AS
SELECT
cityA.city as city_from,
cityA.city as city_to,
find_Distance(cityA.latitude, cityA.longitude, cityB.latitude, cityB.longitude) as distance
FROM
(SELECT city, latitude, longitude FROM city_data.map_info) cityA
CROSS JOIN
(SELECT city, latitude, longitude FROM city_data.map_info) cityB;

Pig Latin query using group by and MAX function

Given the table:
Place(name, province, population, mayorid)
How would you write in Pig Latin the following query?
Return for each province the place(s) with the largest population. Your result set should have the province name, the place name and the population of that place.
Haven't tested this, but something like
places = LOAD 'placesInput' AS (name, province, population, mayorid);
placesProjected = FOREACH places GENERATE name,province,population;
placesGrouped = GROUP placesProjected by province;
biggestPlaces = FOREACH placesGrouped {
sorted = ORDER placesProjected by population DESC;
maxPopulation = LIMIT sorted 1;
GENERATE group as province, FLATTEN(maxPopulation.name) as name, FLATTEN(maxPopulation.population) as population;
};
oughta work.