sqoop import using free form query - sql

sqoop import \
--connect jdbc:mysql://localhost/loudacre \
--username training \
--password training \
--target-dir /axl172930/loudacre/pset1 \
--split-by acct_num \
--query 'SELECT first_name,last_name,acct_num,city,state from accounts a
JOIN (SELECT account_id, count(device_id) as num_of_devices
FROM accountdevice group by account_id
HAVING count(device_id) = 1)d ON a.acct_num = d.account_id
WHERE $CONDITIONS'
The question is as follows: Import the first name, last name, account number, city and state of the accounts having exactly 1 device.
accounts and accountdevice are tables. When I used the distinct keyword in the count function I was getting different number of records. Which approach is correct for the above question? Please suggest if you can get the answer without using a subquery.

I think below query should satisfy your requirement:
SELECT a.first_name,a.last_name,a.acct_num,a.city,a.state,count(d.device_id)
FROM accounts a JOIN num_of_devices d on a.acct_num = d.account_id
GROUP BY a.acct_num HAVING count(d.device_id) = 1);

Related

Pyspark question making count result into a dataframe

I have a pyspark function that looks like this. \
spark.sql("select count(*) from student_table where student_id is NULL") \
spark.sql("select count(*) from student_table where student_scores is NULL") \
spark.sql("select count(*) from student_table where student_health is NULL")
I get a result that looks like \
+-------+|count(1)|\n-------+| 0|+-------+\n|count(1)|\n-------+| 100|+-------+\n|count(1)|\n-------+| 24145|
What I want to do is to make the result into a dataframe for each column by using pandas or pyspark function.
The result should have each null value result for each column.
For example,
Thanks in advance if someone can help me out.
You could use union between the 3 queries but actually you can get all null counts for each column using one query:
spark.sql("""
SELECT SUM(INT(student_id IS NULL)) AS student_id_nb_null,
SUM(INT(student_scores IS NULL)) AS student_scores_nb_null,
SUM(INT(student_health IS NULL)) AS student_health_nb_null,
FROM student_table
""").show()
#+------------------+----------------------+----------------------+
#|student_id_nb_null|student_scores_nb_null|student_health_nb_null|
#+------------------+----------------------+----------------------+
#| 0| 100| 24145|
#+------------------+----------------------+----------------------+
Or by using DataFrame API with:
import pyspark.sql.functions as F
df.agg(
F.sum(F.col("student_id").isNull().cast("int")).alias("student_id_nb_null"),
F.sum(F.col("student_scores").isNull().cast("int")).alias("student_scores_nb_null"),
F.sum(F.col("student_health").isNull().cast("int")).alias("student_health_nb_null")
)
Use union all and add all your queries in one spark.sql.
Example:
spark.sql("""select "student_id" `column_name`,count(*) `null_result` from tmp where student_id is null \
union all \
select "student_scores" `column_name`,count(*) `null_result` from tmp where student_scores is null \
union all \
select "student_health" `column_name`,count(*) `null_result` from tmp where student_health is null""").\
show()

Join-Group PySpark - SQL to Pysaprk

I am trying to join 2 tables based on this SQL query using pyspark.
%sql
SELECT c.cust_id, avg(b.gender_score) AS pub_masc
FROM df c
LEFT JOIN pub_df b
ON c.pp = b.pp
GROUP BY c.cust_id
)
I tried following in pyspark but I am not sure if it's the right way as I was stuck to display my data. so I just choose .max
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.avg(gender_score) as pub_masc
.groupBy('cust_id').max()
any help would be appreciated.
Thanks in advance
Your Python code contains an invalid line .avg(gender_score) as pub_masc. Also you should group by and then average, not the other way round.
import pyspark.sql.functions as F
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.groupBy('cust_id')\
.agg(F.avg('gender_score').alias('pub_masc'))

Discriminate data from functions aggregate on PosgreSQL

On PosgreSQL I have a database of pizza restaurant.
With this code:
SELECT command.id_command, array_agg(history_state_of_command.state)
FROM command JOIN history_state_of_command
ON command.id_command = history_state_of_command.id_command
GROUP BY command.id_command
I obtain these results, with the id of a command and the associated state of command:
command.id_command State of command
1
"{Pizza_Order,Pizza_in_preparation,Pizza_prepared,Pizza_ready_for_delivery,Pizza_delivering,Pizza_deliver}"
2
"{Pizza_Order,Pizza_in_preparation}"
3
"{Pizza_Order,Pizza_in_preparation,Pizza_prepared,Pizza_ready_for_delivery,Pizza_delivering,Pizza_deliver,"Command cancelled"}"
4
"{Pizza_Order,Pizza_in_preparation,Pizza_prepared,Pizza_ready_for_delivery,Pizza_delivering,Pizza_deliver}"
I would like to find an SQL code where I obtain only id of command where the pizza was never prepared:
command.id_command State of command
2 "{Pizza_Order,Pizza_in_preparation}"
Many thanks for your help !
You can use a correlated subquery to find this command:
select h.id_command
from history_state_of_command h
where h.state in ('Pizza_Order', 'Pizza_in_preparation')
and not exists (
select 1
from history_state_of_command i
where i.id_command = h.id_command and i.state = 'Pizza_prepared'
)
You can use aggregation as well:
select hsc.id_command
from history_state_of_command hsc
group by hsc.id_command
having count(*) filter (where hsc.state = 'Pizza_prepared') = 0;
Note: This assumes that commands have some row in the history. If not, then use not exists;
select c.*
from command c
where not exists (select 1
from history_state_of_command hsc
where hsc.id_command = c.id_command and hsc.state = 'Pizza_prepared'
);
This is probably the most efficient method, with appropriate indexes.

Batch processing versus Single row transactions for atomicity

I have two tables; one to hold records of reports generated, and the other to update a flag that the reports have been generated. This script will be scheduled, and the SQLs have been implemented. However, there are two implementations of the script:
Case 1:
- Insert all the records, then
- Update all the flags,
- Commit if all is well
Case 2:
While (there are records)
- Insert a record,
- Update the flag
- Commit if all is well
Which should be preferred and why?
A transaction for Case 1 is for all inserts, then all update. It's all or nothing. I'm to believe this is faster, or not if the connection to the remote database keeps getting interrupted. It requires very little client side processing. But if the inserts fail midway, we'll have to rerun from the top.
A transaction for Case 2 is one insert, update. This requires to keep track of the inserted records, and updating the specific records. I'll have to use placeholders, and while granted the database may cache the SQL, and use the query execution plan repeatedly, I suspect this would be slower than Case 1 because of the additional client side processing. However on an unreliable connection, which we can assume, this looks the better choice.
EDIT 5/11/2015 11:31AM
CASE 1 snippet:
my $sql = "INSERT INTO eval_rep_track_dup\#prod \
select ert.* \
from eval_rep_track ert \
inner join \
(
select erd.evaluation_fk, erd.report_type, LTRIM(erd.assign_group_id, '/site/') course_name \
from eval_report_dup\#prod erd \
inner join eval_report er \
on er.id = erd.id \
where erd.status='queue' \
and er.status='done' \
) cat \
on ert.eval_id = cat.evaluation_fk \
and ert.report_type = cat.report_type \
and ert.course_name = cat.course_name";
my $sth = $dbh->prepare($sql) or die "Error with sql statement : $DBI::errstr\n";
my $noterror = $sth->execute() or die "Error in sql statement : " . $sth->errstr . "\n";
...
# update the status from queue to done
$sql = "UPDATE eval_report_dup\#prod \
SET status='done' \
WHERE id IN \
( \
select erd.id \
from eval_report_dup\#prod erd \
inner join eval_report er \
on er.id = erd.id \
where erd.status='queue' \
and er.status='done' \
)";
$sth = $dbh->prepare($sql);
$sth->execute();
eval_rep_track_dup has 3 number, 8 varchar2 and a timestamp columns
eval_report_dup has 10 number, 8 varchar2 and 3 timestamp columns
Hi
Well if it was up to me I would do the latter method. The principle reason would be if the server/program went down in the middle of processing; you could easily restart the job.
Good luck
pj

More than one row returned by a subquery in simple SQL

I have these tables:
person(pid, name,email,phone,city)
ride(rid,pid,date,spots,start,target) [rideID, personID- the person who gives the ride, spots= open slots in the ride,]
participate(pid,rid)- person pid participates in ride rid
I have to find the query
findRidesForUser (pid,date)
which gives me the contact details of all the people who suggest a ride in the specific date that starts in a city where pid lives, e.g, where ride.start=pid.city.
I'm trying to use
"SELECT person.name, person.email, person.phone, person.city \
FROM person WHERE pid=(\
SELECT pid FROM ride WHERE date='%s' AND \
ride.start= (SELECT city FROM person WHERE person.pid=pid))"
But it gives me the error:
Error executing query: ERROR: more than one row returned by a subquery used as an expression
You should be looking to join the two tables on the appropriate keys:
SELECT
p.name,
p.email,
p.phone,
p.city
FROM person p
JOIN ride r
ON (p.pid = r.pid)
WHERE r.date = 'desiredDate'
AND r.start = (SELECT city FROM person WHERE pid = 'userPid')
Where 'desiredDate' and 'userPid' are the input parameters of findRidesForUser (pid,date)
using person.pid and pid is the same thing that's the same as saying 1=1. Also the pid= implies that you only want one result back, but your getting more then one so either use top or limit to limit the subquery to one or change the = to an 'in'. Using an in and fixing the sub-subquery is as follows:
"SELECT person.name, person.email, person.phone, person.city \
FROM person WHERE pid in (\
SELECT pid FROM ride WHERE date='%s' AND \
ride.start= (SELECT city FROM person as person1 WHERE person.pid=person1.pid))"
Tho I think this is the same thing
"SELECT person.name, person.email, person.phone, person.city \
FROM person WHERE pid in (\
SELECT pid FROM ride WHERE date='%s' AND \
ride.start= city)"
Try using "In" instead of using "="