Querying Stackoverflow public dataset on BigQuery on Q&A SQL - sql

i have a home work assignment :
We want to find all questions about the python pandas library, as well as their answers.
Write a query that retrieves all the questions for which the title contains the word "pandas" from the posts_questions table, as well as all the appropriate answers for each such question from the post_answers , where each row in the returned table will represent a pair of (question + answer). If the question has a number Answers, the same question will appear in multiple rows in the returned table. returned and the
of the question as well as the following fields: the id, title, tag, answer_count score, creation time (creation_date)
and the body of the text (the body) of both the question and the answer. For the body, all slash characters must be removed the line '\n'.
for this i wrote the following SQL code:
SELECT tb1.id as q_id,tb1.title as q_title,tb1.tags as q_tags
,tb1.creation_date as q_creation_date,tb1.score as q_score,tb1.answer_count as q_answer_count
,REPLACE(tb1.body,'\n',' ') as body_qustion,REPLACE(tb2.body,'\n',' ') as body_answer
from `bigquery-public-data.stackoverflow.posts_questions` as tb1
left join `bigquery-public-data.stackoverflow.posts_answers` as tb2
on tb1.id=tb2.id
where( tb1.title like "%pandas%" or tb1.title like "%Pandas%" or tb1.title like "%PANDAS%")
group by tb1.id ,tb1.title ,tb1.tags,tb1.creation_date,tb1.score
,tb1.answer_count,body_qustion,body_answer
but the problem is that when for example for a question i have 3 answers i expect it to return 3 rows for the question instead it returns only one and i dont know what is the problem .
the data is :
bigquery-public-data.stackoverflow.posts_questions
and bigquery-public-data.stackoverflow.posts_answers :

You have joined with the wrong ID of the answer table. In the answer table ID column represents the ID of the answer itself whereas parent_id represents the question id. You can play with the below query to have more understanding.
Query:
SELECT
q.id AS q_id #id of the question in question table
,
a.id AS a_id #id of the answer in answer table
,
q.title AS q_title,
q.tags AS q_tags,
q.creation_date AS q_creation_date,
q.score AS q_score,
q.answer_count AS q_answer_count,
REPLACE(q.body,'\n',' ') AS body_qustion,
REPLACE(a.body,'\n',' ') AS body_answer
FROM
`bigquery-public-data.stackoverflow.posts_questions` q
LEFT JOIN
`bigquery-public-data.stackoverflow.posts_answers` a
ON
q.id = a.parent_id #Joining with quesiton Ids
WHERE
LOWER(q.title) LIKE '%pandas%'
AND q.creation_date BETWEEN '2021-01-01'
AND '2021-01-31'
AND q.answer_count >1
Output:

Related

how get "last url" for each row after joining a table full of urls [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have two table like in the photo linked below, I used this query to join these tables:
declare #ID INT = 1
select *
from PhotoAlbum_Table,Photo_Table
where PhotoAlbum_Table.ID=#ID
and ( PhotoAlbum_Table.PhotoAlbumID=Photo_Table.PhotoAlbumID)
but I need to eliminate some records (those that have same primary key) after joining these tables but dont know how, I marked them on the photo.
I want all albums of a user, with the last photoUrl that they have uploaded to that album.
link to image: http://oi58.tinypic.com/54uosj.jpg
You need to group your selections together.
DECLARE #ID int = 1;
SELECT <Choose the specific items you want> FROM PhotoAlbum_Table AS a
INNER JOIN Photo_Table AS p ON a.PhotoAlbumID = p.PhotoAlbumID
WHERE a.ID = #ID;
GROUP BY <The specific items> --Add this line
EDIT
Based on the new information I would use.
DECLARE #ID int = 1;
WITH cte (
PhotoAlbumID,
PhotoID,
PhotoAlbumDate,
PhotoAlbumName,
PrivacyID
) AS (
SELECT a.PhotoAlbumID,
MAX(p.PhotoID)
a.PhotoAlbumDate,
a.PhotoAlbumName,
a.PrivacyID
FROM PhotoAlbum_Table AS a
INNER JOIN Photo_Table AS p ON a.PhotoAlbumID = p.PhotoAlbumID
WHERE a.ID = #ID;
GROUP BY a.PhotoAlbumID,
a.PhotoAlbumDate,
a.PhotoAlbumName,
a.PrivacyID
)
SELECT cte.*, u.PhotoUrl FROM cte
INNER JOIN Photo_Table AS u ON cte.PhotoID = u.PhotoID;
Well, this type of queries are in general not easy, at least for the first time. That's because the SQL language is not helpful when defining things like "lastest URL", it's easy to fetch "latest Date", or "greatest ID" (assuming you have some autonumbering).
But for "lastest URL" you have to basically split the problem into "lastest URL modification Date" and then "the URL that matches the latest Date".
However, if you work on relatively recent sql server (2008+), you can write it down with the help of row numbering and partitioning functions. If you select all the things you are doing now, and rownumber() them over the URL Date, then you will be able to pick the rows where rownumber==1.
In short:
write a select-join similar to what you have now
add a rownumbering to it, partitioned by albumId and ordered descending by photoDate
then, you will have to wrap all of that with another query
which will filter the rows by something like where rownumber = 1
In very rough sketch it will look like:
select *
from
(
select
al.albumid,
...
ph.photourl,
rownumber() over (
partition by al.albumid
order by ph.photoDate
) as rn
from albums al
join photos ph on ph.album = al.albumid
)
where tmp.rn = 1
but that's a sketch, adjust that to your needs

SQL Count returning username of who has asked the most questions

I'm having trouble working out how to do an sql query and wondered if anyone could help. In my application I have users who can ask questions and I would like to implement some functionality to work out who the most active question poster is.
The table structure is as follows:
User:
UserID (Primary Key), Username
Question: Question ID (PK),UserID(Foreign Key) QuestionText, DateTime Asked
What I would like to do is to find out who has asked the most questions then return their username. I'm having trouble finding answers to similar solutions on the internet. All I can do is count the number of questions asked, and the number of questions asked by different users, e.g. total number of questions asked is 9 and total number of users who have posted questions is 2.
Thanks for your help.
Selects only one question poster who has posted maximumn number of question.
SQL Server
SELECT TOP 1 username
FROM
(
Select userid,username,count(*) as numQuestion
From user u
inner join question q
on u.userid=q.userid
Group by userid,username
)Z
order by numQuestion desc
MySql
SELECT username
FROM
(
Select userid,username,count(*) as numQuestion
From user u
inner join question q
on u.userid=q.userid
Group by userid,username
)Z
order by numQuestion desc
Limit 1
or you could also try this:
select count(*) as counter,
name from user join question
on user.id = question.userid group by user.id
order by counter desc limit 1
sql fiddle

SQL count distinct values for records but filter some dups

I have a MS SQL 2008 table of survey responses and I need to produce some reports. The table is fairly basic, it has a autonumber key, a user ID for the person responding, a date, and then a bunch of fields for each individual question. Most of the questions are multiple choice and the data value in the response field is a short varchar text representation of that choice.
What I need to do is count the number of distinct responses for each choice option (ie. for question 1, 10 people answered A, 20 answered B, and so forth). That is not overly complex. However, the twist is that some people have taken the survey multiple times (so they would have the same User ID field). For these responses, I am only supposed to include the latest data in my report (based on the survey date field). What would be the best way to exclude the older survey records for those users that have multiple records?
Since you didn't give us your DB schema I've had to make some assumptions but you should be able to use row_number to identify the latest survey taken by a user.
with cte as
(
SELECT
Row_number() over (partition by userID, surveyID order by id desc) rn,
surveyID
FROM
User_survey
)
SELECT
a.answer_type,
Count(a.anwer) answercount
FROM
cte
INNER JOIN Answers a
ON cte.surveyID = a.surveyID
WHERE
cte.rn = 1
GROUP BY
a.answer_type
Maybe not the most efficient query, but what about:
select userid, max(survey_date) from my_table group by userid
then you can inner join on the same table to get additional data.

SQL Statement that never returns same row twice?

Requirements: I have a table of several thousand questions. Users can view these multiple choice questions and then answer them. Once a question is answered, it should not be shown to the same user again even if he logs in after a while.
Question
How would I go about doing this efficiently? Would Bloom Filters work?
Create a QuestionsAnswered table and join on it in your select. When the user answers a question, insert the question ID and the user ID into the table.
CREATE TABLE QuestionsAnswered (UserID INT, QuestionID INT)
SELECT *
FROM Question
WHERE ID NOT IN (SELECT QuestionID
FROM QuestionsAnswered
WHERE UserID = #UserID)
INSERT INTO QuestionsAnswered
(UserID, QuestionID)
VALUES
(#UserID, #QuestionID)
Could you add something to the users info in the database which contains a list of answered questions?
So when that user comes back you can only show them questions which are NOT answered?
Create a many-to-many table between users and questions (userQuestions) to store the questions that have been answered already. Then you'd only display questions that don't exist in that userQuestions table for that user.
You insert each question shown into a log table with question_id/user_id, then show him the ones that don't match:
SELECT [TOP 1] ...
FROM questions
WHERE question_id NOT IN (
SELECT question_id
FROM question_user_log
WHERE userd_id = <current_user>)
[ORDER BY ...]
or
SELECT [TOP 1] ...
FROM questions AS q
LEFT OUTER JOIN question_user_log AS l ON q.question_id = l.question_id
AND l.user_id = <current_user>
WHERE l.question_id IS NULL
[ORDER BY...]
after you show the question, you
INSERT INTO question_user_log (question_id, user_id)
VALUES (<question shown>, <current_user>);
BTW, if you cannot create a table to track questions shown then you can query the questions in a deterministic order (ie. by Id or by Title) and select each time the one with the rank higher than the last rank shown (using ROW_NUMBER() in SQL Server/Oracle/DB2, or LIMIT in MySQL). You'd track the last rank shown somewhere in your user state (you do have a user state, otherwise the whole question is pointless).

How to write this challenging SQL (MySQL) command?

This is the scenario:
I am developing a website which is similar to stackoverflow.com.
After an asker/seeker posts his question, other users can post their answers to the question.
A user can post more than one answer to the same question, but usually only the latest answer will be displayed. User can give comments on an answer, if comments are beyond consideration,
the SQL statement is
mysql_query("SELECT * , COUNT( memberid ) AS num_revisions
FROM (
SELECT *
FROM answers
WHERE questionid ='$questionid'
ORDER BY create_moment DESC
) AS dyn_table JOIN users
USING ( memberid )
GROUP BY memberid order by answerid asc")or die(mysql_error());
When comments are taken into considerations,there will be three tables.
I want to select all the latest answer a solver gave on a particular question, how many answers(num_revisions) a solver gave on the question, the name of the solver,the comments on these latest answer.
How to write this SQL statement? I am using MySQL.
I hope you can understand my question. If you are not clear about my question, just ask for clarification.
It is a little bit complex than stackoverflow.com. On stackoverflow.com, you can only give one answer to a question. But on my website, a user can answer a question many times.But only the latest answer will be seriously treated.
The columns of comment table are commentid, answerid,comment, giver, comment_time.
So it is question--->answer---->comment.
You can use a correlated subquery so that you only get the latest answer per member. Here's T-SQL that works like your example (only answers for a given question). And you'll have to convert to mysql flavour:
select *
from answers a
where questionid = '$questionid'
and answerid in (select top 1 answerid
from answers a2
where a2.questionid = a.questionid
and a2.memberid = a.memberid
order by create_moment desc)
order by create_moment
You haven't provided the schema for your comments table so I can't yet include that :)
-Krip
How about this (obviously answers will repeat if there is more than one comment):
select *
from answers a
left outer join comment c on c.answerid = a.answerid
join users u on u.memberid = a.memberid
where questionid = 1
and a.answerid in (select top 1 answerid
from answers a2
where a2.questionid = a.questionid
and a2.memberid = a.memberid
order by create_moment desc)
order by a.create_moment, c.comment_time
-Krip