Pandas pivot multiple columns, indexes, and values - pandas

I have the following dataset
resulted_by
follow_up_result
follow_up_number
#
%
0
User 1
good
1
30
30
1
User 2
good
2
65
65
2
User 3
bad
3
5
0.05
I want to Pivot:
follow up result and resulted by as indexes
follow up number as a column
# and % as values
pivot = df.head(3).pivot(columns=['follow_up_number'], values=["#", '%'], index=['follow_up_result', 'resulted_by'])
However, I want the follow up number to be above the values, here is how I achieved that:
pivot = df.head(3).pivot(columns=['follow_up_result', 'resulted_by'], values=["#", '%'], index=['follow_up_number'])
pivot = pivot.stack(level=0).T
Notice how I switch columns and indexes.
I want the column names to be at the same level as the values.
Is there a way to do that?
Is there a better way to achieve what I need without switching between columns and indexes?
Code Snippet:
https://onecompiler.com/python/3y5gzm7hu

Related

Storing a large array into a table with 10,000 columns in SQLite

I want to be able to store some 100x100 matrices onto a table within my database (covariance matrices). A first good step for me would be to flatten the matrix and store the matrix structure (among other things) into a parent table.
However, creating such a table would require to make a table with about 10,000 or so columns. Writing so many field names would make my SQL code extraordinarily large, and I wouldn't know where to start if I want to query for that matrix.
Is there a neat way to specify such a table in SQL? Is there a neat way for me to set or get a particular (set of) matrix (matrices) from my database using such a table? Is there a better way?
I am using Sqlite for my databases.
All tables with big size of same typed columns can be rotated.
For example if you have a table A like this:
row col1 col2 col3 ...
1 1 2 3
2 11 12 13
You can simply rotate to a table with 3 colums
row col value
1 1 1
1 2 2
1 3 3
2 1 11
2 2 12
2 3 13
so instead of writing big sql like
select col1, col2, col3 ...... from A where row = 2
you write sql like
select value from A where row = 2 order by col
the result set was originally horizontal and now become vertical -- it is rotated and easy to handle.

Spark Dataframe : Group by custom Range

I have a Spark Dataframe which I aggregated based on a column called "Rank", with "Sum" beging the Sum of all values with that Rank.
df.groupBy("Rank").agg(sum("Col1")).orderBy("Rank").show()
Rank
Sum(Col1)
1
1523
2
785
3
232
4
69
5
126
...
....
430
7
Instead of having the Sum for every single value of "Rank", I would like to group my data into rank "buckets", to get a more compact output. For example :
Rank Range
Sum(Col1)
1
1523
2-5
1212
5-10
...
...
...
100+
....
Instead of having 4 different rows for Rank 2,3,4,5 - I would like to have one row "2-5" showing the sum for all these ranks.
What would be the best way of doing that ? I am quite new to Spark Dataframes and am thankful for any help and especially examples on how to achieve that
Thank you !
Few options:
Histogram - build a histogram. See the following post:
Making histogram with Spark DataFrame column
Add another column for the bucket values (See Apache spark dealing with case statements):
df.select(when(people("Rank Range") === "1", "1")
.when(..., "2-5")
.otherwise("100"))
Now you can run your group by query on the new Rank Range column.

groupby 2 columns and count into separate columns based on one columns cases

I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.

SQL Query to return which columns have different values given two rows

I have one table like this:
id status time days ...
1 optimal 60 21
2 optimal 50 21
3 no solution 60 30
4 optimal 21 31
5 no solution 34 12
.
.
.
There are many more rows and columns.
I need to make a query that will return which columns have different information, given two IDs.
Rephrasing it, I'll provide two IDs, for example 1 and 5 and I need to know if these two rows have any columns with different values. In this case, the result should be something like:
id status time days
1 optimal 60 21
5 no solution 34 12
If I provide IDs 1 and 2, for example, the result should be:
id time
1 60
2 50
The output format doesn't need to be like this, it only needs to show clearly which columns are different and their values
I can tell you off the bat that processing this data in some sort of programming language will greatly help you out in terms of simplicity and readability for this type of solution, but here a thread of how it can be done in SQL.
Compare two rows and identify columns whose values are different
If you are looking for the solution in R. Here is my solution:
df <- read.csv(file = "sf.csv", header = TRUE)
diff.eval <- function(first.id, second.id, eval.df) {
res <- eval.df[c(first.id, second.id), ]
cols <- colnames(eval.df)
for (col in cols) {
if (res[1, col] == res[2, col]) {
res[, col] <- NULL
}
}
return(res)
}
print(diff.eval(1, 5, df))
print(diff.eval(1, 2, df))
You just need to create a dataframe out of table. I just created a .csv for ease locally and used the data by importing into a dataframe.

SPSS Compute Variable

Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck