I have problems getting the data in separate rows. At the moment all my data per column is in one cell. I really would appreciate your support!
the column header is "Dealer" and it is showing one value below like this:
|Dealer|
|:---- |
|['Automobiles', 'Garage Benz', 'Cencini SA']|
I would like to get three rows out of this:
Row
Dealer
1
'Automobiles'
2
'Garage Benz'
3
'Cencini SA'
4
....
5
....
...
...
what would be the easiest way to achieve this?
Thanks for your support, as I am totally new to pandas!
The easiest way is to convert your data into a dict like data:
x = {'Dealer':['Automobiles', 'Garage Benz', 'Cencini SA']}
Then
x = pd.DataFrame(x)
Related
First I am sorry if this is a recurrent question, but the way I tried to phrase it I did not find any repeats of it.
I have a data frame that among its columns it has one with date values and other that is a one hot encoding of the presence of an event:
date event
20-11-2019 1
20-11-2019 1
12-3-2018 0
I am trying to find a way to obtain the number of events on each of those dates.
I tried to navigate around group by but got nowhere useful. Can anyone help me?
Try groupby and sum
out = df.groupby('date',as_index=False).sum()
Out[75]:
date event
0 12-3-2018 0
1 20-11-2019 2
does data.groupby('date')[['event']].sum() does what you want?
I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.
I am new to Pandas, and wanted your help with data slicing.
I have a dump of 10 million rows with duplicates. Please refer to this image for a sample of the rows with the steps I am looking to perform.
As you see in the image, the column for criteria "ABC" from Source 'UK' has 2 duplicate entries in the Trg column. I need help with:
Adding a concatenated new column "All Targets" as shown in image
Removing duplicates from above table so that only unique values without duplicates appear, as shown in step 2 in the image
Any help with this regard will be highly appreciated.
I would do like this:
PART 1:
First define a function that does what you want, than use apply method:
def my_func(grouped):
all_target = grouped["Trg"].unique()
grouped["target"] = ", ".join(all_target)
return grouped
df1 = df.groupby("Criteria").apply(my_func)
#output:example with first 4 rows
Criteria Trg target
0 ABC DE DE, FR
1 ABC FR DE, FR
2 DEF UK UK, FR
3 DEF FR UK, FR
PART 2:
df2 = df1.drop_duplicates(subset=["Criteria"])
I tried it only on first 4 rows so let me know if it works.
I have a data set with 30 columns and multiple rows (some cells have no data). I would like to be able to facet the columns in groups.
1 2 3 4...
Row1 A B C D
Row2 E A D F
Row3 Q A B H
Given the above data I would like the facet to retun the number of instances in a group of columns. For the first three columns I need the facet to return:
A - 3
B - 2
C - 1
D - 1
E - 1
Q - 1
I have tried to combine columns when I loaded the data but the individual data was grouped as well. This is not the desired outcome. For example:
ABC - 1
EAD - 1
QAB - 1
Thanks in advance.
I can't think of a more efficient way to do this off the top of my head, but you can do a custom facet with something like:
[ cells.["1"].value, cells.["2"].value, cells.["3"].value ]
where "1", "2", and "3" are the names of your columns. If your column names are single words, like "V1", "V2", "V3", and so on, you can also change the custom facet to something like:
[ cells.V1.value, cells.V2.value, cells.V3.value ]
With a lot of columns, this solution might be somewhat tedious though...
Did you tried to transpose all your column in one and facet on this 'master column'?
When transposing add the column name so you know from where the data comes from. The you can split your master column into 'source column' and 'data'.
You can find here the JSON code to transpose a large amount of column: http://googlerefine.blogspot.ca/2011/09/json-code-to-transpose-important-number.html
it should work for your project with a limited amount of edits.
Hope it help!
Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck