Pandas: Error when merging two tables, Error with set_index - pandas

Thanks in advance for your help, here's my question:
I've successfully loaded my df in to ipython notebook and then I ran a group by on it:
station_count = station.groupby('landmark').count()
which produced a table like this:
Now I'm trying to merge it with another table:
dock_count_by_station = station.groupby('landmark').sum()
that is also a simple group by on the same table, but the merge produces an error:
TypeError: cannot concatenate a non-NDFrame object
with this code:
dock_count_by_station.merge(station_count)
I think the problem is that I need to set the index of the two tables before merging them but I keep getting this error for the code below:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()
KeyError: 'landmark'
station_count.set_index('landmark')

Using join
You can use join, which merges the tables on their index. You may also wish to specify the join type (e.g. 'outer', 'inner', 'left' or 'right'). You have overlapping column names (e.g. station_id), so you need to specify a suffix.
>>> dock_count_by_station.join(station_count, rsuffix='_rhs')
dockcount lat long station_id dockcount_rhs installation lat_rhs long_rhs name station_id_rhs
landmark
Mountain View 117 261.767433 -854.623012 210 7 7 7 7 7 7
Palo Alto 75 187.191873 -610.767939 180 5 5 5 5 5 5
Redwood City 115 262.406232 -855.602755 224 7 7 7 7 7 7
San Francisco 665 1322.569239 -4284.054814 2126 35 35 35 35 35 35
San Jose 249 560.039892 -1828.370075 200 15 15 15 15 15 15
Using merge
Note that your landmark index was set by default when you did the groupby. You can always use as_index=False if you don't want this to occur, but then you would have to use merge instead of join.
dock_count_by_station = station.groupby('landmark', as_index=False).sum()
station_count = station.groupby('landmark', as_index=False).count()
>>> dock_count_by_station.merge(station_count, on='landmark', suffixes=['_lhs', '_rhs'])
landmark dockcount_lhs lat_lhs long_lhs station_id_lhs dockcount_rhs installation lat_rhs long_rhs name station_id_rhs
0 Mountain View 117 261.767433 -854.623012 210 7 7 7 7 7 7
1 Palo Alto 75 187.191873 -610.767939 180 5 5 5 5 5 5
2 Redwood City 115 262.406232 -855.602755 224 7 7 7 7 7 7
3 San Francisco 665 1322.569239 -4284.054814 2126 35 35 35 35 35 35
4 San Jose 249 560.039892 -1828.370075 200 15 15 15 15 15 15

Related

Preparing SQL Data for a CDF Plot

i have the following SQL that I would like to use to plot cumulative distribution plot but i can't seem to get the data right.
Sample Data:
token_Length
Frequency
1
6436
2
7489
3
3724
4
2440
5
667
6
396
7
264
8
215
9
117
10
90
11
61
12
29
13
69
15
40
18
45
How do i prepare this data to create a CDF plot in looker?
So that it looks like
token_Length
Frequency
cume_dist
1
6436
0.291459107
2
7489
0.630604112
3
3724
0.799248256
4
2440
0.909745494
5
667
0.939951091
6
396
0.95788425
7
264
0.969839688
8
215
0.979576125
9
117
0.984874558
10
90
0.988950276
11
61
0.991712707
12
29
0.993025994
13
69
0.996150711
15
40
0.997962141
18
45
1
I have tried a measure as follows:
measure: cume_dist {
type: number
sql: cume_dist() over (order by ${token_length} ASC);;
}
This generates SQL as:
SELECT
token_length,
COUNT(*) AS "count",
cume_dist() over (order by (token_length) ASC) AS "cume_dist"
FROM string_facts

How to group merge columns based on one row identifier with pandas?

I have a dataset, in which it has a lot of entries for a single location. I am trying to find a way to sum up all of those entries without affecting any of the other columns. So, just in case I'm not explaining it well enough, I want to use a dataset like this:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
Bedford 10 12 14 17 27
Bedford 11 40 34 9 1
Bedford 7 1 2 3 3
Leeds 1 1 2 0 0
Leeds 20 13 6 1 1
Bath 101 20 33 41 3
Bath 11 2 3 1 0
And turn it into something like this:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
Bedford 28 53 50 29 31
Leeds 21 33 39 1 1
Bath 111 22 36 42 3
Now, I have read up that a groupby should work in a way, but from my understanding a group by will change it into 2 columns and I don't particularly want to make hundreds of 2 columns and then merge it all. Surely there's a much simpler way to do this?
IIUC, groupby+sum will work for you:
df.groupby('Locations',as_index=False,sort=False).sum()
Output:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
0 Bedford 28 53 50 29 31
1 Leeds 21 14 8 1 1
2 Bath 112 22 36 42 3
Pivot table should work for you.
new_df = pd.pivot_table(df, values=['Cyclists', 'maleRunners', 'femalRunners',
'maleCyclists','femaleCyclists'],index='Locations', aggfunc=np.sum)

How to calculate maximum of three in combined query

usage
The image shows works that I have done in access. The query "overall usage review" count the number of usages in each month by using combined query.
After this step, I want to show the maximum three of usage.quantity among 12months and calculate with the formula ( E.g., "Total MAX three months usage"/3)
E.g.,
Warehouse Part Number ........................................................
A X01 9 16 7 14 10 5 9 11 6 3 11 5
A X02 20 22 10 12 20 17 18 29 14 13 11 19
B X01 8 7 3 26 17 6 3 2 5 10 8 14
B X05 9 10 16 6 10 4 13 12 6 4 3 6
I want to result it as below...
Warehouse Part Number Maximum three usage quantity Results
A X01 41 41/3
A X02 71 71/3
B X01 57 19
B X05 39 13
Someone told me to use dynamic sql, but I dont know what it is... Pls tell me how to solve this problem in detail. The problem stuck in my mind for a very long time...

Generate Seaborn Countplot using column value as count

For the following table
count_value
CPUCore Offline_RetentionAge
i7 183 4184
7 1981
30 471
i5 183 2327
7 831
30 250
Pentium 183 333
7 125
30 43
2 183 575
7 236
31 96
Is it possible to generate a seaborn countplot (or normal countplot) like the following (generated using sns.countplot(x='CPUCore', hue="Offline_BackupSchemaIncrementType", data=dfCombined_df))
Problem here is that I need to use the count_value as count, rather then really go and count the Offline_RetentionAge
I think you need seaborn.barplot:
sns.barplot(x="count_value", y="index", hue='Offline_RetentionAge', data=df.reset_index())

Getting more data while converting data int to float and doing division and Multiplying with int?

I have three columns as shown in below tableA
Student Day Shifts
129 11 4
91 9 6
166 19 8
164 26 12
146 11 6
147 16 8
201 8 3
164 4 2
186 8 6
165 7 4
171 10 4
104 5 4
1834 134 67
I am writing a tvf to calculate Value of Points generated for Students as below
ALTER function Statagic(
#StartDate date
)
RETURNS TABLE
AS
RETURN
(
with src as
( select
Division=case when Shifts=0 then 0 else cast(Day as float)/cast(Shifts as float) end,*
from TableA
)
,tgt as
(select *,Points=Student*Division from src
)
select * from tgt)
When i execute above tvf(select * from Statagic('3/16/2014'))
My output is below
129 11 4 2.75 354.75
91 9 6 1.5 136.5
166 19 8 2.375 394.25
164 26 12 2.16666666666667 355.333333333333
146 11 6 1.83333333333333 267.666666666667
147 16 8 2 294
201 8 3 2.66666666666667 536
164 4 2 2 328
186 8 6 1.33333333333333 248
165 7 4 1.75 288.75
171 10 4 2.5 427.5
104 5 4 1.25 130
1834 134 67 2 3668
Note :
If you see the last row for three columns in the table is the total of rest column.So when you see the last row in the Output of TVF for last two columns when i am adding i am not getting same data i am getting more.
Guys please help me i am struggling to fix this bug i tried in all ways but i am unable to fix it.
select 354.75+136.5+394.25+355.333333333333+267.666666666667+294+536+328+248+288.75+427.5+130=3760.750000000000
3668 is not euql to 3760.75(I am getting more 100 value)