x axis duplicate values are added up on SSRS charts - sql

Forgive my Ignorance. I am new to SSRS reporting.
My x axis on a scatter chart is simple ID value(s) and my Y axis is Price. The values in the table are as follows
ID Price
1 1.47
1 1.52
1 1.46
2 1.40
2 1.44
2 1.38
When the chart is plotted value for 1 under y axis is sum of 1.47 + 1.52 + 1.46. The same happens for 2 etc.
I would like individuals dots on the chart e.g 1.47 a dot, 1.52 a dot and 1.46 another dot and so on.
Thanks

You need to set the Category Group to the Id of your result set.
Of course I'm guessing as you really need to provide a bit more information (like a screenshot of your chart?)

Related

Histogram multiple columns, percentage per columns same plot

I have this dataframe(df_histo):
ID SA WE EE
1 NaN 320 23.4
2 234 2.3 NaN
.. .. .. ..
345 1.2 NaN 234
I can plot a correct histogram if I use density=True,
buckets = [0,250,500,1000,1500,2000,2500,3000,3500,4000,4500,5000]
plt.hist(df_histo, bins=buckets, label=dfltabv_histo.columns, density=True)
however I want to visualize the data(1 histogram) in total percentages for each country. That is, what percentage each bins represents in that country
the closest attempt has been this(with individual columns it works):
plt.hist(dfltabv_histo, bins=buckets, label=dfltabv_histo.columns, weights=np.ones(len(dfltabv_histo)) / len(dfltabv_histo))
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ValueError: weights should have the same shape as x

Select last row from each column of multi-index Pandas DataFrame based on time, when columns are unequal length

I have the following Pandas multi-index DataFrame with the top level index being a group ID and the second level index being when, in ISO 8601 time format (shown here without the time):
value weight
when
5e33c4bb4265514aab106a1a 2011-05-12 1.34 0.79
2011-05-07 1.22 0.83
2011-05-03 2.94 0.25
2011-04-28 1.78 0.89
2011-04-22 1.35 0.92
... ... ...
5e33c514392b77d517961f06 2009-01-31 30.75 0.12
2009-01-24 30.50 0.21
2009-01-23 29.50 0.96
2009-01-10 28.50 0.98
2008-12-08 28.50 0.65
when is currently defined as an index but this is not a requirement.
Assertions
when may be non-unique.
Columns may be of unequal length across groups
Within groups when, value and weight will always be of equal length (for each when there will always be a value and a weight
Question
Using the parameter index_time, how do you retrieve:
The most recent past value and weight from each group relative to index_time along with the difference (in seconds) between index_time and when.
index_time may be a time in the past such that only entries where when <= index_time are selected.
The result should be indexed in some way so that the group id of each result can be deduced
Example
From the above, if the index_time was 2011-05-10 then the result should be:
value weight age
5e33c4bb4265514aab106a1a 1.22 0.83 259200
5e33c514392b77d517961f06 30.75 0.12 72576000
Where original DataFrame given in the question is df:
import pandas as pd
df.sort_index(inplace=True)
result = df.loc[pd.IndexSlice[:, :when], :].groupby('id').tail(1)
result['age'] = when - result.index.get_level_values(level=1)

plot dataframe column on one axis and other columns as separate lines on the same plot (in different color)

I have following dataframe.
precision recall F1 cutoff
cutoff
0 0.690148 1.000000 0.814610 0
1 0.727498 1.000000 0.839943 1
2 0.769298 0.916667 0.834051 2
3 0.813232 0.916667 0.859741 3
4 0.838062 0.833333 0.833659 4
5 0.881454 0.833333 0.854946 5
6 0.925455 0.750000 0.827202 6
7 0.961111 0.666667 0.786459 7
8 0.971786 0.500000 0.659684 8
9 0.970000 0.166667 0.284000 9
10 0.955000 0.083333 0.152857 10
I want to plot cutoff column on x-axis and precision,recall and F1 values as separate lines on the same plot (in different color). How can I do it?
When I am trying to plot the dataframe, it is taking the cutoff column also for plotting.
Thanks
Remove column before ploting:
df.drop('cutoff', axis=1).plot()
But maybe problem is how is created index, maybe help change:
df = df.set_index(df['cutoff'])
df.drop('cutoff', axis=1).plot()
to:
df = df.set_index('cutoff')
df.plot()

How to manipulate data in arrays using pandas

Have data in dataframe and need to compare current value of one column and prior of value of another column. Current time is row 5 in this dataframe and here's the desired output:
target data is streamed and captured into a DataFrame, then that array is multiplied by a constant to generate another column, however unable to generate the third column comp, which should compare current value of prod with prior value of the comp from comp.
df['temp'] = self.temp
df['prod'] = df['temp'].multiply(other=const1)
Another user had suggested using this logic but it is generates errors because the routine's array doesn't match the size of the DataFrame:
for i in range(2, len(df['temp'])):
df['comp'].append(max(df['prod'][i], df['comp'][i - 1]))
Let's try this, I think this will capture your intended logic:
df = pd.DataFrame({'col0':[1,2,3,4,5]
,'col1':[5,4.9,5.5,3.5,6.3]
,'col2':[2.5,2.45,2.75,1.75,3.15]
})
df['col3'] = df['col2'].shift(-1).cummax().shift()
print(df)
Output:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 2.75
4 5 6.3 3.15 3.15

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]