I have a data table with time-series data. How do I write a query to add daily data together for selected series?
My table looks like this...
Day,Y,Series
1,1,A
1,2,A
1,3,A
2,2,A
2,3,B
2,5,C
3,4,A
3,1,B
3,4,C
etc.
I want to return an array (dY) based on a list e.g. {"A","C"}. e.g. giving the Y value (for A+C) for each day...
dY = {4,7,8}
I have managed to write the query in SQL
SELECT Sum(myTable.Y) AS [Total Of Y]
FROM AAAA
WHERE (((myTable.Series) In (1,3)))
GROUP BY myTable.X;
and I think it should be something like this in LINQ (VB.NET)
Dim mySeries = {1, 3}
Dim Ys = (From myrows In oSubData Where mySeries.Contains(myrows("Series")) Select mycol = Sum(Val(myrows("Y"))))
Related
I can create an index vs the previous year when I have just one item, but I'm trying to figure out how to do this when I have multiple items. Here is my data set:
rng = pd.date_range('1/1/2011', periods=3, freq='Y')
rng = np.repeat(rng,3)
country = ["USA","Brazil","Japan"]*3
df = pd.DataFrame({'Country':country,'date':rng,'value':range(20,29)})
If I only had one item/country I can do something like this:
df['pct_iya'] = 100*(df['value'].pct_change()+1)
I'm trying to get this to work with multiple items. Here is the expected result:
Maybe this could work with a groupby, but my attempt did not work...
df['pct_iya2'] = df.groupby(['Country','date'])['value'].pct_change()
Answer: Use a group by (excluding date) and than add one to the percent change (ex +15percent goes from .15 to 1.15), then multiple you 100.
df['pct_iya'] = 100*(df.groupby(['Country'])['value'].pct_change()+1)
I am working with several CSV's that first N columns are information and then the next Ms (M is big) columns are information regarding a date.
This is the dataframe picture
I need to set just the columns between N+1 to N+M - 1 columns name to date format.
I tried this, in this case N+1 = 5, no matter M, I suppose that I can use -1 to not affect the last column name.
ContDiarios.columns[5:-1] = pd.to_datetime(ContDiarios.columns[5:-1])
but I get the following error:
TypeError: Index does not support mutable operations
The way you are doing is not feasable. Please try this way
def convert(x):
try:
return pd.to_datetime(x)
except:
return x
x.columns = map(convert,x.columns)
Or you can also use df.rename property to convert it.
I have performed a stratified sample on a multi-label dataset before training a classifier and want to check how balanced it is now. The columns in the dataset are:
|_Body|label_0|label_1|label_10|label_100|label_101|label_102|label_103|label_104|label_11|label_12|label_13|label_14|label_15|label_16|label_17|label_18|label_19|label_2|label_20|label_21|label_22|label_23|label_24|label_25|label_26|label_27|label_28|label_29|label_3|label_30|label_31|label_32|label_33|label_34|label_35|label_36|label_37|label_38|label_39|label_4|label_40|label_41|label_42|label_43|label_44|label_45|label_46|label_47|label_48|label_49|label_5|label_50|label_51|label_52|label_53|label_54|label_55|label_56|label_57|label_58|label_59|label_6|label_60|label_61|label_62|label_63|label_64|label_65|label_66|label_67|label_68|label_69|label_7|label_70|label_71|label_72|label_73|label_74|label_75|label_76|label_77|label_78|label_79|label_8|label_80|label_81|label_82|label_83|label_84|label_85|label_86|label_87|label_88|label_89|label_9|label_90|label_91|label_92|label_93|label_94|label_95|label_96|label_97|label_98|label_99|
I want to group by every label_* column once, and create a dictionary of the results with positive/negative counts. At the moment I am accomplishing this in PySpark SQL like this:
# Evaluate how skewed the sample is after balancing it by resampling
stratified_sample = spark.read.json('s3://stackoverflow-events/1901/Sample.Stratified.{}.*.jsonl'.format(limit))
stratified_sample.registerTempTable('stratified_sample')
label_counts = {}
for i in range(0, 100):
count_df = spark.sql('SELECT label_{}, COUNT(*) as total FROM stratified_sample GROUP BY label_{}'.format(i, i))
rows = count_df.rdd.take(2)
neg_count = getattr(rows[0], 'total')
pos_count = getattr(rows[1], 'total')
label_counts[i] = [neg_count, pos_count]
The output is thus:
{0: [1034673, 14491],
1: [1023250, 25914],
2: [1030462, 18702],
3: [1035645, 13519],
4: [1037445, 11719],
5: [1010664, 38500],
6: [1031699, 17465],
...}
This feels like it should be possible in one SQL statement, but I can't figure out how to do this or find an existing solution. Obviously I don't want to write out all the column names and generating SQL seems worse than this solution.
Can SQL do this? Thanks!
You can indeed do that in one statement but I am not sure the performances will be good.
from pyspark.sql import functions as F
from functools import reduce
dataframes_list = [
stratified_sample.groupBy(
"label_{}".format(i)
).count().select(
F.lit("label_{}".format(i)).alias("col"),
"count"
)
for i in range(0, 100)
]
count_df = reduce(
lambda a, b: a.union(b),
dataframes_list
)
This will create a dataframe with 2 colummns, col which contains the name of the column you are counting, and count the value of the count.
To change it to a dict, I let you read another post.
Here is a solution with single sql, to get all pos and neg counts
sql = 'select '
for i in range(0, 100):
sql = sql + ' sum(CASE WHEN label_{} > 0 THEN 1 ELSE 0 END) as label{}_pos_count, '.format(i,i)
sql = sql + ' sum(CASE WHEN label_{} < 0 THEN 1 ELSE 0 END) as label{}_neg_count'.format(i,i)
if i < 99:
sql = sql + ', '
sql = sql + ' from stratified_sample '
df = spark.sql(sql)
rows = df.rdd.take(1)
label_counts = {}
for i in range(0, 100):
label_counts[i] = [rows[0][2*i],rows[0][2*i+1] ]
print(label_counts)
You can generate sql without group by.
Something like
SELECT COUNT(*) AS total, SUM(label_k) as positive_k ,.. FROM table
And then use the result to produce your dict {k : [total-positive_k, positive_k]}
I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.
I have a technical issue with access graphs: I have a table in Access database with 4 fields: xValue, yValue, round, partOfRound
What I want: there are always 2 rounds, each round has 2 parts. I need to get a series per round per part (so from round 1 part 1, round 1 part 2, round 2 part 1, round 2 part 2) with all xValues and yValues in a chart.
But then I have an other problem:The xValue isn't a good number to show, this is needing to be this number divided by a number from an other table (see this as number in table3) where the row of table 3 equels the identifier with the identifier I use for my chart. (IDtable2=IDtable3)
The final result will be 4 lines with the data in my graph, so 4 series.
But when I use the wizard for making graphs, I can only set 1 field to the series value, so it will see a round as just 1 series instead of 2.
How do I solve this problem?
Kind regards
Kristof
What type of graph - just a column?
Concatenate the round and partOfRound fields.
Try changing the graph RowSource to:
TRANSFORM Sum(Table2.yValue) AS SumOfyValue SELECT Table2.xValue FROM Table2 GROUP BY Table2.xValue PIVOT [round] & "_" & [partOfRound];
Possible SQL to include table join to calculate the division:
TRANSFORM Sum(Table2.yValue) AS SumOfyValue
SELECT Round([xValue]/[Factor],0) AS x
FROM Table3 INNER JOIN Table2 ON Table3.PK_Table3 = Table2.FK_Table3
GROUP BY Round([xValue]/[Factor],0)
PIVOT [round] & "_" & [partOfRound];
For both queries, I had to open the graph editor (double click the graph) and from the menu click on "By Column" button to get the x values on the x axis.
I do hope round is not an actual name as it is a reserved word and should not use reserved words as names for anything.