how to run each sampler in parallel with same csv data - api

Need to run these sampler request in parallel with same csv file provided
CSV have LN,Userid columns that with three values
I want to run all three samplers at same time for each values. So total would be 9 requests ( 3 values ins CSV)
Submit1|Submit2|Submit3 (all should run in parallel for same userid's in csv)
(for value Userid1) | for value Userid2 | Userid3
Submit1 | Submit1 | Submit1
Submit2 | Submit2 | Submit2
Submit3 | Submit3 | Submit3

Did able to manage by setting csv for all Threads and Recycle on EOF = True and Stop at EOF to False.
When used No of threads to 3, It reaches 18 instead was expecting 9 only

Related

Conditionally remove a field in Splunk

I have a table generated by chart that lists the results of a compliance scan
These results are typically Pass, Fail, and Error - but sometimes there is "Unknown" as a response
I want to show the percentage of each (Pass, Fail, Error, Unknown), so I do the following:
| fillnull value=0 Pass Fail Error Unknown
| eval _total=Pass+Fail+Error+Unknown
<calculate percentages for each field>
<append "%" to each value (Pass, Fail, Error, Unknown)>
What I want to do is eliminate a "totally" empty column, and only display it if it actually exists somewhere in the source data (not merely because of the fillnull command)
Is this possible?
I was thinking something like this, but cannot figure out the second step:
| eventstats max(Unknown) as _unk
| <if _unk is 0, drop the field>
edit
This could just as easily be reworded to:
if every entry for a given field is identical, remove it
Logically, this would look something like:
if(mvcount(values(fieldname))<2), fields - fieldname
Except, of course, that's not valid SPL
could you try that logic after the chart :
``` fill with null values ```
| fillnull value=null()
``` do 90° two time, droping empty/null ```
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
[edit:] it is working when I do the following but not sure it is easy to make it working on all conditions
| stats count | eval keep=split("1 2 3 4 5"," ") | mvexpand keep
| table keep nokeep
| fillnull value=null()
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
[edit2:] and if you need to add more null() could be done like that
| stats count | eval keep=split("1 2 3 4 5"," "), nokeep=0 | mvexpand keep
| table keep nokeep
| foreach nokeep [ eval nokeep=if(nokeep==0,null(),nokeep) ]
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column

Dynamic creation of Table type

I have a column table with a single column.
I would like to create a table type with all the elements in the column of the above mentioned table as column names with fixed datatype and size and use it in a function.
similarly like below:
Dynamic creation of table in tsql
Any suggestions would be appreciated.
EDIT:
To finish a product, a machine has to perform different Jobs on the material with different tools.
I have a list of Jobs a machine can perform and a list of Tools. a specific tool for a specific Job.
Each job needs a specific tool and number of hours (to change the tool once it reached its change time). A Job can be performed many times on a product. (in this case if a Job is performed for 1 hour = tool has been used for 1 hour)
For each product, a set of tools will be at work in a sequence. so I Need a report for each product, number of hours the tool has worked.
EDIT 2:
Product table
---------+-----+
ProductID|Jobs |
---------+-----+
1 | job1 |
1 | job2 |
1 | job3 |
1 | . |
1 | . |
1 |100th |
2 | job1 |
2 | . |
2 | . |
2 |200th |
Jobs table
-------+-------+-------
Jobs | tool | time
-------+-------+-------
job1 |tool 10| 2
job1 |tool 09| 1
job2 |tool 11| 4
job3 |tool 17| 0.5
required report (this table does not physically exist)
----------+------+------+------+------+------+-----
productID | job1 | job2 | job3 | job4 | job5 | . . .
----------+------+------+------+------+------+------
1 | 20 | 10 | 5 | . | . | .
----------+------+------+------+------+------+------
2 | 10 | 13 | 5 | . | . | .
----------+------+------+------+------+------+------
Based on the added information, there are two main requirements here:
You want to sum up the time spent for producing each product grouped by the jobs involved
and
You want to have a cross-table report showing the times from step 1 against products and jobs.
For the first bit, you probably could do this with a query like this:
SELECT
p.product_id,
j.jobs,
SUM(j.time) as SUM_TIME
FROM
products p
INNER JOIN jobs j
ON p.jobs = j.jobs
GROUP BY
p.product_id,
j.jobs;
For the second part: this is usually called a PIVOT report.
SAP HANA does not provide a dynamic SQL command for generating output in this form (other DBMS have that).
However, this dynamic transformation is usually relevant for the data presentation and not so much for the processing.
So, as you probably want to use some form of front end for this report (e.g. MS Excel, Crystal Reports, Business Objects X, Tableau, ...) I would recommend doing the transformation and formatting in the frontend report. Look for "PIVOT" or "CROSSTAB" options to do that.

Creating a session ID and applying it to events

I have event data from an app that helps tell me what people are doing inside my app.
userID|timestamp |name | value |
A | 1 |Launch | 23 |
A | 3 |ClickButton| Header|
B | 2 |Launch | 10 |
B | 5 |ClickBanner| ad |
etc
I am defining a Session as anytime someone has been out of the app for more than 5 minutes, the next entry is a new session. So if you come back in after 4 minutes, it is still the same session.
I use a lag to select the previous launch timestamp, add the value of time in seconds for that and then take the difference for the next launch. So I can select the first timestamp for each 'Session'
Now I need to map each non Launch event back to the session it is a part of so I can easily analyze things such as 'What percent of sessions include an ad click?'
I'm pulling my data using HIVE and am not having success finding an efficient way to do this as my dataset is fairly large.

SQL parse csv string into multiple columns and rows

I have a CSV string from another application that i need to parse into a number of columns with a new row started every 5th element
source:
Amikacin,AMI,S,≤ 8 µg/ml,
Ampicillin,AMP,R,> 32
µg/ml,Ampicillin/Sulbactam,A/S,I,= 16 µg/ml......
the length of the string varies
desired output:
Amikacin | AMI | S | ≤ 8 µg/ml
Ampicillin | AMP | R | > 32 µg/ml
Ampicillin/Sulbactam | A/S | I | = 16 µg/ml
etc
i have looked at the other posts but have not found a solution. I have tried parsing to xml and can get the data into a single column but am struggling to split the data into the desired format.
if anyone has any ideas i would be grateful
thanks
pete
csv_parse.sql - рarse CSV strings with PostgreSQL

Storing large SQL datasets with variable numbers of columns

In America’s Cup yachting, we generate large datasets where at every time-stamp (e.g. 100Hz) we need to store maybe 100-1000 channels of sensor data (e.g. speed, loads, pressures). We store this in MS SQL Server and need to be able to retrieve subsets of channels of the data for analysis, and perform queries such as the maximum pressure on a particular sensor in a test, or over an entire season.
The set of channels to be stored stays the same for several thousand time-stamps, but day-to-day will change as new sensors are added, renamed, etc... and depending on testing, racing or simulating, the number of channels can vary greatly.
The textbook way to structure the SQL tables would probably be:
OPTION 1
ChannelNames
+-----------+-------------+
| ChannelID | ChannelName |
+-----------+-------------+
| 50 | Pressure |
| 51 | Speed |
| ... | ... |
+-----------+-------------+
Sessions
+-----------+---------------+-------+----------+
| SessionID | Location | Boat | Helmsman |
+-----------+---------------+-------+----------+
| 789 | San Francisco | BoatA | SailorA |
| 790 | San Francisco | BoatB | SailorB |
| ... | ... | ... | |
+-----------+---------------+-------+----------+
SessionTimestamps
+-------------+-------------+------------------------+
| SessionID | TimestampID | DateTime |
+-------------+-------------+------------------------+
| 789 | 12345 | 2013/08/17 10:30:00:00 |
| 789 | 12346 | 2013/08/17 10:30:00:01 |
| ... | ... | ... |
+-------------+-------------+------------------------+
ChannelData
+-------------+-----------+-----------+
| TimestampID | ChannelID | DataValue |
+-------------+-----------+-----------+
| 12345 | 50 | 1015.23 |
| 12345 | 51 | 12.23 |
| ... | ... | ... |
+-------------+-----------+-----------+
This structure is neat but inefficient. Each DataValue requires three storage fields, and at each time-stamp we need to INSERT 100-1000 rows.
If we always had the same channels, it would be more sensible to use one row per time-stamp and structure like this:
OPTION 2
+-----------+------------------------+----------+-------+----------+--------+-----+
| SessionID | DateTime | Pressure | Speed | LoadPt | LoadSb | ... |
+-----------+------------------------+----------+-------+----------+--------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+-------+----------+--------+-----+
However, the channels change every day, and over the months the number of columns would grow and grow, with most cells ending up empty. We could create a new table for every new Session, but it doesn’t feel right to be using a table name as a variable, and would ultimately result in tens of thousands of tables – also, it becomes very difficult to query over a season, with data stored in multiple tables.
Another option would be:
OPTION 3
+-----------+------------------------+----------+----------+----------+----------+-----+
| SessionID | DateTime | Channel1 | Channel2 | Channel3 | Channel4 | ... |
+-----------+------------------------+----------+----------+----------+----------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+----------+----------+----------+-----+
with a look-up from Channel column IDs to channel names – but this requires an EXEC or eval to execute a pre-constructed query to obtain the channel we want – because SQL isn’t designed to have column names as variables. On the plus side, we can re-use columns when channels change, but there will still be many empty cells because the table has to be as wide as the largest number of channels we ever encounter. Using a SPARSE table may help here, but I am uncomfortable with the EXEC/eval issue above.
What is the right solution to this problem, that achieves efficiency of storage, inserts and queries?
I would go with Option 1.
Data integrity is first, optimization (if needed) - second.
Other options would eventually have a lot of NULL values and other problems stemming from not being normalized. Managing the data and making efficient queries would be difficult.
Besides, there is a limit on the number of columns that a table can have - 1024, so if you have 1000 sensors/channels you are already dangerously close to the limit. Even if you make your table a wide table, which allows 30,000 columns, still there is a limitation on the size of the row in a table - 8,060 bytes per row. And there are certain performance considerations.
I would not use wide tables in this case, even if I was sure that the data for each row would never exceed 8060 bytes and growing number of channels would never exceed 30,000.
I don't see a problem with inserting 100 - 1000 rows in Option 1 vs 1 row in other options. To do such INSERT efficiently don't make 1000 individual INSERT statements, do it in bulk. In various places in my system I use the following two approaches:
1) Build one long INSERT statement
INSERT INTO ChannelData (TimestampID, ChannelID, DataValue) VALUES
(12345, 50, 1015.23),
(12345, 51, 12.23),
...
(), (), (), (), ........... ();
that contains 1000 rows and execute it as normal INSERT in one transaction, rather than 1000 transactions (check the syntax details).
2) Have a stored procedure that accepts a table-valued parameter. Call such procedure passing 1000 rows as a table.
CREATE TYPE [dbo].[ChannelDataTableType] AS TABLE(
[TimestampID] [int] NOT NULL,
[ChannelID] [int] NOT NULL,
[DataValue] [float] NOT NULL
)
GO
CREATE PROCEDURE [dbo].[InsertChannelData]
-- Add the parameters for the stored procedure here
#ParamRows dbo.ChannelDataTableType READONLY
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
BEGIN TRY
INSERT INTO [dbo].[ChannelData]
([TimestampID],
[ChannelID],
[DataValue])
SELECT
TT.[TimestampID]
,TT.[ChannelID]
,TT.[DataValue]
FROM
#ParamRows AS TT
;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH;
END
GO
If possible, accumulate data from several timestamps before inserting to make the batches larger. You should try with your system and find the optimal size of the batch. I have batches around 10K rows using the stored procedure.
If you have your data coming from sensors 100 times a second, then I would at first dump the incoming raw data in some very simple CSV file(s) and have a parallel background process that would insert it into the database in chunks. In other words, have some buffer for incoming data, so that if the server can't cope with the incoming volume, you would not loose your data.
Based on your comments, when you said that some channels are likely to be more interesting and queried several times, while others are less interesting, here is one optimization that I would consider. In addition to having one table ChannelData for all channels have another table InterestingChannelData. ChannelData would have the whole set of data, just in case. InterestingChannelData would have a subset only for the most interesting channels. It should be much smaller and it should take less time to query it. In any case, this is an optimization (denormalization/data duplication) built on top of properly normalized structure.
Is your process like this:
Generate data during the day
Analyse data afterwards
If these are separate activities then you might want to consider using different 'insert' and 'select' schemas. You could create a schema that's fast for inserting on the boat, then afterwards you batch upload this data into an analysis optimised schema. This requires a transformation step (where for example you map generic column names into useful column names)
This is along the lines of data warehousing and data marts. In this kind of design, you batch load and optimise the schema for reporting. Does your current daily upload have much of a window?