Event hub input size data is three time more in output size when reading using stream analytics - azure-data-lake

when i ingest 100 KB data file in event hub, when i read data from event hub using streaming analytic output file size is three times bigger than input file.
Please confirm

I had a same issue and as was investigating that, the input count (by byte and size) was multiplied by partition factors.
the partitions were created and mapped as inputs (we have only 2 inputs), but when you will see your job diagram - click ... and select expand partitions.
Per attached picture our IoT hub input is expanded to 20 partitions.

Related

Calculations of NTFS Partition Table Starting Points

I have a disk image. I'm able to see partition start and end values with gparted or another tools. However, I want to calculate them manually. I inserted an image , which showing my disk image partition start and end values. Also, I inserted $MFT file with link. As you see in the picture, my starter point for partition table 2 is : 7968240. How can I determine this number with real calculation ? I tried to dived this value with sector size which is 512. However, results are not fit. I'll appriciate for a formula for it. Start and End Points of Partitions.
$MFT File : https://file.io/r7sy2A7itdur
How can I determine this number with real calculation ?
The information about how a hard disk has been partitioned is stored in its first sector (that is, the first sector of the first track on the first disk surface). The first sector is the master boot record (MBR) of the disk; this is the sector that the BIOS reads in and starts when the machine is first booted.
For the current partition system (gpt) you can get more information here. The MFT is only a part on the NTFS in question, which is calculated via GPT or MBR.

Cloudwatch dashboard insight graphs - can I set binsize dynamically?

I'm using dashboards to monitor various output stats on AWS.
Lets say it looks something like this:
stats avg(myfield1), min(myfield2), max(myfield3) by bin(1m)
This works fine - however I am by default using a bin size of 1 minute - so the data retention period is only 3 days. If I want to look at a week or a month I have to use a separate widget with a larger bin size - I still want the 1 minute resolution for the shorter time periods and I'd rather not have to double up the graphs as the dashboard is already very busy.
Obviously all the built in metrics graphs adjust the bin size they are querying dynamically as the data range being viewed is changed.
Is it possible to do this within a cloudwatch insights query and if so what is the syntax?

Not able to load 5gb csv file from cloud storage to bigquery table

I have a file size of more than 5gb on google cloud storage , I am not able to load that file to bigquery table.
errors thrown are:
1) too many errors
2) Too many values in row starting at position: 2213820542
I searched and found it could be because of max file size reached., so my question is how can i upload file having size greater than quota policy, plz help me. I have a billing account on bigquery.
5gb is OK. The error says for the row starting at position 2213820542, it has more columns than specified, e.g. your provided a schema of n columns, and that row has more than n columns after splitting.

How to calculate throughput for individual links (there by fairness index) using old wireless trace format file in ns2?

I am stuck as in how to identify the different connections(flows) in trace file.
The following is the format in which the trace file is being created
event
time
Source node
Destination node
Packet type
Packet size
flags
fid
Source address
Dest. address
Seq. number
Packet id
If you take a look at the frame format of a trace file, the 8th column is the flow id which you can extract using an awkfile.
Read more on awk file and how you can isolate or count sent and received packets along with flow id. Once you have that just divide the two.

How to send data to only one Azure SQL DB Table from Azure Streaming Analytics?

Background
I have set up an IoT project using an Azure Event Hub and Azure Stream Analytics (ASA) based on tutorials from here and here. JSON formatted messages are sent from a wifi enabled device to the event hub using webhooks, which are then fed through an ASA query and stored in one of three Azure SQL databases based on the input stream they came from.
The device (Particle Photon) transmits 3 different messages with different payloads, for which there are 3 SQL tables defined for long term storage/analysis. The next step includes real-time alerts, and visualization through Power BI.
Here is a visual representation of the idea:
The ASA Query
SELECT
ParticleId,
TimePublished,
PH,
-- and other fields
INTO TpEnvStateOutputToSQL
FROM TpEnvStateInput
SELECT
ParticleId,
TimePublished,
EventCode,
-- and other fields
INTO TpEventsOutputToSQL
FROM TpEventsInput
SELECT
ParticleId,
TimePublished,
FreshWater,
-- and other fields
INTO TpConsLevelOutputToSQL
FROM TpConsLevelInput
Problem: For every message received, the data is pushed to all three tables in the database, and not only the output specified in the query. The table in which the data belongs gets populated with a new row as expected, while the two other tables get populated with NULLs for columns which no data existed for.
From the ASA Documentation it was my understanding that the INTO keyword would direct the output to the specified sink. But that does not seem to be the case, as the output from all three inputs get pushed to all sinks (all 3 SQL tables).
The test script I wrote for the Particle Photon will send one of each type of message with hardcoded fields, in the order: EnvState, Event, ConsLevels, each 15 seconds apart, repeating.
Here is an example of the output being sent to all tables, showing one column from each table:
Which was generated using this query (in Visual Studio):
SELECT
t1.TimePublished as t1_t2_t3_TimePublished,
t1.ParticleId as t1_t2_t3_ParticleID,
t1.PH as t1_PH,
t2.EventCode as t2_EventCode,
t3.FreshWater as t3_FreshWater
FROM dbo.EnvironmentState as t1, dbo.Event as t2, dbo.ConsumableLevel as t3
WHERE t1.TimePublished = t2.TimePublished AND t2.TimePublished = t3.TimePublished
For an input event of type TpEnvStateInput where the key 'PH' would exist (and not keys 'EventCode' or 'FreshWater', which belong to TpEventInput and TpConsLevelInput, respectively), an entry into only the EnvironmentState table is desired.
Question:
Is there a bug somewhere in the ASA query, or a misunderstanding on my part on how ASA should be used/setup?
I was hoping I would not have to define three separate Stream Analytics containers, as they tend to be rather pricey. After running through this tutorial, and leaving 4 ASA containers running for one day, I used up nearly $5 in Azure credits. At a projected $150/mo cost, there's just no way I could justify sticking with Azure.
ASA is supposed to be purposed for Complex Event Processing. You are using ASA in your queries to essentially pass data from the event hub to tables. It will be much cheaper if you actually host a simple "worker web app" to process the incoming events.
This blog post covers the best practices:
http://blogs.msdn.com/b/servicebus/archive/2015/01/16/event-processor-host-best-practices-part-1.aspx
ASA is great if you are doing some transformations, filters, light analytics on your input data in real-time. Furthermore, it also works great if you have some Azure Machine Learning models that are exposed as functions (currently in preview).
In your example, all three "select into" statements are reading from the same input source, and don't have any filter clauses, so all rows would be selected.
If you only want to rows select specific rows for each of the output, you have to specify a filter condition. For example, assuming you only want records with a non null value in column "PH" for the output "TpEnvStateOutputToSQL", then ASA query would look like below
SELECT
ParticleId,
TimePublished,
PH
-- and other fields INTO TpEnvStateOutputToSQL FROM TpEnvStateInput WHERE PH IS NOT NULL