Here is a simple example:
Scenario: Table example
* table dogs
| name | age |
| 'Charlie' | 2 |
| 'Jack' | 4 |
| 'Rock' | 9 |
* match dogs == [{name: 'Charlie', age: 2}, {name: 'Jack', age: 4}, {name: 'Rock', age: 9}]
Is it possible to move table to another file, and just import it? If yes, how exactly?
Thanks in advance
If you are want to do this during design time, say, import a table from another data source, you can use some design tool CukeTest, which allow you to edit it visually, import data from a *.csv file into a Table or Example of the gherkin file. You can save a Excel data file to *.csv format.
If you want to do it at runtime, then there are a lot of ways to read data and parse it programmatically, typically from *.json file or *.csv file.
For this use case I recommend using JSON instead of a table and it will import nicely and have the advantage of being editable by most IDE-s etc.
* def dogs = read('dogs.json')
* match dogs == [{name: 'Charlie', age: 2}, {name: 'Jack', age: 4}, {name: 'Rock', age: 9}]
You could do this also:
* call read('dogs.feature')
* match dogs == [{name: 'Charlie', age: 2}, {name: 'Jack', age: 4}, {name: 'Rock', age: 9}]
And in dogs.feature
Feature:
Scenario:
* table dogs
| name | age |
| 'Charlie' | 2 |
| 'Jack' | 4 |
| 'Rock' | 9 |
EDIT: since some teams insist on Excel (which I don't recommend) refer to this answer also if applicable: https://stackoverflow.com/a/47954946/143475
Related
Let's assume we have a DataFrame df with N rows:
| multiple-index | | ordinary columns |
I_A, I_B, I_C, I_D, C_A, C_B, C_C, C_D
How can we extract all N values for I_B index column? df.index gives us all combinations of I_A...I_D but it is not what we need. Of course, we can iterate over it but it will cost productivity, there must be an easier, more straightforward way?
Thank you for your time.
UPDATE
E.g., we have df generated by:
data = {
"animal": ["cat", "dog", "parrot", "hamster"],
"size": ["big", "big", "small", "small"],
"feet": [4, 4, 2, 4]
}
multi = pd.DataFrame(data)
multi.set_index(["size", "feet"], inplace = True)
and which is:
animal
size feet |
big 4 | cat
big 4 | dog
small 2 | parrot
small 4 | hamster
Its index is:
MultiIndex([( 'big', 4),
( 'big', 4),
('small', 2),
('small', 4)],
names=['size', 'feet'])
from which we would like to get all sizes:
['big', 'big', 'small', 'small']
How can we do that?
I think you're looking for MultiIndex.get_level_values:
multi.index.get_level_values('size')
Output: Index(['big', 'big', 'small', 'small'], dtype='object', name='size')
Or as list:
multi.index.get_level_values('size').to_list()
Output: ['big', 'big', 'small', 'small']
columns: datetime | clientid | amounts | *new_column_to_be_implemented* (rolling mean of values before but only for values that are the same in clientid)
`day 1` | 2 | 50 | (na)
`day 2` | 2 | 60 | 50
`day 3` | 1 | 45 | (na)
`day 4` | 2 | 45 | 110
`day 5` | 3 | 90 | (na)
`day 6` | 3 | 10 | 90
`day 7` | 2 | 10 | 105
so this gets the mean of the last 2 amounts of the same clientid for example.
I know it is possible to add a list and append/pop values to remember them, but is there a better way in pandas?
Please make sure to following the guidelines described in How to make good reproducible pandas examples when asking pandas related questions, it helps a lot for reproducibility.
The key element for the answer is the pairing of the groupby and rolling methods. groupby will group all the records with the same clientid and rolling will select the correct amount of records for the mean calculation.
import pandas as pd
import numpy as np
# setting up the dataframe
data = [
['day 1', 2, 50],
['day 2', 2, 60],
['day 3', 1, 45],
['day 4', 2, 45],
['day 5', 3, 90],
['day 6', 3, 10],
['day 7', 2, 10]
]
columns = ['date', 'clientid', 'amounts']
df = pd.DataFrame(data=data, columns=columns)
rolling_mean = df.groupby('clientid').rolling(2)['amounts'].mean()
rolling_mean.index = rolling_mean.index.get_level_values(1)
df['client_rolling_mean'] = rolling_mean
The following is an example Dataframe snippet:
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_lid |trace |message |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1103960793391132675|47c10fda9b40407c998c154dc71a9e8c|[app.py:208] Prediction label: {"id": 617, "name": "CENSORED"}, score=0.3874854505062103 |
|1103960793391132676|47c10fda9b40407c998c154dc71a9e8c|[app.py:224] Similarity values: [0.6530804801919593, 0.6359653379418201] |
|1103960793391132677|47c10fda9b40407c998c154dc71a9e8c|[app.py:317] Predict=s3://CENSORED/scan_4745/scan4745_t1_r0_c9_2019-07-15-10-32-43.jpg trait_id=112 result=InferenceResult(predictions=[Prediction(label_id='230', label_name='H3', probability=0.0), Prediction(label_id='231', label_name='Other', probability=1.0)], selected=Prediction(label_id='231', label_name='Other', probability=1.0)). Took 1.3637824058532715 seconds |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have millions of these, log like structures, where they all can be grouped by trace which is unique to a session.
I'm looking to transform these sets of rows into single rows, essentially mapping over them, for for this example I would extract from the first name the "id": 617 from the second row the values 0.6530804801919593, 0.6359653379418201 and from the third row the Prediction(label_id='231', label_name='Other', probability=1.0) value.
Then I would compose a new table having the columns:
| trace | id | similarity | selected |
with the values:
| 47c10fda9b40407c998c154dc71a9e8c | 617 | 0.6530804801919593, 0.6359653379418201 | 231 |
How should I implement this group-map transform over several rows in pyspark ?
I've written the below example in Scala for my own convenience, but it should translate readily to Pyspark.
1) Create the new columns in your dataframe via regexp_extract on the "message" field. This will produce the desired values if the regex matches, or empty strings if not:
scala> val dss = ds.select(
| 'trace,
| regexp_extract('message, "\"id\": (\\d+),", 1) as "id",
| regexp_extract('message, "Similarity values: \\[(\\-?[0-9\\.]+, \\-?[0-9\\.]+)\\]", 1) as "similarity",
| regexp_extract('message, "selected=Prediction\\(label_id='(\\d+)'", 1) as "selected"
| )
dss: org.apache.spark.sql.DataFrame = [trace: string, id: string ... 2 more fields]
scala> dss.show(false)
+--------------------------------+---+--------------------------------------+--------+
|trace |id |similarity |selected|
+--------------------------------+---+--------------------------------------+--------+
|47c10fda9b40407c998c154dc71a9e8c|617| | |
|47c10fda9b40407c998c154dc71a9e8c| |0.6530804801919593, 0.6359653379418201| |
|47c10fda9b40407c998c154dc71a9e8c| | |231 |
+--------------------------------+---+--------------------------------------+--------+
2) Group by "trace" and eliminate the cases where the regex didn't match. The quick and dirty way (show below) is to select the max of each column, but you might need to do something more sophisticated if you expect to encounter more than one match per trace:
scala> val ds_final = dss.groupBy('trace).agg(max('id) as "id", max('similarity) as "similarity", max('selected) as "selected")
ds_final: org.apache.spark.sql.DataFrame = [trace: string, id: string ... 2 more fields]
scala> ds_final.show(false)
+--------------------------------+---+--------------------------------------+--------+
|trace |id |similarity |selected|
+--------------------------------+---+--------------------------------------+--------+
|47c10fda9b40407c998c154dc71a9e8c|617|0.6530804801919593, 0.6359653379418201|231 |
+--------------------------------+---+--------------------------------------+--------+
I ended up using something in the lines of
expected_schema = StructType([
StructField("event_timestamp", TimestampType(), False),
StructField("trace", StringType(), False),
...
])
#F.pandas_udf(expected_schema, F.PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def transform(pdf):
output = {}
for l in pdf.to_dict(orient='record'):
x = re.findall(r'^(\[.*:\d+\]) (.*)', l['message'])[0][1]
...
return pd.DataFrame(data=[output])
df.groupby('trace').apply(transform)
Let's say I have some data in BigQuery which includes a nested array of objects like so:
{
"name" : "Bob",
"age": "24",
"customFields": [
{
"index": "1",
"value": "1.98"
},
{
"index": "2",
"value": "Nintendo"
},
{
"index": "3",
"value": "Yellow"
}
]
}
I've only been able to unnest this data so that the "index" and "value" fields are columns:
+------+-----+-------+----------+
| name | age | index | value |
+------+-----+-------+----------+
| Bob | 24 | 1 | 1.98 |
| Bob | 24 | 2 | Nintendo |
| Bob | 24 | 3 | Yellow |
+------+-----+-------+----------+
In most cases this would be the desired output, but as the data I'm using refers to Google Analytics custom dimensions I require something a bit more complex. I'm trying to get the index value to be used in the name of the column the data appears in, like so:
+------+-----+---------+----------+---------+
| name | age | index_1 | index_2 | index_3 |
+------+-----+---------+----------+---------+
| Bob | 24 | 1.98 | Nintendo | Yellow |
+------+-----+---------+----------+---------+
Is this possible? What would be the SQL query required to generate this output? It should use the "index" value in he column name, as the output won't be in the ordered "1,2,3,..." all the time.
What you are describing is often referred to as a pivot table - a transformation where values are used as columns. SQL doesn't generally support this as SQL is designed around the concept of having a fixed schema while pivot table requires dynamic schemas.
However if you have a fixed set of index columns you can emulate it with something like:
SELECT
name,
age,
ARRAY(SELECT value FROM UNNEST(customFields) WHERE index="1")[SAFE_OFFSET(0)] AS index_1,
ARRAY(SELECT value FROM UNNEST(customFields) WHERE index="2")[SAFE_OFFSET(0)] AS index_2,
ARRAY(SELECT value FROM UNNEST(customFields) WHERE index="3")[SAFE_OFFSET(0)] AS index_3
FROM your_table;
What this does is specifically define columns for each index that picks out the right values from the customFields array.
This question already has an answer here:
Rails (postgres) query with jsonb array
(1 answer)
Closed 5 years ago.
Say I have a Product table with a json array attribute called "name". For example, Product.first.name == ["large", "black", "hoodie"]. I want to search through my database for Products with names that contain words in my search query. So if I type in "large hoodie", Product.first should be returned in the results.
So first I have to turn the search key into an array of strings:
def search
search_array = params[:search].split(" ")
results = #???
but how can I search for Products with names that include values also contained in search_array? I've found documentation on how to search for values within arrays, but not on how to search for arrays themselves.
You can simply use, #> (contains) operator.
select * from products;
id | name | tags | created_at | updated_at
----+---------+--------------------------------+----------------------------+----------------------------
3 | T-Shirt | {clothing,summer} | 2017-10-30 05:28:19.394888 | 2017-10-30 05:28:19.394888
4 | Sweater | {clothing,winter,large,hoodie} | 2017-10-30 05:28:38.189589 | 2017-10-30 05:28:38.189589
(2 rows)
select * from products where tags #> '{large, hoodie}';
id | name | tags | created_at | updated_at
----+---------+--------------------------------+----------------------------+----------------------------
4 | Sweater | {clothing,winter,large,hoodie} | 2017-10-30 05:28:38.189589 | 2017-10-30 05:28:38.189589
(1 row)
Or, as an AR query,
2.3.1 :002 > Product.where("tags #> '{large, hoodie}'")
Product Load (0.4ms) SELECT "products".* FROM "products" WHERE (tags #> '{large, hoodie}')
=> #<ActiveRecord::Relation [#<Product id: 4, name: "Sweater", tags: ["clothing", "winter", "large", "hoodie"], created_at: "2017-10-30 05:28:38", updated_at: "2017-10-30 05:28:38">]>
Okay, as you are using postgresql, you can use gem pg_search.
Add search scope in model:
include PgSearch
pg_search_scope :search_on_text_columns,
against: %i(name),
using: { tsearch: { prefix: true } }
For more details check out the documentation. Cheers!