how to do a simple matrix transpose in openrefine - openrefine

I have data that I just need to perform a transpose on, seems simple but i can't make heads or tails of the transpose function.
data looks like this
name, requirement_1, requirement_2, requirement_3
label, 1.1, 1.2, 1.3
threshold, 10, 20, 30
objective, 100, 200, 300
floor, 0, .5, .5
I need:
name, label, threshold, objective, floor
requirement_1, 1.1, 10, 100, 0
requirement_2, 1.2, 20, 200, 0.5
requirement_3, 1.3, 30, 300, 0.5
in power query this is simply clicking the transpose button.
Thanks

This is a bit more complicated in OpenRefine, since you have to perform two operations: transpose cells across columns into rows, then columnize by key value.

Related

Recurrent neural network LSTM problem solved using 1 epoch

Looking at a solved problem on which the goal is to make predictions of stock price I've found that only 1 epoch is used to solve it. The data is composed of little less than 1500 points each corresponding to a daily closing price. So we have a dataset of dates (days) and prices.
Using LSTM approach the X_train training set is generated as:
Original dataset:
Date Price
1-1-2010 100
2-1-2010 80
3-1-2010 50
4-1-2010 40
5-1-2010 70
...
30-10-2012 130
...
X_train:
[[100, 80, 50, 40, 70, ...],
[80, 50, 40, 70, 90, ...],
[50, 40, 70, 90, 95, ...],
...
[..., 78, 85, 72, 60, 105],
[..., 85, 72, 60, 105, 130]]
The training set is 60 in length and shifted by one day everytime until a fraction of the total dataframe is reached (training set). Please don't consider things like normalization, etc. This is just an example.
The thing is that in the training part of the problem the epochs are set to 1, this is the first time I see this approach of considering just one pass through the data to train the model. I've searched about it but to no avail.
Does anyone knows how this technique is called (if it has a name) so I can search more about it?

Histogram as stacked bar chart based on categories

I have data with a numeric and categorical variable. I want to show a histogram of the numeric column, where each bar is stacked by the categorical variable. I tried to do this with ax.hist(data, histtype='bar', stacked=True), but couldn't quite get it to work.
If my data is
df = pd.DataFrame({'age': np.random.normal(45, 5, 100), 'job': np.random.choice(['engineer', 'barista',
'quantity surveyor'], size=100)})
I've organised it like this:
df['binned_age'] = pd.qcut(df.age, 5)
df.groupby('binned_age')['job'].value_counts().plot(kind='bar')
Which gives me a bar chart divided the way I want, but side by side, not stacked, and without different colours for each category.
Is there a way to stack this plot? Or just do it a regular histogram, but stacked by category?
IIUC, you will need to reshape your dataset first - i will do that using pivot_table and use len for an aggregator as that will give you the frequency.
Then you can use a similar code to the one you provided above.
df.drop('age',axis=1,inplace=True)
df_reshaped = df.pivot_table(index=['binned_age'], columns=['job'], aggfunc=len)
df_reshaped.plot(kind='bar', stacked=True, ylabel='Frequency', xlabel='Age binned',
title='Age group frequency by Job', rot=45)
prints:
You can use the documentation to tailor the chart to your needs
df['age_bin'] = pd.cut(df['age'], [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70])
df.groupby('age_bin')['job'].value_counts().unstack().plot(kind='bar', stacked=True)

Count rows across certain columns in a dataframe if they are greater than another value and groupby another column

I have a dataframe:
df = pd.DataFrame({
'BU': ['Total', 'Total', 'Total', 'CRS', 'CRS', 'CRS'],
'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'1Q16': [100, 120, 0, 200, 190, 210],
'2Q16': [100, 0, 130, 200, 190, 210],
'3Q16': [200, 250, 0, 120, 0, 190]})
I wish to count the number of rows in 1Q16, 2Q16, 3Q16 by "BU" that are greater than zero. To count rows in 1Q16, 2Q16, 3Q16 I was just explained, I can use:
cols = ['1Q16','2Q16','3Q16']
df[cols].gt(0).sum()
In addition, I want to group them by BU
With your shown samples, please try following.
cols = ['1Q16','2Q16','3Q16']
df[cols].gt(0).groupby(df['BU']).sum()
Output will be as follows:
1Q16 2Q16 3Q16
BU
CRS 3.0 3.0 2.0
Total 2.0 2.0 2.0
Explanation: Following is detailed explanation for above.
Creating cols list which has columns names in it where we want to perform tasks.
Using gt function to get values which are more than 0 in mentioned cols.
Then using groupby and passing df['BU'] to get groupby values related to BU column.
Then applying sum function to get total sum of values greater than 0.

How to bin a numerical pandas Series into n groups of approximately the same size without qcut?

I would like to split my series into exactly n groups (assuming there are at least n distinct values in the series), where the group sizes are approximately equal.
The code needs to be generic, so I cannot know the distribution of the data in advance, hence using pd.cut with pre-defined bins is not an option for me.
I tried using pd.qcut or pd.cut with pd.Series.quantile but they all fall short when some value is repeated very often in the series.
For instance, if I want exactly 3 groups:
series = pd.Series([1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 4, 4, 4, 4])
pd.qcut(series, q=3, duplicates="drop")
creates only 2 categories: Categories (2, interval[float64]): [(0.999, 3.0] < (3.0, 4.0]], whereas I would like to get something like [(0.999, 1.0] < (1.0, 3.0] < (3.0, 4.0]].
Is there any way to do this easily with pandas' built-in methods?

How to quantify the relevant value change

I have two lists of values:
accuracy: 0.6, 0.6, 0.6
hit: 1, 1, 1
Now after some processing, these two lists of values become:
accuracy: 0.59, 0.58, 0.55
hit: 0.5, 0.3, 0.1
so the drop can be calculated as:
accuracy_drop: 0.01, 0.02, 0.05
hit_drop: 0.5, 0.8, 0.9
In my case, the third case with accuracy_drop = 0.05 and hit_drop = 0.9 is the optimal and I want to quantify this 'optimal' case by doing some calculation using the values I have. If calculating the hit_drop/accuracy_drop, the first case gives the highest value; if calculating the hit_drop-accuracy_drop, the third case gives the highest value but I am not sure whether this is a reasonable quantification.
I tried to search for some quantification methods of relevant value change but found nothing. Any idea?