Reshape unbalanced data from wide to long using reshape function - dataframe

I am currently working on longitudinal data and trying to reshape the data from the wide format to the long. The naming pattern of the time-varying variables is r*variable (for example, height data collected in wave 1 is r1height). The identifiers are hhid (household id) and pn (person id). The data itself is unbalanced. Some variables are observed from first wave to last wave, but others are only observed from the middle of the study (i.e., wave 3 to 5).
I have already reshaped the data using merged.stack from the splitstackshape package (see codes below).
df <- data.frame(hhid = c("10001", "10002", "10003", "10004"),
pn = c("001", "001", "001", "002"),
r1weight = c(56, 76, 87, 64),
r2weight = c(57, 75, 88, 66),
r3weight = c(56, 76, 87, 65),
r4weight = c(78,99,23,32),
r5weight = c(55, 77, 84, 65),
r1height = c(151, 163, 173, 153),
r2height = c(154, 164, NA, 154),
r3height = c(NA, 165, NA, 152),
r4height = c(153, 162, 172, 154),
r5height = c(152,161,171,154),
r3bmi = c(22,23,24,25),
r4bmi = c(23,24,20,19),
r5bmi = c(21,14,22,19))
library(splitstackshape)
# Merge stack (this is what I want)
long1 <- merged.stack(df, id.vars = c("hhid", "pn"),
var.stubs = c("weight", "height", "bmi"),
sep = "var.stubs", atStart = F, keep.all = FALSE)
Now I want to know if I can use the "reshape" function to get the same results. I have tried using reshape method but failed. For example, the reshape function, as shown in the code below, returns bizarre longitudinal data. I thought the "sep" statement should cause the problem, but I don't know how to specify a pattern for my time-varying variables.
# Reshape (Wrong results)
library(reshape)
namelist <- names(df)
namelist <- namelist[namelist %in% c("hhid", "pn") == FALSE]
long2 <- reshape(data=df,
varying = namelist,
sep = "",
direction = "long",
idvar = c("hhid", "pn"))
Could anyone let me know how to address this problem?
Thanks

Related

Formatting 2 columns into list of tuples (for NER)

I'm looking to format data held in a df, so that it can be used in an NER model. I'm starting with the data in 2 columns, example below:
df['text'] df['annotation']
some text [('Consequence', 23, 47)]
some other text [('Consequence', 33, 46), ('Cause', 101, 150)]
And need to format it to:
TRAIN_DATA = [(some text, {'entities': [(23, 47, 'Consequence')]}), (some other text, {'entities': [(33, 46, 'Consequence'), (101, 150, 'Cause')]})
I've been attempting to iterate over each row, for example trying:
TRAIN_DATA = []
for row in df['annotation']:
entities = []
label, start, end = entity
entities.append((start, end, label))
# add to dataset
TRAIN_DATA.append((df['text'], {'entities': entities}))
However I can't get it to iterate over each row to populate the TRAIN_DATA. Sometimes there are multiple entities within the annotation column.
Grateful if anyone can highlight where I'm going wrong and how to correct it!
You can use zip() function:
TRAIN_DATA = [
(t, {"entities": [(s, e, l) for (l, s, e) in a]})
for t, a in zip(df["text"], df["annotation"])
]
print(TRAIN_DATA)
Prints:
[
("some text", {"entities": [(23, 47, "Consequence")]}),
(
"some other text",
{"entities": [(33, 46, "Consequence"), (101, 150, "Cause")]},
),
]

Stratified dataframe in R --> Manage empty values

I would like to perform two column sampling on a dataframe. I am working on very small probabilities, and I have a problem on the end. Here is my methodology.
library(splitstackshape)
#Creation of a dataframe similar to the one I'm working on.
data1 <- data.frame(categorie_metier = sample(c("agriculteur", "artisan", "autre", "cadres", "employes", "ouvriers", "prof_int"), 429, replace = TRUE, prob = c(0.01, 0.05, 0.14, 0.41, 0.25, 0.04, 0.10)), en_teletravail = sample(c("0", "1"), 429, replace = TRUE, prob = c(0.59, 0.41)), stringsAsFactors = TRUE)
#Creation of a dataframe to simulate my probabilities.
data2 <- data.frame(categorie_metier = sample(c("agriculteur", "artisan", "autre", "cadres", "employes", "ouvriers", "prof_int"), 1000000, replace = TRUE, prob = c(0.01, 0.03, 0.27, 0.21, 0.13, 0.10, 0.25)), en_teletravail = sample(c("0", "1"), 1000000, replace = TRUE, prob = c(0.991, 0.009)), stringsAsFactors = TRUE)
#Grouping of columns.
data2$groupe <- paste(data2$categorie_metier, data2$en_teletravail)
#Extraction of groups in a variable. Objective: Create an output dataframe of 50 lines.
gsize <- 50 * round(prop.table(table(data2$groupe)), 2)
gsize = as.list(gsize)
#Generation of the output dataframe.
data3 <- stratified(data1, c("categorie_metier", "en_teletravail"), gsize)
Error in stratified(data1, c("categorie_metier", "en_teletravail"), gsize) :
Incompatible sizes supplied
According to my research, this error is due to the existence of values ​​of 0 in "gsize". This is inevitable, because I am working on very small probabilities.
How could I handle these values ​​at 0, knowing that I cannot enlarge the size of data3 ?
Thank you.

Pandas HDFStore: append fails when min_itemsize is set to the maximum of the string column

I'm detecting the maximum lengths of all string columns of multiple dataframes, then attempting to build a HDFStore:
import pandas as pd
# Detect max string length for each column across all DataFrames
max_lens = {}
for df_path in paths:
df = pd.read_pickle(df_path)
for col in df.columns:
ser = df[col]
if ser.dtype == 'object' and isinstance(
ser.loc[ser.first_valid_index()], str
):
max_lens[col] = max(
ser.dropna().map(len).max(), max_lens.setdefault(col, 0)
)
print('Setting min itemsizes:', max_lens)
hdf_path.unlink() # Delete of file for clean retry
store = pd.HDFStore(hdf_path, complevel=9)
for df_path in paths:
df = pd.read_pickle(df_path)
store.append(hdf_key, df, min_itemsize=max_lens, data_columns=True)
store.close()
The detected maximum string lengths are as follows:
max_lens = {'hashtags': 139,
'id': 19,
'source': 157,
'text': 233,
'urls': 2352,
'user_mentions_user_ids': 199,
'in_reply_to_screen_name': 17,
'in_reply_to_status_id': 19,
'in_reply_to_user_id': 19,
'media': 286,
'place': 56,
'quoted_status_id': 19,
'user_id': 19}
Yet still I'm getting this error:
ValueError: Trying to store a string with len [220] in [hashtags] column but
this column has a limit of [194]!
Consider using min_itemsize to preset the sizes on these columns
Which is weird, because the detected maximum length of hashtags is 139.
HDF stores strings in UTF-8, and thus you need to encode the strings as UTF-8 and then find the maximum length.
a_pandas_string_series.str.encode('utf-8').str.len().max()

Group numpy into multiple sub-arrays using an array of values

I have an array of points along a line:
a = np.array([18, 56, 32, 75, 55, 55])
I have another array that corresponds to the indices I want to use to access the information in a (they will always have equal lengths). Neither array a nor array b are sorted.
b = np.array([0, 2, 3, 2, 2, 2])
I want to group a into multiple sub-arrays such that the following would be possible:
c[0] -> array([18])
c[2] -> array([56, 75, 55, 55])
c[3] -> array([32])
Although the above example is simple, I will be dealing with millions of points, so efficient methods are preferred. It is also essential later that any sub-array of points can be accessed in this fashion later in the program by automated methods.
Here's one approach -
def groupby(a, b):
# Get argsort indices, to be used to sort a and b in the next steps
sidx = b.argsort(kind='mergesort')
a_sorted = a[sidx]
b_sorted = b[sidx]
# Get the group limit indices (start, stop of groups)
cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
# Split input array with those start, stop ones
out = [a_sorted[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
return out
A simpler, but lesser efficient approach would be to use np.split to replace the last few lines and get the output, like so -
out = np.split(a_sorted, np.flatnonzero(b_sorted[1:] != b_sorted[:-1])+1 )
Sample run -
In [38]: a
Out[38]: array([18, 56, 32, 75, 55, 55])
In [39]: b
Out[39]: array([0, 2, 3, 2, 2, 2])
In [40]: groupby(a, b)
Out[40]: [array([18]), array([56, 75, 55, 55]), array([32])]
To get sub-arrays covering the entire range of IDs in b -
def groupby_perID(a, b):
# Get argsort indices, to be used to sort a and b in the next steps
sidx = b.argsort(kind='mergesort')
a_sorted = a[sidx]
b_sorted = b[sidx]
# Get the group limit indices (start, stop of groups)
cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
# Create cut indices for all unique IDs in b
n = b_sorted[-1]+2
cut_idxe = np.full(n, cut_idx[-1], dtype=int)
insert_idx = b_sorted[cut_idx[:-1]]
cut_idxe[insert_idx] = cut_idx[:-1]
cut_idxe = np.minimum.accumulate(cut_idxe[::-1])[::-1]
# Split input array with those start, stop ones
out = [a_sorted[i:j] for i,j in zip(cut_idxe[:-1],cut_idxe[1:])]
return out
Sample run -
In [241]: a
Out[241]: array([18, 56, 32, 75, 55, 55])
In [242]: b
Out[242]: array([0, 2, 3, 2, 2, 2])
In [243]: groupby_perID(a, b)
Out[243]: [array([18]), array([], dtype=int64),
array([56, 75, 55, 55]), array([32])]

TensorFlow: How to get Intermediate value of a variable in tf.while_loop()?

I need to fetch the intermediate value of a tensor in tf.while_loop(), however, it only gives me the last returned value.
For example, I have a variable x, which has 3 pages and its dimension is 3*2*4. Now I want to fetch each page one time and calculate the total sum, the page sum, the mean, max and min value of each page. Then I define the condition and body function and want to use tf.while_loop() to calculate the needed results. The source code is as bellow.
import tensorflow as tf
x = tf.constant([[[41, 8, 48, 82],
[9, 56, 67, 23]],
[[95, 89, 44, 54],
[11, 33, 29, 1]],
[[34, 9, 5, 70],
[14, 35, 18, 17]]], dtype=tf.int32)
def cond(out, count, x):
return count < 3
def body(out, count, x):
outTemp = tf.slice(x, [count, 0, 0], [1, -1, -1])
count += 1
outPack = tf.unpack(out)
outPack[0] += tf.reduce_sum(outTemp)
outPack[1] = tf.reduce_sum(outTemp)
outPack[2] = tf.reduce_mean(outTemp)
outPack[3] = tf.reduce_max(outTemp)
outPack[4] = tf.reduce_min(outTemp)
out = tf.pack(outPack)
return out, count, x
out = tf.Variable(tf.constant([0, 0, 0, 0, 0])) # total sum, page sum, mean, max, min
count = tf.Variable(tf.constant(0))
result = tf.while_loop(cond, body, [out, count, x])
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
print(sess.run(x))
print(sess.run(result)[0])
When I run the program, it only gives me the returned value of the last time and I can only get the results of the last page.
So the question is, How can I get the results of each page and How can I get the intermediate value from tf.while_loop()?
Thank you.
To get the "intermediate value" of any variable, you can simply make use of the tf.Print op which really is an identity operation with the side effect of printing a relevant message when evaluating the aforementioned variable.
As an example,
x = tf.Print(x, [x], "Value of x is: ")
Can be placed in any line where you want the value to be reported.