Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent error using "sqlSave" - sql

I'm trying to use sqlSave command to import R dataframe into SQL database. Below is my code
> head(final_series)
Price Time FactorID CountryID id
1 5.363334e+01 1980-01-01 1 1 1
2 5.143333e+01 1980-04-01 1 1 16384
3 5.060000e+01 1980-07-01 1 1 32767
4 5.250000e+01 1980-10-01 1 1 49150
5 5.266667e+01 1981-01-01 1 1 65533
6 5.280000e+01 1981-04-01 1 1 81916
> sqlSave(dbhandle, final_series, tablename = "db_time_price", varTypes = c(id="uniqueidentifier", FactorID= "float", CountryID="float", Time="date", Price="float"), append=TRUE, verbose = T, fast = F)
But I got the following error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
Anyone knows why? Thanks!

Did you check if the table already exists? If the table already exists but with a different dimension you would see this error.

Related

How to compute the sum of some variables in Stata?

I have 48 variables in my dataset: first 12 concern year 2000, second 12 year 2001, third 12 year 2002 and fourth 12 year 2003.
Each single variable contains the values in such a way:
ID
var1
var2
var3
...
var12
...
var48
xx
0
0
1
...
1
...
0
yy
1
0
0
...
9
...
0
zz
3
2
1
...
0
...
0
Now, I want to collect the sum of the values of the first 12 variables in another one called, say, "tot_2000" which should contain just one number (in this example it is 18).
Then, I must repeat this passage for the 3 remaining years, thus having 4 variables ("tot_2000", "tot_2001", "tot2002", "tot2003") to be plotted in an histogram.
What I'm looking for is such a variable:
tot_2000
18
ORIGINAL QUESTION, addressed by #TheIceBear and myself.
I have a dataset that contains, say, 12 variables with values 0,1,2.... like this, for example:
ID
var1
var2
var3
...
var12
xx
0
0
1
...
1
yy
1
0
0
...
9
zz
3
2
1
...
0
and I want to create a variable that is just the sum of all the values (18 in this case), like:
tot_var
18
What is the command?
FIRST ANSWER FROM ME
Here is another way to do it, as indicated in a comment on the first answer by #TheIceBear.
* Example generated by -dataex-. For more info, type help dataex
clear
input str2 ID byte(var1 var2 var3 var4)
"xx" 0 0 1 1
"yy" 1 0 0 9
"zz" 3 2 1 0
end
mata : total = sum(st_data(., "var1 var2 var3 var4"))
mata : st_numscalar("total", total)
di scalar(total)
18
The two Mata commands could be telescoped.
SECOND ANSWER
A quite different question is emerging slowly from comments and edits. The question is still unfocused, but here is an attempt to sharpen it up.
You have monthly data for various identifiers. You want to see bar charts (not histograms) with annual totals.
The data structure or layout you have is a poor fit for handling such data in Stata. You have a so-called wide layout but a long layout is greatly preferable. Then your totals can be put in a variable for graphing.
* fake dataset
clear
set obs 3
gen id = word("xx yy zz", _n)
forval j = 1/48 {
gen var`j' = _n * `j'
}
* you start here
reshape long var, i(id) j(time)
gen mdate = ym(1999, 12) + time
format mdate %tm
gen year = year(dofm(mdate))
* not clear that you want this, but it could be useful
egen total = total(var), by(id year)
twoway bar total year, by(id) xla(2000/2003) name(G1, replace)
* this seems to be what you are asking for
egen TOTAL = total(var), by(year)
twoway bar TOTAL year, base(0) xla(2000/2003) name(G2, replace)
Here is a solution for how to do it in two steps:
* Example generated by -dataex-. For more info, type help dataex
clear
input str2 ID byte(var1 var2 var3 var4)
"xx" 0 0 1 1
"yy" 1 0 0 9
"zz" 3 2 1 0
end
egen row_sum = rowtotal(var*) //Sum each row into a var
egen tot_var = sum(row_sum ) //Sum the row_sum var
* Get the value of the first observation and store in a local macro
local total = tot_var[1]
display `total'

How to get same grouping result using data.table comparing to the sqldf?

I try to implement SQL query using sqldf and data.table.
I need to do this separately using these 2 different libraries.
Unfortunately, I cannot produce the same result using data.table.
library(sqldf)
library(data.table)
Id <- c(1,2,3,4)
HasPet <- c(0,0,1,1)
Age <- c(20,1,14,10)
Posts <- data.table(Id, HasPet, Age)
# sqldf way
ref <- sqldf("
SELECT Id, HasPet, MAX(Age) AS MaxAge
FROM Posts
GROUP BY HasPet
")
# data.table way
res <- Posts[,
list(Id, HasPet, MaxAge=max(Age)),
by=list(HasPet)]
head(ref)
head(res)
Output for sqldf is:
> head(ref)
Id HasPet MaxAge
1 1 0 20
2 3 1 14
while the output for data.table is different:
> head(res)
HasPet Id HasPet MaxAge
1: 0 1 0 20
2: 0 2 0 20
3: 1 3 1 14
4: 1 4 1 14
Please note, that SQL query cannot be modified.
This comes up a lot with data.table. If you want the max or min by group, the best way is a self-join. It's fast, and only a little arcane.
You can build it up step by step:
In data.table, you can select in i, do in j, and group afterwards. So first step is to find the thing we want within each level of the group
Posts[, Age == max(Age), by = HasPet]
# HasPet V1
# 1: 0 TRUE
# 2: 0 FALSE
# 3: 1 TRUE
# 4: 1 FALSE
We can use .I to retrieve the integer vector per row, then what was previously the V1 logical vector TRUE and FALSE indexes within each group so we have only the row containing the max per group.
Posts[, .I[Age == max(Age)], by=HasPet]
# From the data.table special symbols help:
# .I is an integer vector equal to seq_len(nrow(x)). While grouping,
# it holds for each item in the group, its row location in x. This is useful
# to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].
# HasPet V1
# 1: 0 1
# 2: 1 3
We then use the column V1 that we just made in order to call the specific rows (1 and 3) from the data.table. That's it!
Posts[Posts[, .I[Age == max(Age)], by=HasPet]$V1]
You can use .SD to get subset of rows for each value of HasPet.
library(data.table)
Posts[, .SD[Age==max(Age)], HasPet]
# HasPet Id Age
#1: 0 1 20
#2: 1 3 14

Can I use pandas to create a biased sample?

My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone

Add 'document_id' column to pandas dataframe of word-id's and wordcounts

I have following dataset:
import pandas as pd
jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['#Zuora wants to help #Network4Good with Hurric...','#ztrip please help spread the good word on hel...']})
DOCUMENT_ID MESSAGE
0 263403828328665088 #Zuora wants to help #Network4Good with Hurric...
1 264142543883739136 #ztrip please help spread the good word on hel...
I am trying to reshape my data in the form of
docID wordID count
0 1 118 1
1 1 285 1
2 1 1229 1
3 1 1688 1
4 1 2068 1
I used following
r=[]
for i in jsonDF['MESSAGE']:
for j in sortedValues(wordsplit(i)):
r.append(j)
IDCount_Re=pd.DataFrame(r)
IDCount_Re[:5]
gives me following result
0 17
1 help 2
2 wants 1
3 hurricane 1
4 relief 1
5 text 1
6 sandy 1
7 donate 1
8 6
9 please 1
I can get word counts
I have no idea to to append Document_ID to the in the above dataframe.
Following functions were used to split words
from nltk.corpus import stopwords
import re
def wordsplit(wordlist):
j=wordlist
j=re.sub(r'\d+', '', j)
j=re.sub('RT', '',j)
j=re.sub('http', '', j)
j = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
j=j.lower()
j=j.strip()
if not j in stopwords.words('english'):
yield j
def wordSplitCount(wordlist):
'''merges a list into string, splits it, removes stop words and
then counts the occurrences returning an ordered dictitonary'''
#stopwords=set(stopwords.words('english'))
string1=''.join(list(itertools.chain(filter(None, wordlist))))
cnt=Counter()
j = []
for i in string1.split(" "):
i=re.sub(r'&', ' ', i.lower())
if i not in stopwords.words('english'):
cnt[i]+=1
return OrderedDict(cnt)
def sortedValues(wordlist):
'''creates a dictionary list of occurenced w/ values descending'''
d=wordSplitCount(wordlist)
return sorted(d.items(), key=lambda t: t[1], reverse=True)
UPDATE: SOLUTION HERE:
string split and and assign unique ids to Pandas DataFrame
'DOCUMENT_ID' is one of the two fields in each row of jsonDF. Your current code doesn't access it because it directly works on jsonDF['MESSAGE'].
Here is some non-working pseudocode - something like:
for _, row in jsonDF.iterrows():
doc_id, msg = row
words = [word for word in wordsplit(msg)][0].split() # hack
wordcounts = Counter(words).most_common() # sort by decr frequency
Then do a pd.concat(pd.DataFrame({'DOCUMENT_ID': doc_id, ...
and get the 'wordId' and 'count' fields from wordcounts.

Replicate a SQL Join + Group By in data.table

I am trying to replicate a SQL sparse matrix multiplication using data tables. The SQL expression would be:
SELECT a.i, b.j, SUM(a.value*b.value)
FROM a, b
WHERE a.j = b.i
GROUP BY a.i, b.j;
where my data is structured as |i|j|value in each table
To create this data in R you can use:
library(reshape2)
library(data.table)
A <- matrix(runif(25),5,5)
B <- matrix(runif(25),5,5)
ADT <- data.table(melt(A))
BDT <- data.table(melt(B))
setnames(ADT,old = c("Var1","Var2","value"), new = c("Ai","Aj","AVal"))
setnames(BDT,old = c("Var1","Var2","value"), new = c("Bi","Bj","BVal"))
To merge using .[ we need to set the keys that we join on:
setkey(ADT,"Aj")
setkey(BDT,"Bi")
To build up piece by piece
ADT[BDT, allow.cartesian = T]
Ai Aj AVal Bj BVal
1: 1 1 0.39230905 1 0.7083956
2: 2 1 0.89523490 1 0.7083956
3: 3 1 0.92464689 1 0.7083956
4: 4 1 0.15127499 1 0.7083956
5: 5 1 0.88838458 1 0.7083956
---
121: 1 5 0.70144360 5 0.7924433
122: 2 5 0.50409075 5 0.7924433
123: 3 5 0.15693879 5 0.7924433
124: 4 5 0.09164371 5 0.7924433
125: 5 5 0.63787487 5 0.7924433
So far so good. The merge worked properly, Bi has disappeared, but this is encoded by Aj anyway. We now want to multiply AVal by BVal, and then sum the created groups (! in the same expression, I know that I could store and apply a second expression here). I had thought this would be:
ADT[BDT, j = list(Ai, Bj, Value = sum(AVal*BVal)), by = c("Ai","Bj") , allow.cartesian = T]
but I get the error: Object Bj not found. In fact, none of the values from 'BDT' are usable once I insert the by = clause (try to systematically remove Bj,BVal and "Bj" from the expression above, left to right, and you will see what I mean).
Looking into the .EACHI expression, it seems like the motive is here to do what I want, but .EACHI groups on the merged index, not on a separate variable.
Sounds like you simply want to aggregate after the merge:
ADT[BDT, allow.cartesian = T][, sum(AVal * BVal), by = .(Ai, Bj)]