"No data available" in Zeppelin charts - apache-spark-sql

I'm having problems creating visualizations with Zeppelin. I've got a dataset with about 600 million records. It's stored in an HDFS cluster and I'm able to load as a Spark dataframe:
%spark.pyspark
input_hdfs_path = u'hdfs://cluster-master:9000/data/CDR_*.parquet'
df = spark.read.format('parquet').load(input_hdfs_path)
df.registerTempTable("df")
I'm interested in creating histograms on the length of the CDR (field CDR_LENGTH):
%sql
select ROUND(CDR_LENGTH, -2) as duration, count(*) as count
from df
group by 1
order by 1
I do get the appropriate results in the Table tab (with two columns, duration and count), but when going to the bar chart tab (or any other graphic tab), it simply says "No data available". Can you figure out what I'm doing wrong? Thanks

you can find settings on the right side of chart buttons,
then you define Keys, Groups, Values as you like.

Related

In Python, is there a way to divide an element in a list by the next element in a list using a for loop or list comprehension?

I have a list of metrics that each have values for multiple time periods. I would like to write a script that takes a value of a metric for a particular time period and divides it by the previous year.
Currently my code looks like this:
for metric in metric:
iya_df[metric+' '+period[0][-4:]+' IYA'] = pivot[metric][period[0]]/pivot[metric][period[1]]*100
iya_df[metric+' '+period[1][-4:]+' IYA'] = pivot[metric][period[1]]/pivot[metric][period[2]]*100
iya_df[metric+' '+period[2][-4:]+' IYA'] = pivot[metric][period[2]]/pivot[metric][period[3]]*100
iya_df[metric+' '+period[3][-4:]+' IYA'] = pivot[metric][period[3]]/pivot[metric][period[4]]*100
I have a list of metrics and a list of periods. (The slicer after period is just to grab the 4 -digit year).
The source table is a pivot table with multiple indices.
I would like to change the code so that I don't have to change it if my list of time periods changes in length.
There's probably a more efficient way to do this with list comprehension than loops but I'm still getting stronger in Python.

Python selenium get table values into List of Lists

I'm just trying to get the data from this table:
https://www.listcorp.com/asx/sectors/materials
and put all the values (the TEXT) into a list of lists.
I've tried so many different methods using--> xpath, getByClassName, By.tag
------------
rws = driver.find_elements_by_xpath("//table/tbody/tr/td")
---------------
table = driver.find_element_by_class_name("v-datatable v-table theme--light")
--------------
findElements(By.tagName("table"))
--------------
# to identify the table rows
l = driver.find_elements_by_xpath ("//*[#class= 'v-datatable.v-
table.theme--light']/tbody/tr")
# to get the row count len method
print (len(l))
# THIS RETURNS '1' which cant be right because theres hundreds of rows
And nothing seems to work to get the values in an easy way to understand the manner.
(EDIT SOLVED)
before doing the SOLVED solution below.
First do: time.sleep(10) this will allow the page to load so that the table can actually be retrieved. then just append all the cells to a new list. YOU WILL NEED MULTIPLE LISTS to fit all the rows.
So basically you can use find_elements_by_tag_name
and use this code
row = driver.find_elements_by_tag_name("tr")
data = driver.find_elements_by_tag_name("td")
print('Rows --> {}'.format(len(row)))
print('Data --> {}'.format(len(data)))
for value in row:
print(value.text)
Have proper wait to populate the data.

How to select columns based on value they contain pandas

I am working in pandas with a certain dataset that describes the population of a certain country per year. The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set. The dataset describes every year from 1960 up til now but I only need 1970, 1980, 1990 etc. For this purpose I've created a list with all those years and tried to make a new dataset which is equivalent to the old one but only has the columns that contain a value from said list so I don't have all this extra info I'm not using. Online I can only find instructions for removing rows or selecting by column name, since both these criteria don't apply in this situation I thought i should ask here.
The dataset is a csv file which I've downloaded off some world population site. here a link to a screenshot of the data
As you can see the years are given in scientific notation for some years, which is also how I've added them to my list.
pop = pd.read_csv('./maps/API_SP.POP.TOTL_DS2_en_csv_v2_10576638.csv',
header=None, engine='python', skiprows=4)
display(pop)
years = ['1.970000e+03','1.980000e+03','1.990000e+03','2.000000e+03','2.010000e+03','2.015000e+03', 'Country Name']
pop[pop.columns[pop.isin(years).any()]]
This is one of the things I've tried so far which I thought made the most sense, but I am still very new to pandas so any help would be greatly appreciated.
Using the data at https://data.worldbank.org/indicator/sp.pop.totl, copied into pastebin (first time using the service, so apologies if it doesn't work for some reason):
# actual code using CSV file saved to desktop
#df = pd.read_csv(<path to CSV>, skiprows=4)
# pastebin for reproducibility
df = pd.read_csv(r'https://pastebin.com/raw/LmdGySCf',sep='\t')
# manually select years and other columns of interest
colsX = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
'1990', '1995', '2000']
dfX = df[colsX]
# select every fifth year
colsY = df.filter(regex='19|20', axis=1).columns[[int(col) % 5 == 0 for col in df.filter(regex='19|20', axis=1).columns]]
dfY = df[colsY]
As a general comment:
The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set.
This is not correct. Viewing the CSV file, it is quite clear that row 5 (Country Name, Country Code, Indicator Name, Indicator Code, 1960, 1961, ...) are indeed column names. You have read the data into pandas in such a way that those values are not column years, but your first step, before trying to subset your data, should be to ensure you have read in the data properly -- which, in this case, would give you column headers named for each year.

icCube: multiple dimensions in MDX output

The documentation of icCube states:
However, a SELECT is not limited to two axes. We could have columns,
rows, pages, chapters, and sections. And you could still continue
beyond these by specifying a number for the axis.
Indeed, when I try using three dimensions on the demo Sales cube, it works:
select
{[paris], [london]} on 0,
{[2005], [2006]} on 1,
product.members on 2
from sales
However, when I try four dimensions:
select
{[paris], [london]} on 0,
{[2005], [2006]} on 1,
product.members on 2,
measures.members on 3
from sales
I get an error message: Unexpected number of axes (4) for the pivot table (expected:0..3)
What am I missing?
There is nothing wrong with using a 4 axes query. However, it is left up to the client your are using to be able to display it.
For example, Excel accepts 2D results, the icCube pivot table is able to display results up to (and including) 3 axes.
Hope that helps.

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.