In Python dealing with data stacked in columns by splitting and merging - dataframe

I need to use Python to automate the processing of files that have data stacked in columns like this:
User Name
Why would you...?
name1
Because
name2
Because
User Name
Which do you...?
name3
C
name2
D
User Name
How much...?
name4
23
name1
425
My first thought is to split the dataframe into one for each question and then merge them on User Name. How can I extract the chunk of rows for each question though? Perhaps there are other ways as well?
I have not found a way to do this yet.

Related

PySpark - change dataframe column value based on its existence in other dataframe

I am relatively new in using Spark and didn't find out solution for this. I found several similar questions related to this but didn't find the way how to bundle this in my use-case.
I have two dataframes, first is based on CSV which looks like this (displayed as table):
id
license_no
2005
1011
2006
1022
2007
3911
Second dataframe is based on CSV which looks like this:
license_no
active
1011
y
1022
y
3911
n
I need to check does license_no value exists in second dataframe, and if it exists and active=y, then I need to add some prefix (99 in the beginning) to its id and license_no in first dataframe - for example, license_id exists in second dataframe and it is active, so its ID in first dataframe should be changed to 992005 and license_no do 991011. If doesn't exits 88 should be added.
Dataframe should look like this after transformations:
id
license_no
992005
991011
992006
991022
882007
883911
I am not sure what is best solution for this, can I directly do this transformation in one spark command?
#Join
s = df.join(df1,how='left',on='license_no')
#Contitionally concat prefix using list squares
s.select(*[when(col('active')=='y',concat(lit('99'),str(x))).otherwise(concat(lit('88'),str(x))).alias(x) for x in s.columns if x!='active']).show()
+----------+------+
|license_no| id|
+----------+------+
| 991011|992005|
| 991022|992006|
| 883911|882007|
+----------+------+

Storing a large array into a table with 10,000 columns in SQLite

I want to be able to store some 100x100 matrices onto a table within my database (covariance matrices). A first good step for me would be to flatten the matrix and store the matrix structure (among other things) into a parent table.
However, creating such a table would require to make a table with about 10,000 or so columns. Writing so many field names would make my SQL code extraordinarily large, and I wouldn't know where to start if I want to query for that matrix.
Is there a neat way to specify such a table in SQL? Is there a neat way for me to set or get a particular (set of) matrix (matrices) from my database using such a table? Is there a better way?
I am using Sqlite for my databases.
All tables with big size of same typed columns can be rotated.
For example if you have a table A like this:
row col1 col2 col3 ...
1 1 2 3
2 11 12 13
You can simply rotate to a table with 3 colums
row col value
1 1 1
1 2 2
1 3 3
2 1 11
2 2 12
2 3 13
so instead of writing big sql like
select col1, col2, col3 ...... from A where row = 2
you write sql like
select value from A where row = 2 order by col
the result set was originally horizontal and now become vertical -- it is rotated and easy to handle.

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

How to transpose columns when they encode multiple "records"?

I have a spreadsheet I have imported into OpenRefine. The creator encoded groups of information (records) in columns. I need to bring each of those groups of columns into its own row, along with all the relevant columns.
Using a simplified example, how would I go from this:
id foo1 foo2 foo3 bar1 bar2 bar3
1 4 6 a 7 9 b
2 5 5 a 8 8 b
3 6 4 a 9 7 b
To this:
id foobar1 foobar2 foobar3
1 4 6 a
1 7 9 b
2 5 5 a
2 8 8 b
3 6 4 a
3 9 7 b
I've been trying to think of a way forward with intermediate columns, but there are are 6 groups of 5 columns and I'm currently stuck.
I found a solution. The steps are:
Concat each group of columns into a single column (FOO_CONCAT, BAR_CONCAT)
Delete the now unneeded columns (foo1..3, bar1..3)
Transpose your CONCAT columns into a single column, no prefix, ignoring blanks, filling down other columns
Now FOO_CONCATs and BAR_CONCATs are all in the same column
Split that column into several columns...(using the separator you used in step 1)
Rename columns
Strip out prefixes (I had foo1:4, bar2:8, etc for clarity)
Transform to numbers (Edit cells -> Common Transforms -> toNumber)
Now you're ready to transpose,facet, etc
I think this is essentially the same has the solution you describe, but possibly with some shortcuts to avoid all the steps.
Given the example data you post I would:
On "Id" column select Edit column->Add column based on this column
from menu
Make new column name "foobar"
Use the GREL forEach(row.columnNames,cn,if(cn.startsWith("foo"),cells[cn].value,null)).join("|")+"~"+forEach(row.columnNames,cn,if(cn.startsWith("bar"),cells[cn].value,null)).join("|")
Once new "foobar" column exists, on this column use menu option Edit cells->Split multi-valued cells using the "~" character (as used in the GREL above)
The also on the "foobar" column use menu option Edit columns->Split into several columns, using the "|" character as in the GREL above
Finally on ID column use menu Edit cells->Fill down
This should result in the output you describe - if you don't need the original columns at this point you can either remove them, or (sometimes quicker) export the first X columns that have the reconfigured data using the custom tabular exporter, and then import that data into a new project.
You can modify the GREL to deal with the exact column groupings you have. In my example I've used the column naming to group the values, but if that isn't the reality of the data you are dealing with you can use GREL like:
forEach(row.columnNames.slice(1,4),cn,cells[cn].value).join("|")+"~"+forEach(row.columnNames.slice(4,8),cn,cells[cn].value).join("|")
Which uses the 'slice' function to select certain columns rather than using some aspect of the column name to select them.

Pentaho Join table values

I'm rookie at kettle pentaho and Querys. What i'm trying to do is check if value A, in file 1 is in file 2.
I've got 2 files, that i export from my DB:
File 1:
Row1, Row2
A 3
B 5
C 99
Z 65
File 2:
Row1, Row2
A 3
D 11
E 22
Z 65
And i want to create one file output:
File Output
Row1, Row2
A 3
Z 65
What i'm doing: 2 files input, merge join, but no file output. Something missing here.
Any suggestion will be great!!!
You can have the two streams joined by a "Merge join" step, which allows you to set freely the join keys (in your case it seems like you want both fields to be used), and also what type of join, Inner, Left Outer, Right outer or Full outer.
You can use stream lookup for that. Start with the file input for file 1, then create a stream lookup step that uses the input stream for file 2 as its lookup stream. Now just match the columns and you can add the column from file 2 to your data stream.
sort both files in Ascending Order
use the MergeJoin step to join both tables on the sorted field (this case Row1)
use select vales step to delete unwanted field produced as result of the join
output your result using the dummy step or whatever output you prefer
this should work fine