Freqency table from multiple lists - frequency

I have a data frame containing a list of genes from 60 different high throughput experiments. It looks something like this
experiment 1 experiment 2
APOE DAPK
PAK POA2
GALC APOE
JNK
And this goes on for 60 experiments, with a total of about 5000 genes.
I simply want a list of genes that occur most frequently across all these lists, and from which datasets that gene is found in. For example, the output might look like this...
Gene Frequency Present_In
APOE 4 experiment 5, experiment 11, experiment 27, experiment 53
SNCA 3 experiment 2, experiment 43, experiment 48
MAPT 3 experiment 5, experiment 44, experiment 57
GAK 2 experiment 23, experiment 31
I've been trying to do this for 5 hours!
This is what my environment looks like
EDIT - This is the online tool I've been using that does what I need.
http://molbiotools.com/listcompare.html
I copy paste each gene list into the website and it spits out what I need. But I'm nearing 70 gene lists and I need R to do it automatically. Otherwise whenever I add a new list I have to re-copy paste all 70 gene lists 0.o

Related

Multithreading pdf workload slows the program down

I tried building a program that scans PDF files to find certain elements and then outputs a new PDF file with all the pages that contain the elements. It was originally single-threaded and a bit slow to run. It took about 36 seconds on a 6-Core-5600X. So I tried multiprocessing it with concurrent.futures:
def process_pdf(filename):
# Open the PDF file
f = open(filename, "rb")
print("Searching: " + filename)
# Create a PDF object
pdf = PyPDF2.PdfReader(f)
# Extract the text from each page in the PDF
extracted_text = [page.extract_text() for page in pdf.pages]
# Initialize a list
matching_pages = []
# Iterate through the extracted text
for j, text in enumerate(extracted_text):
# Search for the symbol in the text
if symbol in text:
# If the symbol is found, get a new PageObject instance for the page
page = pdf.pages[j]
# Add the page to the list
matching_pages.append(page)
return matching_pages
Multiprocessing Block:
with concurrent.futures.ThreadPoolExecutor() as executor:
# Get a list of the file paths for all PDF files in the directory
file_paths = [
os.path.join(directory, filename)
for filename in os.listdir(directory)
if filename.endswith(".pdf")
]
futures = [executor.submit(process_pdf, file_path) for file_path in file_paths]
# Initialize a list to store the results
results = []
# Iterate through the futures as they complete
for future in concurrent.futures.as_completed(futures):
# Get the result of the completed future
result = future.result()
# Add the result to the list
results.extend(result)
# Add each page to the new PDF file
for page in results:
output_pdf.add_page(page)
The multiprocessing works, as evident from the printed text, but it somehow doesn't scale at all. 1 thread ~ 35 seconds, 12 Threads ~ 38 seconds.
Why? Where's my bottleneck?
Tried using other libraries, but most were broken or slower.
Tried using re instead of in to search for the symbol, no improvement.
In general PDF consumes much of a devices primary resources, Each device and system will be different so by way of illustration here is one simple PDF task.
Search for text by a common python library component poppler pdftotext, (others may be faster but the aim here is to show the normal attempts and issues)
As a ballpark yardstick for 1 minute I scanned roughly 2500 pages of one file for the word "AMENDMENTS" found 900 occurrences such as
page 82
[1] SHORT TITLE OF 1971 AMENDMENTS
[18] SHORT TITLE OF 1970 AMENDMENTS
[44] SHORT TITLE OF 1968 AMENDMENTS
[54] SHORT TITLE OF 1967 AMENDMENTS
page 83
[11] SHORT TITLE OF 1966 AMENDMENTS
[23] SHORT TITLE OF 1965 AMENDMENTS
[42] SHORT TITLE OF 1964 AMENDMENTS
page 84
[16] SHORT TITLE OF 1956 AMENDMENTS
[26] SHORT TITLE OF 1948 AMENDMENTS
page 85
[43] DRUG ABUSE, AND MENTAL HEALTH AMENDMENTS ACT OF 1988
to scan the whole file (13,234 pages) would be about 5 mins 20 seconds
and I know from testing 4 cpus could process 1/4 of that file (3,308 pages from 13,234) in under 1 minute (there is a gain for using smaller files).
So a 4 core device should do say 3000 pages per core in a short time, well lets see.
If I thread that as 3 x 1000 pages one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds
overall there is little or just a slight gain, so the cause is one application spending one third of its time in each thread.
Overall about 3000 pages per minute
Lets try clever lets use 3 applications on the one file. Surely they can each take less time, guess what
one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds no gain using 3 applications
Overall about 3000 pages per minute
Lets try more clever lets use 3 applications on 3 similar but different files. Surely they can each take less time, guess what
one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds no gain using 3 applications with 3 tasks
Overall about 3000 pages per minute
Whatever way I try the resources on this device are constrained to Overall about 3000 pages per minute.
I may just as well let 1 thread run un-fettered.
So the basic answer is use multiple devices, same as graphics farming, is done.

Read text file using fortran, where the starting 18 lines are strings and the rest of the lines are real numbers in 1000*8 array?

I have a text file where the first few lines are text and the rest of the lines contain data in the form of real numbers. I just require the real numbers array to be stored in a new file. For this, I read the total lines of the file for which output is correct and then trying to read the real numbers from the particular line numbers. I am unable to understand as to how to read this data.
Below is a part of file. Also I have many files like these to read.
AptitudeQT paperI: 12233105
Latitude : 30.00 S
Longitude: 46.45 E
Attemptone Time: 2017-03-30-09-03
End Time: 2017-03-30-14-55
Height(agl): m
Pressure: hPa
Temperature: deg C
Humidity: %RH
Uvelocity: cm/s
Vvelocity: cm/s
WindSpeed: cm/s
WindDirection: deg
---------------------------------------
10 1008.383 27.655 62.200 -718.801 -45.665 720.250 266.500
20 1007.175 27.407 62.950 -792.284 -18.481 792.500 268.800
There are many examples how to skip/read lines like this
But to sum it up, option A is to skip header and read only data:
! Skip first 17 lines
do i = 1, 17
read (unit,*,IOSTAT=stat) ! Dummy read
if ( stat /= 0 ) stop "error"
end do
! Read data
do i = 1, 1000
read (unit,*,IOSTAT=stat) data(:,i)
if ( stat /= 0 ) stop "error"
end do
If you have many files like this, I suggest wrapping this in a subroutine/function.
Option B is to use unix tail utility to discard header (more info here):
tail -n +18 file.txt

Interlacing text from multiple files based on contents of line

I'm trying to take N files, which, incidentally, are all syslog log files, and interlace them based on the timestamp which is the first part of the line. I can do this naively but I fear that my approach will not scale well with any more than just a handful of these files.
So let's say I just have two files, 1.log and 2.log. 1.log looks like this:
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.384521+00:00 bar 1
and 2.log looks like this:
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
Given that example, I would want the output to be:
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
2016-04-06T21:13:24.384521+00:00 bar 1
As that would be the lines of the files, combined, and sorted by the timestamp with which each line begins.
We can assume that each file is internally sorted before the program is run. (If it isn't, rsyslog and I have some talking to do.)
So quite naively I could write something like this, ignoring memory concerns and whatnot:
interlaced_lines = []
first_lines = [[f.readline(), f] for f in files]
while first_lines:
first_lines.sort()
oldest_line, f = first_lines[0]
while oldest_line and (len(first_lines) == 1 or (first_lines[1][0] and oldest_line < first_lines[1][0])):
interlaced_lines.append(oldest_line)
oldest_line = f.readline()
if oldest_line:
first_lines[0][0] = oldest_line
else:
first_lines = first_lines[1:]
I fear that this might be quite slow, reading line by line like this. However, I'm not sure how else to do it. Can I perform this task faster with a different algorithm or through parallelization? I am largely indifferent to which languages and tools to use.
As it turns out, since each file is internally presorted, I can get pretty far with sort --merge. With over 2GB of logs it sorted them in 15 seconds. Using my example:
% sort --merge 1.log 2.log
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
2016-04-06T21:13:24.384521+00:00 bar 1

Automating Netlogo based on input from a spreadsheet or txt file

I have developed a model in Netlogo and i want to automate the model run.
Basically what i want to do is read the input from either an excel, csv or .txt file and then ask Netlogo to change the inputs in the model accordingly. Run the model for lets say 100 ticks and store the required output from the 100th tick onto either the same file from which the input was read-in or export it onto a different file. Something like this
Trial Input1 Input2 Output
1 10 20
2 20 20
3 10 30
.
.
.
100 20 100
The variables Input 1 and Input 2 are in the interface either as a slider or input button.
Use the Behavior space feature in Netlogo. It's available under the Tool and below is the documentation on the topic.
https://ccl.northwestern.edu/netlogo/docs/behaviorspace.html

3D Graph in Octave/Matlab from a CSV file

I'm new to Octave/Matlab and I want to plot a 3D-Graph.
I was able to do so using a predefined formula, like this:
x=1:.1:5;
y=1:.1:5;
[xx,yy] = meshgrid(x,y);
z = sin(xx)+sin(yy);
mesh(x,y,z);
But now the question is how to do the same getting the data from a CSV (for example). I know I can use the function csvread, but the big question is how to format the CSV to contain such data.
An example of doing the same graph above but this time grabbing the data from Excel/CSV would be appreciated. Thanks!
Done! I was finally able to do it!
Here's how I did it:
1) I've created a file in Excel with the X values in the cells A2:A42, and the Y values in the cells B1:AP1 (so you form a rectangle).
2) Then in the cells in the middle I put the formula I want (ie =sin(A$2)+sin($B1))
3) Saved the file as CSV (but separated by spaces!) and manually edited it to look this way (the way QtOctave opens matrix files, in Matlab it might be different). For example (note the extra space before each column):
# Created by Octave 3.2.4, Thu Jan 12 19:32:05 2012 ART <diego#notebook2>
# name: z
# type: matrix
# rows: 3
# columns: 3
1 2 3
4 5 6
7 8 9
(if you're not sure how to do it, do what I did: create a simple matrix and export it to see how the exported file looks like!)
4) Octave has a function under Data -> Load matrix from file, which loads that kind of files. Or actually running this command (varname is the name of the resulting variable):
load("-text", "file-where-the-data-is", "varname")
5) Create the graph (ex is the name of the matrix I've just imported):
x=1:.1:5;
y=1:.1:5;
mesh(x,y,ex)