Group Pandas: How to concat or merge/join/append two csv files with same index but different extensions in grouped data? - pandas

I'd like to concat or merge or append/join two csv files with the same indix ID but different extensions on the same ID. The data are grouped by ID also. The 1st file looks like this:
ID,year,age
810006862,2000,49
810006862,2001,
810006862,2002,
810006862,2003,52
810023112,2003,27
810023112,2004,28
810023112,2005,29
810023112,2006,30
810033622,2000,24
810033622,2001,25
and the 2nd file looks like this:
ID,year,from1,to1
810006862,2002,15341,15705
810006862,2003,15706,16070
810006862,2004,16071,16436
810006862,2005,,
810023112,2000,14610,14975
810023112,2001,14976,15340
810023112,2003,15825,16523
810033622,2000,13211,14876
810033622,2001,14761,14987
I have set index of ID for both files after reading it to dataframe, and then concat them together, but it gets error message of "ValueError: Shape of passed values is (25, 2914), indices imply (25, 251)"
I've tried the following codes:
sp = pd.read_csv('sp1.csv')
sp = sp.set_index('ID')
op = pd.read_csv('op1.csv')
op = op.set_index('ID')
ff = pd.concat([sp, op], join = 'outer', sort = False, axis = 1)
I've also tried concat the two files together without setting up index, and the result seemed having correct rows, but the horizontal values were incorrect related.
I've also tried merge as well, but it came with many unnecessary duplicated rows within each group. Since each group has different year and age, I found quite difficult to delete those newly generated rows using this method.
full = pd.merge(sp, op, on = 'ID', how = 'outer', sort = False)
Maybe somebody can suggest ways to easily delete these duplicates, and this will also work for me, because the merged file became so huge! Thanks in advance!
Expected results would be including all different values from both csv files. It is somewhat like this:
ID,year,age,from1,to1
810006862,2000,49,,
810006862,2001,,,
810006862,2002,,15341,15705
810006862,2003,52,15706,16070
810006862,2004,,16071,16436
810006862,2005,,,
810023112,2000,,14610,14975
810023112,2001,,14976,15340
810023112,2003,27,15825,16523
810023112,2004,28,,
810023112,2005,29,,
810023112,2006,30,,
810033622,2000,24,13211,14876
810033622,2001,25,14761,14987
I've searched online for similar posts along for quite some time, but unable to solve my problem. Can anybody offer any clue how to do this? Thanks a lot!

Related

How should I merge monthly datasets into on dataset for cleaning?

I am working on a case study for a ride share. The data is broken up into monthly datasets, and in order to analyze the data over the last year I would need to merge the data. I uploaded all the data to both BigQuery and Rstudio but am unsure of the best way to make one large dataset.
I may not even have to do this, but I believe that to find trends I should have all the data in one datatable. If this is not the case then I will clean the data one month at a time.
Maybe use purrr::map_dfr()? It's like lapply() and rbind() rolled into one.
library(tidyverse)
all_the_tables <-
map_dfr( # union as it loops over the function
.x = list.files(pattern = ".csv"), # input for the function
.f = read_csv # the function
)
If it's more complicated and you need something to vary the source by you can use something like.
map_dfr(
.x = list.files(pattern = ".csv"),
.f = # the tilde let's you make a more complex sequence of steps
~read_csv(path = .x) |>
mutate(source = .x)
)
If it's a lot of files consider using vroom::vroom()

How to how to create a Dataframe based on a condition

I want to create a new DataFrame from another for rows that meet a condition such as:
uk_cities_df['location'] = cities_df['El Tarter'].where(cities_df['AD'] == 'GB')
uk_cities_df[:5]
but the resulting uk_cities_df is returning NaN.
The csv file that I am needing to extract from has no headers so it used the first row values for such. I need to only include rows in uk_cities_df include the ISO code "GB" so "El Tarter" denotes the values for location and "AD" for iso code.
Could you please provide a visual of what uk_cities_df and cities_df look like ?
From what I can gather, I think you might be looking for the .loc operator,
you could try for example :
uk_cities_df['location'] = cities_df.loc[cities_df['AD'] == 'GB']['location']
Also, I did not really get what role 'El Tarter' plays here, maybe you could give more details ?

Why am I getting a `Data type mismatch` error when I add "CF" to the end of my search string in a SQL statement in Access?

The following query (qryCurLotNewProducts) produces a data set that I want process further with another query (qryBNP_CFRecordset):
//qryCurLotNewProducts
SELECT tblNewProducts.*
FROM tblNewProducts INNER JOIN tblCurLot ON (tblCurLot.CatalogNum = tblNewProducts.CatalogNum) AND
(tblNewProducts.LotNum = tblCurLot.CurLot);
When I run this second query to list only the "CF" products found in the first query, I get the `Data type mismatch in criteria expression' error.
//qryBNP_CFRecordset
SELECT qryCurLotNewProducts.*, tblABCategory.UNSPSC, tblAmount.ProductSize
FROM tblAmount RIGHT JOIN (tblABCategory RIGHT JOIN qryCurLotNewProducts ON tblABCategory.ABCategory = qryCurLotNewProducts.ABCategory) ON tblAmount.Amount = qryCurLotNewProducts.Amount
WHERE (((qryCurLotNewProducts.CatalogNum) Like "A700-###CF") AND ((qryCurLotNewProducts.DateEntered) Between #1/1/2000# And #3/1/2020#))
ORDER BY qryCurLotNewProducts.CatalogNum, Abs(qryCurLotNewProducts.LotNum);
If I remove the CF from the search string (so "A700-###"), the query correctly outputs a list containing all items that contain that pattern:
If I use strings like "A700-####F" or "A700-###ZZ" or other combinations like that, I don't get an error but rather an empty results set.
Notably, "A700-001CF", "A700-002CF", etc all create the data type error. It seems there is something about the CF key combination that is causing trouble.
Has anybody else ever seen this issue? Do I need to use some kind of delimiter to tell SQL to not view CF as some kind of special switch?
Abs(qryCurLotNewProducts.LotNum) wont work with the values for Products ending in CF. Your LotNum-Column has a text-type.
Edit: Your LotNum-Column has a text-type as you can see in your first screenshot.

Splunk : formatting a csv file during indexing, values are being treated as new columns?

I am trying to create a new field during indexing however the fields become columns instead of values when i try to concat. What am i doing wrong ? I have looked in the docs and seems according ..
Would appreciate some help on this.
e.g.
.csv file
**Header1**, **Header2**
Value1 ,121244
transform.config
[test_transformstanza]
SOURCE_KEY = fields:Header1,Header2
REGEX =^(\w+\s+)(\d+)
FORMAT =
testresult::$1.$2
WRITE_META = true
fields.config
[testresult]
INDEXED = True
The regex is good, creates two groups from the data, but why is it creating a new field instead of assigning the value to result?. If i was to do ... testresult::$1 or testresult::$2 it works fine, but when concatenating it creates multiple headers with the value as headername. Is there an easier way to concat fields , e.g. if you have a csv file with header names can you just not refer to the header names? (i know how to do these using calculated fields but want to do it during indexing)
Thanks

How to use Bioproject ID, for example, PRJNA12997, in biopython?

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.
You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()