0 votes
311 views
Hi,

I have large >5GB CSV file and I would like to cross match that without cutting it to smaller pieces.

I am looking for method where I can use anaconda / Jupyter notebook to do this.

 My CSV has 20 million rows and I would like to build a code that would query for example 100 000 row blocks and keep writing the result file locally?.

CSV file would have RA and DEC columns that I would like to xmatch with for example GAIA edr3 with radius 1arc sec.

Result: The file with all original rows completed with GAIA data  for those objects that have gaia data available.

I hope this makes sense and I will be open for other solutions as well. I am able to cut the file to smaller pieces but that does not feel the best way.

Edit: probably better way would be to use FITS file? So I am open for that because I suck with fits files!

Best LaaLaa
asked Mar 22, 2022 by LaaLaa | 311 views

1 Answer

0 votes

Hi, thanks for reaching out.

If you want to work in a NB on the DL Jupyter NB server, please first upload your CSV, e.g. to the NB space. The format should have one header line, followed by the data lines, e.g.:

ra,dec,source_id
2.503706331038843,53.16659819583169,396011034259836416
2.5165612839280316,53.171095728972396,396011072920766848
2.519798375886651,53.17298647736945,396011072916668800

Then, in the NB server, upload the table to your MyDB (that's the user relational database). The reason to do so is that then you can run the xmatch directly on the DB, which is faster and should work for 20M rows, without the need to split things. Here's how:

from dl import queryClient as qc
qc.mydb_import('mytable','./mycsvfile.csv')  # table name in mydb, csv file name on local disk

How to write the xmatch queries is shown e.g. in this DL example NB:

https://github.com/astro-datalab/notebooks-latest/blob/master/04_HowTos/CrossmatchTables/How_to_crossmatch_tables.ipynb

(in the section "Run the crossmatch"). Details on writing the SQL string will depend whether you want to keep all matches within the 1 arcsec or just the nearest one, and whether you want to keep rows that have no counterpart in gaia_edr.gaia_source table or not.

Once you have written an xmatch query, you can run it in the NB and write out the result to a new CSV file like this:

xmatchsql = "SELECT ..." # your xmatch query SQL, see the example NB
qc.query(xmatchsql,timeout=600,out='./myxmatch.csv')

BTW, I tried these steps with a 20M rows CSV file (from gaia_edr3 itself), and the qc.mydb_import() step took 22 minutes for me, while a crossmatch query on it vs gaia_edr3.gaia_source took 13 minutes.

Please don't hesitate to contact us again if needed.

Robert

answered Mar 22, 2022 by robertdemo (5,160 points)
Thank you for your time Robert,

I was hoping that I don't need to upload the whole file to mydb but do the cross match in smaller blocks/ in a similar way how TOPCAT runs cross matching with local file. TOPCAT keeps sending 50k row blocks while writing single result file.

Maybe I will just cut it to pieces like usually.
Is there a particular reason why you prefer a chunked approach over cross-matching everything at once? (maybe you are writing a semi-interactive application?) Other wise, the mydb JOIN approach is probably the simplest, given how many rows there are in gaia_edr3...

If you wanted to run everything locally in a NB you'd of course have to have gaia_edr3.gaia_source locally, too, and than you could use the xmatching functions in astropy to accomplish what you want. Problem is of course that the gaia table is 1.6 billion rows, so even with just the three ra,dec,source_id columns you are looking at maybe 18 GB in RAM (if you are ok with 32bit/4byte datatypes. If you add other gaia columns, the size grows quickly, and your RAM will be the limiting factor.
I don't know if it would be possible to do all because I need to cross match it with sources like GAIA / DES / VPHAS+ and GLIMPSE.

GLIMPSE I think is only at IRSA /IPAC
NOIRLAB  Has DES and nsc
ESO has VPHAS+
VSA has some VISTA data
And previously I have used PAN-STARRS dr2 which was exclusive at Mast.

Usually people search certain kind of object and can define constraints for it. I can't do that because I use large amount of data to remove objects that I am not looking for.

If I have 10 files it takes me long time to drag them thru IRSA/Noirlab/Mast and ESO. And after every server I need chance format or null values etc. So I was thinking if there was faster way to get this done.
Yes it's a bit of a pain when one works with many datasets. Please note that at Data Lab we also have VHS DR, in case this makes life a little bit easier.

359 questions

372 answers

385 comments

2,451 users

Welcome to Data Lab Help Desk, where you can ask questions and receive answers from other members of the community.