0 votes
184 views
Dear datalaber,

I have asked one question about the long query, which can not be completed. But even though it can be completed, there is still going to be a very large table. I have imposed more cuts to insure the output is not going to be too large. But the timeout happens very soon. I have no idea how to avoid that.

Thanks for any help. Below is the new query.

sql = '''SELECT  lsls.ra, lsls.dec, lsls.dered_mag_g as LS_mag_g, lsls.dered_mag_r as LS_mag_r,
               lsls.dered_mag_z as LS_mag_z, lsls.dered_mag_w1 as mag_w1, lsls.dered_mag_w2 as mag_w2,
               ng.ra as target_ra, ng.dec as target_dec
         FROM mydb://nearbygalaxy AS ng
         LEFT JOIN LATERAL (
               SELECT ls.*
                    FROM
                        ls_dr9.tractor as ls
                    WHERE
                        ABS((ng.mag_g-ng.mag_r)-(ls.dered_mag_g-ls.dered_mag_r))<0.25 AND q3c_join(ng.ra, ng.dec, ls.ra, ls.dec, 0.25) AND ls.dered_mag_r between 16.0 and 24.0
                    
                    ORDER BY
                        q3c_dist(ng.ra,ng.dec,ls.ra,ls.dec)
               ) as lsls ON true;'''

res = convert(qc.query(sql=sql))
asked Jan 11, 2023 by huanian (250 points) | 184 views

1 Answer

0 votes

The query is slow because a LATERAL JOIN acts as a loop, i.e. the inner SELECT statement is excuted for each row in the outer (nearbygalaxy) table.  The inner statement slow because even though the constrained mag columns are indexed, you're computing a color difference and so a full table-scan of the tractor table is required each time.  Additionally there is the overhead caused by the current implementation of MyDB tables being stored is a separate database.

To optimize the query you can break it up so that you first do the spatial matching using the Q3C index to find all neighbors within the 0.25deg radius of each galaxy and save the result to an intermediate table.  Since the ORDER BY isn't used it can be dropped, and computing the color difference to a new column at this stage also means those are only computed once, this table can also be indexed for speed.  Next, query this neighbors table using the color cut for the final selection. 

The script below executes using only synchronous queries in about 18min.  The neighbors tables has ~28M rows which is then reduced to ~7M for the final color cut.  If you'll be using this final table often you may want to save the result to a local CSV file and load from there.  It could also be saved to another MyDB table but will require 2-3min to readback from the database each time you access it.

Hope this helps.

from dl import queryClient as qc
from dl.helpers.utils import convert
import time

# Separate the queries to avoid full table scans.   
#
# The 'neighbors' query finds all neighbors w/in 0.25 degrees since this can
# use the Q3C index.  We also compute the color difference here and eliminate
# the ORDER BY clause as optimizations.
#
# The 'color_cut' query does the final selection.

neighbors = '''
   SELECT l.ra, l.dec,  
          l.dered_mag_g as LS_mag_g,  
          l.dered_mag_r as LS_mag_r,  
          l.dered_mag_z as LS_mag_z,  
          l.dered_mag_w1 as mag_w1, l.dered_mag_w2 as mag_w2,
          ABS((ng.mag_g-ng.mag_r)-(l.dered_mag_g-l.dered_mag_r)) as cdiff,
          ng.ra as target_ra, ng.dec as target_dec
   FROM mydb://nearbygalaxy AS ng
   LEFT JOIN LATERAL (
       SELECT ls.*
           FROM
             ls_dr9.tractor as ls
           WHERE
             q3c_join(ng.ra, ng.dec, ls.ra, ls.dec, 0.25)  
           -- ORDER BY
             -- q3c_dist(ng.ra,ng.dec,ls.ra,ls.dec)
   ) as l ON true
'''

color_cut = '''select * from mydb://nbors as nb
              where cdiff < 0.25 AND nb.LS_mag_r > 16.0  
                and nb.LS_mag_r < 24.0;
           '''

# Do the initial spatial query, saving results to a MyDB table (28M rows).
_start = time.time()
print('Doing Neighbors table: ')
print('    Result: ' + qc.query(sql=neighbors, out='mydb://nbors',
                               drop=True, timeout=600))

# Index the columns for speed.
print('Indexing....' + qc.mydb_index('mydb://nbors','cdiff'))
print('Indexing....' + qc.mydb_index('mydb://nbors','ls_mag_r'))
print('Total neighbors time: ' + str(time.time() - _start))

# Do the final select by querying the color cut, save the result as a DF.
_start_f = time.time()
print('Final selection: ....')
dat = convert(qc.query(sql=color_cut, timeout=600))

# If the Sync query times out, this ASync call coule be used instead.
#dat = convert(qc.query(sql=color_cut, async_=True, wait=True))

print('Final Table Len: ' + str(len(dat)))
print('Final Table Time: ' + str(time.time() - _start_f))
print('Total Time: ' + str(time.time() - _start))


 

answered Jan 16, 2023 by datalab (19,200 points)
Thank you so much for the detailed explanation. I think I can only query the related columns in the Join sentence, LEFT JOIN LATERAL (
       SELECT la.ra, ls.dec, ls.dered_mag_g, ls.dered_mag_r, ls.dered_mag_z, ls.dered_mag_w1,ls. dered_mag_w2
           FROM
             ls_dr9.tractor as ls
           WHERE
             q3c_join(ng.ra, ng.dec, ls.ra, ls.dec, 0.25)  to save time.

359 questions

372 answers

385 comments

2,451 users

Welcome to Data Lab Help Desk, where you can ask questions and receive answers from other members of the community.