Azure Synapse - Optimize SQL Pools - Query Optimization

How to optimize performance with dedicated sql pool?



OPTIMIZE QUERIES

Transactions

  • Transactions in Dedicated SQL Pool
    • Isolation levels
      • Default: READ UNCOMMITTED.
      • Can be changed to READ COMMITTED SNAPSHOT ISOLATION at DB level, but then cant use READ UNCOMMITTED at session level
    • Transaction size
      • varies with DWU capacity
      • ceilings are presumed when data is evenly distributed with HASH or ROUND_ROBIN. If data is skewed, limits will be reached earlier
    • Trasaction state : XACT_STATE() returned with value -2
    • Limitations
      • No distributed transactions
      • No nested transactions permitted
      • No save points allowed
      • No named transactions
      • No marked transactions
      • No support for DDL such as CREATE TABLE inside a user-defined transaction
    • THROW, RAISERROR
      • THROW is the more modern implementation for raising exceptions in dedicated SQL pool but RAISERROR is also supported
      • THROW: 100K-150K. RAISERROR: 50K
      • Use of sys.messages not supported
  • Transaction Optimization Best Practices
    • Transactions and logging
      • Each distribution has its own transaction log, log writes are automatic but can be made efficient with good code
        • use minimal logging where possible
        • process data as scoped batch instead of individual transactions
        • adopt partition switching patter for large updates
      • Min vs Full logging
        • minimal logging involves logging only the informati on that is required to roll back the transaction after a failure, or for an explicit request (ROLLBACK TRAN). Since minimally logged operations can participate in explicit transactions, they can be rolled back as well.
          • CREATE INDEX, ALTER INDEX REBUILD, DROP INDEX, TRUNCATE TABLE, DROP TABLE, ALTER TABLE SWITCH PARTITION
          • CREATE TABLE AS SELECT (CTAS), INSERT..SELECT (they both are bulk load options)
            • Any writes to update CI or NCI will always be fully logged operation
            • When loading to non-empty, CI table, operation is mix of full logged if writing to non empty page, and min logged if writing to empty page.
          • Internal data movement operations (such as BROADCAST and SHUFFLE) are not affected by the transaction safety limit
        • Optimize for DELETE and UPDATE using mil-logged options, or use Partition Switching. Another option is to use smaller batch.
      • Pause, Scaling
        • When you pause or scale your dedicated SQL pool, it is important to understand that any in-flight transactions are terminated immediately; causing any open transactions to be rolled back.
  • Key Call out’s
    • Dynamic SQL
      • Synapse doesnt support blob data type, meaning very large strings cant stored in memory. You need to break the code into chunks and use the EXEC statement instead.
    • GROUP BY
      • Limited support by dedicated SQL Pool, and not supported by serverless sql pool

Execution Plan

  • MPP Plan (D-SQL)
  • SMP Plan

Statistics

Stats in dedicated SQL Pool

  • REF
  • Auto creation of stats
  • Update stats
  • Last stats update
  • Managing Stats
    • USP to create stats on all columns in a Database
    • Stats Metadata
      • Catalog Views
      • System Functions
      • DBCC SHOW_STATISTICS()

        Stats in serverless SQL Pool

  • Auto creation of stats
  • Manual creation of stats
  • Update stats
  • Managing stats
    • Stats Metadata
      • Catalog Views
      • System Functions

Data Loading

  • Best Practices
    • Prepare data in Azure Storage
    • Run loads with enough compute
    • Allow multiple users to load (DENY CONTROL across different schemas)
    • Load to a staging table
    • Load to a columnstore index
    • Increase batch size when using SQLBulkCopy API or BCP
    • Manage loading failures
    • Insert data into a production table
    • Create statistics after the load
    • Rotate storage keys
  • Delta Lake integration is a key ask which can be referenced here

Data Movement

  • Shuffle_Move, and this, and another example, and [this]
  • Broadcast_Move, and this
  • ReturnOperation: ReturnOperation is the process of getting results at each Compute Node based on redistribution data.
  • Distribute_Replicated_Table_Move
  • Master_Table_Move
  • Partition_Move
  • Trim_Move