transformations - more decoupling #2718

sh-rp · 2025-06-05T11:49:41Z

Description

This PR changes the readable dataset to handle sqlglot expressions where possible without intermediary steps where sql strings are generated and re-parsed.

netlify · 2025-06-05T11:49:46Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`28b7361`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/684989b0a2bb5c0008cf3856
😎 Deploy Preview	https://deploy-preview-2718--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

sh-rp · 2025-06-05T11:52:03Z

dlt/transformations/transformation.py


-        # TODO: why? don't we prevent empty column schemas above?
        all_columns = {**computed_columns, **(columns or {})}


I have added a test for columns merging, and the columns yielded from here overwrite all incoming columns definitions from the decorator, which I think generally makes sense in dlt, but not for the transformations. I would maybe keep this topic open for now until we decide how this should work exactly.

w8: you are overwriting computed_columns with columns which come from the decorator. so how are they not overwritten at the end?

sh-rp · 2025-06-05T11:53:02Z

tests/transformations/test_lineage.py

@@ -31,7 +33,6 @@ def sqlglot_schema() -> SQLGlotSchema:
 QUERY_KNOWN_TABLE_STAR_SELECT = "SELECT * FROM table_1"
 QUERY_UNKNOWN_TABLE_STAR_SELECT = "SELECT * FROM table_unknown"
 QUERY_ANONYMOUS_SELECT = "SELECT LEN(col_varchar) FROM table_1"
-QUERY_GIBBERISH = "%&/ GIBBERISH"


this can not happen anymore, as non parseable sql strings are already caught in the dataset.

sh-rp · 2025-06-05T11:53:55Z

tests/transformations/test_transformations.py

+    def inventory_original(dataset: SupportsReadableDataset[Any]) -> Any:
+        return dataset["inventory"]
+
+    @dlt.transformation(columns={"price": {"precision": 20, "scale": 2}})


I think this is the correct behavior, we should allow to set the precision for an existing column here. Or would you rather do this in the function body? I am not sure.

sh-rp · 2025-06-05T15:04:32Z

docs/website/docs/general-usage/transformations/index.md

+Will be translated to
+
+```sql
+INSERT INTO 


this is the actually generated insert statement. I realized we were not quoting identifiers, so I added this. We are doing 2 subqueries here now, one stemming from the normalization in the extract and one from @anuunchin normalizer work. We can probably still optimize here?

sh-rp · 2025-06-05T15:26:23Z

tests/normalize/test_model_item_normalizer.py

        )
    elif not add_dlt_load_id and add_dlt_id:
        assert (
            normalized_query
-            == 'SELECT _dlt_subquery."a" AS a, _dlt_subquery."b" AS b, UUID() AS _dlt_id FROM'
+            == 'SELECT _dlt_subquery."a" AS "a", _dlt_subquery."b" AS "b", UUID() AS "_dlt_id" FROM'


the model item normalizer will now always use quoted identifiers on the outer queries, regardless wether the inner select has quoted identifiers. Since all queries arriving here now go through query normalization in the dataset where quotes are applied, this should be fine acutally.

sh-rp · 2025-06-06T11:21:05Z

dlt/destinations/dataset/ibis_relation.py

+    def _query(self) -> sge.Query:
+        from dlt.helpers.ibis import duckdb_compiler
+
+        select_query = duckdb_compiler.to_sqlglot(self._ibis_object)


For converting an ibis expression into sqlglot, you always need a compiler which is destination dependent. I think the compiler will fail if you use function that are known not to exist on a specific destination, and certain materializations will also occur in this step. If you have an expression that counts rows, the alias for the result will be created here, which will be CountStar(*) for duckdb, but which is not a valid identifier for bigquery for example (see comment in query normalizer)

sh-rp · 2025-06-06T11:23:38Z

dlt/destinations/dataset/utils.py

@@ -144,6 +145,8 @@ def normalize_query(
                if len(expanded_path) == 3:
                    if node.db != expanded_path[0]:
                        node.set("catalog", sqlglot.to_identifier(expanded_path[0], quoted=False))
+        if isinstance(node, sge.Alias):
+            node.set("alias", naming.normalize_identifier(node.alias))


Here I make sure aliases that are created by the user or automatically by ibis in the compile step above adhere to the naming convention of the current destination. The alternative would be to use a different compiler for every destination. Normalizing the alias in the ModelNormalizer step is too late, since this code is also used for just accessing the data. Without this change, def test_row_counts will fail for bigquery because an alias with invalid symbols remains in the query. All other destinations seem to accept this.

I have reverted this change and am now using the bigquery compiler only for bigquery destinations. Maybe we should be using the correct compiler for each destination type, but it seems duckdb works for all. I am not sure about this one.

…t of unneeded transpiling clean up make_transformation function tests still pending

…oad stage

normalize aliases in normalization step

…uery compiler for bigquery destinations

…-more-decoupling # Conflicts: # docs/website/docs/general-usage/transformations/index.md

rudolfix · 2025-06-12T11:32:29Z

dlt/destinations/dataset/ibis_relation.py

+        # NOTE: We can use the duckdb compiler for all dialects except for bigquery
+        # as bigquery is more strict about identifier naming and will not accept
+        # identifiers generated by the duckdb compiler for anonymous columns
+        destination_dialect = self._dataset.sql_client.capabilities.sqlglot_dialect


I'll keep the old notes:
# NOTE: ibis is optimized for reading a real schema from db and pushing back optimized sql
# - it quotes all identifiers, there's no option to get unqoted query
# - it converts full lists of column names back into star
# - it optimizes the query inside, adds meaningless aliases to all tables etc.

for example: I bet that BigQuery problem is quotation problem where " are not accepted

rudolfix · 2025-06-12T11:32:52Z

dlt/destinations/dataset/relation.py

+                "Must be an SQL SELECT statement."
+            )
+
+        return query.sql(dialect=self._dataset.sql_client.capabilities.sqlglot_dialect)


why provided_dialect is not used here?

rudolfix · 2025-06-12T11:41:49Z

dlt/destinations/dataset/relation.py


        # derived / cached properties
        self._opened_sql_client: SqlClientBase[Any] = None
        self._columns_schema: TTableSchemaColumns = None
-        self._qualified_query: sge.Query = None
-        self._normalized_query: sge.Query = None
+        self.__qualified_query: sge.Query = None


is this intended? names will be mangled which is maybe good because we implement __getitem__ to get column names here

rudolfix · 2025-06-15T20:21:07Z

dlt/destinations/dataset/utils.py

@@ -108,6 +108,7 @@ def normalize_query(
    sqlglot_schema: SQLGlotSchema,
    qualified_query: sge.Query,
    sql_client: SqlClientBase[Any],
+    naming: NamingConvention,


not needed anymore. and I think it should not be needed here

rudolfix · 2025-06-15T20:27:12Z

dlt/transformations/transformation.py


-        # TODO: why? don't we prevent empty column schemas above?
        all_columns = {**computed_columns, **(columns or {})}


w8: you are overwriting computed_columns with columns which come from the decorator. so how are they not overwritten at the end?

rudolfix · 2025-06-15T20:35:03Z

dlt/transformations/transformation.py

-                    "For sql transformations all data_types of columns must be known. "
-                    + "Please run with strict lineage or provide data_type hints "
-                    + f"for following columns: {unknown_column_types}",
-                )
            yield dlt.mark.with_hints(


this is something I wrote in the old PR: could we place all_column in the SqlModel? or better: could we place a relation in it? then:

user is able to use dlt.mark on the model (ie. their own custom hints or table name)

all_columns = {**computed_columns, **(columns or {})} is not needed. columns will be applied by compute_table_schema in DltResource

you can use relation/computed_columns in SqlModel to apply to the resource in extractor. Look how we call this method

root_table_schema = resource.compute_table_schema(items, meta)

and

def compute_table_schema(self, item: TDataItem = None, meta: Any = None) -> TTableSchema: """Computes the table schema based on hints and column definitions passed during resource creation. `item` parameter is used to resolve table hints based on data. `meta` parameter is taken from Pipe and may further specify table name if variant is to be used """

so SqlModel will be passed there and you can use it to extract computed hints.

Or you can implement _compute_schema on ModelExtractor for fully custom logic

sh-rp commented Jun 5, 2025

View reviewed changes

sh-rp commented Jun 6, 2025

View reviewed changes

sh-rp added 8 commits June 6, 2025 14:44

rename flag for executing raw queries to "execute_raw_query"

304a876

return sge queries from the internal _query method which removes a lo…

1f22cdc

…t of unneeded transpiling clean up make_transformation function tests still pending

adds some tests to readable dataset and a test for column hint merging

0f93141

allows any dialect when writing queries and fixes tests

8976de6

update docs and set correct quoting to queries in normalization and l…

97299b2

…oad stage

fixes normalizer tests

5ad309a

fix limit on mssql

04152fa

normalize aliases in normalization step

add missing quote to alias

1f5a787

sh-rp force-pushed the feat/transformations-more-decoupling branch from 211cbfc to 1f5a787 Compare June 6, 2025 12:44

revert identifier normalization step in normalizer_query and use bigq…

e5e1a80

…uery compiler for bigquery destinations

sh-rp marked this pull request as ready for review June 11, 2025 07:21

sh-rp requested a review from rudolfix June 11, 2025 07:21

Merge remote-tracking branch 'origin/devel' into feat/transformations…

28b7361

…-more-decoupling # Conflicts: # docs/website/docs/general-usage/transformations/index.md

sh-rp force-pushed the feat/transformations-more-decoupling branch from 848233e to 28b7361 Compare June 11, 2025 13:50

rudolfix assigned sh-rp Jun 15, 2025

rudolfix requested changes Jun 15, 2025

View reviewed changes

sh-rp mentioned this pull request Jun 18, 2025

Transformation release ToDos #2776

Open

11 tasks


		# TODO: why? don't we prevent empty column schemas above?
		all_columns = {computed_columns, (columns or {})}

transformations - more decoupling #2718

Are you sure you want to change the base?

transformations - more decoupling #2718

Uh oh!

Conversation

sh-rp commented Jun 5, 2025

Description

Uh oh!

netlify bot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs ready!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

netlify bot commented Jun 5, 2025 •

edited

Loading