API Endpoints for Dataset Creation and Updating #36723
Replies: 10 comments 24 replies
-
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
I also believe this would be helpful and aligns with the data-aware scheduling feature, as well as the dataset listener feature. |
Beta Was this translation helpful? Give feedback.
-
I think it is far too big of a feature and should be discussed at the devlist. Currently all the objects (DAGs and Datasaets alike) are created by parsing DAG files, NOT by creating DB entities. IMHO it makes very little sense to start creating those datasets via APIs. Especially that Datasets are not "standalone" entities and that there is nothing that could happen if you create a dataset via API but there is no DAG file that uses it. I am not sure what would be the consequences of it as I know there are other discussions happening about dataset future - but I think if you want to start anything about that, starting a devlist discussion and explaining what you want is really the right way of approaching it. Converting it into discussion as this is definitely not a "feature" scope. |
Beta Was this translation helpful? Give feedback.
-
See https://airflow.apache.org/community/ for devlist information. |
Beta Was this translation helpful? Give feedback.
-
Hi @potiuk and Airflow community, Thank you, @potiuk for you insights. I understand the concerns about deviating from the current method of creating objects via DAG file parsing. However, I'd like to briefly outline our use case in a multi-tenant Airflow setup and seek your guidance or alternative solutions. In our organization, we're dealing with multiple Airflow tenants, let's say Tenant 1 and Tenant 2, as examples. Our goal is to create a more interconnected workflow between these tenants, specifically focusing on Datasets. Our Current Setup
Proposed WorkflowHere's a breakdown of our proposed workflow:
Why This Matters:
Do you see an alternative method within the current Airflow framework that could achieve this goal? We're open to different approaches as long as they provide a practical solution for our use case. I will be sending a proposal to the devlist so this can be discussed further, as suggested. Thank you again for your time considering this proposal, and I look forward to any suggestions or guidance you might have. |
Beta Was this translation helpful? Give feedback.
-
I would like just to mention that also in one-airflow environment this (i.e. possibility to update, add, Dataset event via API, web... whatever) can be very helpful. In cases where you run "Daily" dataset scheduled DAGs - see my issue and question here: #36618 Or - in my case the possibility to edit DAG dataset schedule time (i.e. last time what is considered for the datasets to be presented as ready) would be also helpful. |
Beta Was this translation helpful? Give feedback.
-
We were eagerly waiting for the MR as it would solve a lot of our problems. We have a very large flow of data in our company, of which a big part is still being handled "the old fashioned way". Also, we often have more than one DAG awaiting the end of an external workflow and that means we have to put a DAG in between to trigger all the related DAGs. Because we don't control the code that does the API triggering, we'd have to ask for a cross-team code update every time a new DAG was added (if we didn't add the in-between DAG). All this to say: While this works, it is a mess to maintain. Having external Dataset support would be a blessing. I get that there is a big discussion about this, it is a huge change, but for our (the companies) way of working, being able to use datasets has made life so much easier. As a remark on the side (not related to this discussion, but I wanted to share): We've also overridden |
Beta Was this translation helpful? Give feedback.
-
For people wanting to use something now already, we've been emulating this functionality via a DAG that receives a set of Datasets as a string parameter (JSON) and adds them to the task as outlets. It looks pretty much like this: with DAG(
"datasets-dispatcher",
....
params={
'datasets': Param('', description='Datasets (JSON list)', type='string')
}
):
@task(task_id=f"dispatch")
def dispatcher(params: dict = None, task=None):
from json import loads
datasets = (params or {}).get('datasets', [])
if datasets:
datasets = loads(datasets)
task.add_outlets([Dataset(**ds) for ds in datasets])
dispatcher() The received JSON for the |
Beta Was this translation helpful? Give feedback.
-
Hello everyone. I appreciate the diverse perspectives shared in this discussion on API endpoints for dataset creation and updating in Airflow. Let me address some of the key points raised: @potiuk, I recognize the significance of the traditional method of defining DAG objects via code and agree that any deviation from this needs careful consideration. However, as @cmarteepants and others have highlighted, there's a growing need for Airflow to adapt to more dynamic and interconnected workflows, particularly in multi-tenant environments. Agreeing with many here, an API endpoint to trigger Dataset Updates fits well with Airflow's existing features and addresses a critical need for more flexible scheduling. As @Blizzke and @funes79 pointed out, the ability to handle dataset events via API would greatly simplify and improve the efficiency of workflows that depend on external data sources or cross-team interactions, not to mention the impact it will have on facilitating synchronization with other technologies. Focusing on the update endpoint first seems sensible given its broader acceptance. On creating datasets via API, despite some reservations raised by @jedcunningham and @RNHTTR, I think it's a path worth exploring. As @dantheman0207 points out, the ability to sync datasets across different technologies is essential, and API creation of datasets could significantly boost Airflow's integration power and workflow orchestration. Here's my take on this: Data-aware scheduling was a transformative step for Airflow because it acknowledged data as the primary workflow trigger. This proposal is essentially an extension of that concept, aiming to further decouple Airflow from the assumption that only DAGs can influence datasets. Strategically, it would enable Airflow to serve a wider array of use cases and technological setups, allowing it to embrace the diversity of data sources and workflows, something particularly interesting and necessary for larger organizations. |
Beta Was this translation helpful? Give feedback.
-
we also need something like this feature with multi tenant environments and publish dataset events for targeted user group for a given environment via rest api. as targeted user groups are dynamic for each environment and can be created/added periodically in system of record, if dataset not exists, creating dataset event with post api fails with error dataset not exists. |
Beta Was this translation helpful? Give feedback.
-
Description
I would like to propose the addition of new API endpoints for creating and updating datasets in Airflow. This feature would be a valuable extension to the current dataset capabilities and would align with the direction Airflow is heading, especially considering the dataset listeners introduced in Airflow 2.8.
Proposed Changes:
Use case/motivation
In a multi-instance Airflow architecture, managing dataset dependencies across instances can be challenging, as we are currently experiencing in our organization.
This feature also aligns with the recent advancements in Airflow 2.8, particularly with the introduction of dataset listeners. These developments have opened the door for improved cross-instance dataset awareness, an area where this proposal would be extremely beneficial.
We believe that with the introduction of these new endpoints, Airflow would offer a more efficient and facilitated approach to cross-instance dataset-aware scheduling. This enhancement would not only benefit our organization but also the broader Airflow community, as it is likely a common challenge faced by many and more will likely encounter in the future.
Related issues
This feature complements the discussions and contributions already seen in the community, especially those related to enhancing dataset management and integration in Airflow.
There have been some ongoing discussions and contributions on GitHub, e.g. #36308 #29162, including a previously closed Pull Request (#29433).
These discussions highlight the community's interest in and need for enhanced dataset management capabilities.
Are you willing to submit a PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions