Hi all! This is Eduardo, Ploomber co-founder. I'm excited to show the HN community what we've been working! We'd love to get your feedback so please give it a try and let us know what you think!
From 7/19/2015, 4:56:03 PM till now, @edublancas has achieved 75 Karma Points with the contribution count of 19.
Recent @edublancas Activity
Hi all! This is Eduardo, Ploomber co-founder. I'm excited to show the HN community what we've been working! We'd love to get your feedback so please give it a try and let us know what you think!
This is great feedback, thanks a lot!
I'll ensure we display the "extract_upstream" more prominently in the docs, we've been getting this feedback a couple times now :)
Re: the Jupyter extension injects the cell when the file you're opening is declared in the pipeline.yaml file. You can turn off the extension if you prefer.
Feel free to join our community, this feedback helps us make Ploomber better!
(ploomber maintainer here)
Any feedback for us? What can we do to improve Ploomber?
One of my deal breakers when choosing tooling is how easy is to move from a local environment to a distributed environment. Ideally, you want to start locally and move to a distributed env if you need to. So choose one tool that allows you to get started quickly and move from there.
As an example: one of the reasons why I don't use Kubeflow is because it requires having a Kubernetes cluster up and running, which is an overkill in many cases.
Check out the project I'm working on: https://github.com/ploomber/ploomber
Congrats on the launch! As a former data scientist, I suffered from bad data on a daily basis. Can you provide some details on how anomalies are detected? Is it some kind of threshold-based approach defined by the user or are you running statistical analysis on user's data? Curious to learn more!
On the Myths and Problems of Jupyter Notebooks
2 points • 1 comments
Congrats on the launch! As a former data scientist, it pumps me up to see more notebook-centric tooling, as I believe it is the best environment for data exploration and rapid iterations.
We're working on notebook tooling as well (https://github.com/ploomber/ploomber), but our focus is at the macro level so to speak (how to develop projects that are made up of several notebooks). In recent conversations with data teams, the question "how do you ensure the code quality of each notebook?" has come up a lot, and it's great to see you are tackling that problem. It'll be exciting to see people using both MutableAI and Ploomber! Very cool stuff! I'll give it a try!
Congrats on the launch! Coming from a data science role, this could've been pretty useful for my previous projects. I had to rewrite all of my feature engineering queries when the company I worked at moved to Snowflake.
One question I have is how to Hydra balances writing postgres scripts vs leveraging system-specific features. For example, I remember going through Snowflake's documentation and found interesting functions for data aggregation. Can I leverage Snowflake-specific features when using Hydra?
Interesting, can you share a link to the docs? I'd like to learn more about their approach.
Thanks! Yes, we know Deepnote! We want to focus on pipelines and deployment so we currently integrate with Jupyter distributions (as long as they keep the standard format); we have users that run Ploomber on JupyterHub, Domino, SageMaker and others. I don't think any of our users runs on Deepnote but it should work as well. We are thinking of being the "backend": Users can develop their notebooks from whatever distribution they use; Ploomber will provide the tools to help them build modular pipelines and we'll help them deploy those pipelines. I'd love to learn more about your use case, please ping me at eduardo@ploomber.io
I'm not a DVC user so I'll speak for what I've seen in the documentation and the couple examples I ran a while ago. DVC's core is data versioning and the pipeline features are an extension to it. The main difference is that DVC's pipeline feature is agnostic: you define the command, inputs and outputs, and DVC executes pipeline. On the other hand, Ploomber has a deeper integration between your code and the pipeline. For example, our SQL integration allows you to tell Ploomber how to connect to a database and then list a bunch of SQL files as stages in your pipeline (example: https://github.com/ploomber/projects/blob/master/templates/s...), this reduces the boilerplate a lot since you only have to write SQL, if you wanted to do the same thing with DVC, you'd have to manage the connections and create bash scripts to submit the queries.
The other important difference is that AFAIK, DVC can only run your pipelines locally, and Ploomber can export your pipelines to run in other environments like (Kubernetes, Airflow, AWS, SLURM, Kubeflow), this allows you to run experiments locally but easily move to a distributed environment when you need to train models at a larger scale or want to deploy an ML pipeline.
That totally resonates with me! I spent 6 years working as a data scientist and notebooks just make it a lot simpler to explore and interact with the data, so I totally understand my data science peers for sticking with notebooks.
Having said that, the challenge now is to hit a sweetspot between keeping the Jupyter interactive experience, and providing some features to help data scientists develop modular work. That's where most frameworks fail, so we want to keep our eyes open, and get feedback from both scientists and engineers to develop something that works for everyone.
We allow people to write pipeline tasks in .py files but open them as notebooks in Jupyter. So they keep the same workflow they're used to but under the hood, they're writing .py files - so they can do code reviews (jupytext handles the .py to .ipynb conversion). Also, when executing the pipeline, ploomber generates an output report for each script, so teams can use this to review any outputs generated by the code.
Finally, since the pipeline is modularized, it's easier to split the work. Some people may work in data cleaning, others in feature engineering, and they can all orchestrate the pipeline with "ploomber build".
You can read more about our approach in this guest blog post we published a few months ago on the Jupyter blog: https://blog.jupyter.org/ploomber-maintainable-and-collabora...
We'd love to hear from your experience! Please send me an email to eduardo@ploomber.io
Yes! We want to help people keep enjoying Jupyter and produce tidy pipelines!
We allow users to open .py files as notebooks in Jupyter, so you can get the best of both worlds: interactivity with Jupyter and nice code reviews. jupytext does the heavy lifting for us (it's a great package!) and we add some extra things to improve the experience
More on the docs: https://docs.ploomber.io/en/latest/user-guide/jupyter.html
Thanks for your feedback! Do you have any stories to share? I'd love to hear about your experience with the notebooks-production gap
Launch HN: Ploomber (YC W22) – Quickly deploy data pipelines from Jupyter/VSCode
126 points • 23 comments
The case against data versioning
2 points • 0 comments
site design / logo © 2022 Box Piper