Dan Ofer: Is This Feature Interesting? Interfeat: Ai For Insight Discovery | Pydata Tel Aviv 2025
The goal of data analysis isn't just prediction; it's often the discovery of new, actionable insights. However, identifying genuinely "interesting" phenomena – those that are novel, mechanistically plausible, and potentially useful – remains a largely manual and subjective process. Standard techniques like feature importance ranking (e.g., SHAP) or statistical significance testing often highlight known or trivial relationships.
This talk presents InterFeat, a systematic framework designed to automate the discovery of interesting hypotheses from structured data. We'll walk through the pipeline's components:
Utility Filters: Using statistical tests and model-based importance (ML models) to identify features with predictive value.
Novelty Filters: Leveraging knowledge graphs (like SemMedDB) and large-scale literature databases (like PubMed) to automatically screen out well-established associations.
LLM Annotation Layer: Employing retrieval-augmented Large Language Models (LLMs like GPT-4) to assess the novelty and plausibility of remaining candidates, integrating vast domain knowledge and generating natural language explanations for why a feature might be interesting.
We'll showcase InterFeat's application to identifying novel disease risk factors in the large-scale UK Biobank dataset, demonstrating its ability to recover factors years before they appeared in literature and achieve significantly higher rates of expert-validated interesting features compared to baseline methods.
Key Takeaways for the PyData Audience:
A formal definition and operationalization of "interestingness" beyond mere prediction.
Practical techniques for integrating ML, KGs, text mining, and LLMs in a discovery pipeline.
How to use LLMs effectively for hypothesis evaluation and explanation, grounded in data and external knowledge sources (mitigating hallucination).
Insights from applying this framework to real-world, large-scale data.
Access to the open-source codebase (github.com/ddofer/InterFeat) to adapt the pipeline for their own research or business problems.
Speaker Bio:
Dan Ofer received the B.Sc. degree in psychobiology, in 2013, and the dual M.Sc. degree in bioinformatics and neurobiology from The Hebrew University. He is currently a PhD Candidate with Professor's Dafna Shahaf and Michal Linial, and an AI Researcher in industry since 2015. Previously, at SparkBeyond/McKinsey he developed AI solutions in multiple industries, including insurance, finance, healthcare, and novel biomarker discovery with CRI. His research interests include Biological Foundation models, explainable AI, automated feature engineering on tabular data, Protein LLMs, and AI in healthcare.
Passionate Bookworm, geek and Photographer
Follow PyData Tel Aviv on:
https://www.meetup.com/pydata-tel-aviv/
https://www.linkedin.com/company/pydata-tlv
https://x.com/PyDataTLV
https://www.facebook.com/PyDataTLV
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
Popular Products
-
Put Me Down Funny Toilet Seat Sticker$33.56$16.78 -
Stainless Steel Tongue Scrapers$33.56$16.78 -
Stylish Blue Light Blocking Glasses$85.56$42.78 -
Adjustable Ankle Tension Rope$53.56$26.78 -
Electronic Bidet Toilet Seat$981.56$490.78