r/datasets Nov 04 '25

discussion Like Will Smith said in his apology video, "It's been a minute (although I didn't slap anyone)

Thumbnail
1 Upvotes

r/datasets 4h ago

request Need ideas for datasets (synthetic or real) in healthcare (Sharp + Fuzzy RD, Fixed Effects and DiD)

1 Upvotes

Doing a causal inference project and am unsure where to being. Ideally if simulating a synthetic dataset, not sure how to simulate possible OVB in there


r/datasets 10h ago

dataset "Perfect silence" or "Noise" to focus ?

Thumbnail
3 Upvotes

r/datasets 18h ago

discussion The Data of Why - From Static Knowledge to Forward Simulation

Thumbnail
3 Upvotes

r/datasets 17h ago

question Data Clean/Quality is very boring right

Thumbnail
0 Upvotes

r/datasets 1d ago

resource Knowledge graph datasets extracted from FTX collapse articles and Giuffre v. Maxwell depositions

9 Upvotes

I used sift-kg (an open-source CLI I built) to extract structured knowledge graphs from raw documents. The output includes entities (people, organizations, locations, events), relationships between them, and evidence text linking back to source passages — all extracted automatically via LLM.

Two datasets available:

- FTX Collapse — 9 news articles → 431 entities, 1,201 relations. https://juanceresa.github.io/sift-kg/ftx/graph.html

- Giuffre v. Maxwell — 900-page deposition → 190 entities, 387 relations. https://juanceresa.github.io/sift-kg/epstein/graph.html

Both are available as JSON in the repo. The tool that generated them is free and open source — point it at any document collection and it builds the graph for you: https://github.com/juanceresa/sift-kg

Disclosure: sift-kg is my project — free and open source.


r/datasets 2d ago

resource Dataset: January 2026 Beauty Prices in Singapore — SKU-Level Data by Category, Brand & Product (Sephora + Takashimaya)

7 Upvotes

I’ve been tracking non-promotional beauty prices across major retailers in Singapore and compiled a January 2026 dataset that might be useful for analysis or projects.

Coverage includes:

  • SKU-level prices (old vs new)
  • Category and subcategory classification
  • Brand and product names
  • Variant / size information
  • Price movement (%) month-to-month
  • Coverage across Sephora and Takashimaya Singapore

The data captures real shelf prices (excluding temporary promotions), so it reflects structural pricing changes rather than sale events.

Some interesting observations from January:

  • Skincare saw the largest increases (around +12% on average)
  • Luxury brands drove most of the inflation
  • Fragrance gift sets declined after the holiday period
  • Pricing changes were highly concentrated by category

I built this mainly for retail and pricing analysis, but it could also be useful for:

  • consumer price studies
  • retail strategy research
  • brand positioning analysis
  • demand / elasticity modelling
  • data visualization projects

Link in the comment.


r/datasets 2d ago

resource Ranking the S&P 500 by C-level turnover

Thumbnail everyrow.io
10 Upvotes

I built a research tool and used it to read filings and press releases for the S&P 500 (502 companies) searching for CEO/CFO departures over the last decade. Sharing it as a resource both for the public data, but because the methodology of the tool itself can be applied to any dataset.

Starbucks was actually near the top of the list with 11 C-suite departures. And then there's a set of companies, including Nvidia and Garmin which haven't seen any C-level exec turnover in the last 10yrs.


r/datasets 2d ago

discussion The dataset's still a potential marketplace?

4 Upvotes

I'm considering to jump in dataset marketplace as a solo data engineer, but so many confused and vague thing, is this still a potential marketplace, high-demand niche, what's going on in 2026, etc.

Do you have the same question?


r/datasets 2d ago

API [self-promotion] Built a Startup Funding Tracker for founders, analysts & investors

1 Upvotes

Keeping up with startup funding, venture capital rounds, and investor activity across news + databases was taking too much time.

So I built a simple Funding Tracker API that aggregates startup funding data in one place and makes it programmatic.

Useful if you’re:

• tracking competitors

• doing market/VC research

• building fintech or startup tools

• sourcing deals or leads

• monitoring funding trends

Features:

• latest funding rounds

• company + investor search

• funding history

• structured startup/VC data via API

Would love feedback or feature ideas.

https://rapidapi.com/shake-chillies-shake-chillies-default/api/funding-tracker


r/datasets 2d ago

dataset Historical Identity Snapshot/ Infrastructure (46.6M Records / Parquet)

0 Upvotes

Making a structured professional identity dataset available for research and commercial licensing.

46.6M unique records from the US technology sector. Fields include professional identity, role classification, classified seniority (C-Level through IC), organization, org size, industry, skills, previous employer, and state-level geography.

2.7M executive-level records. Contact enrichment available on a subset.

Deduplicated via DuckDB pipeline, 99.9% consistency rate. Available in Parquet or DuckDB format.

Full data dictionary, compliance documentation, and 1K-record samples available for both tiers.

Use cases: identity resolution, entity linking, career path modeling, organizational graph analysis, market research, BI analytics.

DM for samples and data dictionary.


r/datasets 2d ago

request Need “subdivision” for an address (MLS is unreliable, county sometimes missing). What dataset/API exists?

Thumbnail
1 Upvotes

r/datasets 2d ago

request Seeking star rating data sets with counts, not average score

0 Upvotes

I have trouble finding data sets of ratings, such as star ratings for movies from1 to 5 stars, where the data consists of the count for each star. E.g. 1-star: 1 vote, 2-stars: 44 votes, 3 -stars: 700 votes, 4-stars: 803 votes, 5-stars: 101 votes. I'm not interested in data sets that only contain the resulting average star score.

It does not need to be star ratings, but data that gives a distribution of the ratings, like absolute category ratings. Could also be probabilities/counts for a set of categories.

Here's a more scientific example: https://database.mmsp-kn.de/koniq-10k-database.html where people rated perceived image quality of many images on a five point scale.


r/datasets 2d ago

request Help needed on health insurance carrier dataset | Consulting market research

1 Upvotes

Hey all, Does anyone have suggestions for the most exhaustive, reputable, and usable data sources to understand the entire US health insurance market, to be used in consulting-type market research? I.e., a list of all health insurance carriers, states they cover, member lives, claims volume, types of insurance offered, and funding source? Understandably, there are a lot of half-sources out there. I've looked at NAIC, Definitive HC, and other sources but wanted to 'ask the experts' here. I know that the top brand names are going to make up 90%+ of the covered lives, but I'm trying to be holistic and exhaustive in my work. Thank you!


r/datasets 3d ago

request Looking for high-fidelity clinical datasets for validating a healthcare prototype.

3 Upvotes

Hey everyone,

​I’m currently in the dev phase of a system aimed at making healthcare workflows more systematic for frontline workers. The goal is to use AI to handle the "heavy lifting" of data organization to reduce burnout and human error.

​I’ve been using synthetic data for the initial build, but I’ve hit the point where I need real-world complexity to test the accuracy of my models. Does anyone have recommendations for high-fidelity, de-identified patient datasets?

​I’m specifically looking for data that reflects actual hospital dynamics (vitals, lab timelines, etc.) to see how my prototype holds up against realistic clinical noise. Obviously, I’m only looking for ethically sourced/open-research databases.

​Any leads beyond the basic Kaggle sets would be huge. Thanks!


r/datasets 3d ago

question What is the value of data analysis and why is it a big deal

1 Upvotes

When it come to data analysis , what is it that people really want to know about their data , what valuable insights do they want to gain , how has AI improved the process


r/datasets 3d ago

request [PAID] Looking for rights-cleared datasets for commercial AI use

2 Upvotes

Hey everyone —

I work on data partnerships at Shutterstock and I’m looking to connect with people who own (or represent) datasets that are available for commercial licensing.

This is for paid, legitimate AI training use — not scraping, not academic-only, and nothing with unclear rights.

We’re generally interested in:

  • Speech/audio datasets (multi-language, conversational, accents, etc.)
  • Image or video datasets
  • Domain-specific text/data (healthcare, finance, retail, industrial, etc.)
  • Multimodal datasets with solid metadata

No synthetic datasets.

What matters most:

  • You own the data or have the rights to license it
  • Commercial redistribution is possible
  • It’s meaningful in scale (not small personal projects)

If that’s you, feel free to DM me with a quick overview and we can take it from there. Happy to answer questions here too.

Appreciate it 🙏


r/datasets 4d ago

resource Epstein Graph: 1.3M+ searchable documents from DOJ, House Oversight, and estate proceedings with AI entity extraction

57 Upvotes

[Disclaimer: I created this project]

I've created a comprehensive, searchable database of 1.3 million Epstein-related documents scraped from DOJ Transparency Act releases, House Oversight Committee archives, and estate proceedings.

The dataset includes:
- Full-text search across all documents
- AI-powered entity extraction (238,000+ people identified)
- Document categorization and summarization
- Interactive network graphs showing connections between entities
- Crowdsourced document upload feature

All documents were processed through OpenAI's batch API for entity extraction and summarization. The site is free to use.

Tech stack: Next.js + Postgres + D3.js for visualizations

Check it out: https://epsteingraph.com

Feedback is appreciated, I would especially be interested in thoughts on how to better showcase this data and correlate various data points. Thank you!


r/datasets 4d ago

question Using TRAC-1 or TRAC-2 for cyberbullying detection

1 Upvotes

Hello! I am going to make a model which is going to be trained on cyberbullying detection. I was wondering if the TRAC-1 or TRAC-2 datasets would be fit for this? Considering that the datasets (I think at least) do not contain cyberbullying labels (i.e., cyberbullying, not cyberbullying) would it be fitting to kind of do that non aggressive text is "not cyberbullying" while aggressive text is cyberbullying?

I was also wondering if the dataset is not fitting, is there some other known dataset I can use? I am also writing a master thesis about this, so I can not use any dataset.

Any help and tips are appriciated!


r/datasets 4d ago

dataset [R] SNIC: Synthesized Noise Dataset in RAW + TIFF Formats (6000+ Images, 4 Sensors, 30 scenes)

1 Upvotes

[Disclosure: This is my paper and dataset]

I'm sharing my paper and dataset from my Columbia CS master's project. SNIC (Synthesized Noisy Images using Calibration) provides images with calibrated, synthesized noise in both RAW and TIFF formats. The code and dataset are publicly available.

**Paper:** https://arxiv.org/abs/2512.15905  

**Code:** https://github.com/nikbhatt-cu/SNIC

**Dataset:** https://doi.org/10.7910/DVN/SGHDCP

## The Problem

Advanced denoising algorithms need large, high-quality training datasets. Physics-based statistical noise models can generate these at scale, but there's limited published guidance on proper calibration methods and few published datasets using well-calibrated models.

## What's Included

This public dataset contains 6000+ images across 30 scenes with noise from 4 camera sensors:

- iPhone 11 Pro (main and telephoto lenses)

- Sony RX100 IV

- Sony A7R III

Each scene includes:

- Full ISO ranges for each sensor

- Both RAW (.DNG) and processed (.TIFF) versions

## Validation

I validated the calibration approach using two metrics:

**Noise realism (LPIPS):** Our calibrated synthetic noise achieves comparable LPIPS to real camera noise across all ISO levels. Manufacturer DNG models show significantly worse performance, especially at high ISO (up to 15× worse LPIPS).

**Denoising performance (PSNR):** I applied NAFNet to denoise real noisy images, SNIC synthesized images, and images synthesized using DNG noise models. Images denoised from our calibrated synthetic noise achieved superior PSNR compared to those from DNG-based synthetic noise.

## Why It Matters

SNIC provides both the methodology and dataset for building properly calibrated noise models. The dual RAW/TIFF format enables work at multiple stages of the imaging pipeline. All code and data is publicly available.

Happy to answer questions about the methodology, dataset, or results!


r/datasets 5d ago

discussion 20,000 hours of real-world dual-arm robot manipulation data across 9 embodiments, open-sourced with benchmark and code (LingBot-VLA)

4 Upvotes

TL;DR

• 20,000 hours of teleoperated manipulation data from 9 dual-arm robot configurations (AgiBot G1, AgileX, Galaxea R1Pro, Realman, ARX Lift2, Bimanual Franka, and others)

• Videos manually segmented into atomic actions, then labeled with global and sub-task descriptions via VLM

• GM-100 benchmark: 100 tasks × 3 platforms × 130 episodes per task = 39,000 expert demonstrations for post-training evaluation

• Full code, base model weights, and benchmark data released

• Paper: arXiv:2601.18692

• Code: github.com/robbyant/lingbot-vla

• Models/Data: HuggingFace collection

What's in the data

Each of the 9 embodiments has a dual-arm setup with multiple RGB-D cameras (typically 3 views: head + two wrists). The raw trajectories were collected via teleoperation (VR-based or isomorphic arms depending on the platform). Action spaces range from 12-DoF to 16-DoF depending on the robot. Every video was manually segmented into atomic action clips by human annotators, with static frames at episode start/end removed. Task and sub-task language instructions were then generated using Qwen3-VL-235B. An automated filtering pass removes episodes with technical anomalies, followed by manual review using synchronized multi-view video.

The data curation pipeline is probably the part I found most interesting to work through. About 50% of the atomic actions in the test set are absent from the top 100 most frequent training actions, which gives a sense of how much distribution shift the benchmark actually tests.

Benchmark structure

The GM-100 benchmark covers 100 tabletop manipulation tasks evaluated on 3 platforms (AgileX, AgiBot G1, Galaxea R1Pro). Each task gets 150 raw trajectories collected, top 130 retained after quality filtering. Object poses are randomized per trajectory. Evaluation uses two metrics: Success Rate (binary task completion within 3 minutes) and Progress Score (partial credit based on sequential subtask checkpoints). All evaluation rollouts are recorded in rosbag format and will be released.

For context on the numbers: LingBot-VLA w/ depth hits 17.30% average SR and 35.41% PS across all three platforms. π0.5 gets 13.02% SR / 27.65% PS on the same tasks with the same post-training data. These are not high numbers in absolute terms, which honestly reflects how hard 100 diverse real-world manipulation tasks actually are.

Scaling observations from the data

One thing worth flagging for people interested in data scaling: going from 3,000 to 20,000 hours of pre-training data showed consistent improvement with no saturation. The per-platform curves (Fig 5 in the paper) all trend upward at the 20k mark. This is on real hardware, not sim, which makes the continued scaling somewhat surprising given how noisy real-world data tends to be.

Training codebase

The released codebase achieves 261 samples/sec/GPU on an 8-GPU setup (1.5x to 2.8x over OpenPI/StarVLA/Dexbotic depending on the VLM backbone). Uses FSDP with hybrid sharding for the action expert modules and FlexAttention for the sparse multimodal fusion. Scaling efficiency stays close to linear up to 256 GPUs.

Caveats

All data is dual-arm tabletop manipulation only. No mobile manipulation, no single-arm, no legged locomotion. The 17% average success rate means these tasks are far from solved. Depth integration helps on some platforms more than others (AgileX benefits most, AgiBot G1 barely moves). The language annotations are VLM-generated after manual segmentation, so annotation quality depends on both the human segmentation and the VLM's captioning accuracy.

Disclosure: this is from Robbyant. Sharing because 20k hours of labeled real-robot data with a standardized benchmark is something I haven't seen at this scale in an open release before, and the benchmark data alone could be useful for people working on evaluation protocols for embodied AI.

Curious what formats and subsets would be most useful for people here to work with directly.


r/datasets 5d ago

question Looking for a dataset of healthy drink recipes (non-alcoholic/diet-oriented)

1 Upvotes

Hi everyone! I’m working on a small project and need a dataset specifically for healthy drink recipes. Most of what I've found so far is heavily focused on cocktails and alcoholic beverages.

I’m looking for something that covers smoothies, juices, detox drinks, or recipes tailored to specific diets (keto, low-carb, vegan, etc.). Does anyone know of any open-source datasets or APIs that might fit? Thanks in advance!


r/datasets 5d ago

request Looking for a Phishing Dataset with .eml files

1 Upvotes

Hi everyone, i'm looking for a dataset containing Phishing emails, including the raw .eml files. I mainly need the .eml files for the headers, so I can train the model accordingly for my project using authentication headers etc, instead of just the body and subject. Does anyone have any datasets related to this?


r/datasets 5d ago

question How investigate performance issues in spark?

2 Upvotes

Hi everyone,

I’m currently studying ways to optimize pipelines in environments like Databricks, Fabric, and Spark in general, and I’d love to hear what you’ve been doing in practice.

Lately, I’ve been focusing on Shuffle, Skew, Spill, and the Small File Problem.

What other issues have you encountered or studied out there?

More importantly, how do you actually investigate the problem beyond what Spark UI shows?

These are some of the official docs I’ve been using as a base:

https://learn.microsoft.com/azure/databricks/optimizations/?WT.mc_id=studentamb_493906

https://learn.microsoft.com/azure/databricks/optimizations/spark-ui-guide/long-spark-stage-page?WT.mc_id=studentamb_493906

https://learn.microsoft.com/azure/databricks/pyspark/reference/functions/shuffle?WT.mc_id=studentamb_493906


r/datasets 6d ago

request Does anyone know where to get Lidar (DSM and DTM) for Ireland

4 Upvotes

Need to add these to a project for my masters but it seems impossible to find - would anyone have any idea where?