CJ_Clippy commented

Owner

This is a key part of Futureporn's success. In order to be able to get statistics on archive status, we need to know about every stream that has ever happened. In order to do that, we need a combination of automation and crowdsourcing.

This is the plan for automation. Each component does a task which results in maximum data ingestion into the db.

Crawler component

For each vtuber we know about, we crawl their social media posts
Act on posts which contain CB/Fansly/OF links
Create stream (Oban task)
X.com (strictly only X for now. Bsky/other social medias can get added in V3 or V4)

API Component (/streams/new)

accepts a X post URL
fails when X post URL is a duplicate

Parser components (common)

parses X post body, extracting
- Title (LLM parser, maybe)
- X post URL
- UTC Date
- Lewdtuber (reference to our db)
Categorization and acceptance
- Ignore socials reminder tweets (linktrees)
- Ignore SFW stream announcements
- Ignore retweets
- Ignore links to vods
- Ignore misc. tweets
- Ask a human if unsure

Scrape & Cache

We need a x_posts database schema so we can cache posts in the db. The db is the source of truth, rather than nitter. This allows nitter to be an ephemeral thing rather than a point of failure (nitter data loss should not cause problems)

Nitter
- user accounts
- proxies
rss.app
XPost database type

This is a key part of Futureporn's success. In order to be able to get statistics on archive status, we need to know about every stream that has ever happened. In order to do that, we need a combination of automation and crowdsourcing. This is the plan for automation. Each component does a task which results in maximum data ingestion into the db. # Crawler component * [x] For each vtuber we know about, we crawl their social media posts * [x] Act on posts which contain CB/Fansly/OF links * [x] Create stream (Oban task) * [x] X.com (strictly only X for now. Bsky/other social medias can get added in V3 or V4) # API Component (/streams/new) * [x] accepts a X post URL * [x] fails when X post URL is a duplicate # Parser components (common) * [x] parses X post body, extracting * [ ] Title (LLM parser, maybe) * [x] X post URL * [x] UTC Date * [x] Lewdtuber (reference to our db) * [ ] Categorization and acceptance * [ ] Ignore socials reminder tweets (linktrees) * [ ] Ignore SFW stream announcements * [ ] Ignore retweets * [ ] Ignore links to vods * [ ] Ignore misc. tweets * [ ] Ask a human if unsure # Scrape & Cache We need a x_posts database schema so we can cache posts in the db. The db is the source of truth, rather than nitter. This allows nitter to be an ephemeral thing rather than a point of failure (nitter data loss should not cause problems) * [ ] Nitter * [ ] user accounts * [ ] proxies * [x] rss.app * [x] XPost database type

CJ_Clippy added the

enhancement

label 2025-03-09 10:18:16 +00:00

CJ_Clippy changed title from ~~Automated stream creation~~ to [Feature] Automated stream creation

2025-03-09 10:21:35 +00:00

CJ_Clippy added this to the 2.0 milestone 2025-03-09 10:23:36 +00:00

CJ_Clippy pinned this 2025-03-09 11:31:46 +00:00

[Feature] Automated stream creation #3

Crawler component

API Component (/streams/new)

Parser components (common)

Scrape & Cache