Scan 4chan archives for Stable Diffusion related threads:
- Wednesday, 1 March 2023
- Small update to
strip_links.py
to detect and fix broken addresses.
Search_Pages.py
Change the value of current_page
to how many pages back you want to scan 5.
Needs:
scan_sites.txt
Search settings for 4chan archives. Edit as you want. Change date in links.
take_only.txt
No need to edit. Just all board.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | https://archived.moe/3/thread/
https://archived.moe/a/thread/
https://archived.moe/aco/thread/
https://archived.moe/adv/thread/
https://archived.moe/an/thread/
https://archived.moe/asp/thread/
https://archived.moe/b/thread/
https://archived.moe/bant/thread/
https://archived.moe/biz/thread/
https://archived.moe/c/thread/
https://archived.moe/can/thread/
https://archived.moe/cgl/thread/
https://archived.moe/ck/thread/
https://archived.moe/cm/thread/
https://archived.moe/co/thread/
https://archived.moe/cock/thread/
https://archived.moe/con/thread/
https://archived.moe/d/thread/
https://archived.moe/diy/thread/
https://archived.moe/e/thread/
https://archived.moe/f/thread/
https://archived.moe/fa/thread/
https://archived.moe/fap/thread/
https://archived.moe/fit/thread/
https://archived.moe/fitlit/thread/
https://archived.moe/g/thread/
https://archived.moe/gd/thread/
https://archived.moe/gif/thread/
https://archived.moe/h/thread/
https://archived.moe/hc/thread/
https://archived.moe/his/thread/
https://archived.moe/hm/thread/
https://archived.moe/hr/thread/
https://archived.moe/i/thread/
https://archived.moe/ic/thread/
https://archived.moe/int/thread/
https://archived.moe/jp/thread/
https://archived.moe/k/thread/
https://archived.moe/lgbt/thread/
https://archived.moe/lit/thread/
https://archived.moe/m/thread/
https://archived.moe/mlp/thread/
https://archived.moe/mlpol/thread/
https://archived.moe/mo/thread/
https://archived.moe/mtv/thread/
https://archived.moe/mu/thread/
https://archived.moe/n/thread/
https://archived.moe/news/thread/
https://archived.moe/o/thread/
https://archived.moe/out/thread/
https://archived.moe/outaoc/thread/
https://archived.moe/p/thread/
https://archived.moe/po/thread/
https://archived.moe/pol/thread/
https://archived.moe/pw/thread/
https://archived.moe/q/thread/
https://archived.moe/qa/thread/
https://archived.moe/qb/thread/
https://archived.moe/qst/thread/
https://archived.moe/r/thread/
https://archived.moe/r9k/thread/
https://archived.moe/s/thread/
https://archived.moe/s4s/thread/
https://archived.moe/sci/thread/
https://archived.moe/soc/thread/
https://archived.moe/sp/thread/
https://archived.moe/spa/thread/
https://archived.moe/t/thread/
https://archived.moe/tg/thread/
https://archived.moe/toy/thread/
https://archived.moe/trash/thread/
https://archived.moe/trv/thread/
https://archived.moe/tv/thread/
https://archived.moe/u/thread/
https://archived.moe/v/thread/
https://archived.moe/vg/thread/
https://archived.moe/vint/thread/
https://archived.moe/vip/thread/
https://archived.moe/vm/thread/
https://archived.moe/vmg/thread/
https://archived.moe/vp/thread/
https://archived.moe/vr/thread/
https://archived.moe/vrpg/thread/
https://archived.moe/vdt/thread/
https://archived.moe/vt/thread/
https://archived.moe/w/thread/
https://archived.moe/wg/thread/
https://archived.moe/wsg/thread/
https://archived.moe/wsr/thread/
https://archived.moe/x/thread/
https://archived.moe/xs/thread/
https://archived.moe/y/thread/
https://archived.moe/de/thread/
https://archived.moe/rp/thread/
https://archived.moe/talk/thread/
https://desuarchive.org/a/thread/
https://desuarchive.org/aco/thread/
https://desuarchive.org/an/thread/
https://desuarchive.org/c/thread/
https://desuarchive.org/cgl/thread/
https://desuarchive.org/co/thread/
https://desuarchive.org/d/thread/
https://desuarchive.org/fit/thread/
https://desuarchive.org/g/thread/
https://desuarchive.org/his/thread/
https://desuarchive.org/int/thread/
https://desuarchive.org/k/thread/
https://desuarchive.org/m/thread/
https://desuarchive.org/mlp/thread/
https://desuarchive.org/mu/thread/
https://desuarchive.org/q/thread/
https://desuarchive.org/qa/thread/
https://desuarchive.org/r9k/thread/
https://desuarchive.org/tg/thread/
https://desuarchive.org/trash/thread/
https://desuarchive.org/vr/thread/
https://desuarchive.org/wsg/thread/
https://desuarchive.org/desu/thread/
https://desuarchive.org/meta/thread/
https://arch.b4k.co/g/thread/
https://arch.b4k.co/mlp/thread/
https://arch.b4k.co/qb/thread/
https://arch.b4k.co/v/thread/
https://arch.b4k.co/vg/thread/
https://arch.b4k.co/vm/thread/
https://arch.b4k.co/vmg/thread/
https://arch.b4k.co/vp/thread/
https://arch.b4k.co/vrpg/thread/
https://arch.b4k.co/vst/thread/
https://arch.b4k.co/meta/thread/
https://archive.palanq.win/bant/thread/
https://archive.palanq.win/c/thread/
https://archive.palanq.win/con/thread/
https://archive.palanq.win/e/thread/
https://archive.palanq.win/i/thread/
https://archive.palanq.win/n/thread/
https://archive.palanq.win/news/thread/
https://archive.palanq.win/out/thread/
https://archive.palanq.win/p/thread/
https://archive.palanq.win/pw/thread/
https://archive.palanq.win/qst/thread/
https://archive.palanq.win/toy/thread/
https://archive.palanq.win/vip/thread/
https://archive.palanq.win/vp/thread/
https://archive.palanq.win/vt/thread/
https://archive.palanq.win/w/thread/
https://archive.palanq.win/wg/thread/
https://archive.palanq.win/wsr/thread/
https://archive.palanq.win/meta/thread/
|
Saves to: links_saved.txt
Strip links from the saved threads:
strip_links.py
Don't need to edit. It compares found links to the one on my rentry and saves what is new.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | import os
import re
import requests
from bs4 import BeautifulSoup
def fix_links(text):
# Regular expression to match common ways of bypassing link regulations
pattern = re.compile(r"(https?://\S+\.\S+)")
# Replace common link bypasses with actual links
text = re.sub(r"(https?://\S+)\s*<dot>\s*(\S+)", r"\1.\2", text)
text = re.sub(r"(https?://\S+)\s*\(dot\)\s*(\S+)", r"\1.\2", text)
text = re.sub(r"(https?://\S+)\s*\[dot\]\s*(\S+)", r"\1.\2", text)
text = re.sub(r"(https?://\S+)\s*{dot}\s*(\S+)", r"\1.\2", text)
text = re.sub(r"(https?://\S+)\s*\|\s*(\S+)", r"\1.\2", text)
text = re.sub(r"(https?://\S+)\s*space\s*(\S+)", r"\1/\2", text)
# Replace spaces in links with %20
text = re.sub(r"(https?://\S+)\s+(\S+)", r"\1%20\2", text)
return text
# STEP 1: Delete old files and create new ones
files = ['links_found.txt', 'links_old.txt', 'links_new.txt', 'skipped.txt', 'skipped_2nd.txt', 'skipped_3rd.txt']
for file in files:
if os.path.exists(file):
os.remove(file)
open(file, 'w').close()
print('Step 1: Deleted old files and created new ones')
# STEP 2: Find links on main website and save to links_old.txt
main_site = open('main_site.txt').read().strip()
response = requests.get(main_site, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, 'html.parser')
with open('links_old.txt', 'w') as f:
for link in soup.find_all('a', href=True):
href = link['href'].strip()
if href.startswith('https://') or href.startswith('magnet:?xt'):
# Fix the link and write it to the file
fixed_href = fix_links(href)[0] # Assume that there is only one link per anchor tag
f.write(fixed_href + '\n')
print('Step 2: Found links on main website and saved to links_old.txt')
# STEP 3: Find links on secondary websites and save to links_found.txt
sites = open('sites.txt').read().strip().split()
skip_sites = open('skip_sites.txt').read().strip().split()
skip_extensions = open('skip_extensions.txt').read().strip().split()
force_sites = open('force_sites.txt').read().strip().split()
num_skipped_links = 0
num_added_links = 0
for site in sites:
if site in skip_sites:
print(f'Skipping site {site}...')
continue
try:
response = requests.get(site, headers={'User-Agent': 'Mozilla/5.0'}, timeout=25)
soup = BeautifulSoup(response.content, 'html.parser')
added_links = []
skipped_links = []
for link in soup.find_all('a', href=True):
href = link['href'].strip()
if any(href.endswith(ext) for ext in skip_extensions):
num_skipped_links += 1
skipped_links.append(href)
continue
if href.startswith('https://') or href.startswith('magnet:?xt'):
domain = re.search(r'(?<=https://)(.+?)(?=/|$)', href)
if domain and domain.group(1) in force_sites:
added_links.append(href)
num_added_links += 1
elif href not in open('links_old.txt').read() and not any(href.startswith(skip) for skip in skip_sites) and not any(href.startswith(old) or href.startswith(old + '/') for old in open('links_old.txt').read().split()):
fixed_link = fix_links(href)
added_links.append(fixed_link)
num_added_links += 1
else:
num_skipped_links += 1
skipped_links.append(href)
with open('links_found.txt', 'a') as f:
for link in added_links:
f.write(link + '\n')
print(f"Site {site} - Added {len(added_links)} links. Skipped {len(skipped_links)} links.")
except Exception as e:
print(f'Error: Could not connect to site {site}')
print(e)
with open('skipped.txt', 'a') as f:
f.write(site + '\n')
print(f'Step 3: Found links on secondary websites. Skipped {num_skipped_links} links. Added {num_added_links} links.')
# STEP 4: Copy links from links_found.txt to links_new.txt
with open('links_found.txt', 'r') as f:
found_links = [line.strip() for line in f]
new_links = []
skipped_links = []
for link in found_links:
if link in open('links_old.txt').read():
skipped_links.append(link)
elif link not in new_links:
new_links.append(link)
with open('links_new.txt', 'a') as f:
for link in new_links:
f.write(link + '\n')
print(f'Step 4: Copied {len(new_links)} new links from links_found.txt to links_new.txt')
if skipped_links:
with open('skipped_2nd.txt', 'w') as f:
f.write('\n'.join(skipped_links))
print(f'Skipped {len(skipped_links)} duplicate links. '
f'They were saved to skipped_2nd.txt')
else:
print('No duplicate links were found.')
# STEP 5: Remove duplicates and sort links_new.txt
with open('links_old.txt') as old_file:
old_links = old_file.read().splitlines()
with open('links_found.txt') as new_file:
added_links = []
skipped_links = []
for link in new_file:
link = link.strip()
if link not in old_links:
added_links.append(link)
else:
skipped_links.append(link)
with open('links_new.txt', 'w') as f:
for link in sorted(set(added_links)):
f.write(link + '\n')
if len(skipped_links) > 0:
with open('skipped_3rd.txt', 'w') as f:
for link in skipped_links:
f.write(link + '\n')
print(f'Skipped {len(skipped_links)} duplicate links. See skipped_3rd.txt for details.')
else:
if os.path.exists('skipped_3rd.txt'):
os.remove('skipped_3rd.txt')
print('No duplicate links were found.')
if len(added_links) > 0:
print(f'Added {len(added_links)} new links.')
else:
print('No new links were found.')
print('Step 5: Completed')
|
Needs:
main_site.txt
There you put links to the sites you want copmpare the ones found on 4chan's archives.
sites.txt
= links_saved
Rename the file, or edit the code to save it/use the correct one. Named differently for backups.
skip_extensions.txt
Put what links to ignore. It just uses the past part of the links. Does not look for the whole links. And python, so size matters.
force_sites.txt
Those are the prefix links for whitelisted links. (forced to save anyway)
skip_sites.txt
Edit as you want. The sites you do not want at all. First 4 are the archive sites. To not create unnecessary copies.
It creates:
All of those will be deleted every time you start the script.
links_found.txt
- All links found (before compared to main site).links_new.txt
- Links after the comparison.links_old.txt
- Links used for comparison (taken for the sites listed inmain_site.txt
).skipped.txt
- This is list of thread it was not able to connect, so you will be able to retry it later.skipped_2nd.txt
- This is list of links that was "erased" when comparing them to the sites listed inmain_site.txt
.skipped_3rd.txt
- This is list of links that was removed fromlinks_new.txt
due to duplication.
Requirements
requirements.txt