my pytorch 2.0.0-2.1.0 experience

6800xt only: working
6800xt and 1 cpu layer: working
6800xt and vega 64 plugged in, all layers on 6800xt: working
6800xt and Vega 64 plugged in, both being used:

⎗
✓
 File "/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2560, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

6800xt and vega 64 plugged in, Vega 64 and cpu in use:

⎗
✓
  File "/mnt/UbuntuData/AI/KoboldAI4bit/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! .least two devices, cuda:1 and cuda:0! (when checking)

6800xt and vega 64 plugged in, all layers on Vega 64:

⎗
✓
  File "/mnt/UbuntuData/AI/KoboldAI4bit/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2560, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

Done on a fresh copy of KoboldAI United, trying both the regular Pytorch 2.0 copy, as well as building from source in attempt to fix the bug.
Here is my environment:

⎗
✓
Collecting environment information...
PyTorch version: 2.1.0a0+git93a71cf
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.4.22804-474e8620

OS: Linux Mint 21.1 (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.8.11 (default, Aug  3 2021, 15:09:35)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.19.0-41-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Radeon RX 6800 XT
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.4.22804
MIOpen runtime version: 2.19.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          16
On-line CPU(s) list:             0-15
Vendor ID:                       GenuineIntel
Model name:                      12th Gen Intel(R) Core(TM) i5-12600K
CPU family:                      6
Model:                           151
Thread(s) per core:              2
Core(s) per socket:              10
Socket(s):                       1
Stepping:                        2
CPU max MHz:                     5000.0000
CPU min MHz:                     800.0000
BogoMIPS:                        7372.80
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       416 KiB (10 instances)
L1i cache:                       448 KiB (10 instances)
L2 cache:                        9.5 MiB (7 instances)
L3 cache:                        20 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] pytorch-triton-rocm==2.0.2
[pip3] torch==2.1.0a0+git93a71cf
[conda] mkl                       2022.1.0           hc2b9512_224  
[conda] mkl-include               2023.1.0         h06a4308_46342  
[conda] numpy                     1.24.3                   pypi_0    pypi
[conda] pytorch-triton-rocm       2.0.2                    pypi_0    pypi
[conda] torch                     2.1.0a0+git93a71cf           dev_0    <develop>

Here is my pip list:

⎗
✓
Package               Version            Editable project location
--------------------- ------------------ -----------------------------------
accelerate            0.18.0
ansi2html             1.8.0
apispec               5.2.2
apispec-webframeworks 0.5.2
astunparse            1.6.3
attrs                 23.1.0
bidict                0.22.1
bleach                4.1.0
Brotli                1.0.9
cachelib              0.10.2
certifi               2023.5.7
cffi                  1.15.1
charset-normalizer    3.1.0
click                 8.1.3
cmake                 3.26.3
colorama              0.4.6
cryptography          39.0.1
diffusers             0.14.0
dnspython             2.2.1
eventlet              0.33.3
exceptiongroup        1.1.1
expecttest            0.1.4
filelock              3.12.0
Flask                 2.2.3
flask-cloudflared     0.0.10
Flask-Compress        1.13
Flask-Cors            3.0.10
flask-ngrok           0.0.25
Flask-Session         0.4.0
Flask-SocketIO        5.3.2
fsspec                2023.5.0
ftfy                  6.1.1
greenlet              2.0.2
huggingface-hub       0.12.1
hypothesis            6.75.2
idna                  3.4
ijson                 3.2.0.post0
importlib-metadata    6.6.0
itsdangerous          2.1.2
Jinja2                3.1.2
lit                   16.0.3
loguru                0.7.0
lupa                  1.10
Markdown              3.4.3
MarkupSafe            2.1.2
marshmallow           3.19.0
mkultra               0.1
mpmath                1.3.0
networkx              3.1
ninja                 1.11.1
numpy                 1.24.3
packaging             23.1
peft                  0.3.0
Pillow                9.5.0
pip                   23.1.2
protobuf              4.21.12
psutil                5.9.5
pycparser             2.21
pydub                 0.25.1
pyOpenSSL             23.1.1
python-engineio       4.4.1
python-socketio       5.7.2
pytorch-triton-rocm   2.0.2
PyYAML                6.0
regex                 2023.5.5
requests              2.30.0
safetensors           0.3.1
sentencepiece         0.1.97
setuptools            67.7.2
six                   1.16.0
sortedcontainers      2.4.0
sympy                 1.11.1
termcolor             2.3.0
tokenizers            0.13.3
torch                 2.1.0a0+git93a71cf /mnt/UbuntuData/AI/KoboldAI/pytorch
tqdm                  4.65.0
transformers          4.28.0
types-dataclasses     0.6.6
typing_extensions     4.5.0
urllib3               2.0.2
wcwidth               0.2.6
webencodings          0.5.1
Werkzeug              2.3.3
wheel                 0.40.0
zipp                  3.15.0

Here is an example of successful output with a RX 6800xt using Pytorch 2.0 built from source on May 8th, 2023

⎗
✓
Colab Check: False, TPU: False
INFO       | __main__:general_startup:1310 - Running on Repo: https://github.com/henk717/KoboldAI.git Branch: united
INIT       | Starting   | Flask
INIT       | OK         | Flask
INIT       | Starting   | Webserver
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | Webserver
MESSAGE    | Webserver started! You may now connect with a browser at http://127.0.0.1:5000
INIT       | OK         | LUA Scripts
Setting Seed
Opening in existing browser session.
Connection Attempt: 127.0.0.1
INFO       | __main__:do_connect:2796 - Client connected! UI_1
ERROR      | koboldai_settings:__setattr__:1203 - __setattr__ just set model_selected to NeoCustom in koboldai_vars. That variable isn't defined!
INFO       | __main__:get_model_info:1517 - Selected: NeoCustom, /mnt/UbuntuData/AI/KoboldAI/models/PygmalionAI_pygmalion-2.7b
INIT       | Searching  | GPU support
INIT       | Found      | GPU support
INIT       | Starting   | Transformers
INIT       | Info       | Final device configuration:
       DEVICE ID  |  LAYERS  |  DEVICE NAME
   (primary)   0  |      32  |  AMD Radeon RX 6800 XT
               1  |       0  |  Radeon RX Vega
             N/A  |       0  |  (Disk cache)
             N/A  |       0  |  (CPU)
Loading model tensors:   0%|          | 0/484 [00:00<?, ?it/s]/mnt/UbuntuData/AI/KoboldAI/modeling/lazy_loader.py:149: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = STORAGE_TYPE_MAP[dtype].from_buffer(f.read(nbytes), "little")
Loading model tensors: 100%|##########| 484/484 [00:06<00:00, 71.13it/s] INFO       | __main__:load_model:1975 - Pipeline created: PygmalionAI_pygmalion-2.7b
INFO       | koboldai_settings:__setattr__:761 - Changing preset to Default
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | LUA Scripts
Setting Seed
Connection Attempt: 127.0.0.1
INFO       | __main__:do_connect:2796 - Client connected! UI_1
PROMPT     @ 2023-05-09 15:48:15 | You generate the following  story concept : 
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
the eos token: 50256
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
50256
modelkwargs 519: {}
578 tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:0')
560: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:0')
model kwargs: {}
692 input ids: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:0')
694 input ids interleave: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:0')
dict to expand model kwards 701: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
attention mask 704: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')
dict to expand before loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
what is the key?: output_attentions
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
what is the key?: output_hidden_states
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
what is the key?: use_cache
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
what is the key?: attention_mask
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
698 inputids: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:0')
print logits warper: []
model inputs 2531: {'input_ids': tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:0'), 'past_key_values': None, 'use_cache': True, 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0'), 'token_type_ids': None}
/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py:195: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at /mnt/UbuntuData/AI/KoboldAI/pytorch/aten/src/ATen/native/TensorCompare.cpp:497.)
  attn_weights = torch.where(causal_mask, attn_weights, mask_value)
these are your 2555 outputs: CausalLMOutputWithPast(loss=None, logits=tensor([[[  5.9141,   3.2344,  -0.9761,  ...,  -8.1406,  -0.7656,   1.6104],
         [  5.9492,   3.5215,  -0.1315,  ...,  -5.3633,  -1.6357,   0.6392],
         [  4.4492,   2.4531,   1.1680,  ...,  -1.9980,   0.9658,   1.3965],
         ...,
         [  3.6543,   0.7285,  -3.0078,  ..., -10.3047,  -7.4297,  -2.6973],
         [  5.7109,   7.9883,   3.1367,  ...,  -6.2773,   1.4424,   2.0176],
         [  2.1855,   1.5078,  -0.8696,  ...,  -5.3672,   0.7788,   4.1680]]],
       device='cuda:0', dtype=torch.float16), past_key_values=((tensor([[[[-0.5195, -0.5195, -0.0773,  ...,  0.3716, -0.5278, -0.2148],
          [-0.9155, -0.8638, -0.1818,  ...,  0.2438, -0.0994, -0.2361],
          [-0.8604, -0.2693, -0.0911,  ...,  0.2451, -0.4109, -0.3909],
          ...,
          [-0.6367, -0.6631,  0.1520,  ...,  0.5176, -0.1594, -0.3325],
          [-0.6421, -0.5298,  0.1146,  ...,  0.5986, -0.1389, -0.0269],
          [-0.6016, -0.2847, -0.2590,  ...,  0.4741, -0.3179, -0.0516]],
         [[ 0.4983,  0.6016,  1.1904,  ...,  0.0179, -0.2810,  0.5142],
          [ 0.2905,  0.5308,  0.4758,  ..., -0.1766, -0.0623,  0.2627],
          [ 0.3965,  0.8062,  0.9531,  ...,  0.4258, -0.5859,  0.4355],
          ...,
          [-0.0966,  0.4402,  0.6538,  ...,  0.4590, -0.1388,  0.0258],
          [ 0.4573,  0.4321,  0.6558,  ...,  0.7354,  0.5479, -0.0069],
          [ 0.3030,  0.2830,  0.5581,  ...,  0.2703, -0.3604,  0.4133]],
         [[ 0.1461,  0.1089, -0.6504,  ...,  0.1085,  0.1030, -0.0263],
          [ 0.2195, -0.1393, -0.4382,  ...,  0.5747,  1.1738, -0.0179],
          [ 0.0617, -0.5698, -0.8789,  ...,  0.7119,  0.3359, -0.2542],
          ...,
          [ 0.0320,  0.2009, -0.2617,  ...,  0.1041,  0.1785, -0.4893],
          [ 0.7349, -0.3191, -0.2751,  ...,  0.4570, -0.1652, -0.3154],
          [ 0.1288, -0.2288, -0.6226,  ...,  0.1220,  0.1877,  0.0588]],
         ...,
         [[ 0.1699,  0.0024,  0.5181,  ..., -0.0181, -0.2778,  0.1206],
          [ 0.1053, -0.4224,  0.4041,  ...,  0.3350,  0.0746, -0.1072],
          [ 0.4458,  0.4971,  0.5791,  ...,  0.6182, -0.0105,  0.0856],
          ...,
          [-0.5146,  0.3232, -0.3552,  ...,  0.0636,  0.0112,  0.3833],
          [-0.2197,  0.2303,  0.3132,  ...,  0.2793,  0.1322,  0.3259],
          [-0.1255,  0.3281,  0.3186,  ...,  0.3879, -0.1395,  0.2312]],
         [[-0.4937,  0.5562, -0.6558,  ...,  1.1240,  0.9502, -0.1401],
          [-0.6289, -0.5454, -0.1553,  ...,  1.2793,  0.8955, -0.1749],
          [-0.4573,  0.4116, -0.3037,  ...,  0.9131,  0.4019,  0.4624],
          ...,
          [-0.1299, -0.0217, -0.6543,  ...,  1.0283,  0.7700, -0.1146],
          [-0.0559, -0.4453, -0.3865,  ...,  0.8564,  0.3760, -0.0221],
          [-0.6841, -0.0766, -0.5215,  ...,  0.7559,  0.4646,  0.4824]],
         [[ 0.0817, -0.2251, -0.7314,  ..., -0.0720,  0.0841, -0.1260],
          [ 0.1659, -0.5391, -0.4731,  ..., -0.4324,  0.2749,  0.1700],
          [-0.1685, -0.0894, -0.1472,  ..., -0.4929,  0.2147, -0.2737],
          ...,
          [ 0.1075,  0.0658,  0.0640,  ..., -0.7104,  0.1135,  0.0446],
          [-0.2396,  0.0383, -0.6724,  ..., -0.1892, -0.2198, -0.2491],
          [ 0.1415,  0.3093, -0.7583,  ..., -0.2070,  0.6152, -0.0528]]]],



These then repeat hundreds of times during generation, expanding the attention_mask tensor by 1 each time. It then ends like this:

       device='cuda:0', dtype=torch.float16))), hidden_states=None, attentions=None)
what are the next token logits?: tensor([[ 3.0547,  1.9287,  2.1895,  ..., -4.3555, -0.6821,  1.6748]],
       device='cuda:0', dtype=torch.float16)
NextTokenScores Processor: tensor([[ 3.0547,  1.9287,  2.1895,  ..., -4.3555, -0.6821,    -inf]],
       device='cuda:0', dtype=torch.float16)
NextTokenScores Warper: tensor([[-inf, -inf, -inf,  ..., -inf, -inf, -inf]], device='cuda:0',
       dtype=torch.float16)
probs: tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)
attention_mask from model_kwargs["attention_mask"] is: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')
""model_kwargs["attention_mask"] after torch.cat is: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')
INFO       | modeling.inference_model:raw_generate:574 - Generated 3 tokens in 3.36 seconds, for an average rate of 0.89 tokens per second.
GENERATION @ 2023-05-09 15:48:18 | \nA young

And now, here is the problem I run into when using Pytorch 2.0 with a GPU that is in the second slot, Vega 64:

⎗
✓
ERROR      | __main__:generate:4113 - Traceback (most recent call last):
  File "aiserver.py", line 4100, in generate
    genout, already_generated = tpool.execute(model.core_generate, txt, found_entries)
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/eventlet/tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/eventlet/tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "/mnt/UbuntuData/AI/KoboldAI/modeling/inference_model.py", line 313, in core_generate
    result = self.raw_generate(
  File "/mnt/UbuntuData/AI/KoboldAI/modeling/inference_model.py", line 560, in raw_generate
    result = self._raw_generate(
  File "/mnt/UbuntuData/AI/KoboldAI/modeling/inference_models/hf_torch.py", line 241, in _raw_generate
    genout = self.model.generate(
  File "/mnt/UbuntuData/AI/KoboldAI/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/generation/utils.py", line 1507, in generate
    return self.sample(
  File "/mnt/UbuntuData/AI/KoboldAI/modeling/inference_models/hf_torch.py", line 209, in new_sample
    return new_sample.old_sample(self, *args, **kwargs)
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2588, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

In my attempt to debug, I went to the last filename in the traceback "KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/generation/utils.py" and I started printing out variables to traceback where my data would begin to corrupt.
First I printed "probs" then I would go back to what created "probs" and print out that,

⎗
✓
probs = nn.functional.softmax(next_token_scores, dim=-1)

⎗
✓
probs: tensor([[0., 0., 0.,  ..., nan, nan, nan]], device='cuda:1',
       dtype=torch.float16)

so I went back further in the code to see what next_token_scores consisted of:

⎗
✓
next_token_logits = outputs.logits[:, -1, :]
print(f"what are the next token logits?: {next_token_logits}")
# pre-process distribution
next_token_scores = logits_processor(input_ids, next_token_logits)
print(f"NextTokenScores Processor: {next_token_scores}")
next_token_scores = logits_warper(input_ids, next_token_scores)
print(f"NextTokenScores Warper: {next_token_scores}")

and my printed out debug showed me this:

⎗
✓
what are the next token logits?: tensor([[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01, -3.7255e-01,
         -3.7255e-01]], device='cuda:1', dtype=torch.float16)
NextTokenScores Processor: tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., -3.7255e-01, -3.7255e-01,
         -3.7255e-01]], device='cuda:1', dtype=torch.float16)
NextTokenScores Warper: tensor([[0., 0., 0.,  ..., nan, nan, nan]], device='cuda:1',
       dtype=torch.float16)

The values inside the tensor are extremely small and just in general don't seem right so I decided to keep printing the values further back to see where it messed up:

⎗
✓
       DEVICE ID  |  LAYERS  |  DEVICE NAME
   (primary)   0  |       0  |  AMD Radeon RX 6800 XT
               1  |      32  |  Radeon RX Vega
             N/A  |       0  |  (Disk cache)
             N/A  |       0  |  (CPU)
Loading model tensors:   0%|          | 0/484 [00:00<?, ?it/s]/mnt/UbuntuData/AI/KoboldAI/modeling/lazy_loader.py:149: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = STORAGE_TYPE_MAP[dtype].from_buffer(f.read(nbytes), "little")
Loading model tensors: 100%|##########| 484/484 [00:13<00:00, 36.51it/s] INFO       | __main__:load_model:1975 - Pipeline created: PygmalionAI_pygmalion-2.7b
INFO       | koboldai_settings:__setattr__:761 - Changing preset to Default
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | LUA Scripts
Setting Seed
Connection Attempt: 127.0.0.1
INFO       | __main__:do_connect:2796 - Client connected! UI_1
PROMPT     @ 2023-05-09 15:53:18 | You generate the following  story concept : 
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
the eos token: 50256
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
50256
modelkwargs 519: {}
line 578 tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:1')
line 560: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:1')
model kwargs: {}
line 692 input ids: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:1')
line 694 input ids interleave: tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1')
Line 689 dict to expand model kwards 701: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
attention mask 704: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')
dict to expand before loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
what is the key?: output_attentions
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
what is the key?: output_hidden_states
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
what is the key?: use_cache
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
what is the key?: attention_mask
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1')}
698 inputids: tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1')
print logits warper: []
model inputs 2531: {'input_ids': tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1'), 'past_key_values': None, 'use_cache': True, 'position_ids': tensor([[                   0, -4702111234474983746,  9042521604759584124,
          4340410370284600378,  -361700864190383368, -5063812098665367114,
          8680820740569200756,  3978709506094217010,  -723401728380766736]],
       device='cuda:1'), 'attention_mask': tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1'), 'token_type_ids': None}
/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py:195: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at /mnt/UbuntuData/AI/KoboldAI/pytorch/aten/src/ATen/native/TensorCompare.cpp:497.)
  attn_weights = torch.where(causal_mask, attn_weights, mask_value)
these are your 2555 outputs: CausalLMOutputWithPast(loss=None, logits=tensor([[[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         ...,
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01]]], device='cuda:1', dtype=torch.float16), past_key_values=((tensor([[[[       nan,        nan,        nan,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         [[-2.6583e+36, -2.6583e+36, -2.6789e+36,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         [[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         ...,
         [[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         [[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         [[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]]]],

Which brought me to the dictionary expansion loop, where it looks like at Attention_mask, it sets the first value correctly but then sets the rest to a (default?) value at this part of the code:

⎗
✓
def _expand_inputs_for_generation(
    expand_size: int = 1,
    is_encoder_decoder: bool = False,
    input_ids: Optional[torch.LongTensor] = None,
    **model_kwargs,
) -> Tuple[torch.LongTensor, Dict[str, Any]]:
    """Expands tensors from [batch_size, ...] to [batch_size * expand_size, ...]"""

    def _expand_dict_for_generation(dict_to_expand):
        print(f"dict to expand before loop: {dict_to_expand}")
        for key in dict_to_expand:
            print(f"what is the key?: {key}")
            if dict_to_expand[key] is not None and isinstance(dict_to_expand[key], torch.Tensor):
               dict_to_expand[key] = dict_to_expand[key].repeat_interleave(expand_size, dim=0)
            print(f"dict to expand after loop: {dict_to_expand}")

        return dict_to_expand

    if input_ids is not None:
        print(f"692 input ids: {input_ids}")
        input_ids = input_ids.repeat_interleave(expand_size, dim=0)
        print(f"694 input ids interleave: {input_ids}")
    print(f"dict to expand model kwards 701: {model_kwargs}")
    print(f"attention mask 704: {model_kwargs['attention_mask']}")
    model_kwargs = _expand_dict_for_generation(model_kwargs)

Remember: This is all code from inside the utils.py file of transformers library. This setup works with Pytorch 1.13.1, but not with Pytorch 2.0; so surely the problem as to lie somewhere in a torch function?
In further observations, I noticed there was invalid responses at the repeat_interleave functions so to test, I commented them out, and it "fixed" the attention_mask tensors values but it would still mess up further down the line between line 2531 and line 2555 (might vary slightly from your code)

⎗
✓
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
print(f"model inputs 2531: {model_inputs}")
# forward pass to get next token
outputs = self(
    **model_inputs,
    return_dict=True,
    output_attentions=output_attentions,
    output_hidden_states=output_hidden_states,
)
print(f"these are your 2555 outputs: {outputs}")

However, in an effort to not change any of the transformers library code, I removed the comments and brought it back to normal operation.

Another observation is in this code:

⎗
✓
if input_ids is not None:
    print(f"692 input ids: {input_ids}")
    input_ids = input_ids.repeat_interleave(expand_size, dim=0)
    print(f"694 input ids interleave: {input_ids}")

It outputs this:

⎗
✓
692 input ids: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:1')
694 input ids interleave: tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1')

And to help summarize this all, here's the full terminal log (minus the hundreds of repeating line 2551 outputs):

⎗
✓
Colab Check: False, TPU: False
INFO       | __main__:general_startup:1310 - Running on Repo: https://github.com/henk717/KoboldAI.git Branch: united
INIT       | Starting   | Flask
INIT       | OK         | Flask
INIT       | Starting   | Webserver
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | Webserver
MESSAGE    | Webserver started! You may now connect with a browser at http://127.0.0.1:5000
INIT       | OK         | LUA Scripts
Setting Seed
Opening in existing browser session.
Connection Attempt: 127.0.0.1
INFO       | __main__:do_connect:2796 - Client connected! UI_1
Connection Attempt: 127.0.0.1
INFO       | __main__:do_connect:2796 - Client connected! UI_1
ERROR      | koboldai_settings:__setattr__:1203 - __setattr__ just set model_selected to NeoCustom in koboldai_vars. That variable isn't defined!
INFO       | __main__:get_model_info:1517 - Selected: NeoCustom, /mnt/UbuntuData/AI/KoboldAI/models/PygmalionAI_pygmalion-2.7b
INIT       | Searching  | GPU support
INIT       | Found      | GPU support
INIT       | Starting   | Transformers
INIT       | Info       | Final device configuration:
       DEVICE ID  |  LAYERS  |  DEVICE NAME
   (primary)   0  |       0  |  AMD Radeon RX 6800 XT
               1  |      32  |  Radeon RX Vega
             N/A  |       0  |  (Disk cache)
             N/A  |       0  |  (CPU)
Loading model tensors:   0%|          | 0/484 [00:00<?, ?it/s]/mnt/UbuntuData/AI/KoboldAI/modeling/lazy_loader.py:149: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = STORAGE_TYPE_MAP[dtype].from_buffer(f.read(nbytes), "little")
Loading model tensors: 100%|##########| 484/484 [00:13<00:00, 36.51it/s] INFO       | __main__:load_model:1975 - Pipeline created: PygmalionAI_pygmalion-2.7b
INFO       | koboldai_settings:__setattr__:761 - Changing preset to Default
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | LUA Scripts
Setting Seed
Connection Attempt: 127.0.0.1
INFO       | __main__:do_connect:2796 - Client connected! UI_1
PROMPT     @ 2023-05-09 15:53:18 | You generate the following  story concept : 
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
the eos token: 50256
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
50256
modelkwargs 519: {}
578 tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:1')
560: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:1')
model kwargs: {}
692 input ids: tensor([[1639, 7716,  262, 1708,  220, 1621, 3721, 1058,  220]],
       device='cuda:1')
694 input ids interleave: tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1')
dict to expand model kwards 701: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
attention mask 704: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')
dict to expand before loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
what is the key?: output_attentions
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
what is the key?: output_hidden_states
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
what is the key?: use_cache
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}
what is the key?: attention_mask
dict to expand after loop: {'output_attentions': False, 'output_hidden_states': False, 'use_cache': True, 'attention_mask': tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1')}
698 inputids: tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1')
print logits warper: []
model inputs 2531: {'input_ids': tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1'), 'past_key_values': None, 'use_cache': True, 'position_ids': tensor([[                   0, -4702111234474983746,  9042521604759584124,
          4340410370284600378,  -361700864190383368, -5063812098665367114,
          8680820740569200756,  3978709506094217010,  -723401728380766736]],
       device='cuda:1'), 'attention_mask': tensor([[                   1, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746,
         -4702111234474983746, -4702111234474983746, -4702111234474983746]],
       device='cuda:1'), 'token_type_ids': None}
/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py:195: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at /mnt/UbuntuData/AI/KoboldAI/pytorch/aten/src/ATen/native/TensorCompare.cpp:497.)
  attn_weights = torch.where(causal_mask, attn_weights, mask_value)
these are your 2555 outputs: CausalLMOutputWithPast(loss=None, logits=tensor([[[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         ...,
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01],
         [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
          -3.7255e-01, -3.7255e-01]]], device='cuda:1', dtype=torch.float16), past_key_values=((tensor([[[[       nan,        nan,        nan,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         [[-2.6583e+36, -2.6583e+36, -2.6789e+36,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         [[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         ...,
         [[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         [[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]],
         [[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          ...,
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01],
          [-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01,
           -3.7255e-01, -3.7255e-01]]]], device='cuda:1', dtype=torch.float16))), hidden_states=None, attentions=None)
what are the next token logits?: tensor([[-3.7255e-01, -3.7255e-01, -3.7255e-01,  ..., -3.7255e-01, -3.7255e-01,
         -3.7255e-01]], device='cuda:1', dtype=torch.float16)
NextTokenScores Processor: tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., -3.7255e-01, -3.7255e-01,
         -3.7255e-01]], device='cuda:1', dtype=torch.float16)
NextTokenScores Warper: tensor([[0., 0., 0.,  ..., nan, nan, nan]], device='cuda:1',
       dtype=torch.float16)
probs: tensor([[0., 0., 0.,  ..., nan, nan, nan]], device='cuda:1',
       dtype=torch.float16)
ERROR      | __main__:generate:4113 - Traceback (most recent call last):
  File "aiserver.py", line 4100, in generate
    genout, already_generated = tpool.execute(model.core_generate, txt, found_entries)
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/eventlet/tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/eventlet/tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "/mnt/UbuntuData/AI/KoboldAI/modeling/inference_model.py", line 313, in core_generate
    result = self.raw_generate(
  File "/mnt/UbuntuData/AI/KoboldAI/modeling/inference_model.py", line 560, in raw_generate
    result = self._raw_generate(
  File "/mnt/UbuntuData/AI/KoboldAI/modeling/inference_models/hf_torch.py", line 241, in _raw_generate
    genout = self.model.generate(
  File "/mnt/UbuntuData/AI/KoboldAI/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/generation/utils.py", line 1506, in generate
    return self.sample(
  File "/mnt/UbuntuData/AI/KoboldAI/modeling/inference_models/hf_torch.py", line 209, in new_sample
    return new_sample.old_sample(self, *args, **kwargs)
  File "/mnt/UbuntuData/AI/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2587, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)