Проблема кодировок часто возникает при написании парсеров, чтении данных из xml и CSV файлов. Ниже представлены способы эту проблему решить.
1
windows-1251 в UTF-8
$text = iconv('windows-1251//IGNORE', 'UTF-8//IGNORE', $text);
echo $text;
PHP
$text = mb_convert_encoding($text, 'UTF-8', 'windows-1251');
echo $text;
PHP
2
UTF-8 в windows-1251
$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text);
echo $text;
PHP
$text = mb_convert_encoding($text, 'windows-1251', 'utf-8');
echo $text;
PHP
3
Когда ни что не помогает
$text = iconv('utf-8//IGNORE', 'cp1252//IGNORE', $text);
$text = iconv('cp1251//IGNORE', 'utf-8//IGNORE', $text);
echo $text;
PHP
Иногда доходит до бреда, но работает:
$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text);
$text = iconv('windows-1251//IGNORE', 'utf-8//IGNORE', $text);
echo $text;
PHP
4
File_get_contents / CURL
Бывают случаи когда file_get_contents()
или CURL возвращают иероглифы (ÐлмазнÑе боÑÑ) – причина тут не в кодировке, а в отсутствии BOM-метки.
$text = file_get_contents('https://example.com');
$text = "\xEF\xBB\xBF" . $text;
echo $text;
PHP
Ещё бывают случаи, когда file_get_contents() возвращает текст в виде:
�mw�Ƒ0�����&IkAI��f��j4/{�</�&�h�� ��({�o�����:/��<g���g��(�=�9�Paɭ
Это сжатый текст в GZIP, т.к. функция не отправляет правильные заголовки. Решение проблемы через CURL:
function getcontents($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
echo getcontents('https://example.com');
PHP
12.01.2017, обновлено 02.11.2021
Другие публикации
Отправка e-mail в кодировке UTF-8 с вложенными файлами и возможные проблемы.
JSON (JavaScript Object Notation) – текстовый формат обмена данными, основанный на JavaScript, который представляет собой набор пар {ключ: значение}. Значение может быть массивом, числом, строкой и…
Описание значений глобального массива $_SERVER с примерами.
Так как Instagram и Fasebook ограничили доступ к API, а фото с открытого аккаунта всё же нужно периодически получать и…
В статье представлены различные PHP-расширения для чтения файлов XLS, XLSX, описаны их плюсы и минусы, а также примеры…
Примеры как зарегистрировать бота в Телеграм, описание и взаимодействие с основными методами API.
Для конвертации на php строки из utf-8 в windows-1251 и наоборот, можно использовать следующую функцию:
Описание функции iconv:
string iconv ( string from_kodirovka, string to_kodirovka, string str )
Производит преобразование кодировки символов строки str из начальной кодировки from_kodirovka в конечную to_kodirovka. Возвращает строку в новой кодировке, или FALSE в случае ошибки.
Если добавить //TRANSLIT к параметру out_charset будет включена транслитеризация. Это означает, что вслучае, когда символа нет в конечной кодировке, он заменяется одним или несколькими аналогами. Если добавить //IGNORE, то символы, которых нет в конечной кодировке, будут опущены. Иначе, будет возвращена строка str, обрезанная до первого недопустимого символа.
В случае, если ваш хостинг не поддерживает iconv, для конвертации из utf-8 в win-1251 и наоборот можно использовать следующие функции:
function utf8_to_cp1251($s) { $tbl = $GLOBALS['unicode_to_cp1251_tbl']; $uc = 0; $bits = 0; $r = ""; for($i = 0, $l = strlen($s); $i < $l; $i++) { $c = $s{$i}; $b = ord($c); if($b & 0x80) { if($b & 0x40) { if($b & 0x20) { $uc = ($b & 0x0F) << 12; $bits = 12; } else { $uc = ($b & 0x1F) << 6; $bits = 6; } } else { $bits -= 6; if($bits) { $uc |= ($b & 0x3F) << $bits; } else { $uc |= $b & 0x3F; if($cc = @$tbl[$uc]) { $r .= $cc; } else { $r .= '?'; } } } } else { $r .= $c; } } return $r; }
function cp1251_to_utf8($s) { $tbl = $GLOBALS['cp1251_to_utf8_tbl']; $r = ""; for($i = 0, $l = strlen($s); $i < $l; $i++) { $c = $s{$i}; $b = ord($c); if ($b < 128) { $r .= $c; } else { $r .= @$tbl[$b]; } } return $r; } $unicode_to_cp1251_tbl = array( 0x0402 => "\x80", 0x0403 => "\x81", 0x201A => "\x82", 0x0453 => "\x83", 0x201E => "\x84", 0x2026 => "\x85", 0x2020 => "\x86", 0x2021 => "\x87", 0x20AC => "\x88", 0x2030 => "\x89", 0x0409 => "\x8A", 0x2039 => "\x8B", 0x040A => "\x8C", 0x040C => "\x8D", 0x040B => "\x8E", 0x040F => "\x8F", 0x0452 => "\x90", 0x2018 => "\x91", 0x2019 => "\x92", 0x201C => "\x93", 0x201D => "\x94", 0x2022 => "\x95", 0x2013 => "\x96", 0x2014 => "\x97", 0x2122 => "\x99", 0x0459 => "\x9A", 0x203A => "\x9B", 0x045A => "\x9C", 0x045C => "\x9D", 0x045B => "\x9E", 0x045F => "\x9F", 0x00A0 => "\xA0", 0x040E => "\xA1", 0x045E => "\xA2", 0x0408 => "\xA3", 0x00A4 => "\xA4", 0x0490 => "\xA5", 0x00A6 => "\xA6", 0x00A7 => "\xA7", 0x0401 => "\xA8", 0x00A9 => "\xA9", 0x0404 => "\xAA", 0x00AB => "\xAB", 0x00AC => "\xAC", 0x00AD => "\xAD", 0x00AE => "\xAE", 0x0407 => "\xAF", 0x00B0 => "\xB0", 0x00B1 => "\xB1", 0x0406 => "\xB2", 0x0456 => "\xB3", 0x0491 => "\xB4", 0x00B5 => "\xB5", 0x00B6 => "\xB6", 0x00B7 => "\xB7", 0x0451 => "\xB8", 0x2116 => "\xB9", 0x0454 => "\xBA", 0x00BB => "\xBB", 0x0458 => "\xBC", 0x0405 => "\xBD", 0x0455 => "\xBE", 0x0457 => "\xBF", 0x0410 => "\xC0", 0x0411 => "\xC1", 0x0412 => "\xC2", 0x0413 => "\xC3", 0x0414 => "\xC4", 0x0415 => "\xC5", 0x0416 => "\xC6", 0x0417 => "\xC7", 0x0418 => "\xC8", 0x0419 => "\xC9", 0x041A => "\xCA", 0x041B => "\xCB", 0x041C => "\xCC", 0x041D => "\xCD", 0x041E => "\xCE", 0x041F => "\xCF", 0x0420 => "\xD0", 0x0421 => "\xD1", 0x0422 => "\xD2", 0x0423 => "\xD3", 0x0424 => "\xD4", 0x0425 => "\xD5", 0x0426 => "\xD6", 0x0427 => "\xD7", 0x0428 => "\xD8", 0x0429 => "\xD9", 0x042A => "\xDA", 0x042B => "\xDB", 0x042C => "\xDC", 0x042D => "\xDD", 0x042E => "\xDE", 0x042F => "\xDF", 0x0430 => "\xE0", 0x0431 => "\xE1", 0x0432 => "\xE2", 0x0433 => "\xE3", 0x0434 => "\xE4", 0x0435 => "\xE5", 0x0436 => "\xE6", 0x0437 => "\xE7", 0x0438 => "\xE8", 0x0439 => "\xE9", 0x043A => "\xEA", 0x043B => "\xEB", 0x043C => "\xEC", 0x043D => "\xED", 0x043E => "\xEE", 0x043F => "\xEF", 0x0440 => "\xF0", 0x0441 => "\xF1", 0x0442 => "\xF2", 0x0443 => "\xF3", 0x0444 => "\xF4", 0x0445 => "\xF5", 0x0446 => "\xF6", 0x0447 => "\xF7", 0x0448 => "\xF8", 0x0449 => "\xF9", 0x044A => "\xFA", 0x044B => "\xFB", 0x044C => "\xFC", 0x044D => "\xFD", 0x044E => "\xFE", 0x044F => "\xFF", ); $cp1251_to_utf8_tbl = array( 0x80 => "\xD0\x82", 0x81 => "\xD0\x83", 0x82 => "\xE2\x80\x9A", 0x83 => "\xD1\x93", 0x84 => "\xE2\x80\x9E", 0x85 => "\xE2\x80\xA6", 0x86 => "\xE2\x80\xA0", 0x87 => "\xE2\x80\xA1", 0x88 => "\xE2\x82\xAC", 0x89 => "\xE2\x80\xB0", 0x8A => "\xD0\x89", 0x8B => "\xE2\x80\xB9", 0x8C => "\xD0\x8A", 0x8D => "\xD0\x8C", 0x8E => "\xD0\x8B", 0x8F => "\xD0\x8F", 0x90 => "\xD1\x92", 0x91 => "\xE2\x80\x98", 0x92 => "\xE2\x80\x99", 0x93 => "\xE2\x80\x9C", 0x94 => "\xE2\x80\x9D", 0x95 => "\xE2\x80\xA2", 0x96 => "\xE2\x80\x93", 0x97 => "\xE2\x80\x94", 0x99 => "\xE2\x84\xA2", 0x9A => "\xD1\x99", 0x9B => "\xE2\x80\xBA", 0x9C => "\xD1\x9A", 0x9D => "\xD1\x9C", 0x9E => "\xD1\x9B", 0x9F => "\xD1\x9F", 0xA0 => "\xC2\xA0", 0xA1 => "\xD0\x8E", 0xA2 => "\xD1\x9E", 0xA3 => "\xD0\x88", 0xA4 => "\xC2\xA4", 0xA5 => "\xD2\x90", 0xA6 => "\xC2\xA6", 0xA7 => "\xC2\xA7", 0xA8 => "\xD0\x81", 0xA9 => "\xC2\xA9", 0xAA => "\xD0\x84", 0xAB => "\xC2\xAB", 0xAC => "\xC2\xAC", 0xAD => "\xC2\xAD", 0xAE => "\xC2\xAE", 0xAF => "\xD0\x87", 0xB0 => "\xC2\xB0", 0xB1 => "\xC2\xB1", 0xB2 => "\xD0\x86", 0xB3 => "\xD1\x96", 0xB4 => "\xD2\x91", 0xB5 => "\xC2\xB5", 0xB6 => "\xC2\xB6", 0xB7 => "\xC2\xB7", 0xB8 => "\xD1\x91", 0xB9 => "\xE2\x84\x96", 0xBA => "\xD1\x94", 0xBB => "\xC2\xBB", 0xBC => "\xD1\x98", 0xBD => "\xD0\x85", 0xBE => "\xD1\x95", 0xBF => "\xD1\x97", 0xC0 => "\xD0\x90", 0xC1 => "\xD0\x91", 0xC2 => "\xD0\x92", 0xC3 => "\xD0\x93", 0xC4 => "\xD0\x94", 0xC5 => "\xD0\x95", 0xC6 => "\xD0\x96", 0xC7 => "\xD0\x97", 0xC8 => "\xD0\x98", 0xC9 => "\xD0\x99", 0xCA => "\xD0\x9A", 0xCB => "\xD0\x9B", 0xCC => "\xD0\x9C", 0xCD => "\xD0\x9D", 0xCE => "\xD0\x9E", 0xCF => "\xD0\x9F", 0xD0 => "\xD0\xA0", 0xD1 => "\xD0\xA1", 0xD2 => "\xD0\xA2", 0xD3 => "\xD0\xA3", 0xD4 => "\xD0\xA4", 0xD5 => "\xD0\xA5", 0xD6 => "\xD0\xA6", 0xD7 => "\xD0\xA7", 0xD8 => "\xD0\xA8", 0xD9 => "\xD0\xA9", 0xDA => "\xD0\xAA", 0xDB => "\xD0\xAB", 0xDC => "\xD0\xAC", 0xDD => "\xD0\xAD", 0xDE => "\xD0\xAE", 0xDF => "\xD0\xAF", 0xE0 => "\xD0\xB0", 0xE1 => "\xD0\xB1", 0xE2 => "\xD0\xB2", 0xE3 => "\xD0\xB3", 0xE4 => "\xD0\xB4", 0xE5 => "\xD0\xB5", 0xE6 => "\xD0\xB6", 0xE7 => "\xD0\xB7", 0xE8 => "\xD0\xB8", 0xE9 => "\xD0\xB9", 0xEA => "\xD0\xBA", 0xEB => "\xD0\xBB", 0xEC => "\xD0\xBC", 0xED => "\xD0\xBD", 0xEE => "\xD0\xBE", 0xEF => "\xD0\xBF", 0xF0 => "\xD1\x80", 0xF1 => "\xD1\x81", 0xF2 => "\xD1\x82", 0xF3 => "\xD1\x83", 0xF4 => "\xD1\x84", 0xF5 => "\xD1\x85", 0xF6 => "\xD1\x86", 0xF7 => "\xD1\x87", 0xF8 => "\xD1\x88", 0xF9 => "\xD1\x89", 0xFA => "\xD1\x8A", 0xFB => "\xD1\x8B", 0xFC => "\xD1\x8C", 0xFD => "\xD1\x8D", 0xFE => "\xD1\x8E", 0xFF => "\xD1\x8F", );
If you need convert string from Windows-1251 to 866. Some characters of 1251 haven't representation on DOS 866. For example, long dash -- chr(150) will be converted to 0, after that iconv finish his work and other charactes will be skiped. Problem characters range in win1251 (128-159,163,165-167,169,171-174,177-182,187-190).
Use this:
//$text - input text in windows-1251
//$cout - output text in 866 (cp866, dos ru ascii)
for($i=0;$i<strlen($text);$i++) {
$ord=ord($text[$i]);
if($ord>=192&&$ord<=239) $cout.=chr($ord-64);
elseif($ord>=240&&$ord<=255) $cout.=chr($ord-16);
elseif($ord==168) $cout.=chr(240);
elseif($ord==184) $cout.=chr(241);
elseif($ord==185) $cout.=chr(252);
elseif($ord==150||$ord==151) $cout.=chr(45);
elseif($ord==147||$ord==148||$ord==171||$ord==187) $cout.=chr(34);
elseif($ord>=128&&$ord<=190) $i=$i; //нет представления данному символу
else $cout.=chr($ord);
}
- Home
- /
-
Windows-1251
- /
- Windows-1251 Encoding : PHP
Welcome to our comprehensive guide on Windows-1251 encoding in PHP! If you’re working with Cyrillic characters or need to ensure proper data representation in your web applications, understanding Windows-1251 encoding is essential. In this article, we’ll explore what Windows-1251 encoding is, how it works with PHP, and practical tips for implementing it in your projects. Whether you’re dealing with legacy systems or modern applications, you’ll discover valuable insights and examples to help you effectively manage character encoding in PHP. Join us as we demystify Windows-1251 encoding and enhance your programming skills!
Introduction to Windows-1251
Windows-1251 is a character encoding that is widely used for representing Cyrillic characters in digital formats. Developed by Microsoft, this encoding is designed to support languages that use the Cyrillic script, such as Russian, Bulgarian, and Serbian. It encompasses 256 different characters, which include standard Latin characters, control characters, and specific Cyrillic characters. As a single-byte encoding system, Windows-1251 is popular in legacy systems and applications where text needs to be displayed or processed in Cyrillic languages.
In PHP, encoding text to Windows-1251 can be achieved using the iconv()
function or the mb_convert_encoding()
function. These functions allow developers to convert strings from one character encoding to another, making it easier to handle text data that includes Cyrillic characters.
Example of Encoding
// Original string in UTF-8
$originalString = "Привет, мир!"; // Hello, World!
// Convert the string to Windows-1251 encoding
$encodedString = iconv("UTF-8", "Windows-1251//IGNORE", $originalString);
// Output the encoded string
echo $encodedString; // Displays the string in Windows-1251 encoding
In this example, the iconv()
function takes the original UTF-8 string and converts it to Windows-1251 encoding, ignoring any characters that cannot be represented.
Decoding with Windows-1251 in PHP
Decoding text from Windows-1251 back to a more universal encoding such as UTF-8 is also straightforward in PHP. The same iconv()
function can be used for this purpose.
Example of Decoding
// A Windows-1251 encoded string
$windows1251String = "\xCF\xF0\xE8\xE2\xE5\xF2, \xEC\xE8\xF0!"; // Привет, мир!
// Convert the string back to UTF-8
$decodedString = iconv("Windows-1251", "UTF-8//IGNORE", $windows1251String);
// Output the decoded string
echo $decodedString; // Displays "Привет, мир!" in UTF-8
This code snippet illustrates how to decode a string from Windows-1251 back to UTF-8, allowing for proper display and manipulation of text.
Advantages and Disadvantages of Windows-1251
Advantages
- Compatibility: Windows-1251 is widely supported in many applications and systems, making it a reliable choice for legacy software that requires Cyrillic support.
- Simplicity: As a single-byte encoding, Windows-1251 is straightforward to implement and requires less memory compared to multi-byte encodings.
Disadvantages
- Limited Character Set: Windows-1251 can only accommodate 256 characters, which may not be sufficient for all Cyrillic languages or specialized symbols.
- Obsolescence: With the rise of Unicode (UTF-8), which supports a much broader range of characters, Windows-1251 is becoming less common in modern applications.
Key Applications of Windows-1251
Windows-1251 is primarily used in systems and applications that require the display and processing of text in Cyrillic languages. Common applications include:
- Legacy Databases: Many older databases still utilize Windows-1251 for storing and retrieving text data.
- Email Clients: Some email clients support Windows-1251, allowing users to read and send messages in Cyrillic.
- Web Applications: Certain web applications and sites still use Windows-1251 encoding for compatibility with older browsers and systems.
Popular Frameworks and Tools for Windows-1251
Several frameworks and tools support Windows-1251 encoding, making integration easier for developers:
- PHP: As demonstrated, PHP provides built-in functions like
iconv()
andmb_convert_encoding()
for encoding and decoding. - MySQL: MySQL databases can be configured to use Windows-1251 character sets, facilitating storage and retrieval of Cyrillic text.
- Java: The Java programming language offers support for Windows-1251 through its
Charset
class, enabling easy manipulation of encoded strings.
By leveraging these frameworks and tools, developers can effectively manage Windows-1251 encoded text in their applications, ensuring compatibility and proper display across different platforms.
Иногда требуется сменить полностью кодировку файла, например с utf-8 в windows-1251. Зачастую это делается с помощью редактора кода. Но что если это необходимо сделать программно, в этом поможет функции php — iconv().
Для того чтобы не перекодировать каждую строку файла с помощью iconv(string $input_charset, string $output_charset, string $string) — мы можем преобразовать лишь одну строку. Этой строкой будет наш файл, полностью, полученный с помощью функции file_get_contents($path).
Для примера полностью перекодируем файл из UTF-8 в WINDOWS-1251.
В итоге это будет выглядеть вот так:
$file_string = file_get_contents ("tmp/test_file.csv"); $file_string = iconv("UTF-8", "WINDOWS-1251", $file_string); file_put_contents ("tmp/test_file.csv", $file_string);
Также, если вы хотите сменить окончания строк, например с Mac ( \r ) формата на Windows ( \r\n ) / Unix ( \n ) формат:
// Windows CRLF $string = preg_replace('~(*BSR_ANYCRLF)\R~', "\r\n", $string); // Unix CR $string = preg_replace('~(*BSR_ANYCRLF)\R~', "\n", $string); // Mac LF $string = preg_replace('~(*BSR_ANYCRLF)\R~', "\r", $string);