S_C_I_E_N_T_I_F_I_C_ F_E_A_S_T_
(Propositions, Ideas, Realizations — PIR)
Chris MYRSKI, 1992 and further
— — — — —
Remark: The original folder PIR on other sites begins with "An Illiterate World", which here goes under number Eng. № 20, then comes "Idea About New Calendar", which here is part (the last) of Eng. № 7. Oth.А. of my publicistics, after this follow "Reflections About The Numbers", which here is under number Eng. № 19, then follows this Eng. № 21.A. PIR, where, after small "Introduction", are specially programmer ideas, then is "Just Injustice", which also is in the publicistics, in Eng. № 7. Jour. D, and then follows the second part of this folder, Eng. № 21.B., where are quite original ideas for the sphere of mass services, as well also a long list with a heap of various ideas in all of my works. There can be expected also a third part of this folder with some linguistic ideas. |
— — — — —
Contents Of This Booklet
Introduction to the whole PIR
Computer Program For Splitting Of Words Of Different Languages
Computer Program For Compressing Of Files Of Different Types
Ideas About Browsers Searching In The Internet
— — — — —
INTRODUCTION
I think some introductory remarks are necessary here because this is not a book but a folder with quite motley materials, and they are surely not fiction, and the name is not only "Feast" but also PIR. About the name it is easy, the initials PIR in Russian — and I usually publish in Internet everything initially in Russian, even not in my mother Bulgarian language — means "feast" but the initials are almost the same in English if are taken for "Propositions, Ideas, Realizations" (or, then, Researches).
Then the things are quite different because having been left with the coming of our democracy without constant work, and being, after all, scientific worker, I have chosen different fields where to spent interestingly my time and to try and apply any ideas which emerged in my head. This means chiefly that the ideas are set to me by nobody, the results can be questionable, but they, surely, are non-traditional and some of them are even urgent. Initially and mostly these are linguistic themes, like about worldwide alphabet, about many ideas hidden behind the numbers, then in the recent time emerged my English (and not only) Latin transliteration (and there are a pair or other relatively related materials about Bulgarian language, which, however, for the moment are placed in another — or rather in two, in different languages and with different approaches — folder).
But there are other ideas, there is proposed a new decimal calendar; there is quite serious idea for jurisprudence about unification of damages and guilt in lawsuits, together with personal modification of punishments. Then quite recently emerged three programmer ideas, where the two of them discuss very old programs by me, for DOS, for splitting of the words in every (in Latin and Cyrillic) possible language, as also for compressing of files of any (I just like the universality and the related with it word "any") type, which programs worked pretty good and can be transferred in another operating environment, and one new is about bettering of browsers' searching in Internet. You see, with coming to my 65th year I decided that there is no sense in keeping valuable ideas or realizations for myself, and chose, even without payment and however amateurish they may seem, to publish them. Then there are two, maybe winning (but for about an year there is no answer to them) business ideas, about bank deposits, and a kind of advertising in the supermarkets, yet not of the products, but of the very shops.
There may be expected to emerge (if I will live long enough) a pair of other linguistic materials, about a kind of bettering or correction of the English language; maybe also something else, it depends.
Ah, and because I am not traditional fiction writer but rather popularizer of many simple (well, relatively — I mean, without higher mathematics) ideas, I try to open the eyes of people (who, willingly or not but like not this, they like chiefly to be deluded), so because of this I have almost in every of my books or journalistic papers a PIR-idea, sometimes a heap of such ideas, and in order to provide the reader with some kind of guide through my creative works, where what can be found in them, I have included in the end the last material about other PIR ideas outside this folder.
This is all. If you like my ideas then read them and try to implement some of them in reality, they are worth of this, at leas I think so. But if you don't like them then don't worry, such people like you are highly necessary to build the background on which such clever heads like my can stand out and be noticed, so that I am only thankful to you. Ha-ha.
March 2016, Sofia, Bulgaria
— — — — —
COMPUTER PROGRAM FOR SPLITTING OF WORDS OF DIFFERENT LANGUAGES
This is not just an idea, it is realized by me, I have used it for some time and was quite satisfied with it, so that here exists precedent, what is important, because is known that this is possible, not just search for something not knowing whether this can be found or not (like, for example, is the matter with the existence of God). This program has worked, but this was before about 20 years (well, at least 15), and for DOS, and with the emergence of Windows everything become more complicated, and I have ceased to waste time (and to try to find the necessary software free). And as far as in the new platform everything has, in any case, to be made anew, and also I have the right to keep the details to myself, as well as I am publishing this on a popular site, it is not a rule there to give fragments of programs in algorithmic languages, so for all these reasons I will not make efforts to search for the program and look at it precisely but will describe everything from my memory. Nevertheless I am usually verbose, so that if one shows desire, and if he (or she) is programmer, then he will be able to concoct something similar, although in Windows everything is more complicated, there must be worked on level of pixels, not words, must be taken into account all fonts and so on, so that this will be also not so easy. Anyway, I will return to this point at the end, because such possibility is simply obligatory for each browser or screen of mobile computer device (even for a phone, if it has enough memory).
What is necessary to stress is that the program must work satisfactory good (and it was so by me) for a mixture of languages even in one and the same file, or for the pair of languages with which one usually works, it must allow easy adjusting, if in the given languages are contradictory concepts for splitting of the words, as well also to reflect the preferences of the person, must perform justification to the right border of the page, and other moments. I had a variant that worked stand alone (as a separate program), was given the name of the input file (.txt), was given the name of the output file (.prt) and the program transformed the first into the second, and then remained only to print the latter file. Ah, I have worked only with Cyrillic and Latin alphabets, if one will want to do splitting of words also, say, in Arabic, then there can be another difficulties, but particularly for Greek alphabet should not be special problems (at a first sight — I have not thought about this). So, but because in various languages people have quite different views to the splitting of words, then I must in the beginning say a pair of paragraphs about this.
1. Basic notions about syllables, types of letters, and universal rules for splitting
Here I don't discover America, yet there is some personal contribution of mine to this topic; I touch these matters in my idea about universal world alphabet, and they must be kept in mind, because, precisely speaking, for all, at least European, languages there are simply no universal rules for splitting! In Russian or Bulgarian, they are relatively simple, but in English this is not at all so and there exist dictionaries where is given the splitting for every word, this is mainly based on the syllables, but here also can be different meanings. In German there are their own rules, usually by syllables, but it is wholly inappropriate to split some letters and they use normally 2-3 characters (further I will write only chars) for one letter (consonant), even 4 once; in English (and French), on the other hand, is quite ugly to split the vowels, for the reason that they are read several together. Because of this I asked myself firstly the question: what is this a syllable, with what it is characterized, how it can be distinguished?
Well, it turned out that it usually, not always but in the most cases (say, in 90% of them), begins with a consonant (which I will shorten to only C., like similarly the vowel will be given only as V.) and ends how it happens, but blessed are the languages where the Vs are simple. And now we come to my views about this, what kinds of V. sounds exist at all (in the world), because we must recognize somehow — without dictionaries, of course, we can't use dictionaries for all possible world languages — where the V. is as if one (although there can be several), builds syllable, and where these are several separate Vs. For example: the word "piano" has for the Slavs 3 syllables, but in English "ia" must not be split, this is true also for the word "ready" or for "beauty". Taking this into account I have come to the conclusion that there exist three types of Vs (chiefly, but this can be applied also to the Cs), namely: basic, modified, and diphthongs or combined Vs.
The basic Vs are only six, they are: "e", "i", "o", "u", "a" and Bulgarian "ъ", which is the same as in English "girl" (I can't indulge in profound explanations here), which I will mark as "ə". The modified are when one wants to say one V. but says another one, like in your "but" which is pronounced as 'bəat', or in the classical (i.e. from Roman times) case of "ae" which is read as 'ae' (like in your "hand") — and I suppose you have guessed why I am using subscript, as well also that in single quotes I give how the words are pronounced, where in double ones is how they are written. In Russian exist their own modified Vs like, for example, their usual "e", which is pronounced as 'ie' (like their "devka"-girl sounds more like the Ukrainians pronounce it, 'divka'); also their "donkey" sound 'əi' ('ы' in Cyrillic), in 'məi'-we; also their unstressed "o" is pronounced nearer to "a" or exactly (in my opinion) like the English 'əa' (in, say, 'əakno'-window, written as "okno" — although the alphabet is different, but I have somehow to write it with only Latin alphabet). There is no necessity to go in bigger details because specially in this program I don't do such analysis, but this explains why I have accepted my concept of splitting of Vs, which I will give below. And diphthongs, in my understanding, because some people, even linguists, call, for example, French "ai", which is read is pure 'e', also diphthong, are such combinations of Vs where they are several, like in your "pear" read as 'peə'; in Russian and some other languages there are not such Vs except the classical 'aj', 'ej, etc. (and here with "j" I signify the semi-C. "jot", like in English "may" read as 'maj').
So, and now we come to this that the letters are to be distinguished, there has to be fixed in the program with each letter, for both alphabets (and what has to be done with various "chicks", mainly above the letters, I will also say to the end of the paper), a characteristic of its type, and these types must be three! Id est, Vs, Cs, and undefined, but still letters, not other chars. These undefined letters (if I have not forgotten something, but in all cases this question has to be performed somehow), say, Cyrillic 'j' (or also German, like in their "Johan"), or Latin "h", as well also "w" (especially in English), play the role of modifiers of the previous letter and must be joined with it, or take the same characteristic like for it (yet what is to be done if such symbol is in the beginning of the word I can't recall exactly, but this is not important because one letter must not remain on the line).
When the syllable is more or less correctly recognized then we can move to performing of the splitting of the word, to separating of the full-fledged syllable, what we will consider in the next point about my realization. Yet I want to stress that this program works, as it is accepted to say, heuristically, i.e. it gives some solution which is sufficiently good, but it is not necessarily the ideal or grammatically correct one. ( For example, it turns out that Latin "ck" has to be read as "kk" and take this into account by the splitting, leaving on both lines by a letter "k", but I think this is unnecessary luxury. )
2. How I have done the very splitting of the words
Well, it is started from the end of the word that does not fit by the given length of the line, and working only with chars, going from the end to the beginning of the word, and heeding about the number of chars in both of its parts, is detached the possible syllable (but there may be several variants) and adding the dividing hyphen is checked whether the first part of the word can be placed; if not, is continued further, initially not adding new syllable, and if this is necessary, then beginning to form new syllable, when is checked again could the splitting be performed, and so while is reached the minimum of allowable chars in the beginning of the word (this is 2, but can be required also 3, or that words with less than 5 chars are not split, I usually split also 4-char words) and then the word is left as it is and with it a new line is begun. When the word is processed and is found or not a splitting, then when something is taken from the initial line remains still to expand it to the right end adding by interval between the words again moving from the end to the beginning of the line.
Naturally here are necessary procedures for recognizing of the end of the word, in phonetical meaning, such like space, punctuation signs, other symbols; and also, I think, to leave such words that are not pure words, i.e. in them exist numbers, or special symbols, or the alphabets are mixed — they must be left as they are, for this can be passwords, new symbols, and others. It is also natural that is necessary to give the length of line that we have to get (by default, I think, 70), the number of chars that can be left in the end or in the beginning of the line (and if there is found a syllable, because otherwise in continued further), and some other parameters of the program. But is necessary to give some other important things, too, because till the moment we have said only that it is not clear is it necessary to split or not two Vs, or also can be separated consequence of Cs or not, but have not explained how exactly the problem has to be solved, and we can't leave the program to stay and think like the donkey of Buridan from which haystack to begin eating. So that let us now proceed to this.
Here I have introduced the following heuristic rules: vowels as a rule are not split, if we don't allow some exceptions, and then is necessary to input a list of exceptions (I think it was made so, but am not 100 percent sure, maybe I have inputted list of prefixes ending on a V.); consonants as a rule are split, but must be entered list of combinations of C. which must not be split (here I am convinced that this was so); undefined symbols (neither Vs not Cs) must go with the previous char, i.e. splitting can be done after them. Input in a DOS program of list of things that have to be performed is not so easy, but I have made this (at the worst this must be inputted as variants in the declaration of constants, and perform a new translation). So that to the program must be given a list of non-splitting combinations — such like Ger. "sch", also "ch", "ck", "sh", "ph", etc. (10 - 15 combinations, not really much) — and a list of splittable combinations, chiefly of Vs — say "ea" in German is allowable to split (but "ae" still is not allowed), or maybe we will want to split "eu", or "ia".
Then the consequence of working of the program is such: if we can't split then we skip it and go further (to the beginning); if can be split (Vs) then we split them and finish; otherwise (what happens most often) we split always equal Cs (in some languages this occurs frequently), split always V-C, split sometimes C-V (but if before this is another C. then split there), never split before undefined char (like I have said), and as if this was all.
It is good that in DOS don't exist special chicks above the letters, and that all letters are equally wide. But on the screen, as also by printing in Windows or in another contemporary operating system, all these problems exist, so that it is necessary to write the next subsection.
3. What has to be done when the letters are of various sizes, types, and fonts?
Here I will mainly guess, because I have no experience in programming in another operating environment, but I think that will guess rightly, i.e. there simply must exist special procedures, which will do the work for you, for the reason that you all know well that when you type something in a file then you write on the given line until this is possible, and when this is already not possible then automatically you happen to be on the next line. That's it. But what are these procedures, where they are and how are written (most probably on Visual Basic), how they are called (i.e. their names and parameters), is unknown to me for the simple reason that nowadays is necessary to pay for everything, no matter whether you do something useful for the others, or want to gain something only for yourself. I have used to work for the others, but when with the coming of democracy I was forced to work only for myself, then I have ceased to work for the others, and have begun to write various things per il mio diletto (for my personal pleasure, in Italian), right?
But well, so be it, I will tell you, after all, my meaning also in regard to these problems. The program has to be called by hitting of some button on the panel of your editor or browser, must be possibility for its adjustment by entering of new non-splitting consequences, also of new splitting Vs (or prefixes, syllables), as well also of new Vs and Cs, for there can be various alphabets. So that we have come to the question, what has to be done with letters that have something above or below them: the program must recognize these letters, too, no matter how many they may happen to be (and they, in theory, are long ago recognized, because by making of indexes in Word they are treated correctly), and this for every alphabet. Maybe will be necessary for each letter to maintain also belonging to the alphabet, in order to be able to recognize mixed words, but maybe this can be left because there are a heap of languages where there are a pair of symbols added to the main ones (say, "i" in Ukrainian). Then when unknown letter (but not every symbol) is met the program has to process it as undefined letter and allow splitting only after it.
As to the various fonts and possibilities for adornment of letters that change their size, then for this must be responsible the very operating system, with calling of some function (say, DoesFitInLine (string, linelength) : Boolean).But in order that one was able to correct what the program has done must be a possibility to call it only for the marked area, if such exists, according to the situation on the screen (i.e. to continue the paragraph) and acquire all parameters for it, which are to be taken from the current paragraph (i.e. here are necessary calls of other service functions and procedures of the operating system or the browser). Ah, texts in tables have also to be treated according with their parameters for the paragraphs.
But, mark this, that however imperfect one such program is, it will be in any case much better than the contemporary situation of the things, on what we will dwell in the next subsection. And if some of you wants to object that in real time and when one can with hitting of a key (or with the mouse) change the width of the screen and this will call reformatting of the whole document, then I will answer briefly: this is not at all significant for today's computers. I personally (till 2015) work on a computer where are only 256 MB of operating memory, can't watch whatever movies, on the hard disc have only 10 GB, from which I keep free about 1/3, the processor speed is about 600 MHz, and by these conditions, when I edit a whole book, of roughly 200 pages, I can perform spell checking for a pair of seconds, and this, positively, is more difficult than my program for splitting of the words. But even if for such processing is necessary time, if it goes about printing this does not matter, and if it is about showing on the screen then it suffices to process the first pair of thousands of symbols until the screen fills, and further to continue in background mode; and, besides, one can calmly bet that the contemporary computers are at least an order faster than my own. So that the point here is not in the slowing but in the lack of desire, and in not paying the necessary attention to the problem.
4. And is it worth to bother us at all?
Of course that it is worth, because by all the fantastic power of contemporary computers and software products without such program we as if live in the ... stone age, or at least a thousand of years before Christ, where people chiseled on stones letter after letter, paying no attention neither to intervals, no to punctuation signs, no to paragraphs and pages! It is really so. Even when are sent messages on mobile phones they are placed exactly so, and the words are split where happens without whatever hyphen (say, I could have quietly received the following message on my phone: Hello, Mister M|yrski, you have i|nvoice for used se|rvices for the amo|unt of lv 12.3|4." ... and so on). And even without such program is necessary only accepting of one symbol as "optional hyphen" and its correct interpretation on every terminal device, and all messages are to be written inserting it everywhere where it can be needed — but till the present day nothing of the kind exists.
Well, here everything started with the English, where the conditions for splitting are complicated, and not clear to every averagely taken individual, and the words are significantly (probably with 3-4 symbols) shorter than in Russian or German, but this is simply not nice, and about 10 percents unneeded wasting of paper, if the texts are really printed. But I suppose that there is another reason: people in commerce are used to perfection, and when this is difficultly to be got then they say, let us do without this thing (what, in my opinion, is judgement on the level of kindergarten). Even the simplest splitting after the last of consequence of Vs (i.e. before a C.) and splitting of some Cs would have been quite acceptable.
Yet it is possible an alternative approach, which is in this, that has to be started grammatical analysis, where for each of the words is maintained its whole splitting. But this not only will require working in every language separately, what is not at all universally, and I like universality, but hardly will be faster, the grammar is far away more complicated, I think. Or then to pass each file through a program (once, and not in real time), which, based on a whole dictionary of all possible words and their variants (because of the cases, genders, plurality), adds all possible optional hyphens, and then from the browsers is required only not to show those of them on which the line does not end. This is also very good variant, at least for official documents, like laws and other instructions, but for some their reasons the internet people have decided to abolish this special sign, together with the end of line or section, et cetera. If the browsers have allowed this then writing of a dictionary with splitting of the words would have been only tedious but not at all difficult task. But have they done what they like, only not to sit so and be satisfied, at least in some aspects, with the abilities of mentioned stone age.
Well, gentlemen, how you want. I have given the example, have done a decent program, and if somebody wants to renew the idea for every editor or browser then let him (or her) try. I personally don't intent to occupy myself with programming by the simple reason that I have no time — briefly putting it: I am little read and because of this have begun to translate myself, firstly in Russian, then (in the moment still) in English, think to make partial translation of my things also in German, and in addition to this I have planned work for me for the next at least 5 (but maybe whole 10) years, and according to the statistics I have in my disposition approximately 5 years life. That's the situation. But if some company thinks to become engaged with this then let it first sends me 1,000 euros, to make the sum round, only for presenting of this quite realizable idea, and then we will see what more they can require and will I be in position to offer it to them and on what conditions.
Nov 2014
— — — — —
COMPUTER PROGRAM FOR COMPRESSING OF FILES OF DIFFERENT TYPES
0. Introductory remarks
This idea was realized by me before about 20 years, when I worked still on 286 computer, in DOS, and using translator of Pascal from 1986, if I am not wrong. And the program worked, and still works by me, but when all began to use one Winzip then in the end I also moved to it. But I have written it, for one thing, in order to use better the diskettes, my files could not fit in them and was necessary to split them in parts, and, for another thing, in order to experiment my hashing, because in this translator was impossible to allocate more than 48 KB for all variables, and I needed one array of at least 64 KB, but better twice bigger, and as many for other arrays. Id est this was a challenge for me, but not only this. I decided to use one directly anti-scientific idea, to make compressing without analyzing the type of the file (they are so many, and all the time new ones emerge, I did not intend to study them all, as well not to compete with branded products), but compress the file till the size which ... the very God (or the nature of the file) allows by coding of the characters!
I will explain this further, but at the moment want to add that, in spite of all possible difficulties with the hashing (I think that have defined this main array of the order of several KB, i.e. at least ten times less than really necessary), the program worked pretty good, for text files gave compressing of about 45% (and Winzip gave somewhere about 55%), and for some other types where are images even to 15% of the main file. What isn't good, however, is that it works terribly slow, because makes several iterations, reads everything byte after byte, compares them, forms some arrays, then processes these arrays with the information from the entire file, and if the file is of the order of 100 - 200 KB (as it was in the beginning in DOS) then everything finished for mere seconds, but it the file was, say, 2 MB, then sometimes was necessary a whole ... hour (the dependence on the length of the file is probably exponential, but in no way linear). So that I, as far as I remember, informed, either an office of IBM, or some other company, but decided that this is enough — the program is mine, it works, but I don't intend to spread it further, because nowadays the files grow very fast, and if this is video information, then who knows how long the program will work.
All this is so, but ... Now see, by me there are often "buts", especially by nontraditional approaches. So that here the point is that if there was no hashing, if I could allocate a whole MB for variables, and even more, make virtual arrays for the files (by, say, a pair of MBs), and perform block reading in them, then this could have quietly reduced the time on a whole decimal order, if not more. But this is not all, because the output file by me looked like ... after meat mincer, I changed the very bytes, formed my own "bytes" (or words, putting it more correctly), and this could have given very good ciphering (encryption)! Because people still search for any possibilities for better ciphering, it turns out that this is so for providing of better security of bank operations, where are used over-large prime numbers (say, with several hundreds or even thousands of digits, where the finding of their prime factors is quite time-consuming operation, but the checking is nearly trivial (via multiplying, although not of the usual floating point, but integer with unlimited number of digits).
So that, see, I think that there is still some "bread" (as the Bulgarians like to say) in my "crazy" idea, and it is better to explain it in general outline, to bring it to the knowledge of all who is interested, than to take it with me to the grave. Only that I will on the purpose not look at the exact realization but narrate from memory. This will preserve some know-hows for me, but if the program has to be transformed for new programming environment then it, anyway, must be rewritten (i.e. written anew). And also such method of presenting of dry professional matters will make them accessible to all, how it has to be on a literary site. Well, this is enough as introduction.
1. My idea about compressing of the file to the allowed by its very nature
Everything began with this, that I noticed that in one file are used far from all possible symbols. For example, if this is text file in Latin, then usually remain about 50 possible characters (and let my shorten this to chars), and even if it is in Cyrillic (in DOS, of course, where the char table is with 256 chars, but in the first Windows, too) also remain about 10-20 chars; only .exe files use nearly 100 percents of all possible chars. These chars can be used. How? Well, for coding of consequences of chars, and exactly the most often used ones! This always is made in my zero pass. But this, obviously, is a drop in the ocean, I do this so, in order not to leave anything unused. Further I simply extend the size of the bytes and make my own "bytes", by 10 bits, or by 12 (or more, yet the experiments have shown that 16 bit "bytes" have quite big dimensions, for to be possible to save something with their usage). If the new bytes by me are of 10 bits, then if the first 2 bits are "00" then these are the old chars, and there remain also 3 variants, i.e. 3 char tables for coding! This is something decent. Now it remains only for each given file to find what more has to be added, what new symbols must be formed, and add by one such table in every pass, analyzing consequences of old and new symbols. But you sea already that the bytes are shifted, and it is not known what every real byte means, so that repeating of some bytes does not mean real repeating of symbols, here the things are mixed.
So, and what to add? Well, this what is necessary for each individual file, what occurs most often in it, roughly speaking. In theory I wanted to do analysis of all possible consequences of symbols, at least in reasonable limits, i.e. first of 2-byte consequences (what does not mean to divide the file in couples of bytes, but of all possible such couples, i.e. n and n+1, n+1 and n+2, etc. if necessary), then of 3 bytes, then of 4, and maybe of five. After this somehow to weigh the number of meets according with their length and choose the most often met consequences, which are to be coded adding new symbols for them, but one such symbol now will mean two (or more) initial symbols. Do you get in what the economy can hide, and why it depends on the very file, not only on its type but on the concrete file? And really, by the hashing I have inserted intermediary prints for maximal repetitions of the pairs and in the beginning they were, seems to me, several thousands (if I am not wrong, because this can be so if I used 2-byte numbers for counters, but maybe I have used 1-byte ones), and in the end of new char table they were about 10 times less.
Soon, however, I decided that this is unnecessary complicated and in this way I will increase the time on a pair of decimal orders (!), so that I have left only the 2-byte consequences, but in recompense of this the existence of several passes is obligatory, so that by 3 passes, if in the file are met, say, series of spaces, then in the end I will substitute 2^3 = 8 such spaces with one new symbol. In addition to this, with some approximation, a substitution of two adjacent symbols several times must be equivalent to substitution of a group (4-5) of symbols. Because of this the number of passes is necessary, it is not good if in one only pass I have added all possible chars.
So, and how, after all, the consequences must be analyzed? By usual counting of the number of their occurrences, of course, so that, in principle, if we want to leave the hashing at all, we must maintain an array with dimension of 2*newbytelength what in the most used by me variant (because the experiments have shown that so it is better) of 10 bits would have given dimension of 20 bits, i.e. 1MB, and also by 2 bytes for the integers, this makes 2MB. The dimension of new "bytes" must be used also for the old bytes, in order that it was possible by the second pass to use, together with the old bytes, also the new, elongated "bytes" representing now 2-byte consequences. In this situation it turns out that my hashing, really (I have said that I specially don't want to look how it was in reality), has used less than one percent of the real arrays, and for this reason the consumed by me time is so overwhelming, as well also the economy is not so big (i.e. if the beginning of the file is not much representative for the entire file then the hash-table will become very soon full and further even if there appear more often met combinations they will be simply skipped, there is no more space for placing of counters).
Further, even if it is possible to allocate space for arrays with dimension of a pair of MB, this is hardly reasonable to be done, because these arrays have to be all the time in the operating memory, otherwise we will save nothing regarding the time, and there are necessary also virtual arrays for the very files (for portions of them, for some buffers), because they are thoroughly scanned, byte after byte, and this in the new limits and sizes of bytes. So that some hashing will maybe always be necessary, but so, where about one quarter of all possible combinations, not of ridiculously small portions.
2. Forming of the new files
When is accumulated sufficient statistics for the number of meets of all consequences of 2 new bytes, then this array is ordered and are separated the first 256 pairs from it, in order of diminishing of their use. If I am not wrong I don't order the entire array, but find 256 times that pair which occurs most often than the others (search of maximum value of the field of counter) and write it in a temporary table of new symbols. When is known what must be changed I once more time go though the entire file and reading all "bytes" I substitute them or not with new and rewrite them in the new variant of the file, only that before all this I put the very table with the new symbols, and a pair of other indicators, like the number of cycle. I manage with some trick to use only two files, alternating them, and read from the one and write to the other, and then, in the next pass, vice versa. Well, and it is necessary to pay special attention to the complementing of the new file to whole number of bytes (with usual zeros), so that to avoid possible error by reading of the new non-byte "bytes". Naturally there are kept also the tables for all i-1 iterations (either after the current one, or in the end of the file), but each iteration works with "bytes" of the new length. It has no sense to provide possibility to finish with at least one iteration earlier, because the place for the new table is established in advance. In this way, after executing of the given iteration, the files change their places, the arrays are zeroed, it is read again the just written file, its symbol table in the beginning is detached, is rewritten in proper time in the new file, and this main part of the file without the tables is analyzed anew on the frequency of occurring of two consecutive new "bytes", in the same way like in the previous iteration, then again is formed new table, the file is coded again, and is written.
Exactly the information in the beginning of each iteration of the file, containing the last table with 256 new signs, is used during the deciphering or decoding of the file, by decoding program, that works significantly faster, because it only reads portions of bits and, either substitutes them with new consequence, or leaves them as they are. So that the time of working is not such a significant factor, because by the deciphering (or expanding) of the file the work goes pretty fast. But as far as here are always added new 256 new chars, and as far as the number of the iteration is established by the first bits, then are written only the pair of old symbols (from the previous iteration), and not with which new symbol it has to be substituted (this is unnecessary).
If, though, these tables are written on a separate file, then they can be used also for encrypting of the file, what we will consider in the next subsection.
3. Variant of encrypted compression
Because without the tables of new symbols even the devil, as is said, cannot understand anything in the file it is clear that if they are sent separately the security of compressed file can be preserved. Yet I have in mind not such security where one program (the decoding) can, after all, decode various files (having in disposition their new tables), but new personal ciphers for every user! This can elementary be done if before writing of the table is performed simple swapping of several bytes. As far as the bites are 256 or 2 hexadecimal digits then record of, say, 6FA7 will mean that 111-th element of the table (a pair of symbols, with the necessary number of bits, for the given iteration) has to be changed with its 167-th element, and vice versa. This is really a cipher so that the decoding program can be made universal and the tables with symbols is not necessary to write on separate files, but these swappings, for each iteration, are to be sent secretly, not by email, but in a separate envelope, something of the kind. Then the decoding and decompressing program can simply ask strings of hexadecimal numbers, say, by 4 couples for iteration or 12 such quadruples of digits like the cited above, and when one enters them (or copies from some file on his computer), then everything will be OK, but if he does not enter them, then will be taken that there are no swappings and in this case will be got some bit-mince, as I said in the beginning.
Possibilities for changing, in theory, and for 10-bit "bytes" and 3 iterations, are even not 3*256 but much more, because can be done several swappings involving one of the bytes, what will result in cyclical changings. Generating of ciphers in random way, (for example, making a program that starts and all the time generates random hexadecimal numbers until someone hits the necessary key) will allow absolute uniqueness of these ciphers. The only minus of this method will be this, that to remember these consequences will be also difficult and one will be forced to look somewhere, what means that the cipher can be found. But it is always so with the ciphers, isn't it? I have not worked with such things but suppose that it is so. Another variant is working with current code from some book, if there will be established some method for converting of letters (from every alphabet) in hexadecimal numbers, what isn't so difficult (say, the remainders modulo 16 is the simplest way — this, that will be some repetitions, as I said, does not matter). And if these are video files then the compression will become on a decimal order, if not more, and the time for decoding, on contemporary computers, will not be significant.
But — this is my favorite word, at least in this material — if the char tables are already 2-byte (I am not 100 percent sure in this, but "feel in the guts" that this is so), and for video information also are used for long time words longer than a byte, then it is possible also another alternative method of working, what we will discuss in the next point.
4. Coding in new character tables
In the beginning I want to remark here that if the tables are already 2-byte then even by the classical variant is worth already on the zero pass to make swapping of bytes by a 2-byte word, i.e. so that the bytes alternated: І ІІ, ІІ І, І ІІ, ІІ І, ... . In this way, as I suppose, a massive compression will be obtained in the 1st bytes, and the very tables will be in the 2nd bytes. This can lead to significantly higher compression of the files, and if they are not 2-byte (what can sometimes happen, I am speaking about any types of files), then this will not hinder the program, simply the comparing will be done by one byte. I personally think for a long time to try this in my program, but can never find the necessary time — ever since I have begun to publish myself on the Internet, as well also to translate myself, I have no time to spent on programmer work (more so with translators from 86th year of the last century).
But in this subsection I mean that, when the tables are already such extended, then there is no need in the extending of the words (at least for text files, but I suspect that also for video ones, because by the coding of colour in 24 bits it turns out that the words are even 3-byte), in this case they are simply very slightly used and can happen that there leaves place for thousands and more new symbols! How to find this? Well, it is needed simply to work with 2-byte words, and the whole analysis to make on the level of 2 bytes, what will force with itself the necessity of compulsory hashing and work with words twice longer than here. This will at once increase the dimension of the task, but, as I have said, this might not become necessary, if the simple trick with swapping of bytes by a word will be done. This must be experimented somehow.
And everything depends on the size of the file. If they are by MBs and more, and video files, then although the adjacent points will be very similar, yet, in a big file could be used all possible colours. But maybe for audio files will be not so many variants? The important thing in my opinion is to choose one universal method, in order that it can be possible to be applied in all cases with quite good results, not to seek for perfection; then also the ciphering will make sense. Ultimately can be set some limit (say, of 1, or 10 MBs) and each file to be processed in such portions and coded separately, although saved in one file.
Well, there are variants, and the program can be modified, but the significant moment is that I propose quite working, and not badly, idea, there exists precedent. Only that I personally don't intend to interrupt my literary and translating activity, so that don't have in mind to spend much time on programmer experiments. If some company takes into its head that I will be needed for them, then let them first send me 2,000 euros only for the realized idea, and we shall see further.
Dec 2014
— — — — —
IDEAS ABOUT BROWSERS SEARCHING IN THE INTERNET
This time these are only my ideas, maybe even naked ideas (but nowadays the naked ideas are allowed, aren't they?) so that I want nothing from anybody. I just share my opinion. Because these are big and complicated programs, they are expert systems, and they are learning, and perform grammatical analysis in different languages, and all the time do searching through the web and actualize their tables for access and so on. Besides, I am not at all specialist in Internet, I was only programmer before about 25 years, but our time is dynamical so that I was left far behind. Despite of this there are obvious things that simply poke in the eyes if one is not prejudiced in something, if he does not defend somebody's private position, although there it can't be said that the people don't work properly. No, they work, but as if not in the right direction, do this what is easier and more impressive, not this, what is necessary. So that, ladies and gentlemen, browser specialists — as well also clients, because when the users decide to require something is will soon emerge —, if you want listen to me then good, but if you don't want then I have fulfilled my duty.
So, then I will begin.
1. General impression
The general impression when using any browser is reduced to this, that these are private companies and they try to jump with something before the others — as, in fact, various supermarkets —, but these are usually nonessential things, this is throwing of dust in the eyes, and were by them not the concept of showing first what in a pairs (well, in a pair of hundreds, maybe) sites is said about the given request, then people would have given up to use them at all! That's how it is. I think that I don't exaggerate, they have enviable achievements, but not in the area of searching, in one old Word exist far better abilities for searching, by parts of the words, by a pattern, even for the English exist the strange search of sound-alike words.
Well, there are reasons for this. In the Internet can't be searched when people ask, no, it is searched there all the time and are maintained tables for every word (I suppose, but how else?) and later, when necessity arises, then these tables are joined or intersected. And search by parts of the words is, in principle, silly. But, on the other hand, they search by one-letter words (say, by English "I", or Russian "ja"-I — it's with one Cyrillic letter —, etc.), so that why not to search, for example, "multy-", or "-brev-", or "ang-", and so on, but without the hyphen, I insert it in order to stress that this isn't whole word? But the biggest difficulty for the browsers arises not when they do the search, but when they show the found information. And do you know why? I personally have begun to think about this only now, and for me it is clear that this is because in the web is not defined order relation and can't be said which site before which has to be shown, in principle! Due to this it turns that this, what any browser shows you, is ordered not in the necessary order, more so in the order in which you would have wanted it to be (although you have not told in what order you want the things, but you have also no such opportunity).
What has to be done in such cases? As if there are only two variants: one is to cluster somehow the occurrences (by date, by languages, by countries, etc.), and choose to show only part of these groups, and the other is to introduce some counter for the priority of showing of the given site (there are not so many sites, seems to me, in any case less than all words in a given language, and there are a heap of languages, as also there can be many variants of each word). I think that the browser people use both, this and that, where the priority is computed by the number of requests to the given site, or maybe even page of the site, and also introducing a list of most important sites. They do all this, I don't say that they don't, but not enough good for the end-user. Let us look at this in more details.
Now, when you write one only word in the window, everything goes more or less good, where "more" means, that in the beginning emerges the Wikipedia, and a pair of others, as if important sites, and what is after this is not interesting for you, and "less" means that — well, why should be shown sites that are of no interest for you? The only reason in showing of all encountering of a given word is in this, to check what is correct, because the writing of the word might be erroneous, and this is valid also for combinations of words (like German — to give an example from language with cases, — "in diesem Abend" or "in diesen Abend" — and surprisingly for me I found that here are about 100,000 occurrences of the erroneous variant —, or you search how is it more correct to express oneself in English, to say "depends on" or "depends from" — and here are also many occurrences in the not-correct variant but it is in another meaning —, yet it is preferable not no forget the quotes). This is a very good opportunity (as by-product) but it is entirely useless to show these encounters, it is enough to look at the statistics of usage and choose what is more often used. But well, what is in addition is not so bad, after all, though the point is in this that there are not a few sites which ensure in this way place for them by such requests, and when you look there they show you various ads, so that it turns out that showing of unnecessary things serves the advertising and hinders the users, or you are caught on the bait, as is said.
But the worst work of the browsers happens when you search for several words, because then, despite of all sorts of tricks by finding the roots of the words and missing (as a rule) of the conjunctions, i.e. in spite of the done grammatical analyses of your request, and of forming of various variants for searching, is applied, as a rule, union, OR, not intersection (AND) of the words. In this way, adding more words you don't narrow the search but on the contrary, you extend it, what contradicts to the common sense. And if you decide to write the words in quotes, for literate search, then you may miss many similar words. Where this, what the user wants, is some possibility after performing of the first search to add something more and restrict the number of occurrences, but there is almost nothing to be added, because even the language and the country does not correspond exactly to their names, this is something placed on the web in the given country, but it may be even on Swahili. All attractiveness of the browsers is bases first of all on maintaining of many big ordered lists of often met requests and on the frequency of use of some sites; the systems seem quite intelligent, but their intellect is nearly the same as the intellect of a parrot.
So that it makes sense to share with you some of my propositions, and how they could be implemented, what has to be taken away from the existing things if it contradicts to the common sense, remains, naturally, at the discretion of specialists making this software. But there are necessary quite drastic changes, because the condition of the web is approximately on the level in the beginning of its emerging (like, say, in 1990), while the information has increased since that time surely more that 1,000 times, and what will be after a pair of decades is incomprehensible for the mind. So, yet I will not specially order my propositions but simply will express some different ideas.
2. Reliability of the source, and other types of pages
For me it is obvious that must be some opinion about reliability of the source, because it can't be placed on one level this, what the official instances, like state agencies and others, or also scientific organizations, express, with this what say the media (this is generally cheating, I think you have not other impressions about them, just nice deception, which is liked by the majority of readers), or also various (competing, and for this reason contradicting one another) companies (the media also contradict one another), and especially with this what says everyone who can speak (in fact, type on the keyboard), like school children, young people, pensioners, clients, at cetera. I don't say that it must not be listened to the ones and the others and the thirds, but they must be distinguished.
What I mean is the following: must be introduced type or reliability of the site as one of three (at least) variants: a) authorized, which must apply for this, must be some indicators that are to be satisfied by them, but first of all unity and centralization of the meanings, official view to the things, at least in the framework of the state, these are official instances, and even not all of their sites, there can be unauthorized sites even of Ministries, also official academic and educational institutions, and so on, but also with single and official meaning, not one think so but another otherwise, and here, surely are the national variants of Wikipedia; b) companies and any organizations, media, societies, literary sites, and so on, which prove their belonging to this category with this, that they are registered as legal entities; and c) physical persons, i.e. everybody who wants (Sulyu and Pulyu, as we in Bulgaria say), who prove nothing, and if some source can prove nothing it is included in this category (say, blogs, where can be added personal meanings, talks and chats, questions and answers open to everybody, etc.). Then the search must be conducted by default only for authorized instances (and such ones have to be not more than one percent, I suppose), and show only the statistics of encounters for the second and third category.
Only in this case the Internet can be used as alternative of former encyclopedias, for education, not for deluding of easily gullible. But to take such measures in one only country there is simply no sense, here is needed the most difficult, united decision of the whole Internet, and it just has not, as I think, global administrative body. Hence is has to be built, to UN, maybe.
Further in necessary stricter monitoring of the languages and countries on each site, i.e. is necessary to introduce such parameters in the beginning of each page. For example, I am writing this material in Russian and place it in Russia, but I might place it also in another country and again in Russian, or it maybe (like it really happens by me) that I place something in Bulgarian, or English, or German, et cetera, on a Russian site; this is valid also for any ads, because despite the efforts of many computerized translators the language is still the most important parameter for each textual material. In reality on can give credence to Internet only as to the date of appearance of the things, here everything is precise, but otherwise all is conditional.
3. Search in vicinity
As I said, I am not specialist in the field of Internet (only somewhere around), but I have not heard that was spoken about vicinity search, and without it to conduct more or less good search by more than one word happens to be quite unsuccessful, due to the lack of order relation, and such search introduces some order. What I mean is the following: introduction, say via square brackets, of consequence of words, which without quotes will be multiplied in all case and other forms, but with them will not, which will be searched on distance from one another, or on maximal distance if there are more than two words, where the very amount of this distance (in words) will be given by the last parameter (or maybe one more parameter for the whole group). By default has to be understood 3 words, or by 2 on the left or on the right, but not more than 5 for the whole group. For example
[Myrski "Chris" 2]
or also
[population number world 3 7]
or also
[ [ "Chris" Myrski 1] [religion communism 2] Bulgarian 100 ]
and other variants, which surely are not so difficult, so that even housewives, as is said, to be able to write such requests to the browser; if there are not quotes is supposed that the word can be varied giving, say, "Myrski's (but if this is in another language, then can be much more variants, like — given in English — "Myrskij"), "The communism as religion" (or "Religion of the communism", or "communist religion", etc.), and somewhere the word Bulgarian (if needed can be given also 10,000). The basic work of the browsers during the search will not be affected by this, but will be changed the way of ordering or the results when they are shown, and the results can be reduced to only one (i.e. to copies of this work on various sites). In addition to this in this way can be given also wider requests, which can further be narrowed changing some of the numbers, or adding new words, and this is very significant, because I have said that one must decrease the number of occurrences, not increase them.
4. Search by important parameters of the page
Now, when one looks at the abilities of contemporary browsers, one can come to the conclusion that people all the time have done so like the browsers do today, yet this is not at all so. Libraries exist since thousands of years, but nowhere and never was possible to perform search by the occurrences of some words (like say, communism, party, Government, duty of the citizens, and so on)! This, what was possible to be done, and what can be done also today in every library, is to perform search by the author, by the title, in Cyrillic or in Latin (or, maybe, in Arabic), by index on what bookshelf they are placed, and also by thematic catalog (when you don't know exactly the author or the title, or you are interesting in several similar books). That's how it is. Not by this can you find in them the word, I beg to be excused by the readers, "arse" (or only "ass"). I agree that the new not always must look at the old, but it must somehow be coordinated with it, one should not reject the whole history before us and begin to live anew (as many of the young, probably, think). When till now was worked so, then such possibilities have to be available also today, and this, that something else can also be allowed — well, so much the better, but vertical upgrading with preserving of all old features. This have invented not I, this is the right way of working in any area.
Well, as to the author and the title of the work, then they, surely will be found if are searched all possible words (although will be found also many citations of them, what isn't exactly the same), but where remain the keywords, by which, in fact, has to be performed the search (not by the use of conjunctions and occasional words), and the thematic? Here I also am not a coryphaeus but there exist library education and the people there know these things, these are elementary truths for them, it can't be otherwise. As to the keywords, then now all know this word and use it (as also I on various sites), but this is not correct, this is amateurism, everybody puts whatever keywords he /she wants, what is not the right way, though it may be allowed (for lack of a better). And then don't forget that if these keywords are on the language of narration of the document then there can't be made difference between their occurrences as words in the text, and as keywords. For this reason they must be preceded by something what must not be detached from them, say the word "Index", or "Theme" (like ThemeDemocracy, what is my favorite theme). This now is better, but it is not enough.
Ah well, and how is it right, can exclaim some of the readers, and then I will say again: ask the specialists in library matters. They must tell you that there have to be established indexes or thematics for all libraries (and the Internet is one enormous library), which must even be written in the beginning of the books, on the second page (like I, if am not wrong, have seen on some American books, that they are cataloged in their catalog). So that here, I repeat, must exist some administrative body for the entire world, that has to represent the Internet, for example, "Commission on Internet by UN", and they have simply to work out the necessary requirements and in one language, let this be the English for the time being (although it leaves a lot to be desired). In general outlines have to be approved special tables with the thematics of all possible areas, for which there have to be translations in all possible languages and the way for calling them in every browser, in order to copy the right words, as well also some standards for giving the name of the author, the title, short abstract, such things. But if I begin now to explain in details what has to be done I can ... deprive good specialists of their deserved earnings, isn't it so? Well, jokes aside, but this is not a field for enthusiasts.
Still, I will risk to propose you one brilliant idea in the next subsection, because otherwise I will not be Myrski, right?
5. Introduction of at least one special character as letter in all alphabets
Here is no need to search long, this is the known underscore ("_"). It is good with this, that it as if is a hyphen, but is not sign for splitting of the words and is used on the contrary, for joining of several words. In this case if some word will begin even only with it, then this will distinguish it in all languages, but it is far better to write, say, Ind_word, or I_word, where the word "word", obviously, signifies any word in any language. I personally use the second variant in my unique book Urrh, for to allow to be made search of only in this way marked words. Similarly can be entered a pair of other special designations, like: Au_name, or Tit_title or The_theme. And there might been possible also to insert several such sing, if there are sub-themes to the given theme. You see how elementary everything is.
But in order that all were able to use even at once this proposition is necessary very slight effort on the part of the developers (and maintainers of these software products), it is necessary only that they processed this symbol with all alphabets (even with Arabic or Swahili), not to reject it and not to take it for delimiter on which the previous word ends. Then can be made also this, what I mentioned in the beginning, that Word can, but in the Internet this is impossible, namely: to conduct search till the every character (say, to write only "The_math*" and search for all possible variants like "mathematics", "mathematical", only "math", and others). Such luxury can be allowed for the whole web, because this will not be word from whatever language, it will be met tens of thousands (if not millions) times rarely than all this listed variants of the words; it is necessary only for all words where the character "_" is met to maintain indexes until every possible symbol of this combined word.
Well, I think to finish with this, but, as you see, there is what to be wished from all browsers, and not only in regard of their colour decoration, or in all complicated functions, but in the very mechanisms of searching on the web, otherwise there is no real sense in showing all possible millions and milliards of occurrences of some required string of characters.
Dec 2014
— — — — —
Сконвертировано и опубликовано на http://SamoLit.com/
Сконвертировано и опубликовано на https://SamoLit.com/